<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Hyatt Regency, San Francisco Airport, California, USA, March</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Fine-Grained ImageNet Classification in the Wild</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Maria Lymperaiou</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Konstantinos Thomas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giorgos Stamou</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>AILS Lab, School of Electrical and Computer Engineering, National Technical University of Athens</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>2</volume>
      <fpage>7</fpage>
      <lpage>29</lpage>
      <abstract>
        <p>Image classification has been one of the most popular tasks in Deep Learning, seeing an abundance of impressive implementations each year. However, there is a lot of criticism tied to promoting complex architectures that continuously push performance metrics higher and higher. Robustness tests can uncover several vulnerabilities and biases which go unnoticed during the typical model evaluation stage. So far, model robustness under distribution shifts has mainly been examined within carefully curated datasets. Nevertheless, such approaches do not test the real response of classifiers in the wild, e.g. when uncurated web-crawled image data of corresponding classes are provided. In our work, we perform ifne-grained classification on closely related categories, which are identified with the help of hierarchical knowledge. Extensive experimentation on a variety of convolutional and transformer-based architectures reveals model robustness in this novel setting. Finally, hierarchical knowledge is again employed to evaluate and explain misclassifications, providing an information-rich evaluation scheme adaptable to any classifier.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Image Classification</kwd>
        <kwd>Knowledge Graphs</kwd>
        <kwd>Robustness</kwd>
        <kwd>Explainable Evaluation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Even though so much efort is invested to perpetually improve model performance by
employing more and more refined architectures and techniques, inevitably increasing the demand for
computational resources necessary for training, there are still some open questions regarding the
ability of such models to properly handle distribution shifts. Distribution shifts refer to testing
an already trained model on a data distribution that diverges from the one the model was trained
on. The analysis of distribution shifts has gained interest in recent years [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13 ref14">10, 11, 12, 13, 14</xref>
        ], as a
crucial step towards enhancing model robustness. Most of these endeavors apply pixel-level
perturbations to artificially influence the distribution under investigation. Nevertheless, the
highly constrained setting of artificial distribution shifts excludes various real-world scenarios,
impeding robust generalization of image classifiers. In this case, natural shifts [
        <xref ref-type="bibr" rid="ref15 ref16 ref17 ref18">15, 16, 17, 18</xref>
        ]
are more representative. They usually require the creation of a curated dataset containing
image variations such as changes in viewpoint or object background, rotations, and other minor
changes. Both synthetic and natural shifts can comprise data augmentation techniques, which
aid the development of robust models when incorporated during training [
        <xref ref-type="bibr" rid="ref19 ref20 ref21 ref22">19, 20, 21, 22</xref>
        ].
      </p>
      <p>So far, there is no approach testing image classification ’in the wild’, where uncurated images
corresponding to pre-defined dataset labels are encountered. We argue that this is a real-world
user-oriented scenario, where totally new images corresponding to ImageNet labels need to
be appropriately classified. For example, an image of a cat found on the web may significantly
difer from ImageNet cat instances, even when popular distribution shifts are taken into account.
Even though a human can identify a cat present in an image with satisfactory confidence, we
question whether an image classifier can do so; the unrestricted space of possible variations of
uncurated images demands advanced generalization capabilities to properly understand the real
discriminative characteristics of an ImageNet class without getting distracted from extraneous
features.</p>
      <p>The problem of classification ’in the wild’ becomes even more dificult when fine-grained
classification needs to be performed, as distinguishing between closely related categories relies
on detailed discriminative characteristics, which may be less prevalent in uncurated settings.
For example, siamese and persian cat races present many visual similarities, increasing the
potential risk of learning and reproducing dataset biases, especially when distribution shifts are
present. We can attribute this risk to the fact that existing classifiers lack external or domain
knowledge, which can help humans discriminate between closely related categories.</p>
      <p>
        To sum up, in our current paper we aspire to answer the following questions:
1. How do diferent models, pre-trained on ImageNet or web images, behave on uncurated
image sets crawled from Google images (given ImageNet labels as Google queries)? We
target this question by producing a novel natural distribution shift based on uncurated
web images upon which we evaluate various image classifiers.
2. How does hierarchical knowledge help with evaluating classification results since several
ImageNet categories are hierarchically related? We attempt to verify to which extent
the assumption that the lack of external knowledge limits the generalization capabilities of
classifiers holds. Thus, we leverage WordNet [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] to discover neighbors of given terms
and test whether classifiers struggle with discriminating between closely related classes.
3. Can evaluation of classification be explainable? Knowledge sources, such as WordNet
can reveal the semantic relationships between concepts (ImageNet classes), providing
possible paths connecting frequently confused classes.
      </p>
      <p>Our code can be found at https://github.com/marialymperaiou/classification-in-the-wild.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>
        Image classifiers With the outburst of neural architectures for classification tasks, Computer
Vision has been one of the fields most benefited from recent developments. Convolutional
classifiers (CNN) is a well-established backbone, with first successful endeavors [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] already
paving the way for more refined architectures, such as VGG [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ], Inception [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ], ResNet [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ],
Xception [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ], InceptionResnet [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] and others [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. There is some criticism around the usage of
CNNs for image classification, even though some contemporary endeavors such as ConvNext
[
        <xref ref-type="bibr" rid="ref29">29</xref>
        ] revisit and insist on the classic paradigm, providing advanced performance. The rapid
advancements that the Transformer framework [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] brought via the usage of self-attention
mechanisms, widely replacing prior architectures for Natural Language Processing applications,
inspired the usage of similar models for Computer Vision as an answer to the aforementioned
criticism [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ]. Thus, Vision Transformers (ViTs) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] built upon [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] set a new baseline in literature;
ever since, several related architectures emerged. In general, transformer-based models rely on
an abundance of training data to ensure proper generalization. This requirement was relaxed in
DeiT [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ], enabling learning on medium-sized datasets. Further development introduced novel
transformer-based architectures, such as BeiT [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], Swin [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ] and RegNets [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ], which realize
specific refinements to boost performance. Overall, it has been proven that ViTs are more robust
compared to classic CNN image classifiers [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ]. In our work, we verify the degree this claim
holds by testing CNN and transformer-based classifiers on the uncurated fine-grained setting.
Robustness under distribution shifts Generalization capabilities of existing image
classifiers have been a crucial problem [
        <xref ref-type="bibr" rid="ref35">35</xref>
        ], currently addressed from a few diferent viewpoints.
Artificial corruptions [
        <xref ref-type="bibr" rid="ref11 ref14 ref16 ref36 ref37">36, 14, 37, 16, 11</xref>
        ] or natural shifts [
        <xref ref-type="bibr" rid="ref15 ref38">15, 38</xref>
        ] on curated data have already
exposed biases and architectural vulnerabilities. Adversarial robustness [
        <xref ref-type="bibr" rid="ref39 ref40 ref41 ref42 ref43">39, 40, 41, 42, 43</xref>
        ] is a
related field where models are tested against adversarial examples, which introduce
imperceptible though influential perturbations on images. Contrary to such attempts, we concentrated
around naturally occurring distribution shifts stemming from uncurated image data. Regarding
architectural choices, many studies perform robustness tests attempting to resolve the CNN vs
Transformer contest [
        <xref ref-type="bibr" rid="ref34 ref44 ref45">34, 44, 45</xref>
        ], while other ventures focus on interpreting and
understanding model robustness [
        <xref ref-type="bibr" rid="ref46 ref47 ref48">46, 47, 48</xref>
        ]. In our approach, by experimenting with both CNN and
transformer-based architectures we adopt such research attempts to the uncurated setting.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <p>The general workflow of our method (Figure 1) consists of three stages. First, the dataset should
be constructed by gathering common terms (queries) and their subcategories which exist as
ImageNet classes. Images corresponding to those terms are crawled from Google search. In
the second stage, various pre-trained classifiers are utilized to classify crawled images. The
hierarchical relationships between the given classes are reported to enrich the evaluation
process. Finally, all semantic relationships between misclassified samples are gathered to extract
explanations and quantify how much, falsely predicted classes, diverge from their ground truth.</p>
      <p>
        Dataset creation We start by gathering user-defined common words regarding visual
concepts as queries, which will act as starting points towards extracting subcategories. The WordNet
hierarchy [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] is used to provide the subcategories, via the hypernym-hyponym (IsA)
relationships, which refer to more general or more specific concepts respectively. For example, given the
query ’car’, its hypernym is ’motor vehicle’ (’car’ IsA ’motor vehicle’), while its hyponyms are
’limousine’ (’limousine’ IsA ’car’), ’sports car’ (’sports car’ IsA ’car’) and other specific car types.
Therefore, we map queries on WordNet to obtain all their immediate hyponyms, constructing a
hyponyms set . We then filter out any hyponyms not belonging to ImageNet class labels.
      </p>
      <p>The filtered categories of  among the initial query are provided as search terms to a web
crawler suitable for searching Google images. We set a predefined threshold  for the number
of Google images returned so that we evaluate classifiers on categories containing almost equal
numbers of samples. This is necessary since some popular categories may return way more
Google images compared to others. We will experiment with several values of k, thus influencing
the tradeof between relevance to the keyword and adequate dataset size. The retrieved images
comprise a labeled dataset , with the keywords as labels.</p>
      <p>
        Classification We consider a variety of image classifiers to test their ability for fine-grained
classification on uncurated web images. We commence our experimentation with
convolutionalbased models as baselines, which have generally been considered to be less robust against
distribution shifts and other perturbations [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ], and we proceed with recent transformer-based
architectures. We perform no further training or fine-tuning on the selected models.
      </p>
      <p>For each model, we perform inference on the crawled images that constitute our dataset, as
explained in the previous paragraph. We implement a rich evaluation scheme to capture various
insights of the classification process. Accuracy is useful as a benchmark metric to compare
our findings with expected classification results. WordNet similarity functions ofer valuable
information about misclassifications; for example, let’s assume that the true label of a sample
is ’cat’ and the classifier predicts the label ’dog’ in one case and the label ’airplane’ in another
case. Intuitively, we hypothesize that a ’cat’ is more closely related to a ’dog’ than an ’airplane’
since they are both animals. This human intuition is reflected in the WordNet hierarchy, thus
assigning a diferent penalty depending on the concept relevance within the hierarchy.</p>
      <p>This concept-based evaluation can be realized using the following WordNet functions: path
similarity, Leacock-Chodorow Similarity (LCS), and Wu-Palmer Similarity (WUPS). Path
similarity evaluates how similar two concepts are, based on the shortest path that connects them
within the WordNet hierarchy. It can provide values between 0 and 1, with 1 denoting the
maximum possible similarity score. LCS also seeks for the shortest path between two concepts
but additionally regards the depth of the taxonomy. Specifically, equation 1 mathematically
describes LCS between two concepts 1 and 2:
 = − log
ℎ(1, 2)
2 · 
(1)
where ℎ(1, 2) denotes the shortest path connecting 1 and 2 and  refers to the taxonomy
depth. Higher LCS values indicate higher similarity between concepts. WUPS takes into
account the depth that the two concepts 1 and 2 appear in WordNet taxonomy and the depth
of their most specific common ancestor node, called Least Common Subsumer. Higher WUPS
scores refer to more similar concepts. For each of the path similarity, LCS, and WUPS metrics
we obtain an average value over the total number of samples of the constructed dataset .</p>
      <p>Moreover, we report the percentage of sibling concepts among misclassifications. Two concepts
are considered to be siblings if they share an immediate (1 hop) parent. For example, the concepts
’tabby cat’ and ’egyptian cat’ share the same parent node (’domestic cat’). It is highly likely that
a classifier is more easily confused between two sibling classes, thus providing false positive (FP)
predictions closely related to the ground truth (GT) label. Therefore, a lower number of siblings
denotes reduced classification capacity compared to models of higher siblings percentage.
Explanations are provided during the evaluation stage, aiming to answer why a pre-trained
classifier cannot correctly classify uncurated images belonging to a class .</p>
      <p>FP predictions contain valuable information regarding which classes are confused with the
GT. The per-class misclassification frequency (MF) refers to the percentage of occurrences of
each false positive class  within the total number of false positive instances. Thus, given a
dataset with  classes,  as the ground truth class and  as one of the false positive classes, the
misclassification frequency for the  →  misclassification is:</p>
      <p>=
  = =
∑︁  
=0
· 100%
(2)
  scores can be extracted for all  ̸=  FP classes so that the most influential
misclassifications are discovered. Higher   scores denote some classifier tendency to choose the FP
class over the GT one, therefore indicating either a classifier bias or an annotation error in the
dataset. Specifically, a classifier bias refers to consistently classifying samples from class  as
samples of class  , given that the annotation is the best possible. Of course, such a requirement
cannot be always satisfied, especially when expert annotators are needed, as may happen in the
case of fine-grained classification. On the other hand, since our explainable evaluation approach
is able to capture such misclassification patterns, it is not necessary to attribute the source of
misclassification beforehand. Human annotators can be employed at a later stage, identifying
and verifying the source of misclassifications.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>In all following experiments, we selected a threshold of  =50 crawled images per class. We will
present results on a random initial query as a proof-of-concept to demonstrate our findings. For
this reason, we provide the query ’cat’, which returns the following WordNet hyponyms (also
corresponding to ImageNet labels):</p>
      <p>={’angora cat’, ’cougar cat’, ’egyptian cat’, ’leopard cat’, ’lynx cat’, ’persian cat’, ’siamese
cat’, ’tabby cat’, ’tiger cat’}</p>
      <p>The same experimentation can be replicated for other selected queries, as long as they can be
mapped on WordNet.</p>
      <sec id="sec-4-1">
        <title>4.1. Convolutional classifiers</title>
        <p>
          We leveraged the following CNN classifiers: VGG16/19, [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ], ResNet50/101/152 [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ],
InceptionV3 [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ], InceptionResnetV2 [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ], Xception [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ], MobileNetV2 [
          <xref ref-type="bibr" rid="ref49">49</xref>
          ], NasNet-Large [
          <xref ref-type="bibr" rid="ref50">50</xref>
          ],
DenseNet121/169/201 [
          <xref ref-type="bibr" rid="ref51">51</xref>
          ], EficientNet-B7 [
          <xref ref-type="bibr" rid="ref52">52</xref>
          ], ConvNeXt [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ]. We present results for CNN
classifiers in Table 1. Bold instances denote lower accuracy than the best ImageNet accuracy of
each model, as reported by the authors of each model respectively1. Underlined cells indicate
best accuracy/sibling percentage scores for each category. The absence of models or keywords
from Table 1 means that they correspond to zero accuracy scores. For example, we observe the
complete absence of models such as InceptionV3, InceptionResNetV2, Xception, NASNetLarge,
1https://paperswithcode.com/sota/image-classification-on-imagenet
DenseNet121/169/201 meaning that they are completely unable to properly classify the crawled
images, even those belonging to categories that show satisfactory accuracy when other
classiifers are deployed. MobileNetV2 also shows deteriorated performance for all categories. We
will investigate later if hierarchical knowledge can help extract any meaningful information
regarding this surprisingly low performance.
        </p>
        <p>Other results that can be extracted from Table 1 is that some categories can be easily classified
(’siamese cat’, ’lynx cat’, ’cougar cat’, ’persian cat’, ’cat’) contrary to others (’tabby cat’, ’tiger
cat’, ’egyptian cat’, ’leopard cat’, ’angora cat’). Since we have no specific knowledge of animal
species, we will once again leverage WordNet to obtain explanations regarding this behavior.
Sibling percentages ofer a first glance at the degree of confusion between similar classes
in the fine-grained setting. For example, even though ’siamese cat’ and ’cougar cat’ classes
demonstrate high accuracy scores, we observe a completely diferent behavior regarding the
sibling percentages: most CNN classifiers return some sibling false positives for ’siamese cat’
ground truth label, while the opposite happens for the ’cougar cat’ ground truth label, which
mostly receives zero sibling misclassifications. This behavior indicates that for ’siamese cat’
if a sample is misclassified, it is likely that it belongs to a conceptually similar class, while for
’cougar cat’ misclassifications, false positives belong to more semantically distant categories.</p>
        <p>Regarding model capabilities, we observe that for both ’siamese’ and ’cougar cat’ classes, all
ResNet50 false positives belong to non-sibling classes, contrary to EficientNet false positives,
which all belong to sibling classes. By also looking to other categories, we observe that in
general, EficientNet achieves a higher sibling percentage compared to ResNet50, meaning that
EficientNet misclassifications are more justified compared to ResNet50 misclassifications.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Transformer-based classifiers</title>
        <p>
          The following transformer-based image classifiers were used: ViT [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], Regnet-x [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ], DeiT [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ],
BeiT [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], CLIP [
          <xref ref-type="bibr" rid="ref53">53</xref>
          ], Swin Transformer V2 [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ]. Results for Transformer-based classifiers are
provided in Table 2. We spot a similar pattern regarding the categories upon which models
struggle to make predictions: instances belonging to ’tabby cat’, ’tiger cat’, ’egyptian cat’
categories are classified with low accuracy compared to ’siamese cat’, ’lynx cat’, ’cougar cat’,
’persian cat’, ’cat’, ’angora cat’ and ’leopard cat’. We suspect that there is a common reason
behind this behavior, probably attributed to unavoidable intra-class similarities present in the
ifne-grained classification setting.
        </p>
        <p>As for model performance, we examine sibling percentage apart from exclusively evaluating
accuracy. The behavior of transformer-based models regarding sibling misclassification is harder
to be interpreted compared to CNN models, because models that return high sibling percentages
for some categories may present low sibling percentages on other categories and vice versa. For
example, BeiT scores low on sibling percentages for ’tabby cat’ (3.45%), ’siamese cat’ (0%) and
’persian cat’ (10%) compared to other models for the same classes; on the other hand, it returns
best sibling scores for ’leopard cat’ (78.72%), ’tiger cat’ (22.45%) and ’egyptian cat’ (22.50%). More
results about the explainability of results are provided in Section 4.3.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Explaining misconceptions</title>
        <p>In Tables 3,4 &amp; 5 we report the top-3 misclassifications per ground truth (GT) category and
per model, as well as the misclassification frequency (MF) for each false positive (FP) label. GT
column refers to cat species exclusively, even if the word ’cat’ is omitted (for example, ’tiger’ GT
entry refers to ’tiger cat’). We highlight with red irrelevant FP classes, which are semantically
distant compared to the GT label, while misconceptions involving sibling classes are highlighted
with blue. Moreover, magenta indicates that an FP is actually an immediate (1 hop) hypernym
of the GT. Due to space constraints, we present here all transformer-based models, but only a
subset of the CNN models tested in total; more results can be found in the Appendix.</p>
        <p>Interestingly, we can spot some surprising frequent misconceptions, such as confusing cat
species with the ’mexican hairless’ dog breed. For CNN classifiers, we spot this peculiarity for
all models under investigation: 10.53% of ResNet50 FP for ’egyptian cat’ GT label belong to the
’mexican hairless’ class; the same applies to 14.29% of ResNet101 FP, 18.18% of ResNet152 FP</p>
        <p>GT
tabby
angora</p>
        <p>lynx
siamese</p>
        <p>tiger
persian
cougar
leopard
egyptian
cat</p>
        <p>FP
and 8.33% of VGG16 FP. More animals such as ’wallaby’, ’jaguar’, ’sea lion’, ’cheetah’, ’arctic
fox’, ’coyote’ etc appear as frequent FPs.</p>
        <p>
          For transformer models, the ’egyptian cat’ → ’mexican hairless’ abnormality is observed for all
classifiers when ’egyptian cat’ GT label is provided, resulting in the following ’mexican hairless’
FP percentages: 26.67% for CLIP, 10% for BeiT, 15.62% for DeiT, 15.38% for xRegNet, 20.83% for
Swin, and 16.33% for ViT. Obviously, regardless of whether the CNN or transformer classifier is
being used, images of ’egyptian cats’ are often erroneously perceived as ’mexican hairless dogs’.
A qualitative analysis between ’egyptian cat’ images and ’mexican hairless dog’ images indicates
that these animals are obviously distinct, even though they present similar ear shapes and rather
hairless, thin bodies. Therefore, we can assume that the transformer-based classifiers are biased
towards texture, verifying relevant observations reported for CNNs [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. Also, ear shape acts
as a confounding factor, overshadowing other actually distinct animal characteristics. There
are more misclassifications involving animals, such as ’armadillo’, ’chihuahua’, ’soft-coated
wheaten terrier’, ’kelpie’, and others.
        </p>
        <p>Even more surprising are misclassifications not including animal species. For example, CNN
classifiers predict ’web site’ instead of ’tabby cat’, ’hatched’ instead of ’persian cat’, ’barbershop’
instead of ’cat’, ’menu’ instead of ’cougar’ etc. All ResNet50/101/152 and VGG16 make at
least one such misclassification, something that highly questions which features of cat species
contribute to such predictions.</p>
        <p>Misclassifications involving non-animal classes using transformers (Tables 4, 5) provide the
following interesting abnormalities: ’cat’ is classified as ’fur coat’ for 50% of the FP instances
using DeiT. This non-negligible misclassification rate once again verifies the aforementioned
texture bias. In a similar sense, xRegNet classifies ’egyptian cat’ images as ’mask’ and as ’comic
book’ 7.69% of the FPs respectively. Such categories had also appeared in CNN misclassifications.
We cannot provide a human-interpretable explanation about the ’mask’ misclassification, since
the term ’mask’ may refer to many diferent objects. We hypothesize that ’mask’ ImageNet
instances may contain carnival masks looking similar to cats, therefore the lack of context
confused xRegNet. ’Comic book’ appears 9.38% of the times an ’egyptian cat’ image is misclassified
by DeiT, 33.33% of the times a ’cat’ photo is misclassified by xRegNet, and 16.67% of the times
an ’egyptian cat’ is misclassified by Swin. This can be attributed to the fact that crawled images
may contain cartoon-like instances, which cannot be clearly regarded as cats. Other interesting
misclassifications involving irrelevant categories are ’cat’ →’washer’ (25% of FPs using ViT),
’leopard cat’→’web site’ (2.27% of FPs using ViT, 15% of FPs using DeiT), ’persian cat’→’plastic
bag’ (25% of FPs using ViT), ’cat’→’jersey’ (25% of FPs using Swin), ’egyptian cat’→’table lamp’
(8.33% of FPs using Swin), ’cat’→’tub (33.33% of FPs using xRegNet), and others.</p>
        <p>An interesting observation revolves around the ’egyptian cat’ label. For CNN models, almost
all top-3 FP of ’egyptian cat’ GT label correspond to irrelevant ImageNet categories. On the
contrary, ’tabby cat’, ’angora cat’, and ’tiger cat’ present more sensible FPs, which usually
involve sibling categories (highlighted with blue). As for transformer models, we observe that
’egyptian cat’ label is always being confused with at least one irrelevant ImageNet category,
while ’angora cat’ is only confused with other cat species, and not with conceptually distant
classes. Thus, ’egyptian cat’ crawled images seem to contain some misleading visual features
that frequently derail the classification process. Indeed, when viewing ’egyptian cat’ crawled
images, some of them are drawings or photos of cat souvenirs; however, misconceptions such as
’table lamp’ or ’armadillo’ cannot be visually explained by human inspectors, unraveling more
questions on the topic. A comparison between CNN classifiers (Table 3) and transformer-based
classifiers (Table 4, 5) denotes that transformers are more capable of retrieving similar categories
to the GT; this becomes obvious by observing the higher number or irrelevant misclassifications
highlighted with red for CNNs, compared to transformer results.</p>
        <p>By combining Tables 3, 4 &amp; 5 with Tables 1&amp; 2, we obtain some very interesting findings: how
are low classification metric scores connected to the relevance between misclassified categories?
We start with categories presenting low accuracy scores (’tabby cat’, ’tiger cat’, ’egyptian cat’),
and we compare them with categories ofering frequent extraneous misclassifications (’egyptian
cat’ and ’cat’, followed by ’tabby cat’ and ’lynx’). Classifying ’egyptian cat’ images both yields
low classification scores and returns irrelevant false positives. On the other hand, even though
’cat’ images present high accuracy scores, misclassifications are highly unrelated when they
happen. ’Tiger cat’ scores low in accuracy, however, misclassifications are rather justified, since
other cat species are returned. Surprisingly, ’tiger cat’ also scores low in siblings percentage,
indicating that false positives are not immediately related to the GT ’tiger cat’ class. In this
case, we assume that false positives (’egyptian cat’, ’tabby cat’, ’leopard cat’ etc) belong to more
distant relatives of the ’tiger cat’ concept class, even though bearing some similar features.</p>
        <p>Overall, throughout this analysis we prove that classification accuracy is unable to reveal
the whole truth behind the way classifiers behave; to this end, knowledge sources are able
to shed some light on the inner workings of this process. By analyzing a constraint family
of related ImageNet labels (cat species) we already disentangled the classification accuracy
from the classification relevance: false positives can be highly relevant to the ground truth
(such as ’tiger cat’ misclassifications) or not (’cat’ misclassifications). We, therefore, argue that
ifne-grained classification also demands ifne-grained evaluation , which can provide insightful
information when driven by knowledge. The human interpretable insights of Tables 4, 5 are
cat
going to be quantified and verified in the next Section.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Knowledge-driven metrics</title>
        <p>The aforementioned claim regarding the need for fine-grained evaluation is supported by
demonstrating results using knowledge-driven metrics based on conceptual distance as provided by
WordNet (Tables 6&amp; 7). Since higher path similarity/LCH, WUPS scores are better, we denote
with bold best (higher) scores for each category.</p>
        <p>By comparing path similarity, LCH, and WUPS metrics across categories, we observe that
categories having a large number of irrelevant FP (marked in red in Tables 4, 5), such as ’cougar
cat
cat’ and ’lynx cat’, followed by ’egyptian cat’ and ’cat’ also present low knowledge-driven
metric scores in Tables 6, 7, as expected. Other categories such as ’angora cat’, ’leopard cat’, and
’tiger cat’ that present misclassifications of related (sibling or parent) categories also present
higher knowledge-driven metric scores. Therefore, we can safely assume that employing
knowledge-driven metrics for evaluating fine-grained classification results is highly correlated
with human-interpretable notions of similarity and therefore trustworthy.</p>
        <p>Model performance is rather clear when examining CNN classifiers. EficientNet achieves
predicting more relevant FP images compared to other classifiers for the majority of the categories.
On the other hand, it is harder to draw a similar conclusion for Transformer-based classifiers, as
diferent models perform better for diferent categories; however, compared to CNN classifiers
the results of knowledge-driven metrics are the same or higher for most categories. Even though
this diference is not impressive, transformer-based models showcase an improved capability of
predicting more relevant classes, when failing to return the GT one.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this work, we implemented a novel distribution shift involving uncurated web images,
upon which we tested convolutional and transformer-based image classifiers. Selecting closely
related categories for classification is instructed by hierarchical knowledge, which is again
employed to evaluate the quality of results. We prove that accuracy-related metrics can only
scratch the surface of classification evaluation since they cannot capture semantic relationships
between misclassified samples and ground truth labels. To this end, we propose an explainable,
knowledge-driven evaluation scheme, able to quantify misclassification relevance by providing
the semantic distance between false positive and real labels. The same scheme is also used to
compare the classification capabilities of CNN vs transformer-based models on the implemented
distribution shift. As future work, we plan to extend our analysis to more query terms in order to
examine the extend of our current findings, and also combine the uncurated image classification
setting with artificial corruptions to enhance our insights.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>The research work was supported by the Hellenic Foundation for Research and Innovation
(HFRI) under the 3rd Call for HFRI PhD Fellowships (Fellowship Number 5537).</p>
    </sec>
    <sec id="sec-7">
      <title>A. More CNN misclassifications</title>
      <p>In Table 8, we present the continuation of the results present in Table 3 for the rest of the CNN
models presenting non-zero accuracy. It becomes evident that the capacity of the classifier plays
an important role in identifying relevant FP: MobileNetV2, which already demonstrated low
accuracy scores, also fail to retrieve semantically related FP classes. This can be easily observed
from the numerous red entries corresponding to this model.</p>
      <p>Other than that, the results agree with the observations analyzed in Table 3, where ’egyptian
cat’ label demonstrated many irrelevant FP, contrary to ’tabby cat’ or ’tiger cat’ labels.
tabby
angora</p>
      <p>lynx
siamese</p>
      <p>tiger
persian
cougar
leopard
egyptian</p>
      <p>cat
siamese</p>
      <p>tiger
persian
cougar
leopard
egyptian</p>
      <p>cat
tabby
angora</p>
      <p>lynx
siamese</p>
      <p>tiger
persian
cougar
leopard
egyptian</p>
      <p>cat
egyptian cat
persian cat
egyptian cat
whippet
tabby cat
lynx
lynx
egyptian cat</p>
      <p>lynx
fur coat
comic book
shower curtain
west highland
white terrier
shower curtain</p>
      <p>zebra
spotlight
comic book</p>
      <p>knot
windsor tie
shower curtain</p>
      <p>tiger cat
persian cat
egyptian cat
egyptian cat
egyptian cat
tabby cat
tiger cat
egyptian cat
comic book
macaque
tiger cat
persian cat
tabby cat
egyptian cat
tabby cat
siamese cat</p>
      <p>web site
egyptian cat
mexican hairless
fur coat</p>
      <p>MF</p>
      <p>Top-2</p>
      <p>FP
tiger cat
arctic fox
coyote
fur coat
egyptian cat
pekinese
coyote
lynx
mask
snow leopard</p>
      <p>mask
window screen
tiger
sock
mask
shower curtain
mask
tiger
theater curtain
window screen
egyptian cat
egyptian cat
tiger cat</p>
      <p>tabby cat
lynx
mexican hairless
mexican hairless
20.00%
10.87%
20.00%
16.67%
16.28%
25.00%
18.18%
16.00%
5.41%
11.11%</p>
      <p>Top-3
FP
lynx
egyptian cat
timber wolf
egyptian cat</p>
      <p>tiger
fur coat
timber wolf</p>
      <p>jaguar
book jacket
mousetrap</p>
      <p>sock
spotlight
trafic light
mask
maze
ant
theater
curtain</p>
      <p>mask
spotlight</p>
      <p>teddy
persian cat
tabby cat
tabby cat</p>
      <p>tiger
tiger cat
lampshade
indigo
bunting</p>
      <p>MF</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.-J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fei-Fei</surname>
          </string-name>
          ,
          <article-title>Imagenet: A large-scale hierarchical image database</article-title>
          ,
          <source>in: 2009 IEEE Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2009</year>
          , pp.
          <fpage>248</fpage>
          -
          <lpage>255</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2009</year>
          .
          <volume>5206848</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          , I. Sutskever,
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <article-title>Imagenet classification with deep convolutional neural networks</article-title>
          ,
          <source>in: Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS'12</source>
          , Curran Associates Inc.,
          <string-name>
            <surname>Red</surname>
            <given-names>Hook</given-names>
          </string-name>
          ,
          <string-name>
            <surname>NY</surname>
          </string-name>
          , USA,
          <year>2012</year>
          , p.
          <fpage>1097</fpage>
          -
          <lpage>1105</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>A survey of convolutional neural networks: Analysis, applications, and prospects</article-title>
          , arXiv,
          <year>2020</year>
          . URL: https://arxiv.org/abs/
          <year>2004</year>
          .02806. doi:
          <volume>10</volume>
          . 48550/ARXIV.
          <year>2004</year>
          .
          <volume>02806</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaiser</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          , Attention is all you need,
          <year>2017</year>
          . URL: https://arxiv.org/abs/1706.03762. doi:
          <volume>10</volume>
          .48550/ARXIV.1706.03762.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weissenborn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Unterthiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Minderer</surname>
          </string-name>
          , G. Heigold,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gelly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Houlsby</surname>
          </string-name>
          ,
          <article-title>An image is worth 16x16 words: Transformers for image recognition at scale</article-title>
          ,
          <year>2020</year>
          . URL: https://arxiv.org/ abs/
          <year>2010</year>
          .11929. doi:
          <volume>10</volume>
          .48550/ARXIV.
          <year>2010</year>
          .
          <volume>11929</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vasudevan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yeung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Seyedhosseini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          , Coca:
          <article-title>Contrastive captioners are image-text foundation models</article-title>
          ,
          <year>2022</year>
          . URL: https://arxiv.org/abs/2205.
          <year>01917</year>
          . doi:
          <volume>10</volume>
          .48550/ARXIV.2205.
          <year>01917</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Wortsman</surname>
          </string-name>
          , G. Ilharco,
          <string-name>
            <given-names>S. Y.</given-names>
            <surname>Gadre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Roelofs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Gontijo-Lopes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Morcos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Namkoong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farhadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Carmon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kornblith</surname>
          </string-name>
          , L. Schmidt,
          <article-title>Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time</article-title>
          , in: K. Chaudhuri,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jegelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Szepesvari</surname>
          </string-name>
          , G. Niu, S. Sabato (Eds.),
          <source>Proceedings of the 39th International Conference on Machine Learning</source>
          , volume
          <volume>162</volume>
          <source>of Proceedings of Machine Learning Research, PMLR</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>23965</fpage>
          -
          <lpage>23998</lpage>
          . URL: https://proceedings.mlr.press/v162/wortsman22a.html.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Guo</surname>
          </string-name>
          , Swin transformer v2:
          <article-title>Scaling up capacity and resolution (</article-title>
          <year>2021</year>
          ). URL: https: //arxiv.org/abs/2111.09883. doi:
          <volume>10</volume>
          .48550/ARXIV.2111.09883.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H.</given-names>
            <surname>Bao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Piao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wei</surname>
          </string-name>
          , Beit:
          <article-title>Bert pre-training of image transformers, 2021</article-title>
          . URL: https://arxiv.org/abs/2106.08254. doi:
          <volume>10</volume>
          .48550/ARXIV.2106.08254.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Rosenfeld</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Kolter</surname>
          </string-name>
          ,
          <article-title>Certified adversarial robustness via randomized smoothing</article-title>
          , in: K. Chaudhuri, R. Salakhutdinov (Eds.),
          <source>Proceedings of the 36th International Conference on Machine Learning</source>
          , volume
          <volume>97</volume>
          <source>of Proceedings of Machine Learning Research, PMLR</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>1310</fpage>
          -
          <lpage>1320</lpage>
          . URL: https://proceedings.mlr.press/v97/cohen19c.html.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>R.</given-names>
            <surname>Geirhos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rubisch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Michaelis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bethge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. A.</given-names>
            <surname>Wichmann</surname>
          </string-name>
          , W. Brendel,
          <article-title>Imagenettrained cnns are biased towards texture; increasing shape bias improves accuracy and robustness, 2018</article-title>
          . URL: https://arxiv.org/abs/
          <year>1811</year>
          .12231. doi:
          <volume>10</volume>
          .48550/ARXIV.
          <year>1811</year>
          .
          <volume>12231</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>F.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Heinze-Deml</surname>
          </string-name>
          ,
          <article-title>Invariance-inducing regularization using worstcase transformations sufices to boost accuracy and spatial robustness</article-title>
          , in: H.
          <string-name>
            <surname>Wallach</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Larochelle</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Beygelzimer</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <article-title>d'Alché-</article-title>
          <string-name>
            <surname>Buc</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Fox</surname>
          </string-name>
          , R. Garnett (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>32</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2019</year>
          . URL: https: //proceedings.neurips.cc/paper/2019/file/1d01bd2e16f57892f0954902899f0692-Paper.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Puigcerver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ruyssen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Riquelme</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lucic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Djolonga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Pinto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Neumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Bachem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tschannen</surname>
          </string-name>
          , M. Michalski,
          <string-name>
            <given-names>O.</given-names>
            <surname>Bousquet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gelly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Houlsby</surname>
          </string-name>
          ,
          <article-title>A large-scale study of representation learning with the visual task adaptation benchmark</article-title>
          ,
          <year>2019</year>
          . URL: https://arxiv.org/abs/
          <year>1910</year>
          .04867. doi:
          <volume>10</volume>
          . 48550/ARXIV.
          <year>1910</year>
          .
          <volume>04867</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>D.</given-names>
            <surname>Hendrycks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Dietterich</surname>
          </string-name>
          ,
          <article-title>Benchmarking neural network robustness to common corruptions</article-title>
          and perturbations,
          <year>2019</year>
          . arXiv:
          <year>1903</year>
          .12261.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>R.</given-names>
            <surname>Taori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Shankar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Carlini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Recht</surname>
          </string-name>
          , L. Schmidt,
          <article-title>Measuring robustness to natural distribution shifts in image classification</article-title>
          ,
          <year>2020</year>
          . URL: https://arxiv.org/abs/
          <year>2007</year>
          . 00644. doi:
          <volume>10</volume>
          .48550/ARXIV.
          <year>2007</year>
          .
          <volume>00644</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>D.</given-names>
            <surname>Hendrycks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Basart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Mu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kadavath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Dorundo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Desai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. L.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Parajuli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. X.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Steinhardt</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Gilmer,</surname>
          </string-name>
          <article-title>The many faces of robustness: A critical analysis of out-of-distribution generalization</article-title>
          ,
          <source>2021 IEEE/CVF International Conference on Computer Vision</source>
          (ICCV) (
          <year>2020</year>
          )
          <fpage>8320</fpage>
          -
          <lpage>8329</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A.</given-names>
            <surname>Barbu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mayo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Alverio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gutfreund</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tenenbaum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Katz</surname>
          </string-name>
          ,
          <article-title>Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models</article-title>
          , in: H.
          <string-name>
            <surname>Wallach</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Larochelle</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Beygelzimer</surname>
          </string-name>
          , F.
          <string-name>
            <surname>d'Alché- Buc</surname>
          </string-name>
          , E. Fox, R. Garnett (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>32</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2019</year>
          . URL: https://proceedings.neurips.cc/paper/2019/file/ 97af07a14cacba681feacf3012730892-Paper.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>D.</given-names>
            <surname>Hendrycks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Basart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Steinhardt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Song</surname>
          </string-name>
          , Natural adversarial examples,
          <year>2019</year>
          . URL: https://arxiv.org/abs/
          <year>1907</year>
          .07174. doi:
          <volume>10</volume>
          .48550/ARXIV.
          <year>1907</year>
          .
          <volume>07174</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>T. DeVries</surname>
          </string-name>
          , G. W. Taylor,
          <article-title>Improved regularization of convolutional neural networks with cutout</article-title>
          ,
          <year>2017</year>
          . URL: https://arxiv.org/abs/1708.04552. doi:
          <volume>10</volume>
          .48550/ARXIV.1708.04552.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Leung</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Goodfellow</surname>
          </string-name>
          ,
          <article-title>Improving the robustness of deep neural networks via stability training</article-title>
          ,
          <year>2016</year>
          . URL: https://arxiv.org/abs/1604.04326. doi:
          <volume>10</volume>
          .48550/ ARXIV.1604.04326.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>L.</given-names>
            <surname>Taylor</surname>
          </string-name>
          , G. Nitschke,
          <article-title>Improving deep learning using generic data augmentation</article-title>
          ,
          <year>2017</year>
          . URL: https://arxiv.org/abs/1708.06020. doi:
          <volume>10</volume>
          .48550/ARXIV.1708.06020.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>S.-A.</given-names>
            <surname>Rebufi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gowal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Calian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Stimberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Wiles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <article-title>Data augmentation can improve robustness</article-title>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/2111.05328. doi:
          <volume>10</volume>
          .48550/ARXIV. 2111.05328.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>C.</given-names>
            <surname>Fellbaum</surname>
          </string-name>
          ,
          <string-name>
            <surname>Wordnet:</surname>
          </string-name>
          <article-title>An electronic lexical database (</article-title>
          <year>1998</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          ,
          <article-title>Very deep convolutional networks for large-scale image recognition</article-title>
          ,
          <source>CoRR abs/1409</source>
          .1556 (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>C.</given-names>
            <surname>Szegedy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vanhoucke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Iofe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shlens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wojna</surname>
          </string-name>
          ,
          <article-title>Rethinking the inception architecture for computer vision</article-title>
          , in:
          <source>2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>2818</fpage>
          -
          <lpage>2826</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2016</year>
          .
          <volume>308</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Deep residual learning for image recognition</article-title>
          ,
          <source>in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2016</year>
          .
          <volume>90</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>F.</given-names>
            <surname>Chollet</surname>
          </string-name>
          , Xception:
          <article-title>Deep learning with depthwise separable convolutions</article-title>
          ,
          <source>2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          (
          <year>2016</year>
          )
          <fpage>1800</fpage>
          -
          <lpage>1807</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>C.</given-names>
            <surname>Szegedy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Iofe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vanhoucke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Alemi</surname>
          </string-name>
          ,
          <article-title>Inception-v4, inception-resnet and the impact of residual connections on learning</article-title>
          ,
          <source>in: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence</source>
          , AAAI'
          <fpage>17</fpage>
          , AAAI Press,
          <year>2017</year>
          , p.
          <fpage>4278</fpage>
          -
          <lpage>4284</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Feichtenhofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Darrell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <article-title>A convnet for the 2020s</article-title>
          ,
          <source>CoRR abs/2201</source>
          .03545 (
          <year>2022</year>
          ). URL: https://arxiv.org/abs/2201.03545. arXiv:
          <volume>2201</volume>
          .
          <fpage>03545</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>S.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Naseer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hayat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. W.</given-names>
            <surname>Zamir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. S.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <article-title>Transformers in vision: A survey</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>54</volume>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>41</lpage>
          . URL: https://doi.org/10.1145%2F3505244. doi:
          <volume>10</volume>
          .1145/3505244.
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cord</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Douze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Massa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sablayrolles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jegou</surname>
          </string-name>
          ,
          <article-title>Training dataeficient image transformers &amp; distillation through attention</article-title>
          , in: M.
          <string-name>
            <surname>Meila</surname>
          </string-name>
          , T. Zhang (Eds.),
          <source>Proceedings of the 38th International Conference on Machine Learning</source>
          , volume
          <volume>139</volume>
          <source>of Proceedings of Machine Learning Research, PMLR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>10347</fpage>
          -
          <lpage>10357</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Guo</surname>
          </string-name>
          , Swin transformer v2:
          <article-title>Scaling up capacity and resolution</article-title>
          ,
          <source>in: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>11999</fpage>
          -
          <lpage>12009</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR52688.
          <year>2022</year>
          .
          <volume>01170</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <surname>I. Radosavovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. P.</given-names>
            <surname>Kosaraju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dollár</surname>
          </string-name>
          , Designing network design spaces,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>S.</given-names>
            <surname>Paul</surname>
          </string-name>
          , P.-Y. Chen,
          <article-title>Vision transformers are robust learners</article-title>
          ,
          <source>in: AAAI Conference on Artificial Intelligence</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>B.</given-names>
            <surname>Recht</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Roelofs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Shankar</surname>
          </string-name>
          ,
          <article-title>Do imagenet classifiers generalize to imagenet?</article-title>
          ,
          <source>in: International Conference on Machine Learning</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>R.</given-names>
            <surname>Geirhos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. R. M.</given-names>
            <surname>Temme</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rauber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. H.</given-names>
            <surname>Schütt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bethge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. A.</given-names>
            <surname>Wichmann</surname>
          </string-name>
          ,
          <article-title>Generalisation in humans and deep neural networks</article-title>
          ,
          <year>2018</year>
          . URL: https://arxiv.org/abs/
          <year>1808</year>
          .08750. doi:
          <volume>10</volume>
          .48550/ARXIV.
          <year>1808</year>
          .
          <volume>08750</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>A.</given-names>
            <surname>Laugros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Caplier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ospici</surname>
          </string-name>
          ,
          <article-title>Using synthetic corruptions to measure robustness to natural distribution shifts</article-title>
          ,
          <source>ArXiv abs/2107</source>
          .12052 (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>D.</given-names>
            <surname>Hendrycks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Basart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Steinhardt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Song</surname>
          </string-name>
          , Natural adversarial examples,
          <year>2019</year>
          . URL: https://arxiv.org/abs/
          <year>1907</year>
          .07174. doi:
          <volume>10</volume>
          .48550/ARXIV.
          <year>1907</year>
          .
          <volume>07174</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>B.</given-names>
            <surname>Biggio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Corona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Maiorca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Nelson</surname>
          </string-name>
          ,
          <string-name>
            <surname>N.</surname>
          </string-name>
          <article-title>Š rndić</article-title>
          , P. Laskov,
          <string-name>
            <given-names>G.</given-names>
            <surname>Giacinto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Roli</surname>
          </string-name>
          ,
          <article-title>Evasion attacks against machine learning at test time</article-title>
          ,
          <source>in: Advanced Information Systems Engineering</source>
          , Springer Berlin Heidelberg,
          <year>2013</year>
          , pp.
          <fpage>387</fpage>
          -
          <lpage>402</lpage>
          . URL: https://doi.org/10.1007%
          <fpage>2F978</fpage>
          -
          <fpage>3</fpage>
          -
          <fpage>642</fpage>
          -40994-3_
          <fpage>25</fpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>642</fpage>
          -40994-3_
          <fpage>25</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.-A.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Pang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <article-title>Benchmarking adversarial robustness on image classification</article-title>
          ,
          <source>in: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>318</fpage>
          -
          <lpage>328</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR42600.
          <year>2020</year>
          .
          <volume>00040</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>D.</given-names>
            <surname>Tsipras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Santurkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Engstrom</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Turner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Madry</surname>
          </string-name>
          , Robustness may be at odds with accuracy,
          <year>2018</year>
          . URL: https://arxiv.org/abs/
          <year>1805</year>
          .12152. doi:
          <volume>10</volume>
          .48550/ARXIV.
          <year>1805</year>
          .
          <volume>12152</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>U.</given-names>
            <surname>Ozbulak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. T.</given-names>
            <surname>Anzaku</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. D.</given-names>
            <surname>Neve</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. V.</given-names>
            <surname>Messem</surname>
          </string-name>
          ,
          <article-title>Selection of source images heavily influences the efectiveness of adversarial attacks</article-title>
          ,
          <source>ArXiv abs/2106</source>
          .07141 (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ullah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mianjy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Arora</surname>
          </string-name>
          ,
          <article-title>Adversarial robustness is at odds with lazy training</article-title>
          ,
          <source>ArXiv abs/2207</source>
          .00411 (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          [44]
          <string-name>
            <given-names>F.</given-names>
            <surname>Pinto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. H. S.</given-names>
            <surname>Torr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. K.</given-names>
            <surname>Dokania</surname>
          </string-name>
          ,
          <article-title>An impartial take to the cnn vs transformer robustness contest</article-title>
          ,
          <source>in: European Conference on Computer Vision</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          [45]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <article-title>Can cnns be more robust than transformers?</article-title>
          ,
          <source>ArXiv abs/2206</source>
          .03452 (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          [46]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bhojanapalli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chakrabarti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Glasner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Unterthiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Veit</surname>
          </string-name>
          ,
          <article-title>Understanding robustness of transformers for image classification</article-title>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          [47]
          <string-name>
            <given-names>W.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gould</surname>
          </string-name>
          , L. Zheng,
          <article-title>On the strong correlation between model invariance and generalization</article-title>
          ,
          <source>ArXiv abs/2207</source>
          .07065 (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          [48]
          <string-name>
            <given-names>E.</given-names>
            <surname>Mintun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kirillov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <article-title>On interaction between augmentations and corruptions in natural corruption robustness</article-title>
          , in: M.
          <string-name>
            <surname>Ranzato</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Beygelzimer</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Dauphin</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Liang</surname>
            ,
            <given-names>J. W.</given-names>
          </string-name>
          <string-name>
            <surname>Vaughan</surname>
          </string-name>
          (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>34</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2021</year>
          , pp.
          <fpage>3571</fpage>
          -
          <lpage>3583</lpage>
          . URL: https://proceedings.neurips.cc/paper/2021/file/ 1d49780520898fe37f0cd6b41c5311bf-Paper.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref49">
        <mixed-citation>
          [49]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sandler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Howard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhmoginov</surname>
          </string-name>
          , L.-C.
          <article-title>Chen, Mobilenetv2: Inverted residuals</article-title>
          and linear bottlenecks,
          <year>2018</year>
          , pp.
          <fpage>4510</fpage>
          -
          <lpage>4520</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2018</year>
          .
          <volume>00474</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref50">
        <mixed-citation>
          [50]
          <string-name>
            <given-names>B.</given-names>
            <surname>Zoph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vasudevan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shlens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <article-title>Learning transferable architectures for scalable image recognition</article-title>
          ,
          <year>2018</year>
          , pp.
          <fpage>8697</fpage>
          -
          <lpage>8710</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2018</year>
          .
          <volume>00907</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref51">
        <mixed-citation>
          [51]
          <string-name>
            <given-names>G.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          , L. van der Maaten, K. Weinberger, Densely connected convolutional networks,
          <year>2017</year>
          . doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2017</year>
          .
          <volume>243</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref52">
        <mixed-citation>
          [52]
          <string-name>
            <given-names>M.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          , Eficientnet:
          <article-title>Rethinking model scaling for convolutional neural networks</article-title>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref53">
        <mixed-citation>
          [53]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Krueger</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          ,
          <article-title>Learning transferable visual models from natural language supervision</article-title>
          ,
          <source>in: International Conference on Machine Learning</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>