<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">st Place Solution for FungiCLEF 2022 Competition: Fine-grained Open-set Fungi Recognition</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Zihua</forename><surname>Xiong</surname></persName>
							<email>xiongzihua.xzh@alibaba-inc.com</email>
							<affiliation key="aff0">
								<orgName type="department">Evaluation Forum</orgName>
								<address>
									<addrLine>September 5-8</addrLine>
									<postCode>2022</postCode>
									<settlement>Bologna</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Yumeng</forename><surname>Ruan</surname></persName>
							<email>ruanyumeng.rym@alibaba-inc.com</email>
							<affiliation key="aff0">
								<orgName type="department">Evaluation Forum</orgName>
								<address>
									<addrLine>September 5-8</addrLine>
									<postCode>2022</postCode>
									<settlement>Bologna</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Yifei</forename><surname>Hu</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Evaluation Forum</orgName>
								<address>
									<addrLine>September 5-8</addrLine>
									<postCode>2022</postCode>
									<settlement>Bologna</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Yue</forename><surname>Zhang</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Evaluation Forum</orgName>
								<address>
									<addrLine>September 5-8</addrLine>
									<postCode>2022</postCode>
									<settlement>Bologna</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Yuke</forename><surname>Zhu</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Evaluation Forum</orgName>
								<address>
									<addrLine>September 5-8</addrLine>
									<postCode>2022</postCode>
									<settlement>Bologna</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Sheng</forename><surname>Guo</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Evaluation Forum</orgName>
								<address>
									<addrLine>September 5-8</addrLine>
									<postCode>2022</postCode>
									<settlement>Bologna</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><roleName>Ant Group, China</roleName><forename type="first">Bing</forename><forename type="middle">Han</forename><surname>Mybank</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Evaluation Forum</orgName>
								<address>
									<addrLine>September 5-8</addrLine>
									<postCode>2022</postCode>
									<settlement>Bologna</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">st Place Solution for FungiCLEF 2022 Competition: Fine-grained Open-set Fungi Recognition</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">56AA3A7C37579CC3AF6CC354932118D5</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T03:27+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>FungiCLEF</term>
					<term>fungi recognition</term>
					<term>long tail</term>
					<term>open-set</term>
					<term>fine-grained</term>
					<term>classification</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>In this paper, we describe our method for Fine-Grained Fungi Recognition at FungiCLEF 2022, which aims to recognize the fungi belonging to 1,604 known species and many other unknown species, termed as a fine-grained, open-set machine learning problem. For the purpose of building a strong close-set classifier, we taken MetaFormer [1] and ConvNext [2] as our strong baseline, then we applied hyper-parameter tuning and some modern training techniques to improve it. To deal with long tailed class distribution problem, we adapt the Seesaw Loss [3] to balance the training process between head classes and tail classes. Furthermore, to avoid tail categories being misclassified as open-set categories, we intuitively design a post process to alleviate the confusion. As a common practice, test time augmentations and model ensemble are used. With all these techniques together, our method achieves superior mean 𝑓 1 score on test set, that is 83.78% on public leaderboard, and 80.43% on private leaderboard which is the 1st place among the participators. The code will be made available at https://github.com/guoshengcv/fgvc9_fungiclef.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>FungiCLEF 2022 <ref type="bibr" target="#b3">[4]</ref> is a competition held jointly by CLEF 2022 conference <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b5">6]</ref> and FGVC9 workshop at CVPR 2022 conference. The competition release the train data based on Danish Fungi 2020 <ref type="bibr" target="#b6">[7]</ref>, which aims at fine-grained fungi recognition. It includes both image and meta-information such as habitat, substrate, time, longitude, latitude etc, and contains 295,938 samples belonging to 1,604 species. The competition also release test data which contains 59,420 observations with 118,676 images and 3,134 species, it includes meta-information as train data but miss some attributes such as longitude and latitude. After data analyze, we found the category distribution of the train dataset is long-tailed, as a result, in this work we tackle the competition as a fine-grained, long-tailed, open-set classification task.</p><p>Different from common classification tasks that try to distinguish objects with large inter-class variations, Fine-Grained Visual Classification (FGVC) aims at capturing the subtle difference within similar categories, such as differentiating bird species, car types, etc. It is acknowledged that FGVC is a challenging task due to small inter-class variations and large intra-class variations.</p><p>Numerous methods for FGVC are mainly focused on modeling discriminative regions, such as part-based model <ref type="bibr" target="#b7">[8,</ref><ref type="bibr" target="#b8">9,</ref><ref type="bibr" target="#b9">10]</ref> and attention-based model <ref type="bibr" target="#b10">[11,</ref><ref type="bibr" target="#b11">12]</ref>. Recently, inspired by the fact that human experts use meta-information to distinguish visually similar species, there are many works <ref type="bibr" target="#b12">[13,</ref><ref type="bibr" target="#b13">14,</ref><ref type="bibr" target="#b0">1,</ref><ref type="bibr" target="#b14">15]</ref> utilize additional information to enhance fine-grained classification performance. Among them, Metaformer <ref type="bibr" target="#b0">[1]</ref> is the state-of-the-art work proposed recently, it is a hybrid framework that convolution and transformer are both used. In this work, we taken Metaformer as a strong baseline and improve it progressively.</p><p>Recently, transformers have leading the research in the filed of computer vision, starting from Vision Transformer <ref type="bibr" target="#b15">[16]</ref>, there are many various transformer backbones achieve SoTA performance in a wild range of vision tasks, such as Swin Transformer <ref type="bibr" target="#b16">[17]</ref>, CSwin Transformer <ref type="bibr" target="#b17">[18]</ref>, etc. On the other side, ConvNext <ref type="bibr" target="#b1">[2]</ref> is a pure convolution backbone, it applies modern training techniques, macro and micro design of the network architecture, achieves comparable results with transformers. In this work, in order to obtain models with distinct difference and enhance the performance of model ensemble, we taken ConvNext as another baseline backbone.</p><p>In real world scenarios, the distribution of the categories is often long-tailed. It is well known that major class will dominate the training process and suppress the performance of tail class. Many works designed loss function to deal with the problem of long-tail classification, such as Adaptive Class Suppression Loss <ref type="bibr" target="#b18">[19]</ref>, Equalization loss <ref type="bibr" target="#b19">[20]</ref>, Seesaw Loss <ref type="bibr" target="#b2">[3]</ref>, etc. In our method, we utilize Seesaw Loss to dynamically balance the training process between head classes and tail classes.</p><p>For practical application, it usually faces with open-set recognition challenge, the classifier should not only recognize the classes which have been seen during training, but also notice that if a instance comes from unknown classes. Motivated by <ref type="bibr" target="#b20">[21]</ref>, in this work, we trained the model on known classes to obtain a good close-set classifier, and determine whether a instance belonging to open-set based on the maximum value of it's logit score vector. Furthermore, to avoid tail categories being misclassified as open-set categories, we intuitively design a post process to alleviate the confusion.</p><p>Our main contributions in FungiCLEF 2022 competition can be summarized as follows:</p><p>• We take Metaformer and ConvNext as our strong baseline, then we apply hyper-parameter tuning and some modern training techniques to improve it's performance on fungi dataset. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Approach</head><p>Motivated by <ref type="bibr" target="#b20">[21]</ref>, we divide the fine-grained, open-set recognition problem into two parts. Firstly we are attended to lift up the close-set recognition performance, including network architecture, </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Overview of the Approach</head><p>As shown in Figure <ref type="figure" target="#fig_0">1</ref>, we taken MetaFormer <ref type="bibr" target="#b0">[1]</ref> and ConvNext <ref type="bibr" target="#b1">[2]</ref> as our initial baseline.</p><p>MetaFormer is a hybrid framework that combines convolution and vision transformer, it also proposes a simple and effective solution for adding meta-information using the transformer layer.</p><p>In our approach, we directly use MetaFormer and modify the input of meta-information. We perform the mapping</p><formula xml:id="formula_0">[𝑚𝑜𝑛𝑡ℎ, 𝑑𝑎𝑦] → [𝑠𝑖𝑛( 2𝜋𝑚𝑜𝑛𝑡ℎ<label>12</label></formula><p>), 𝑐𝑜𝑠( 2𝜋𝑚𝑜𝑛𝑡ℎ</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>12</head><p>), 𝑠𝑖𝑛( 2𝜋𝑑𝑎𝑦 31 ), 𝑐𝑜𝑠( 2𝜋𝑑𝑎𝑦 31 )] to encode temporal information. We use one-hot encoding to encode category meta-information such as countryCode, Substrate and Habitat. To enhance the model diversity for later model ensemble, we use ConvNext as another network architecture. We apply hyper-parameter tuning to improve their performance, and we will illustrate the ablation studies in Sec 3.2 to show the progressive process.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Loss for Long Tail Classification</head><p>It is known that instances from head categories dominate the training process, the biased learning lead to misclassification for tail categories. In this work, we borrow the idea from Seesaw Loss <ref type="bibr" target="#b2">[3]</ref> to alleviate this problem. During training process, Seesaw Loss dynamically balances positive and negative gradients for each category with a dynamic factor, it reformulate the Cross Entropy loss as</p><formula xml:id="formula_1">𝐿 𝑠𝑒𝑒𝑠𝑎𝑤 (𝑧) = − 𝐶 ∑︁ 𝑖=1 𝑦 𝑖 log(̂︀ 𝑝 𝑖 ), with ̂︀ 𝑝 𝑖 = 𝑒 𝑧 𝑖 ∑︀ 𝐶 𝑗̸ =𝑖 𝑆 𝑖𝑗 𝑒 𝑧 𝑗 + 𝑒 𝑧 𝑖 . (<label>1</label></formula><formula xml:id="formula_2">)</formula><p>where 𝑦 is the category label, usually represented by one-hot, 𝑧 is the outputs of model, ̂︀ 𝑝 is the probability calculated by 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧) with a dynamic factor 𝑆. For more detail, please refer to Seesaw Loss <ref type="bibr" target="#b2">[3]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Post Process</head><p>The post process is intuitively designed based on several observations, and applied on final ensemble results. In order to be as clear as possible the process, we write the post process in python-style pseudocode, refer to the Algorithm 1. We detailed it in the following.</p><p>Threshold for selecting open-set samples. It is acknowledged that the predict confidence score of open-set samples are relatively low. This phenomenon can be used as the criterion for their recognition. Specifically, we draw the logit (the direct output of the model) frequency distribution of both validation set and test set, shown in Figure <ref type="figure" target="#fig_1">2</ref>. As the open set samples are only contained in the test set, we can compare the low confidence areas of the two distributions to approximately get the logit threshold for open set samples. For example, we can set threshold to 5 as an approximate for the Alleviate the influence of microscopy images. In the test set, we find that one test sample may contains several images as shown in Figure <ref type="figure" target="#fig_3">3</ref>. For such case, we average the model outputs of them to get the confidence and use argmax to get predicted category. During the above process, we also find that there are images showing huge visual discrepancy with the majority. Specifically, we find that some test samples contain microscopy images such as sample_c shown in Figure <ref type="figure" target="#fig_3">3</ref>, these microscopy images tend to produce low confidence due to little training data, it will influence the naive average strategy. To cope with this problem, we delicately design the post process. As shown in Algorithm 1, from line 29 to line 35, if the maximum logit of averaged outputs is lower than a certain threshold, we will look into the maximum logit of all images from a test sample, if it is greater than a certain threshold, we will get the corresponding category as the prediction. In this case, we think that the test samples with low averaged outputs may be caused by containing too many microscopy images, the high confidence prediction of one image from test sample is sufficient to infer the category, and the test sample should not be considered as open-set categories.  We found many test samples contain several images, we predict the category of the test sample by averaging the model outputs of them. We also notice that some test samples such as sample_c contains one image pictured from natural environment, and the other images come from microscope view, it will disturb the average results.</p><p>Distinguish tail categories and open-set categories. We put the tail categories that are never been predicted by the model into hard tail categories, we argue that there are many hard tail categories misclassified as open-set categories. To deal with the problem above, we design the post process, refer to the Algorithm 1. From line 18 to line 27, to avoid the misclassification, we mining hard tail categories from top-3 predictions with low threshold filtering.</p><p>Algorithm 1 Pseudocode of Post Process in a python-like style. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Experiments</head><p>In this section, we first elaborate on the implementation and training details. Then we introduce ablation studies on loss functions and bag of training settings. Then we list some other attempts and it's results. Finally we study on different test time augmentations, and show the effectiveness of post process for tail categories recognition and open-set categories recognition.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Implementation Details</head><p>We trained the model on Danish Fungi 2020 dataset <ref type="bibr" target="#b6">[7]</ref> which contains 295,938 training images belonging to 1,604 species observed mostly in Denmark, the dataset has been divided into train and validation set. We use both train and validation for training in most settings. We report the results on test set which contains 59,420 observations with 118,676 images and 3,134 species. The test set is divided into 2 parts, the public set contains 20% of the data, the private set contains 80% of the data. As the performance of open-set recognition affects the mean 𝑓 1 score, to make relative fair comparison in ablation studies, based on the observation illustrated in Sec 2.3, we utilize threshold to select ∼1000 samples which have low confidence score as open-set samples for most experiments. We conduct all the experiments with Tesla V100 (32G). We use AdamW optimizer with cosine learning scheduler, initialize the learning rate to 5𝑒 −5 and scale it by batch size, we follow most of the augmentation and regularization strategies of <ref type="bibr" target="#b16">[17]</ref> in training.    </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Ablation Studies</head><p>As shown in Table <ref type="table" target="#tab_2">1</ref>, we train MetaFormer-0 for 100 epochs, with Soft Target Cross Entropy loss and mixup <ref type="bibr" target="#b21">[22]</ref> augmentation to build our baseline. For ablation studies, it should be noted that except the parameter to be compared, there are little other not consistent parameters, such as the accumulate steps in last row in Table <ref type="table" target="#tab_6">5</ref>, we argue that it will not affects the conclusion largely.</p><p>Losses. As shown in Table <ref type="table" target="#tab_3">2</ref>, we compare different losses with several common augmentation techniques. Specifically, we compare cross entropy loss with either mixup or label smoothing <ref type="bibr" target="#b22">[23]</ref> and Seesaw loss. They are all devoted to alleviate the long-tail problem in training. It is found that label smooth converges faster than mixup in our experiments. The best performance is achieved when Seesaw loss is adopted. Batch size. Table <ref type="table" target="#tab_4">3</ref> illustrates that larger batch size improves the performance. in detail, by increasing batch size from 32 to 64, we improved 𝑓 1 score from 76.90% to 77.49% on public set, consistently improve 𝑓 1 score from 72.48% to 74.27% on private set. Similar techniques is to increase the accumulate steps. As shown in Table <ref type="table" target="#tab_5">4</ref>, enlarging accumulate steps improves mean 𝑓 1 score in private test set consistently with MetaFormer-0 and MetaFormer-1.</p><p>Training epochs. We found the longer training epochs will not definitely improve the performance. As shown in Table <ref type="table" target="#tab_6">5</ref>, for MetaFormer-0 and MetaFormer-2, it is consistent that proper epochs is essential for better result. Following this line, we did not train models with dozens of epochs such as 100 epochs and above. Image size. Usually, training with larger image size improves the overall performance, especially for fine-grained tasks. We use the 384 as the baseline image size and try several other larger settings. As shown in Table <ref type="table" target="#tab_7">6</ref>, the larger image size 448 does not consistently bring improvements   on public test set. We blame it to the coupled training schedules with the image size, which we do not investigate into it. Finally, we adopt the image size 384 in all settings. Nevertheless, the performance with this baseline is satisfactory enough. Pretrain dataset. We transfer MetaFormer-2 pretrained on different dataset such as herbarium, imagenet22k and inaturalist21. The results are shown in Table <ref type="table" target="#tab_8">7</ref>. Experimentally, we do not directly choose the best-performed pre-training model. Instead, we use ensemble techniques to combine them. We find that ensemble will produce a consistent improvement compared with single model. Even combining the best-performed single model with other slightly poorerperformed models will not affect the conclusion.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Other Attempts</head><p>ConvNext. In addition to MetaFormer, We also train ConvNext. The results are listed in Table <ref type="table" target="#tab_9">8</ref>.</p><p>The experiments in ConvNext is not fully explored compared to MetaFormer. Although the results of ConvNext are inferior to MetaFormer, we still add it to the model ensemble process and the performance is also improved.    Pseudo label. After training models with various settings, we use model ensemble to get the best model currently, and take the model predictions on test samples as their label. We select top ∼ 50% test samples by their confidence score. We trained MetaFormer-2 and ConvNext-large with train+val+pseudo, the results are listed in Table <ref type="table" target="#tab_10">9</ref>. For open-set recognition, we intuitively select samples with lower confidence score as open-set samples. As shown in Table <ref type="table" target="#tab_13">12</ref>, by increasing open-set sample from ∼ 1000 to ∼ 1500, we improve mean 𝑓 1 score from 83.26% to 83.50% on public test set, from 79.38% to 79.60% on private test set. It demonstrates the effectiveness to select a proper open-set threshold. Compare the average ensemble (v3) with average ensemble (v2), v3 improves the ensemble performance, the only difference between them is that v3 contains the models trained with pseudo label. It is acknowledged that the tail categories tend to have lower confidence score compared to head categories, so the tail categories are easier to be misclassified as head categories or wrongly identified as open-set categories. As shown in Table <ref type="table" target="#tab_13">12</ref>, with our post process for tail categories applied on average ensemble (v3), which termed as average ensemble (v4), the mean 𝑓 1 score improves a lot on both public test set and private test set.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.">Test Time Augmentation and Post Process</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Conclusion</head><p>In this paper, we introduce our solution for FungiCLEF 2022 competition. To solve this challenging fine-grained, open-set problem, we try a bunch of techniques, such as different network baseline, hyper-parameters tuning, modern training techniques, loss for long tail recognition and specially designed post process. With these endeavours we achieved 1st place among the participators. The experimental results show the progressive process for single model, and the effectiveness of test time augmentation and post process for tail categories. For future work, it is valuable to study the method that fuse meta-information and visual information for Fine-Grained Visual Classification, and the problem of distinguish between tail categories and open-set categories is also worth exploring.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: The overall of our apporach. We trained MetaFormer and Convnext with various settings, during testing process, model snsemble and post process are used.</figDesc><graphic coords="3,89.29,84.19,417.48,232.93" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 .</head><label>2</label><figDesc>On this basis, we have draw a rough conclusion that the test set contains approximate 1000 ∼ 2000 open set samples. It should be noted that this rough conclusion may be wrong, since we have no information about the reality that how many open-set samples in test set. Despite of it, in the rest of the experiments, we set the samples with top-k lowest confidence as the open set, the value of 𝑘 is set based on this rough conclusion. As shown from line 7 to line 9 in Algorithm 1, we adjust the threshold to obtain open-set samples. For different experiment setting and model, it is hard and needless to have exactly same number of open-set samples, experimentally, we set k to ∼ 1000 at first, and adjust it to ∼ 1500 at the final based on the public test set performance.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Logit score frequency distribution of val and test. We draw the Logit score frequency distribution on both validation and test set, with the output logits of a single model.</figDesc><graphic coords="5,120.07,283.53,352.37,194.69" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 3 :</head><label>3</label><figDesc>Figure3: Selected test samples. We found many test samples contain several images, we predict the category of the test sample by averaging the model outputs of them. We also notice that some test samples such as sample_c contains one image pictured from natural environment, and the other images come from microscope view, it will disturb the average results.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head></head><label></label><figDesc>• We find category distribution of the fungi dataset is long-tailed, thus we adapt the Seesaw</figDesc><table><row><cell>Loss to balance the training process between head classes and tail classes, which lift up the</cell></row><row><cell>baseline model performance.</cell></row><row><cell>• To avoid tail categories being misclassified as open-set categories, we intuitively design a</cell></row><row><cell>post process to alleviate the confusion.</cell></row><row><cell>• Detailed ablation experiments have been done. With the techniques above, we achieve</cell></row><row><cell>superior performance.</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 1</head><label>1</label><figDesc>MetaFormer-0 baseline. loss batch size accumulate steps epochs mixup train+val public mean 𝑓 1 private mean 𝑓 1</figDesc><table><row><cell>Soft Target CE 32</cell><cell>1</cell><cell>100</cell><cell>yes</cell><cell>no</cell><cell>78.76%</cell><cell>74.26%</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 2</head><label>2</label><figDesc>Mean 𝑓 1 score on public/private test set with different losses and MetaFormer-0 as backbone.</figDesc><table><row><cell>loss</cell><cell cols="7">batch size accumulate steps epochs mixup train+val public mean 𝑓 1 private mean 𝑓 1</cell></row><row><cell>Soft Target CE</cell><cell>32</cell><cell>1</cell><cell>32</cell><cell>yes</cell><cell>no</cell><cell>71.49%</cell><cell>67.6%</cell></row><row><cell cols="2">Label Smoothing CE 32</cell><cell>1</cell><cell>32</cell><cell>no</cell><cell>no</cell><cell>76.9%</cell><cell>72.48%</cell></row><row><cell cols="2">Label Smoothing CE 64</cell><cell>3</cell><cell>32</cell><cell>no</cell><cell>yes</cell><cell>79.45%</cell><cell>75.67%</cell></row><row><cell>Seesaw Loss</cell><cell>64</cell><cell>3</cell><cell>32</cell><cell>no</cell><cell>yes</cell><cell>79.79%</cell><cell>76.15%</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 3</head><label>3</label><figDesc>Mean 𝑓 1 score on public/private test set with different batch size and MetaFormer-0 as backbone.</figDesc><table><row><cell>loss</cell><cell cols="7">batch size accumulate steps epochs mixup train+val public mean 𝑓 1 private mean 𝑓 1</cell></row><row><cell cols="2">Label Smoothing CE 32</cell><cell>1</cell><cell>32</cell><cell>no</cell><cell>no</cell><cell>76.90%</cell><cell>72.48%</cell></row><row><cell cols="2">Label Smoothing CE 64</cell><cell>1</cell><cell>32</cell><cell>no</cell><cell>no</cell><cell>77.49%</cell><cell>74.27%</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_5"><head>Table 4</head><label>4</label><figDesc>Mean 𝑓 1 score on public/private test set with different accumulate steps.</figDesc><table><row><cell>loss</cell><cell cols="4">batch size accumulate steps epochs backbone</cell><cell cols="2">train+val public mean 𝑓 1 private mean 𝑓 1</cell></row><row><cell cols="2">Seesaw Loss 64</cell><cell>3</cell><cell>32</cell><cell cols="2">MetaFormer-0 yes</cell><cell>79.79%</cell><cell>76.15%</cell></row><row><cell cols="2">Seesaw Loss 64</cell><cell>6</cell><cell>32</cell><cell cols="2">MetaFormer-0 yes</cell><cell>80.22%</cell><cell>76.90%</cell></row><row><cell cols="2">Seesaw Loss 32</cell><cell>3</cell><cell>64</cell><cell cols="2">MetaFormer-1 yes</cell><cell>81.67%</cell><cell>77.62%</cell></row><row><cell cols="2">Seesaw Loss 32</cell><cell>6</cell><cell>64</cell><cell cols="2">MetaFormer-1 yes</cell><cell>81.66%</cell><cell>77.94%</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_6"><head>Table 5</head><label>5</label><figDesc>Mean 𝑓 1 score on public/private test set with different training epochs.</figDesc><table><row><cell>loss</cell><cell cols="4">batch size accumulate steps epochs backbone</cell><cell cols="2">train+val public mean 𝑓 1 private mean 𝑓 1</cell></row><row><cell cols="2">Seesaw Loss 64</cell><cell>3</cell><cell>32</cell><cell cols="2">MetaFormer-0 yes</cell><cell>79.79%</cell><cell>76.15%</cell></row><row><cell cols="2">Seesaw Loss 64</cell><cell>3</cell><cell>64</cell><cell cols="2">MetaFormer-0 yes</cell><cell>80.46%</cell><cell>77.01%</cell></row><row><cell cols="2">Seesaw Loss 64</cell><cell>3</cell><cell>100</cell><cell cols="2">MetaFormer-0 yes</cell><cell>80.18%</cell><cell>76.77%</cell></row><row><cell cols="2">Seesaw Loss 24</cell><cell>4</cell><cell>32</cell><cell cols="2">MetaFormer-2 yes</cell><cell>81.18%</cell><cell>77.56%</cell></row><row><cell cols="2">Seesaw Loss 24</cell><cell>4</cell><cell>48</cell><cell cols="2">MetaFormer-2 yes</cell><cell>82.04%</cell><cell>77.92%</cell></row><row><cell cols="2">Seesaw Loss 24</cell><cell>6</cell><cell>64</cell><cell cols="2">MetaFormer-2 yes</cell><cell>80.45%</cell><cell>77.63%</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_7"><head>Table 6</head><label>6</label><figDesc>Mean 𝑓 1 score on public/private test set with different image size and MetaFormer-1 as backbone. image size batch size accumulate steps epoch public mean 𝑓 1 private mean 𝑓 1</figDesc><table><row><cell>384</cell><cell>32</cell><cell>6</cell><cell>64</cell><cell>81.76%</cell><cell>78.25%</cell></row><row><cell>448</cell><cell>20</cell><cell>6</cell><cell>64</cell><cell>80.79%</cell><cell>78.48%</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_8"><head>Table 7</head><label>7</label><figDesc>Mean 𝑓 1 score on public/private test set with different pretrain dataset and MetaFormer-2 as backbone.</figDesc><table><row><cell cols="6">pretrain dataset batch size accumulate steps epochs public mean 𝑓 1 private mean 𝑓 1</cell></row><row><cell>herbarium</cell><cell>12</cell><cell>6</cell><cell>32</cell><cell>80.90%</cell><cell>77.37%</cell></row><row><cell>imagenet22k</cell><cell>12</cell><cell>8</cell><cell>48</cell><cell>81.47%</cell><cell>77.86%</cell></row><row><cell>inaturalist21</cell><cell>24</cell><cell>6</cell><cell>48</cell><cell>82.04%</cell><cell>77.92%</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_9"><head>Table 8</head><label>8</label><figDesc>Mean 𝑓 1 score on public/private test set with ConvNext-tiny, ConvNext-base and ConvNext-large as backbone. Notice that we only use image data to train ConvNext.</figDesc><table><row><cell cols="5">batch size accumulate steps epochs +pseudo label backbone</cell><cell cols="2">public mean 𝑓 1 private mean 𝑓 1</cell></row><row><cell>96</cell><cell>3</cell><cell>64</cell><cell>no</cell><cell>convnext-tiny</cell><cell>76.93%</cell><cell>73.46%</cell></row><row><cell>32</cell><cell>4</cell><cell>64</cell><cell>no</cell><cell cols="2">convnext-base 78.97%</cell><cell>75.46%</cell></row><row><cell>10</cell><cell>4</cell><cell>64</cell><cell>no</cell><cell cols="2">convnext-large 79.15%</cell><cell>75.59%</cell></row><row><cell>24</cell><cell>6</cell><cell>80</cell><cell>yes</cell><cell cols="2">convnext-large 80.65%</cell><cell>76.61%</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_10"><head>Table 9</head><label>9</label><figDesc>Mean 𝑓 1 score on public/private test set with pseudo label, ConvNext-large and MetaFormer-2 as backbone.</figDesc><table><row><cell>backbone</cell><cell cols="5">batch size accumulate steps epochs public mean 𝑓 1 private mean 𝑓 1</cell></row><row><cell cols="2">ConvNext-large 24</cell><cell>6</cell><cell>80</cell><cell>80.65%</cell><cell>76.61%</cell></row><row><cell>MetaFormer-2</cell><cell>24</cell><cell>8</cell><cell>80</cell><cell>82.45%</cell><cell>77.93%</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_11"><head>Table 10</head><label>10</label><figDesc>Single model's mean 𝑓 1 score on public/private test set with different test time augmentation.</figDesc><table><row><cell>test time augmentation</cell><cell>public mean 𝑓 1</cell><cell>private mean 𝑓 1</cell></row><row><cell>center crop / five crop</cell><cell cols="2">81.66% / 81.76% 77.94% / 78.25%</cell></row><row><cell>center crop / five crop</cell><cell cols="2">81.67% / 81.63% 77.62% / 77.69%</cell></row><row><cell cols="3">five crop / multi scale &amp; ten crop 80.46% / 80.20% 77.02% / 77.31%</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_12"><head>Table 11</head><label>11</label><figDesc>Ensemble model's mean 𝑓 1 score on public/private test set with different test time augmentation.</figDesc><table><row><cell>test time augmentation</cell><cell>public mean 𝑓 1</cell><cell>private mean 𝑓 1</cell></row><row><cell cols="3">center crop / multi scale &amp; ten crops 83.20% / 83.26% 79.51% / 79.38%</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_13"><head>Table 12</head><label>12</label><figDesc>The effectiveness of post process for tail categories recognition and open-set categories recognition. ensemble and post process number of open-set samples public mean 𝑓 1 private mean 𝑓 1</figDesc><table><row><cell>average ensemble (v1)</cell><cell>∼1000</cell><cell>83.26%</cell><cell>79.38%</cell></row><row><cell>average ensemble (v2)</cell><cell>∼1500</cell><cell>83.50%</cell><cell>79.60%</cell></row><row><cell>average ensemble (v3)</cell><cell>∼1500</cell><cell>83.65%</cell><cell>79.79%</cell></row><row><cell>average ensemble (v4)</cell><cell>∼1500</cell><cell>83.78%</cell><cell>80.43%</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_14"><head></head><label></label><figDesc>Test Time Augmentation. For test time augmentation, we use center crop, five crop and multi scale &amp; ten crop during test phase. The effects of test time augmentation(TTA) are shown in Table10and Table11. It should be noted that the mean 𝑓 1 score on public test set is out of accord with private test set in some experiments, and it is hard to decide which TTA is better only based on public score, in consideration of robustness, we have chosen multi scale &amp; ten crop based on the public mean 𝑓 1 in Table11. Post Process. For short, we name the different version of ensemble and post process as v1 (initial version), v2 (v1 + proper open-set threshold), v3 (v2 + models trained with pseudo label) and v4 (v3 + post process for tail categories).</figDesc><table /></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<author>
			<persName><forename type="first">Q</forename><surname>Diao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Wen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Yuan</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2203.02751</idno>
		<title level="m">Metaformer: A unified meta framework for fine-grained recognition</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">A convnet for the 2020s</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Mao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C.-Y</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Feichtenhofer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Darrell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Xie</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</title>
				<meeting>the IEEE/CVF Conference on Computer Vision and Pattern Recognition</meeting>
		<imprint>
			<publisher>CVPR</publisher>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Seesaw loss for long-tailed instance segmentation</title>
		<author>
			<persName><forename type="first">J</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Cao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Pang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Gong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">C</forename><surname>Loy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Lin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF International Conference on Computer Vision</title>
				<meeting>the IEEE/CVF International Conference on Computer Vision</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="9695" to="9704" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Overview of FungiCLEF 2022: Fungi recognition as an open set classification problem</title>
		<author>
			<persName><forename type="first">L</forename><surname>Picek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Šulc</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Heilmann-Clausen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Matas</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Working Notes of CLEF 2022 -Conference and Labs of the Evaluation Forum</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Lifeclef 2022 teaser: An evaluation of machine-learning based species identification and species distribution prediction</title>
		<author>
			<persName><forename type="first">A</forename><surname>Joly</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Goëau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Kahl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Picek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Lorieul</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Cole</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Deneu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Servajean</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Durso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Bolon</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">European Conference on Information Retrieval</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="390" to="399" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Overview of lifeclef 2022: an evaluation of machine-learning based species identification and species distribution prediction</title>
		<author>
			<persName><forename type="first">A</forename><surname>Joly</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Goëau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Kahl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Picek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Lorieul</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Cole</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Deneu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Servajean</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Durso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Glotin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Planqué</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W.-P</forename><surname>Vellinga</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Navine</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Klinck</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Denton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Eggel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Bonnet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Šulc</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Hruz</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference of the Cross-Language Evaluation Forum for European Languages</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<author>
			<persName><forename type="first">L</forename><surname>Picek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Šulc</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Matas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Heilmann-Clausen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">S</forename><surname>Jeppesen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Laessøe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Frøslev</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2103.10107</idno>
		<title level="m">Danish fungi 2020 -not just another image recognition dataset</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Weakly supervised complementary parts models for fine-grained image classification from the bottom up</title>
		<author>
			<persName><forename type="first">W</forename><surname>Ge</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Yu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</title>
				<meeting>the IEEE/CVF Conference on Computer Vision and Pattern Recognition</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="3034" to="3043" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Filtration and distillation: Enhancing region attention for fine-grained visual categorization</title>
		<author>
			<persName><forename type="first">C</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Xie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z.-J</forename><surname>Zha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the AAAI Conference on Artificial Intelligence</title>
				<meeting>the AAAI Conference on Artificial Intelligence</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="volume">34</biblScope>
			<biblScope unit="page" from="11555" to="11562" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Learning to navigate for fine-grained classification</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Luo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Wang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the European Conference on Computer Vision (ECCV)</title>
				<meeting>the European Conference on Computer Vision (ECCV)</meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="420" to="435" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J.-N</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kortylewski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Bai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Yuille</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2103.07976</idno>
		<title level="m">Transfg: A transformer architecture for fine-grained recognition</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Channel interaction networks for fine-grained image categorization</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Han</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Scott</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the AAAI Conference on Artificial Intelligence</title>
				<meeting>the AAAI Conference on Artificial Intelligence</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="volume">34</biblScope>
			<biblScope unit="page" from="10818" to="10825" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Fine-grained image classification via combining vision and language</title>
		<author>
			<persName><forename type="first">X</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Peng</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</title>
				<meeting>the IEEE Conference on Computer Vision and Pattern Recognition</meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="5994" to="6002" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Geo-aware networks for fine-grained recognition</title>
		<author>
			<persName><forename type="first">G</forename><surname>Chu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Potetz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Howard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Song</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Brucher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Leung</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Adam</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops</title>
				<meeting>the IEEE/CVF International Conference on Computer Vision Workshops</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="0" to="0" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Presence-only geographical priors for fine-grained image classification</title>
		<author>
			<persName><forename type="first">O</forename><forename type="middle">Mac</forename><surname>Aodha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Cole</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Perona</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF International Conference on Computer Vision</title>
				<meeting>the IEEE/CVF International Conference on Computer Vision</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="9596" to="9606" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">An image is worth 16x16 words: Transformers for image recognition at scale</title>
		<author>
			<persName><forename type="first">A</forename><surname>Dosovitskiy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Beyer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kolesnikov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Weissenborn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Unterthiner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Dehghani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Minderer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Heigold</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gelly</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Uszkoreit</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Houlsby</surname></persName>
		</author>
		<ptr target="https://openreview.net/forum?id=YicbFdNTTy" />
	</analytic>
	<monogr>
		<title level="m">9th International Conference on Learning Representations, ICLR 2021, Virtual Event</title>
				<meeting><address><addrLine>Austria</addrLine></address></meeting>
		<imprint>
			<publisher>OpenReview</publisher>
			<date type="published" when="2021">May 3-7, 2021. 2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Swin transformer: Hierarchical vision transformer using shifted windows</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Cao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Wei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Guo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)</title>
				<meeting>the IEEE/CVF International Conference on Computer Vision (ICCV)</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<author>
			<persName><forename type="first">X</forename><surname>Dong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Bao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Yuan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Guo</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2107.00652</idno>
		<title level="m">Cswin transformer: A general vision transformer backbone with cross-shaped windows</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<monogr>
		<title level="m" type="main">Adaptive class suppression loss for long-tail object detection</title>
		<author>
			<persName><forename type="first">T</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Zeng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Tang</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Equalization loss for long-tailed object recognition</title>
		<author>
			<persName><forename type="first">J</forename><surname>Tan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Ouyang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Yin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Yan</surname></persName>
		</author>
		<idno type="DOI">10.1109/CVPR42600.2020.01168</idno>
		<ptr target="https://openaccess.thecvf.com/content_CVPR_2020/html/Tan_Equalization_Loss_for_Long-Tailed_Object_Recognition_CVPR_2020_paper.html.doi:10.1109/CVPR42600.2020.01168" />
	</analytic>
	<monogr>
		<title level="m">IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020</title>
				<meeting><address><addrLine>Seattle, WA, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Computer Vision Foundation / IEEE</publisher>
			<date type="published" when="2020-06-13">2020. June 13-19, 2020. 2020</date>
			<biblScope unit="page" from="11659" to="11668" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Open-set recognition: a good closed-set classifier is all you need?</title>
		<author>
			<persName><forename type="first">S</forename><surname>Vaze</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Han</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Vedaldi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Zisserman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Learning Representations</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">mixup: Beyond empirical risk minimization</title>
		<author>
			<persName><forename type="first">H</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Cissé</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">N</forename><surname>Dauphin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Lopez-Paz</surname></persName>
		</author>
		<ptr target="https://openreview.net/forum?id=r1Ddp1-Rb" />
	</analytic>
	<monogr>
		<title level="m">6th International Conference on Learning Representations, ICLR 2018</title>
				<meeting><address><addrLine>Vancouver, BC, Canada</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018-05-03">April 30 -May 3, 2018. 2018</date>
		</imprint>
	</monogr>
	<note>Conference Track Proceedings, OpenReview.net</note>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">When does label smoothing help?</title>
		<author>
			<persName><forename type="first">R</forename><surname>Müller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Kornblith</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Hinton</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 33rd International Conference on Neural Information Processing Systems</title>
				<meeting>the 33rd International Conference on Neural Information Processing Systems<address><addrLine>Red Hook, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Curran Associates Inc</publisher>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
