=Paper= {{Paper |id=Vol-3180/paper-163 |storemode=property |title=Does Closed-Set Training Generalize to Open-Set Recognition? |pdfUrl=https://ceur-ws.org/Vol-3180/paper-163.pdf |volume=Vol-3180 |authors=Fan Gao,Zining Chen,Weiqiu Wang,Yinan Song,Fei Su,Zhicheng Zhao,Hong Chen |dblpUrl=https://dblp.org/rec/conf/clef/GaoCWSSZC22 }} ==Does Closed-Set Training Generalize to Open-Set Recognition?== https://ceur-ws.org/Vol-3180/paper-163.pdf
Does Closed-Set Training Generalize to Open-Set
Recognition?
Fan Gao1 , Zining Chen1 , Weiqiu Wang1 , Yinan Song1 , Fei Su1 , Zhicheng Zhao1 and
Hong Chen2
1
  Beijing Key Laboratory of Network System and Network Culture, School of Artificial Intelligence,Beijing University of
Posts and Telecommunications, Beijing, China
2
  China Mobile Research Institute


                                         Abstract
                                         Automatic classification of fungi assists scientists in species identification and biodiversity protection.
                                         The FungiCLEF 2022 challenge provides a large-scale multi-modal fine-grained dataset to contribute to
                                         this issue. This paper proposes a novel open-set image classification method called Class-wise Weighted
                                         Prototype Classifier (CWPC) which decouples closed-set training and open-set inference. Thus, it can
                                         benefit from all existing closed-set advances and transfer to open-set without further modification. By
                                         using meta-vision models and two different vision-only models, an ensemble result achieves excellent
                                         performance with the mean F1 scores of 81.02% and 77.58% on public leaderboard and private leaderboard,
                                         respectively.

                                         Keywords
                                         Fungi identification, Fine-grained, Open-set recognition, Metadata, Long-tailed




1. Introduction
   Fungi contains many fine-grained classes of eukaryotic organisms that are widely distributed
in nature and play an important role in human production and life. Automatic recognition
of fungi species assists mycologists, citizen scientists and nature enthusiasts in species iden-
tification in the wild. However, fungi identification is difficult because of the high diversity
of fungi, fine granularity of species and domain gap caused by observation tools. As a part
of LifeCLEF-2022 [1, 2] which aims at biodiversity identification and prediction, FungiCLEF-
2022 [3] searches for a robust open-set fungi identification system which is more practical than
a closed-set recognition system in real-world scenario.
   Thanks to the large-scale labeled dataset such as ImageNet [4] and iNaturalist [5], convolution
neural networks are the mainstream of vision recognition and outperform human experts in
some fields [4, 6]. Due to the limitation of convolution structures, CNN only takes image
information as input and can’t benefit from rich metadata information. Meanwhile, most of
existing methods and optimization are based on closed-set recognition which can’t be applied
to open-set recognition directly.
CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy
$ xsgaofan@bupt.edu.cn (F. Gao); chenzn@bupt.edu.cn (Z. Chen); wangweiqiu@bupt.edu.cn (W. Wang);
songyn@bupt.edu.cn (Y. Song); sufei@bupt.edu.cn (F. Su); zhaozc@bupt.edu.cn (Z. Zhao);
chenhongyj@chinamobile.com (H. Chen)
                                       © 2022 Copyright for this paper by its authors.https://www.overleaf.com/project/62b09d47feab3260d8e84f90 Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
   In this paper, we propose a novel open-set classification method called Class-wise Weighted
Prototype Classifier (CWPC) by decoupling closed-set training and open-set inference. On
the one hand, the open-set recognition can benefit from all the advances in closed-set image
classification, such as large-scale pre-trained models, label smoothing and data augmentation.
On the other hand, it is cost-saving to transfer closed-set recognition models to open-set
scenarios with further modification and training. As for the closed-set training, firstly we
elaborately design a text template to compensate the context of metadata which have more
reasonable and complete semantic information than discrete and independent words. And
then, we combine text and vision information with a meta-vision model where convolution
is used to extra deep vision embedding and transformer is used to fuse image and metadata
information. We also employ two different vision-only models to complement each other. A
hard classes mining strategy and LDAM loss [7] are used to eliminate the long-tailed distribution
of dataset. Finally, we got a result of 81.02% and 77.58% with our method respectively on public
leaderboard and private leaderboard.


2. Related Work
2.1. Open-set Recognition
   Unlike traditional closed-set recognition, open-set recognition is more suitable for real-world
applications. This task was first proposed in [8], in which the authors apply an 1-vs-Set machine
to calculate an open-space risk as an indicator. When a sample is far from known samples,
the increased risk suggests it is more likely from unknown classes. OpenMax [9] replaces the
SoftMax layer in DNNs with an OpenMax layer to redistribute to get the class probability of
unknown samples. PROSER [10] takes open-set problem into consideration during the training
process. It generates data placeholders by fusing middle hidden layer features from different
classes as the embedding of open-set classes and augments the output layer with an extra
dummy classifier to well separate known and unknown. Although these methods make great
progress on open-set recognition, they can’t utilize metadata efficiently.

2.2. Multi-Modality
   Fine-grained classification methods with only images have been explored by many researchers.
Besides visual information, additional information is used to improve the performance. CVL [11]
proposes a two-branch network while the vision stream learns deep vision representations
and the language stream learns text representations. The results of two streams are merged in
later stage to combine vision and language. Geo-Aware Networks [12] incorporates geolocation
information prior to fine-grained classification and examines various ways of geographic
prior. MetaFormer [13] is a hybrid structure backbone where the convolution can extra image
embedding and introduce the inductive bias of the convolution, and the transformer can fuse
visual and meta-information.
2.3. Long-tail Recognition
   For long-tailed recognition, re-balancing methods including re-weighting [14, 15] and re-
sampling [16, 17] are conventional methods to alleviate the imbalance of datasets. However,
recent studies find that they may do harm to feature learning. Besides, re-balancing methods
are easily over-fitting and under-fitting on the tail and head classes, respectively. Multi-experts
models such as BBN [18] and RIDE [19], are also designed to solve long-tailed problem, but
these methods have high computation complexity and are hard to optimize when we choose a
large-scale pretrained model as backbone. OLTR [20] is the first work proposed for open-set
long-tailed recognition which utilizes extra attention module and memory bank. Therefore,
considering the computation and memory cost, in our strategy, we apply LDAM loss on the
large-scale pretrained models to fine-tune on the competition dataset, which assigns large
margins on the high-frequency classes and small margins on the low-frequency ones.


3. Challenge Description
3.1. Dataset
   The data of FungiCLEF is from Danish Fungi 2020 [21], a novel fine-grained dataset which
consists of 266,344 images for training and 29,594 images for validation. It contains 1,604
species mainly from the Fungi kingdom with a few visually similar species. While most of
images are collected from natural scene, there remains some hand drawn drafts and microscope
observations which have a huge domain gap with others, as shown in Figure 1. In addition to
image information and class labels, this dataset provides rich observation metadata in csv files.
There are more than 20 kinds covering basic time and geographic localities, full taxonomy labels,
substrate and habitat, etc. For some images, not all meta information is available and some
are missing. The class frequencies in the dataset follow an extremely unbalanced long-tailed
distribution with a maximum 1,913 and a minimum 31, as illustrated in Figure 2. An additional
set of 118,676 images from 3,134 species is used for testing. These images are provided with less
metadata (e.g. time stamp, location, substrate, habitat).

3.2. Task
   Being a part of LifeCLEF-2022 which aims at biodiversity identification and prediction,
FungiCLEF-2022 is an automatic fungi recognition competition, as well as an open-set machine
learning problem, which means unknown categories will emerge during test time. Under this
circumstances, open-set recognition task is proposed to perform on known classes and reject
unknown classes as one class. Meanwhile, Danish Fungi 2020 is a fine-grained dataset with
1,604 fungi species. Small inter-class variances and huge intra-class similarity make it more
challenging. Contrast to traditional visual recognition, this task provides rich metadata acquired
by citizen-scientists, i.e. only vision models are not sufficient, the combination of metadata and
images must be considered. Here we conclude the main difficulties of this completion:

    • Usage of rich metadata;
    • Extremely unbalanced long-tailed data distribution;
    • Open-set recognition rather than closed-set;
    • Robust recognition with noise data, e.g. images, hand-drawn drafts and microscopic
      observations.


4. Method
   In this section, we will introduce our solution for the open-set fungi recognition challenge.
The insight of our solution is to generalize models trained on the closed-set dataset to the
open-set scenario without any additional trivial module or extra computation cost on open-set
training. Therefore, we decouple the open-set recognition into closed-set training and open-
set inference, described in Section 4.1 and Section 4.2, respectively. For closed-set training,
we utilize the existing closed-set advances and innovate to use metadata with a designed text
template and merge multi-modal embeddings in feature space. For open-set inference, we design
a Class-wise Weighted Prototype Classifier (CWPC) and the ObservationId-awared Weighted
Similarity (OAWS) strategy to generalize closedo-set training models to open-set recognition
challenge. Besides, we proposed a weighted Top-5 voting strategy to ensemble diverse models
for better performances.

4.1. Closed-Set Training Improvements
4.1.1. Multi-modal Information Usage
  Metadata Preprocessing. For training and validation data, more than 20 kinds of metadata
are provided including time stamp, geographic localities, full taxonomy labels, substrate and
habitat, etc. There are plenty of choices during training, while it only provides 10 metadata


Figure 1: Samples of fungi challenge dataset.
Figure 2: Distribution of fungi challenge dataset.




for test set: "eventDate", "month", "day", "countryCode", "Location_lvl0", "Location_lvl3", "Lo-
cation_lvl2", "Location_lvl1", "Substrate" and "Habitat". Therefore, to keep the consistency of
training and testing, we choose from the above ten metadata dropping time-related ones in
consideration of the potential confusion caused by time. Also we replace "countryCode" with
full country name. Instead of regarding these metadata as discrete and independent words, we
design a description text template with all metadata and replace the missing ones with the word
"unknown". For example, if the values of "countryCode", "Location_lvl0", "Location_lvl3", "Loca-
tion_lvl2", "Location_lvl1", "Substrate" and "Habitat" are "US", "Mount Olive Baptist Church",
"United States", "Texas", "Brazoria", "bark of living trees" and "None", respectively, with our
template, the description is: "Its location is Mount Olive Baptist Church, Brazoria, Texas, United
States. It lives in bark of living trees. Its habitat is unknown". And the description is taken as
the caption of its corresponding image, which is used in Metadata Encoding. Our designed
text template eliminates the distractions of missing metadata, adds contextual information and
ensures that we can get fixed dimension features for later stages.
   Metadata Encoding. To get deep text embedding efficiently, we employ pre-trained NLP
models directly. Intuitively, we use a multilingual BERT [22]-base model because the location is
recorded in Danish. It is pretrained on the top 104 languages with the largest Wikipedia using
masked language modeling (MLM) objective. And for each designed template, it generates a
768-dimension feature. Further, we update the multilingual BERT model with RoBERTa [23]
Figure 3: The pipeline of Meta-vision Models and vision-only models during closed-set training.




large model. RoBERTa is a well-trained BERT with some modifications including more epochs,
larger batches, more data, etc. It generates a 1,024-dimension feature for each text template
which contains more information and can be more representative.
   Meta-Vision Models. We use MetaFormer as our meta-vision backbone to add meta infor-
mation to improve the fine-grained classification. Metaformer is a hybrid framework which
uses convolution to extract deep vision features and uses transformer layers to fuse vision and
meta information. The origin MetaFormer design multi-layered fully-connected networks for
each metadata to get embedding vector. However, our meta information has been merged as
one in a unified text template and encoded by pre-trained NLP models as described in Metadata
Preprocessing and Metadta Encoding, respectively. After getting initial text embeddings with
a pre-trained NLP model, we apply a single fully-connected layer on them, followed by an
activation layer ReLU and layer normalization. Relative transformer layers in MetaFormer are
used to fuse visual token, meta token and class token. Like ViT [24], only the class token is
used for the category prediction.
   Vision-Only Models. While MetaFormer focuses on the fusion of multi-modal information,
models only trained on images is of necessity to learn visual-representative deep features. Here
we use convolution-based ConvNeXt [25] and transformer-based Swin Transformer [6], both
of which are pioneer works in their respective fields. We hope these two different network
structures can pay attention to different image patterns, bring new views into learning process
and complement each other in the final decision. We adopt vanilla Swin Transformer and
ConvNeXt architecture for simplicity.
   To sum up, during closed-set training, the pipeline of meta-vision models and vision-only
models are illustrated in Figure 3.

4.1.2. Long-Tailed Solution
   LDAM Loss [7]. As analyzed in Section3.1, the dataset shows an extremely unbalanced
long-tailed distribution, which will deteriorate the network performance during testing. To
alleviate this adverse effect, we train our models with LDAM loss rather than CE loss. LDAM
loss enforces a class-aware margin for each class to optimize a uniform-label generalization
Figure 4: Dirty cases of hard classes.




error bound. It encourages larger margins for minority classes and smaller margins for majority
classes. Meanwhile, the inputs of LDAM loss should be normalized by normalizing last hidden
activation layer and the weight vectors of last fully-connected layer with L2 norm. We only use
LDAM loss on our meta-vision model.
   Hard Classes Mining. We design a hard class mining (HCM) strategy with the accuracy on
train and validation set, then augment them with high-resolution data provided by the host.
Specifically, we set the threshold on the train set to 80% and classes whose accuracy under 80%
are defined as hard classes. Besides, based on the validation set, we only consider the classes
whose samples are more than 50, and the threshold is set to 85%. Based on above two principles,
we get 83 hard classes. We manually filter corresponding images and remove some dirty cases
like too-small target or low-quality images, as shown in Figure 3. Finally, we complement the
rest images with high-resolution ones as provided.

4.1.3. Data Augmentation
  Including the traditional data augmentation[26] like random horizontal flip, we also use
Mixup [27] and CutMix [28] for models’ robustness at a probability of 0.4. It should be pointed
out that these two data augmentation methods and LDAM loss are not compatible because of
the mixed labels. Besides, we use random erase with a probability of 0.2 and Auto Augment
(AA) [29], which searches for improved data augmentation policies automatically, which are
over ShearX/Y, TranslateX/Y, Rotate, AutoContrast, etc.
4.2. Open-Set Inference Design
  Our model is trained as a closed-set classification task described as above, therefore, inferenc-
ing methods are innovated and well-designed to tackle the open-set challenge that whether test
images belong to the "unknown" class. All data used during inference stage are features and
prediction scores of training and test sets extracted by our closed-set training models.

4.2.1. Class-wise Weighted Prototype Classifier
   Traditional open-set maximum softmax probability (MSP) method only utilizes test prediction
scores to judge unknown class probability, lack of using information in train set. Thus, our
paper considers a similarity-based method and proposes our Class-wise Weighted Prototype
Classifier (CWPC) by constructing class centers using both features and prediction scores of
train set. Specifically, we firstly extract features and prediction scores of all images from train
set. Then, we utilize them to compute class centers assigning different weights on samples with
the same label instead of taking the features of samples equally. We innovate to apply softmax
on the maximum prediction scores of all images with the same label to compute the weight of
each sample when computing the class center. The weights for samples of each class can be
formulated as follows:
                                    𝑃𝑖 = [𝑝1 , 𝑝2 , . . . , 𝑝𝑐 ],
                                    𝑚𝑖 = 𝑀 𝑎𝑥(𝑃𝑖 ),
                                                                                                (1)
                                   𝑀𝑐 = [𝑚1 , 𝑚2 , . . . , 𝑚𝑁𝑐 ],
                                   𝑊𝑐 = 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥(𝑀𝑐 )

  where 𝑃𝑖 is the prediction score on 𝑖𝑡ℎ image after Softmax function, 𝑚𝑖 denotes the maximum
prediction score of 𝑖𝑡ℎ image, 𝑁𝑐 is the number of images of class 𝑐 in the training set, 𝑀𝑐
denotes prediction scores of 𝑁𝑐 images, and 𝑊𝑐 denotes weights for images of class 𝑐.
  CWPC improves the compactness within each class, resulting in more accurate class center
representations and achieving powerful prerequisites on subsequent similarity measurements.
Also, as an inference-stage strategy, CWPC consumes negligible computation resources, and can
be applied on inference stage of all open-set image classification tasks with great generalization,
which we consider as a universal algorithm in open-set challenge.

4.2.2. ObservationId-awared Weighted Similarity
  To meet the requirements of submission on ObservationId, our paper designs an
ObservationId-awared Weighted Similarity (OAWS) to make fusion on images with same Obser-
vationId. As CWPC outputs all class centers, OAWS module aims at calculating the similarity
between features of ObservationId and each class center. Thus, we first employ fusion strategy
on images with the same ObservationId, where different weights are applied on different images.
The fusion weights for images with the same ObservationId are computed as follows,

                                      𝑃𝑖 = [𝑝1 , 𝑝2 , . . . , 𝑝𝑐 ],
                                     𝑚𝑖 = 𝑀 𝑎𝑥(𝑃𝑖 ),
                                                                                               (2)
                                     𝐾𝑜 = [𝑘1 , 𝑘2 , . . . , 𝑘𝑁𝑜 ],
                                     𝑊𝑜 = 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥(𝐾𝑜 )

where 𝑃𝑖 is prediction score on 𝑖𝑡ℎ image after Softmax function, 𝑚𝑖 denotes the maximum
prediction score of 𝑖𝑡ℎ image, 𝐾𝑜 denotes prediction scores of 𝑁𝑜 images with the same Obser-
vationId, 𝑊𝑜 denotes weights for images with the same ObservationId.
   Then, cosine similarity is subsequently adopted to measure similarity for final results. Specif-
ically, an adjustable threshold is set, the maximum similarity under which belongs to the
"unknown" class on test set. OAWS module not only serves as a special technique on Fungi Chal-
lenge, but also has referenced significance for open-set challenge for its novelty on similarity
and threshold design, which can be further adjusted to achieve better performance.

4.2.3. Comparsions
   As CWPC and OAWS successfully generalize closed-set training to open-set recognition,
achieving prominent improvements on Fungi Challenge, several comparative methods are
proposed. First, based on MSP, we calculate the maximum prediction score and set a threshold
to judge whether it belongs to the "unknown" class. It should be noted that as the particularity
in Fungi Challenge is ObservationId-format result, we calculate the average test prediction
score within each ObservationId as follows,
                                   𝑃𝑖 = [𝑝1 , 𝑝2 , . . . , 𝑝𝑐 ],
                                  𝑚𝑗 = 𝑀 𝑒𝑎𝑛([𝑃1 , 𝑃2 , . . . , 𝑃𝑖 ]),                         (3)
                               𝑃𝑚𝑎𝑥 = 𝑀 𝑎𝑥(𝑆𝑜𝑓 𝑡𝑚𝑎𝑥(𝑚𝑗 ))

  where 𝑃𝑖 is prediction score on 𝑖𝑡ℎ image after Softmax function, 𝑚𝑗 represents the mean
prediction score of 𝑗𝑡ℎ ObservationId, 𝑃𝑚𝑎𝑥 denotes the maximum prediction score of 𝑗𝑡ℎ
ObservationId.
  Second, besides CWPC, class centers can be calculated by using three other selection strategies
proposed as follows.
    • Average Selection: use average features from all images in the train set to calculate class
      centers as follows,
                                𝐹𝑎𝑣𝑒𝑟𝑎𝑔𝑒𝑗 = 𝑀 𝑒𝑎𝑛([𝑓1 , 𝑓2 , . . . , 𝑓𝑖 ])                    (4)
      where 𝑓𝑖 is features of 𝑖𝑡ℎ image in 𝑗𝑡ℎ classes.
    • Filter Selection: use average features from images in the train set whose maximum
      prediction score is above threshold to calculate class centers.
    • GT Selection: use average features from images whose predictions are the same as GT to
      calculate class centers.
  Third, besides OAWS, we apply two other different fusion strategies on test features.
    • Average Fusion: test features per ObservationId are the average features of all images
      with the same ObservationId.
    • Filter Fusion: test features per ObservationId are the average features of images whose
      maximum prediction score is above threshold.

   Fourth, we consider using features of every single image in train set instead of class centers.
We calculate the similarity between test features of each ObservationId and features of every
single image, and extract the top-1 or top-k prediction using model ensemble strategy in
Section. 4.2.4, named as Single-Image Similarity Top1 and Single-Image Similarity Top9.
   Fifth, we conduct OpenMax, an open-set inference strategy based on Extreme Value Theory
(EVT), to estimate the probability of an input being from an unknown class. The key element of
estimating the unknown probability is adapting Meta-Recognition concepts to the activation
patterns in the penultimate layer of the network.

4.2.4. Inference Augmentation
   Test-time Augmentation (TTA). TTA aims at creating multiple enhanced copies of images
on test sets, which allows models to make predictions on both original and augmented copies to
improve the mean F1 score on the test set. Typical TTA methods such as crop, flip, color jitter
are used in Fungi Challenge, where we use random crop with an extension rate of 1.15 on input
size, five crop with an additional extension of 32 pixels, horizontal flip with a probability of 1,
color jitter with a scope of 0.2, and conduct fusion TTA methods based on above.
   Diverse Model Ensemble. As diverse network architectures, training strategies, data
augmentations are proposed to improve the performance of models, the variability and diversity
between models greatly differs. In order to take full advantage of the semantic information of
different models, model ensemble methods are essential to make improvements on final results,
which voting is considered as the easiest and most efficient way. We propose top-1 and top-5
voting strategy on diverse models in Fungi Challenge. Top-1 strategy follows "the minority
obeys the majority" rule to find the majority class index as the final result. Top-5 strategy
extracts predicted top-5 class index of each model and differently weigh them, and chooses the
class index with maximum weight as the final result,

                                 𝑊𝑣𝑜𝑡𝑒 = [1, 1/2, 1/3, 1/4, 1/5]                               (5)

where 𝑊𝑣𝑜𝑡𝑒 is the weight of 𝑇 𝑜𝑝1 to 𝑇 𝑜𝑝5.


5. Experiments
5.1. Implementation Details
   We fine-tune our MetaFormer-2 on ImageNet-21K pre-trained models with input resolution
224×224 on 4 Nvidia T4 GPUs and 384×384 input resolution on 4 Nvidia V100 GPUs. AdamW
optimizer is employed with a cosine learning rate scheduler. The learning rate is initialized
as 2 × 10−4 for 30 epochs and the first 3 epochs are set for warm-up from 5 × 10−8 . As for
vision-only models, we fine-tune SwinTransformer-Base and ConvNeXt networks on 4 Nvidia
Table 1
Results on selection strategies.
                         Method           Model      Test Input   Macro-F1(%)
                           MSP          MetaFormer   224×224      75.39
                    Average Selection   MetaFormer   224×224      75.74
                     Filter Selection   MetaFormer   224×224      75.58
                    Average Selection   MetaFormer   384×384      77.39
                      GT Selection      MetaFormer   384×384      77.24
                          CWPC          MetaFormer   384×384      77.49


V100 GPUs for 30 epochs. Both are pre-trained on ImageNet-21K and the pretraining weights are
provided by the official. The optimizer, learning rate and scheduler is the same as MetaFormer-2
but no warm-up epochs. We choose SwinTransformer-base and ConvNeXt-base with input
resolution 384×384 in balance of computational consumption. The weight decay is 10−8 for
SwinTransformer-base and 2 × 10−5 for others.

5.2. Result
   We totally train 7 models, two of which are vision-only models and five are meta-vision
models. The evaluation metric for this competition is Mean F1-Score, denoted as Macro-F1, and
the results are shown in Tab. 6. We conducted these 7 experiments with different training data,
multi-scale input size and loss function. Particularly, test set images with pesudo-labels given
by diverse classifiers are used to further fine-tune our trained model. These settings ensure
the diversity of models during the model ensemble stage which can complement each other
to a better result. Finally, we got 6-th place with a result of 81.02% on public leaderboard and
77.58% on private leaderboard.

5.3. Ablation Studies
   We conduct ablation studies to demonstrate the effectiveness of our strategies on selection,
fusion, similarity, augmentation and ensemble. Tab. 1 proves that CWPC is the best selection
strategy in Fungi Challenge. Tab. 2 proves that OAWS is the best fusion strategy in Fungi
Challenge. Tab. 3 and Tab. 5 proves that CWPC and OAWS is the best open-set strategy in Fungi
Challenge. Tab. 4 proves that fivecrop is the best test-time augmentation strategy in Fungi
Challenge. Tab. 6 proves that Top5 voting is the best ensemble strategy in Fungi Challenge
and our ensembled model achieves final result of 81.02% on public leaderboard and 77.58% on
private leaderboard.


6. Conclusions
  In this paper, we propose a novel open-set fine-grained image classification method called
Class-wise Weighted Prototype Classifier (CWPC) using extra text information for FungiCLEF-
2022 challenge. We decouple all the process into closed-set training and open-set testing. Thus,
Table 2
Results on fusion strategies.
                         Method            Model        Test Input      Macro-F1(%)
                     Average Fusion     MetaFormer        384×384       76.83
                      Filter Fusion     MetaFormer        384×384       76.36
                          OAWS          MetaFormer        384×384       77.06


Table 3
Results on similarity strategies.
                    Method                         Model                Test Input   Macro-F1(%)
         Single-Image Similarity Top1     SwinTransformer-Base          384×384      64.98
         Single-Image Similarity Top9     SwinTransformer-Base          384×384      64.15
                   CWPC                   SwinTransformer-Base          384×384      73.00


Table 4
Results on test-time augmentation strategies.
                                 Test-Time Augmentation              Macro-F1(%)
                                         None                        77.49%
                        Horizontal flip + Vertical Flip + Origin     77.69
                        Horizontal flip + Color jitter + Origin      77.81
                                      CenterCrop                     77.24
                                    RandomCrop                       77.71
                                       FiveCrop                      78.28


Table 5
Results on overall strategies.
                        Method             Model           Test Input     Macro-F1(%)
                      OpenMax           Convnext-Base      384×384        77.18
                    CWPC + OAWS         Convnext-Base      384×384        79.41


it can benefit from the numerous advances in closed-set image classification, such as large-scale
pre-trained models, label smoothing and data augment. It is also cost-saving to generalize
closed-set recognition models to open-set scenarios without any further modification with
our methods. Besides, we add extra metadata to improve the performance of fine-grained
classification using a hybrid structure where convolution is used to extract deep vision features
and transformer is used to fuse vision and metadata embedding. With other long-tailed solution
and data augmentation, we got 6-th place in this challenge with a final result of 81.02% on
public leaderboard and 77.58% on private leaderboard.
Table 6
Final results.
          Model           Input size   Train set   Val set   Pseudo   HCM        Loss       Macro-F1(%)
                                          √
     MetaFormer           224×224         √          √                         CE Loss          78.33
     MetaFormer           384×384         √                                   LDAM Loss         79.72
     MetaFormer           384×384         √          √         √               CE Loss          78.28
     MetaFormer           384×384         √          √                 √      LDAM Loss         78.22
     MetaFormer           384×384         √          √         √              LDAM Loss         79.42
    Convnext-Base         384×384         √                                    CE Loss          79.81
 SwinTransformer-Base     384×384                                              CE Loss          73.00
     Ensemble Top5                                                                             81.02


7. Acknowldgments
  This work is supported by Chinese National Natural Science Foundation (62076033, U1931202)
and MoE-CMCC "Artifical Intelligence" Project No. MCM20190701.


References
 [1] A. Joly, H. Goëau, S. Kahl, L. Picek, T. Lorieul, E. Cole, B. Deneu, M. Servajean, A. Durso,
     I. Bolon, et al., Lifeclef 2022 teaser: An evaluation of machine-learning based species
     identification and species distribution prediction, in: European Conference on Information
     Retrieval, Springer, 2022, pp. 390–399.
 [2] A. Joly, H. Goëau, S. Kahl, L. Picek, T. Lorieul, E. Cole, B. Deneu, M. Servajean, A. Durso,
     H. Glotin, R. Planqué, W.-P. Vellinga, A. Navine, H. Klinck, T. Denton, I. Eggel, P. Bonnet,
     M. Šulc, M. Hruz, Overview of lifeclef 2022: an evaluation of machine-learning based
     species identification and species distribution prediction, in: International Conference of
     the Cross-Language Evaluation Forum for European Languages, Springer, 2022.
 [3] L. Picek, M. Šulc, J. Heilmann-Clausen, J. Matas, Overview of FungiCLEF 2022: Fungi
     recognition as an open set classification problem, in: Working Notes of CLEF 2022 -
     Conference and Labs of the Evaluation Forum, 2022.
 [4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale hierarchical
     image database, in: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE
     Conference on, Ieee, 2009, pp. 248–255. doi:10.1109/cvprw.2009.5206848.
 [5] G. Van Horn, O. Mac Aodha, Y. Song, Y. Cui, C. Sun, A. Shepard, H. Adam, P. Perona,
     S. Belongie, The inaturalist species classification and detection dataset, in: Proceedings of
     the IEEE conference on computer vision and pattern recognition, 2018, pp. 8769–8778.
 [6] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin transformer: Hierarchical
     vision transformer using shifted windows, in: Proceedings of the IEEE/CVF International
     Conference on Computer Vision, 2021, pp. 10012–10022.
 [7] K. Cao, C. Wei, A. Gaidon, N. Arechiga, T. Ma, Learning imbalanced datasets with label-
     distribution-aware margin loss, in: Advances in Neural Information Processing Systems,
     2019, pp. 1567–1578.
 [8] W. J. Scheirer, A. de Rezende Rocha, A. Sapkota, T. E. Boult, Toward open set recognition,
     IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (2013) 1757–1772.
     doi:10.1109/TPAMI.2012.256.
 [9] A. Bendale, T. E. Boult, Towards open set deep networks, in: Proceedings of the IEEE
     conference on computer vision and pattern recognition, 2016, pp. 1563–1572.
[10] D.-W. Zhou, H.-J. Ye, D.-C. Zhan, Learning placeholders for open-set recognition, in:
     Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
     2021, pp. 4401–4410.
[11] X. He, Y. Peng, Fine-grained image classification via combining vision and language, in:
     Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017,
     pp. 5994–6002.
[12] G. Chu, B. Potetz, W. Wang, A. Howard, Y. Song, F. Brucher, T. Leung, H. Adam, Geo-aware
     networks for fine-grained recognition, in: Proceedings of the IEEE/CVF International
     Conference on Computer Vision Workshops, 2019, pp. 0–0.
[13] Q. Diao, Y. Jiang, B. Wen, J. Sun, Z. Yuan, Metaformer: A unified meta framework for
     fine-grained recognition, arXiv preprint arXiv:2203.02751 (2022).
[14] C. Huang, Y. Li, C. C. Loy, X. Tang, Learning deep representation for imbalanced classifica-
     tion, in: Proceedings of the IEEE conference on computer vision and pattern recognition,
     2016, pp. 5375–5384.
[15] Y.-X. Wang, D. Ramanan, M. Hebert, Learning to model the tail, in: Advances in Neural
     Information Processing Systems, 2017, pp. 7029–7039.
[16] L. Shen, Z. Lin, Q. Huang, Relay backpropagation for effective learning of deep convolu-
     tional neural networks, in: European conference on computer vision, Springer, 2016, pp.
     467–482.
[17] J. Byrd, Z. Lipton, What is the effect of importance weighting in deep learning?, in:
     International Conference on Machine Learning, PMLR, 2019, pp. 872–881.
[18] B. Zhou, Q. Cui, X.-S. Wei, Z.-M. Chen, Bbn: Bilateral-branch network with cumulative
     learning for long-tailed visual recognition, in: Proceedings of the IEEE/CVF Conference
     on Computer Vision and Pattern Recognition, 2020, pp. 9719–9728.
[19] X. Wang, L. Lian, Z. Miao, Z. Liu, S. X. Yu, Long-tailed recognition by routing diverse
     distribution-aware experts, arXiv preprint arXiv:2010.01809 (2020).
[20] Z. Liu, Z. Miao, X. Zhan, J. Wang, B. Gong, S. X. Yu, Large-scale long-tailed recognition
     in an open world, in: Proceedings of the IEEE/CVF Conference on Computer Vision and
     Pattern Recognition, 2019, pp. 2537–2546.
[21] L. Picek, M. Šulc, J. Matas, T. S. Jeppesen, J. Heilmann-Clausen, T. Læssøe, T. Frøslev,
     Danish fungi 2020-not just another image recognition dataset, in: Proceedings of the
     IEEE/CVF Winter Conference on Applications of Computer Vision, 2022, pp. 1525–1535.
[22] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
     transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
[23] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer,
     V. Stoyanov, Roberta: A robustly optimized bert pretraining approach, arXiv preprint
     arXiv:1907.11692 (2019).
[24] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. De-
     hghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Trans-
     formers for image recognition at scale, arXiv preprint arXiv:2010.11929 (2020).
[25] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, S. Xie, A convnet for the 2020s, in:
     Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
     2022, pp. 11976–11986.
[26] S. Marcel, Y. Rodriguez, Torchvision the machine-vision package of torch, in: Proceedings
     of the 18th ACM international conference on Multimedia, 2010, pp. 1485–1488.
[27] H. Zhang, M. Cisse, Y. N. Dauphin, D. Lopez-Paz, mixup: Beyond empirical risk minimiza-
     tion, arXiv preprint arXiv:1710.09412 (2017).
[28] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, Y. Yoo, Cutmix: Regularization strategy to train
     strong classifiers with localizable features, in: Proceedings of the IEEE/CVF international
     conference on computer vision, 2019, pp. 6023–6032.
[29] C. Lin, M. Guo, C. Li, X. Yuan, W. Wu, J. Yan, D. Lin, W. Ouyang, Online hyper-parameter
     learning for auto-augmentation strategy, in: Proceedings of the IEEE/CVF International
     Conference on Computer Vision, 2019, pp. 6579–6588.