=Paper= {{Paper |id=Vol-3180/paper-175 |storemode=property |title=When Large Kernel Meets Vision Transformer: A Solution for SnakeCLEF & FungiCLEF |pdfUrl=https://ceur-ws.org/Vol-3180/paper-175.pdf |volume=Vol-3180 |authors=Yang Shen,Xuhao Sun,Zijian Zhu |dblpUrl=https://dblp.org/rec/conf/clef/ShenSZ22 }} ==When Large Kernel Meets Vision Transformer: A Solution for SnakeCLEF & FungiCLEF== https://ceur-ws.org/Vol-3180/paper-175.pdf
When Large Kernel Meets Vision Transformer: A
Solution for SnakeCLEF & FungiCLEF
Yang Shen1 , Xuhao Sun1 and Zijian Zhu1
1
    School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China


                                         Abstract
                                         LifeCLEF 2022 is an evaluation campaign that is being organized as part of the CLEF initiative labs.
                                         This paper record solutions of two competitions in LifeCLEF 2022, i.e., SnakeCLEF 2022 and FungiCLEF
                                         2022. The SnakeCLEF aims at building an automatic and robust image-based system for snake species
                                         identification while FungiCLEF aims at automatic recognize fungi species with both images and rich
                                         metadata. These two competitions contain a number of challenges, such as fine-grained image recognition,
                                         long-tailed recognition and openset recognition. In this paper, we utilize existing efficient techniques
                                         and tricks to deal with the long-tailed challenge in both SnakeCLEF and FungiCLEF. We also propose a
                                         new backbone by combining the large kernel convolution and vision transformer as both of them have
                                         shown superior performance in recognition tasks. For the SnakeCLEF competition, our team achieves a
                                         85.4% Macro F1-Score on the private leaderboard while for the FungiCLEF competition, we achieve a
                                         78.9% Macro F1-Score. Codes are available at: https://github.com/sinbais/CLEF2022.

                                         Keywords
                                         FungiCLEF, SnakeCLEF, Openset recognition, Fine-grained image recognition, Long-tailed recognition




1. Introduction
Building accurate knowledge of the identity and the evolution of species is essential for the
sustainable development of humanity, as well as for biodiversity conservation. However,
the difficulty of identifying animals and fungi is hindering the aggregation of new data and
knowledge. Identifying and naming living organisms is almost impossible for the general public
and is often difficult even for professionals and naturalists [1]. The LifeCLEF Lab has been
promoting and evaluating advances in this domain for over 10 years and has has achieved a lot
of meaningful results [2, 3, 4].
   In this paper, we combine existing effective techniques for the state-of-the-art pre-trained
models and utilize advanced methods in the long-tailed recognition task to give solutions for
both competitions (cf. Section 4). We also design a new backbone by combining two recent
hot topics, i.e., large kernel convolution and vision transformers (cf. Section 3.1). Our overall
strategy was to test as many models as possible and spend less time on fine-tuning. The goal
was to have many diverse models for ensembling rather than some highly tuned ones. In the
ensemble stage, our strategy was to choose the best combination that we thought can avoid
overfitting and spend the rest of the time fitting the public leaderboard. In the following of this
section, we introduce these two competitions once more.
CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy
$ shenyang 98@njust.edu.cn (Y. Shen); sunxh@njust.edu.cn (X. Sun); zhuzijian@njust.edu.cn (Z. Zhu)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
   The SnakeCLEF competition aims at building an automatic and robust image-based system
for snake species identification, which is an important goal for biodiversity, conservation, and
global health. With recent estimates of 81,410–137,880 deaths and up to three times as many
victims of amputations, permanent disability and disfigurement (globally each year) caused by
venomous snakebite, such a system has the potential to improve eco-epidemiological data and
treatment outcomes (e.g. based on the specific use of antivenoms). This applies especially in
remote geographic areas and developing countries, where automatic snake species identification
has the greatest potential to save lives. [5, 1, 6].
   The FungiCLEF competition focuses on recognize fungi in the open world. Automatic
recognition of fungi species assists mycologists, citizen scientists and nature enthusiasts in
species identification in the wild. Its availability supports the collection of valuable biodiversity
data. In practice, species identification typically does not depend solely on the visual observation
of the specimen but also on other information available to the observer — such as habitat,
substrate, location and time. Thanks to rich metadata, precise annotations, and baselines
available to all competitors, the challenge provides a benchmark for image recognition with
the use of additional information. Moreover, the toxicity of a mushroom can be crucial for the
decision of a mushroom picker [7, 1, 6].


2. Datasets and Evaluation Protocol
2.1. Dataset for SnakeCLEF




                                                 [Natrix natrix]
                     Head class (most samples)




                                                                                 [Pseudonaja guttata]



                                                                   Tail class (most classes)




                    Figure 1: Class distribution of the training set of SnakeCLEF.


  For this year challenge, organizers prepared a dataset based on 187,129 snake observations
with 318,532 photographs belonging to 1,572 snake species and observed in 208 countries. The
data were gathered from the online biodiversity platform – iNaturalist.1
  The provided dataset has a heavy long-tailed class distribution, where the most frequent
species is represented by 6,472 images and the least frequent species by just 5 samples [5].

2.2. Dataset for FungiCLEF

                             Head class (most samples)




                                                         [Trametes versicolor (L.) Lloyd]




                                                                                                  [Podosphaera aphanis (Wallr.)
                                                                                                  U.Braun & S.Takam.]



                                                                                        Tail class (most classes)




                       Figure 2: Class distribution of the training set of FungiCLEF.


  The FungiCLEF challenge dataset is based on the data from the Danish Fungi 2020 dataset [8],
which contains 295,938 training images belonging to 1,604 species observed mostly in Denmark.
All training samples passed an expert validation process, guaranteeing high-quality labels. Rich
observation metadata about habitat, substrate, time, location, EXIF are provided.
  The test set contains 59,420 observations with 118,676 images and 3,134 species, covering the
whole year and includes observations collected across all substrate and habitat types [7].

2.3. Evaluation Protocol
The evaluation metric for this competition is Mean (Macro) F1-Score:
                                                                                            N
                                                                                1 ∑︁
                                                                     Macro F1 =      F1i ,                                        (1)
                                                                                N
                                                                                            i=i

where 𝑖 is the species index and 𝑁 is the number of classes/species. The F1 score, commonly
used in information retrieval, measures accuracy using the statistics precision (P) and recall (R).
   1
       www.inaturalist.org
           The macro F1 score is not biased by class frequencies and is more suitable for the long-tailed
           class distributions observed in nature. Precision is the ratio of true positives (TP) to all predicted
           positives (TP + FP). The Recall is the ratio of true positives (TP) to all actual positives (TP +
           FN). The F1 metric weights recall and precision equally, and a good retrieval algorithm will
           maximize both precision and recall simultaneously. Thus, moderately good performance on
           both will be favoured over extremely good performance on one and poor performance on the
           other.


           3. Methods
           3.1. Proposed CoLKANet
4          Guo et al.




               H                      =                      +                     +


                     W           C
           Figure 3: Decomposition diagram of large-kernel convolution [9]. 𝐻, 𝑊, 𝐶 represents for the height,
Fig. 2. Decomposition
         weight and channel of adiagram      of large-kernel
                                 tensor. A standard  convolution canconvolution.       A three
                                                                       be decomposed into  standard       convolution can
                                                                                                parts: a depth-wise
be decomposed into three parts: a depth-wise convolution (DW-Conv), a(1 ×depth-wise
         convolution (DW-Conv),   a depth-wise  dilation convolution   (DW-D-Conv)  and  a 1 × 1 convolution      1
         Conv). The colored grids represent the location of convolution kernel and the yellow grid means the
dilation center
          convolution        (DW-D-Conv) and a 1×1 convolution (1×1 Conv). The colored
                point. The diagram shows that a 13 × 13 convolution is decomposed into a 5 × 5 depth-wise
grids represent     the
         convolution, a5× location     of dilation
                             5 depth-wise   convolution       kernel
                                                    convolution           and rate
                                                                 with dilation the3 yellow
                                                                                     and 1 × 1 grid    means
                                                                                                 convolution.    the center
                                                                                                              Note:
point. The diagram shows that a 13×13 convolution is decomposed into a 5×5 depth-
         zero paddings  are omitted in above  figure.
wise convolution, a 5×5 depth-wise dilation convolution with dilation rate 3 and 1×1
convolution.We first
                Note:introduce
                          zero the   CoLKANet.
                                 paddings          Actually,
                                                 are   omittedit is aincombination of large kernel attention (cf.
                                                                         above figure.
        Fig. 3) and vision transformer. Since the breakthrough of AlexNet [10] and ResNet [11] Convo-
        lutional Neural Networks (CNNs) have been the dominating model architecture for computer
        vision. Meanwhile, with the success of self-attention models in natural language processing,
        many previous works have attempted to bring in the power of attention into computer vision.
the hugeWhen    pre-trained of
            potential       on large-scale weakly labeledmodels.
                                attention-based            JFT-300M dataset,  ViT canself-attention
                                                                       However,        achieve comparable is origi-
        results to state-of-the-art CNNs. In this year, researchers revisited large kernel design in CNNs
nally designed       for NLP. It has three shortcomings when dealing with computer
        and found that using a few large convolutional kernels instead of a stack of small kernels could
vision tasks.
        be a more(1)    It treats
                    powerful         images
                              paradigm  [12]. Fromasprevious
                                                      1D sequences          which
                                                             research, we can         neglects
                                                                               see that            the 2D struc-
                                                                                        earlier convolution
ture of images.      (2) The
        helps transformer        quadratic
                            see better. Therefore,complexity       is too
                                                   following this idea,     expensive
                                                                        we combine          for high-resolution
                                                                                     large kernel attention
        with  vision  transformer.  We  use the structure in VAN   [9] and by
images. (3) It only achieves spatial adaptability but ignores the adaptabilityfollowing  CoAtnet   [13], we      in
        replace those earlier CNN structures. Overall structure of the CoLKANet can refer to Fig. 4.
channels dimension. For vision tasks, different channels often represent different
objects [11,25]. Channel adaptability is also proven important for visual tasks
[34,86,60,81,11]. To solve these problems, we propose a novel visual attention
method, namely, LKA. It involves the pros of self-attention such as adaptability
and long range dependence. Besides, it benefits from the advantages of convolu-
tion such as making use of local contextual information.




2.3      Vision MLPs
             Output                                                                q1     α1,1

                              (c)                                                  k1
                                                                                   v1
                              (c)                                                          α1,2
                                                                                   q2
                                                                                   k2
                              (b)
                                                                                   v2
                              (b)




                                                                         …
                                                                                           α1,n




                                                                                                  …
                                                                                   qn
                              (b)                                                  kn
                                                                                   vn
              Input                 (b) Large Kernel Attention

      (a) Overall Structure                                                  (c) Self-attention

                               Figure 4: The proposed structure of the CoLKANet.


3.2. Other Classification Models
For other classification models, we have tried EfficientNet [14], RepVGG [15], EfficientNet-
V2 [16], Swin Transformer [17], VOLO [18], ViTAE [19], ViT [20] and ConvNeXt [21]. We
choose the combination of Swin Transformer, VOLO, ConvNeXt, ViT and our CoLKANet as
the final solution. We found that CNNs are more likely to overfit in this competition (except
ConvNeXt) while Transformer seems to be more stable. However, when raising the input
resolution of ConvNeXt and EfficientNet, it will gain a substantial improvement.


4. Details and Tricks
In this section, we report the details for different backbones and describe all the tricks that we
have used to generate the final submission. We also report the scores we have recorded on both
public leaderboard and final leaderboard.

4.1. Details for Vision Transformers
For different competitions, we use different settings for vision transformers.

SnakeCLEF. We use ViT, Swin Transformer, VOLO and the proposed CoLKANet as backbones.
We calculate the 5-fold cross validation accuracy while training with the image resolution of
384 × 384 for Swin Transformer and 448 × 448 for VOLO. We trained the ViT and CoLKANet
by using the whole dataset with the image resolution of 384 × 384. We choose 1.2 × 104 as the
initial learning rate while 10−5 /7 as the minimum learning rate and set weight decay as 2×10−5 .
A good technique to reduce overfitting is to stop the model from becoming overconfident. This
can be achieved by softening the ground truth using Label Smoothing [22]. We set the Label
Smoothing value as 0.1 according to the original paper [22]. All these backbone are trained for
15 epochs without warmup. AdamW [23] optimizer is utilized for training.
FungiCLEF. We use Swin Transformer, VOLO and the proposed CoLKANet as backbones.
We calculate the 5-fold cross validation accuracy while training with the image resolution of
384 × 384 for Swin Transformer while 448 × 448 for VOLO. We trained the CoLKANet by using
the whole dataset with the image resolution of 384 × 384. We choose 1.2 × 104 as the initial
learning rate while 10−5 /7 as the minimum learning rate and set weight decay as 2 × 10−5 .
Label Smoothing is set as 0.1. All these backbone are trained for 15 epochs without warmup.
AdamW [23] optimizer is utilized for training.

4.2. Details for CNNs
We only use ConvNeXt as the CNN backbone in the final submission for both SnakeCLEF and
FungiCLEF. We calculate the 5-fold cross validation points while training with the resolution of
384 × 384. We also train a single model which use the whole dataset with the image resolution
of 448 × 448. We choose 1.2 × 104 as the maximum learning rate while 10−5 /7 as the minimum
learning rate and set weight decay as 2 × 10−5 . Label Smoothing is set as 0.1. We apply warmup
and gradually increase the learning rate for 3 epochs. Then, another optimization is to apply
Cosine Schedule to adjust our LR during the following 15 epochs. AdamW [23] optimizer is
utilized for training.

4.3. Loss Functions
As the evaluation metric is Macro F1-Score, we have to deal with the long-tailed problem (If the
evaluation metric is Micro F1-Score, adpot the original Cross Entropy Loss is enough.). Because
the head classes contain much more training examples, the network makes the weight norm of
the head classes larger to approach the optimal solution. It results in predicted probabilities
mainly near 1.0. Another fact is that distributions of predicted probability are related to instance
numbers. Unlike balanced recognition, applying different strategies for these classes is necessary
for solving the long-tailed problem. Therefore, we adopt label-aware smoothing [24] to solve
the over-confidence in cross-entropy and varying distributions of predicted probability issues
for both SnakeCLEF and FungiCLEF. It is expressed as:
                           𝐾
                                               {︃
                          ∑︁                     1 − 𝜖𝑦 = 1 − 𝑓 (𝑁𝑦 ), 𝑖 = 𝑦 ,
              𝑙(𝑞, 𝑝) = −     𝑞𝑖 𝑙𝑜𝑔𝑝𝑖 , 𝑞𝑖 =      𝜖𝑦     𝑓 (𝑁𝑦 )                                 (2)
                          𝑖=1                     𝐾−1 = 𝐾−1 ,            otherwise ,

where 𝜖𝑦 is a small label smoothing factor for Class-𝑦, relating to its class number 𝑁𝑦 . Details
about label-aware smoothing can refer the original paper [24].
   It also worth noting that if we increase the weight of some tail classes and head classes, we
will get a higher score on the public leaderboard (but useless in private leaderboard). We explain
this phenomenon in Section 6.1.

4.4. Important Tricks
Bellow are 8 important tricks we used during both SnakeCLEF and FungiCLEF.
   ∙ Augmentation: Augmentation is very import during training stage. We tried combination
of RandomResizedCrop, Transpose, HorizontalFlip, VerticalFlip, ShiftScaleRotate, IAAPiece-
wiseAffine, HueSaturationValue, RandomBrightnessContrast, OpticalDistortion, GridDistortion,
ElasticTransform, Cutout and CoarseDropout designed in Albumentations [25] at the begining
of the competitions. During the middle stage, we replaced them with TrivialAugment [26]
and gain around 0.5% improvement for each single model. We also used Random Erasing [27],
CutMix [28] and Mixup [29] throughout the competitions. They provide strong regularization
effects by softening not only the labels but also the images.
   ∙ Confusion Matrix: A typical way to analyze model performance is using the confusion
matrix. As we split the training images into 5 equal parts (i.e., 5-fold), we get the confusion
matrix from the valid part when training those 5-fold models (We did not use confusion matrix
for models trained with the whole dataset). This trick gain around 0.2% improvement when
ensemble different models.
   ∙ Normalization of Output: Besides confusion matrix, we hope to vote for each model (e.g.,
Models A, B and C predict that the image belongs to category 1, 2, 1 respectively. Finally we set
the image as category 1.) but it performs poor. So we temporarily set another trick, i.e., scale
the model output and normalize the maximum value to 𝛼, 𝛼 can not be one for it may be too
close to the voting strategy, formula is as follows:

                            𝑁 𝑜𝑟𝑚(𝑓 (𝑥)) = (1/𝑚𝑎𝑥(𝑓 (𝑥)))𝛼 𝑓 (𝑥) .                              (3)

It is found that models performs well when 𝛼 is 0.15 or 0.20. This trick gain around 0.1%
improvement when ensemble different models.
   ∙ Test Time Augementation: It is an very important trick during all the competitions. However,
it may lead to overfitting. Performing this trick is very easy: just crop the test images for around
8 to 13 times and calculate the average score. It gains around 0.6% improvement for each single
model but unfortunately, it did not work on the private leaderboard in SnakeCLEF.
   ∙ Pseudo Labelling: We only perform pseudo labelling for SnakeCLEF. This trick did not
work in FungiCLEF (We only tried it once and it did not work on public leaderboard. This may
be caused by the openset problem.). We generate pseudo labels on test dataset only for tail
categories (train data less than 100) by clustering methods, and finetune the model on both
training and these test data. It gains around 0.9% improvement on the public leaderboard but
useless on the private leaderboard.
   ∙ Weight Decay Tuning: Our standard recipe uses ℓ2 regularization to reduce overfitting.
The Weight Decay parameter controls the degree of the regularization (the larger the stronger)
and is applied universally to all learned parameters of the model by default [30]. More about
separating the Normalization parameters from the rest can refer ClassyVision [31].
   ∙ Exponential Moving Average (EMA): EMA [32] is a technique that allows one to push
the accuracy of a model without increasing its complexity or inference time. It performs an
exponential moving average on the model weights and this leads to increased accuracy and
more stable models. The averaging happens every few iterations and its decay parameter was
tuned via grid search.
   ∙ FixRes mitigations: It is a very important trick in CNNs (Transformers fix image size so it can
not be used) and we only use this trick in ConvNeXt. The model performed significantly better if
the resolution used during validation was increased from the training size. This effect is studied
in detail on the FixRes paper [33] and two mitigations are proposed: a) one could try to reduce
the training resolution so that the accuracy on the validation resolution is maximized or b) one
could fine-tune the model on a two-phase training so that it adjusts on the target resolution.
Another very important phenomenon is that if we improve the resolution of training images,
we will easily gain a better score. The image scaling ratio is set as 0.758 and 0.875 according
to the experiments in the original paper [33] and [30] . It gains around 0.8% improvement in
SnakeCLEF but only 0.2% improvement in FungiCLEF.


5. Main Results
5.1. Generating the Final Submission
It is very easy for us to generate the final csv file. Specifically, we generate only one prediction
for each observation. If one observation has several different test images, we use the model
to calculate the predicted value of each image and then calculate the average value for that
observation.

Table 1
                                        Results on SnakeCLEF.
        Backbone             Train Resolution   Score                      Comments
       Swin baseline            384× 384        75.6%                Without any tricks.
            ViT                 384× 384        79.8%         Pretained model from MAE[34].
     Swin single model          384× 384      77.4±0.5%              5-fold single model.
                                                                  All the folds trained with
    Swin 5-fold ensemble        384× 384         80.7%
                                                         LabelAwaerSmoothing and CrossEntropy.
    VOLO single model           448× 448       78.7±0.7%             5-fold single model.
                                                             All the VOLO models are trained
 VOLO 5-fold + Swin 5-fold          -            82.6%
                                                                with LabelAwaerSmoothing.
        CoLKANet                384× 384         80.1%      Train with LabelAwaerSmoothing.
   ConvNeXt single model        384× 384       77.9±0.8%             5-fold single model.
                                                                    0,2,4 fold trained with
 ConvNeXt 5-fold ensemble       384× 384         81.4%             LabelAwaerSmoothing
                                                            1,3 fold trained with CrossEntropy.
                                                              ConvNeXt 448 without FixRes +
                                                                   5-fold CoLKANet 384 +
   ConvNeXt+CoLKANet                -            83.9%
                                                                           CoLKANet
                                                                     Weight ratio is 2:2:1.
                                                                       ConvNeXt 448 +
                                                                   5-fold ConvNeXt 384 +
          Ensemble                  -            85.4%          5-fold VOLO + CoLKANet +
                                                                       5-fold Swin + ViT
                                                              with all the tricks in Section 4.4.
5.2. SnakeCLEF
In this section, we report the scores we had recorded in SnakeCLEF. Details can be found in
Table 1. Except ViT and VOLO, all the pretrained models are from ImageNet-22k [35]. Each
backbone can be found in timm [36]. Swin refers to swin_large_patch4_window12_384_in22k,
ConvNeXt refers to convnext_large_in22k, VOLO refers to volo_d4_448.

5.3. FungiCLEF
In this section, we report the scores we had recorded in FungiCLEF. Details can be found in
Table 2. Except VOLO, all the pretrained models are from ImageNet-22k [35]. It worth mention
that in this competition, combine models train with LabelAwaerSmoothing (i.e., deal with the
long-tailed problem) and CrossEntropy (i.e., do not deal with the long-tailed problem) will get
around 1.6% improvement on each single model.

Table 2
                                       Results on FungiCLEF.
        Backbone            Train Resolution Score                        Comments
       Swin baseline           384× 384        71.7%                 Without any tricks.
     Swin single model         384× 384     73.4±0.5%                5-fold single model.
                                                                 All the folds trained with
   Swin 5-fold ensemble        384× 384        76.5%
                                                         LabelAwaerSmoothing and CrossEntropy.
    VOLO single model          448× 448     73.8±0.8%                5-fold single model.
                                                                   0,2,4 fold trained with
   VOLO 5-fold ensemble        448× 448        76.2%               LabelAwaerSmoothing
                                                            1,3 fold trained with CrossEntropy
       CoLKANet                384× 384        75.1%        Train with LabelAwaerSmoothing.
  ConvNeXt single model        384× 384     73.6±0.5%                5-fold single model.
                                                                 All the folds trained with
 ConvNeXt 5-fold ensemble      384× 384        76.1%
                                                         LabelAwaerSmoothing and CrossEntropy.
          ConvNeXt             448× 448        75.7%        Train with LabelAwaerSmoothing.
                                                                       ConvNeXt 448 +
                                                                   5-fold ConvNeXt 384 +
          Ensemble                 -           78.9%            5-fold VOLO + CoLKANet +
                                                                         5-fold Swin
                                                             with all the tricks in Section 4.4.




6. Discussions
6.1. Gap Between the Public Leaderboard and Private Leaderboard.
First of all, at the begining of these two competitions, the most important thing is to guess
the data distribution of the test dataset. Based on our experience, we guessed that the data
distribution for the test dataset should be roughly the same as the provided train/val dataset,
and the scores for the 20% data which showed on the public leaderboard should also fit the
Figure 5: Data distribution of SnakeCLEF. (a) A smaple distribution of the original training data. (b)
Our guess about the data distribution of the test dataset on the public leaderborad. (c) The actual
distribution of the test dataset on the public leaderboard. The blue dots represent the number of samples
contained in a category while the red dots indicate that the number of samples is 0.


distribution of the test dataset (cf. Fig. 5 (b)). However, during these two competitions, we found
if we artificially increase the weight of some categories (especially on SnakeCLEF which did
not suffer from OSR), we would get a better score on the public leaderboard. These categories
were chosen at random and the best random state gain around 1% improvement on the public
leaderborad. At the outset, we thought it a mismatch in the number of some tail categories
between the training dataset and the test dataset.
   While towards the end of these two competitions, we made a rather important discovery. In
the FungiCLEF competition, if we submit a csv with all labels set as −1, we can get a score of
about 0.0005 on the public leaderboard. As the FungiCLEF competition contains 1,604 categories.
We derived this score backwards by using the formula for Macro-F1 and got a result that the
whole dataset should contain about 60% to 70% openset images (i.e., label= −1).
   Give an example, assuming a total of 59,420 results are to be submitted, then 34,000 of these
results have a ground truth of −1 and the other 25,000 samples between 0 and 1,603, a document
with all −1’s has a macro F1 value of approximately 0.00045. We added this threshold to our
earlier submissions and, not surprisingly, the score dropped dramatically. Obviously this is an
unreasonable result. Therefore, we had a wild guess: 20% of the data in the public leaderboard,
which means, randomly select 20% of these 1,604 categories in FungiCLEF to calculate the
score (cf. Fig. 5 (c)). The same goes for the SnakeCLEF.
   In this setting, we calculate the openset images in the whole dataset should be around 10%.
However, the overall strategy: fitting the public leaderboard, seems overfitted. The fungi
competition, on the other hand, did not have a significant reduction on the private leaderboard
after adjusting the appropriate threshold, as the −1 category had a far greater effect than the
other categories.
6.2. Only A Threshold Strategy is Sufficient for the Openset Problem.
The ability to identify whether or not a test sample belongs to one of the semantic classes
in a classifier’s training set is critical to practical deployment of the model. Sagar Vaze et
al. [37] demonstrated that the ability of a classifier to make the ‘none-of-above’ decision is
highly correlated with its accuracy on the closed-set classes. They also use this correlation
to boost the performance of the maximum softmax probability OSR ‘baseline’ by improving
its closed-set accuracy and with this strong baseline achieve state-of-the-art on a number of
OSR benchmarks. Therefore, we only use a simple threshold for the FungiCLEF competition.
We also tried post-processing with meta data (by using MetaFormer [38]) but useless in this
competition.


7. Conclusion
Fine-grained image recognition is an important problem in computer vision. Combined with
the long-tailed problem and the openset problem, the SnakeCLEF and the FungiCLEF become
more challenging. In this paper, we report the advanced techniques we had used to deal with
these challenges. By combining the recent hot topics in computer vision tasks, i.e., large kernel
and vision transformer, we also construct a new model named CoLKANet. For the SnakeCLEF
competition, our team achieves a 85.4% Macro F1-Score on the private leaderboard. For the
FungiCLEF competition, our team achieves a 78.9% Macro F1-Score on the private leaderboard.


References
 [1] A. Joly, H. Goëau, S. Kahl, L. Picek, T. Lorieul, E. Cole, B. Deneu, M. Servajean, A. Durso,
     I. Bolon, et al., LifeCLEF 2022 teaser: An evaluation of machine-learning based species
     identification and species distribution prediction, in: European Conference on Information
     Retrieval, Springer, 2022, pp. 390–399.
 [2] A. Joly, H. Goëau, H. Glotin, C. Spampinato, P. Bonnet, W.-P. Vellinga, J.-C. Lombardo,
     R. Planqué, S. Palazzo, H. Müller, Lifeclef 2017 lab overview: multimedia species identifica-
     tion challenges, in: International conference of the cross-language evaluation forum for
     European languages, Springer, 2017, pp. 255–274.
 [3] D. Casanova, J. B. Florindo, W. N. Gonçalves, O. M. Bruno, Ifsc/usp at imageclef 2012:
     Plant identification task., in: CLEF (Online Working Notes/Labs/Workshop), 2012.
 [4] H. Goëau, P. Bonnet, A. Joly, Plant identification in an open-world (lifeclef 2016), in: CLEF:
     Conference and Labs of the Evaluation Forum, 1609, 2016, pp. 428–439.
 [5] L. Picek, A. M. Durso, M. Hrúz, I. Bolon, Overview of SnakeCLEF 2022: Automated snake
     species identification on a global scale, in: Working Notes of CLEF 2022 - Conference and
     Labs of the Evaluation Forum, 2022.
 [6] A. Joly, H. Goëau, S. Kahl, L. Picek, T. Lorieul, E. Cole, B. Deneu, M. Servajean, A. Durso,
     H. Glotin, R. Planqué, W.-P. Vellinga, A. Navine, H. Klinck, T. Denton, I. Eggel, P. Bonnet,
     M. Šulc, M. Hruz, Overview of LifeCLEF 2022: an evaluation of machine-learning based
     species identification and species distribution prediction, in: International Conference of
     the Cross-Language Evaluation Forum for European Languages, Springer, 2022.
 [7] L. Picek, M. Šulc, J. Heilmann-Clausen, J. Matas, Overview of FungiCLEF 2022: Fungi
     recognition as an open set classification problem, in: Working Notes of CLEF 2022 -
     Conference and Labs of the Evaluation Forum, 2022.
 [8] L. Picek, M. Šulc, J. Matas, T. S. Jeppesen, J. Heilmann-Clausen, T. Læssøe, T. Frøslev,
     Danish fungi 2020-not just another image recognition dataset, in: Proceedings of the
     IEEE/CVF Winter Conference on Applications of Computer Vision, 2022, pp. 1525–1535.
 [9] M.-H. Guo, C.-Z. Lu, Z.-N. Liu, M.-M. Cheng, S.-M. Hu, Visual attention network, arXiv
     preprint arXiv:2202.09741 (2022).
[10] A. Krizhevsky, I. Sutskever, G. E. Hinton, ImageNet classification with deep convolutional
     neural networks, Advances in neural information processing systems 25 (2012).
[11] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Pro-
     ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp.
     770–778.
[12] X. Ding, X. Zhang, Y. Zhou, J. Han, G. Ding, J. Sun, Scaling up your kernels to 31x31:
     Revisiting large kernel design in CNNs, arXiv preprint arXiv:2203.06717 (2022).
[13] Z. Dai, H. Liu, Q. V. Le, M. Tan, Coatnet: Marrying convolution and attention for all data
     sizes, Advances in Neural Information Processing Systems 34 (2021) 3965–3977.
[14] M. Tan, Q. Le, EfficientNet: Rethinking model scaling for convolutional neural networks,
     in: International Conference on Machine Learning, PMLR, 2019, pp. 6105–6114.
[15] X. Ding, X. Zhang, N. Ma, J. Han, G. Ding, J. Sun, RepvVGG: Making VGG-style convnets
     great again, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
     Recognition, 2021, pp. 13733–13742.
[16] M. Tan, Q. V. Le, EfficientNetv2: Smaller models and faster training, arXiv preprint
     arXiv:2104.00298 (2021).
[17] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin transformer: Hierarchical
     vision transformer using shifted windows, arXiv preprint arXiv:2103.14030 (2021).
[18] L. Yuan, Q. Hou, Z. Jiang, J. Feng, S. Yan, VOLO: Vision outlooker for visual recognition,
     arXiv preprint arXiv:2106.13112 (2021).
[19] Y. Xu, Q. Zhang, J. Zhang, D. Tao, ViTAE: Vision transformer advanced by exploring
     intrinsic inductive bias, Advances in Neural Information Processing Systems 34 (2021).
[20] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. De-
     hghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, An image is worth
     16x16 words: Transformers for image recognition at scale, arXiv preprint arXiv:2010.11929
     (2020).
[21] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, S. Xie, A convnet for the 2020s,
     arXiv preprint arXiv:2201.03545 (2022).
[22] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the inception architecture
     for computer vision, in: Proceedings of the IEEE Conference on Computer Vision and
     Pattern Recognition, 2016, pp. 2818–2826.
[23] I. Loshchilov, F. Hutter, Fixing weight decay regularization in adam (2018).
[24] Z. Zhong, J. Cui, S. Liu, J. Jia, Improving calibration for long-tailed recognition, in:
     Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
     2021, pp. 16489–16498.
[25] A. Buslaev, V. I. Iglovikov, E. Khvedchenya, A. Parinov, M. Druzhinin, A. A. Kalinin,
     Albumentations: Fast and flexible image augmentations, Information 11 (2020). URL:
     https://www.mdpi.com/2078-2489/11/2/125. doi:10.3390/info11020125.
[26] S. G. Müller, F. Hutter, Trivialaugment: Tuning-free yet state-of-the-art data augmentation,
     in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp.
     774–782.
[27] Z. Zhong, L. Zheng, G. Kang, S. Li, Y. Yang, Random erasing data augmentation, in:
     Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 2020, pp. 13001–
     13008.
[28] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, Y. Yoo, CutMix: Regularization strategy to train
     strong classifiers with localizable features, in: Proceedings of the IEEE/CVF International
     Conference on Computer Vision, 2019, pp. 6023–6032.
[29] H. Zhang, M. Cisse, Y. N. Dauphin, D. Lopez-Paz, mixup: Beyond empirical risk minimiza-
     tion, arXiv preprint arXiv:1710.09412 (2017).
[30] Vasilis      Vryniotis,        How        to     train     state-of-the-art      models     us-
     ing      torchvision’s       latest      primitives,       2021.      https://pytorch.org/blog/
     how-to-train-state-of-the-art-models-using-torchvision-latest-primitives/.
[31] A. Adcock, V. Reis, M. Singh, Z. Yan, L. van der Maaten, K. Zhang, S. Motwani, J. Guerin,
     N. Goyal, I. Misra, L. Gustafson, C. Changhan, P. Goyal, Classy vision, https://github.com/
     facebookresearch/ClassyVision, 2019.
[32] F. Klinker, Exponential moving average versus moving exponential average, Mathematis-
     che Semesterberichte 58 (2011) 97–107.
[33] H. Touvron, A. Vedaldi, M. Douze, H. Jégou, Fixing the train-test resolution discrepancy,
     Advances in neural information processing systems 32 (2019).
[34] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, R. Girshick, Masked autoencoders are scalable vision
     learners, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
     Recognition, 2022, pp. 16000–16009.
[35] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, ImageNet: A large-scale hierarchical
     image database, in: 2009 IEEE Conference on Computer Vision and Pattern Recognition,
     Ieee, 2009, pp. 248–255.
[36] R.      Wightman,        Pytorch      image       models,      https://github.com/rwightman/
     pytorch-image-models, 2019. doi:10.5281/zenodo.4414861.
[37] S. Vaze, K. Han, A. Vedaldi, A. Zisserman, Open-set recognition: A good closed-set classifier
     is all you need, arXiv preprint arXiv:2110.06207 (2021).
[38] Q. Diao, Y. Jiang, B. Wen, J. Sun, Z. Yuan, Metaformer: A unified meta framework for
     fine-grained recognition, arXiv preprint arXiv:2203.02751 (2022).