=Paper=
{{Paper
|id=Vol-3180/paper-180
|storemode=property
|title=Solution for SnakeCLEF 2022 by Tackling Long-tailed Categorization
|pdfUrl=https://ceur-ws.org/Vol-3180/paper-180.pdf
|volume=Vol-3180
|authors=Lingfeng Yang,Xiang Li,Renjie Song,Kexin Zhu,Gang Li
|dblpUrl=https://dblp.org/rec/conf/clef/YangLSZL22
}}
==Solution for SnakeCLEF 2022 by Tackling Long-tailed Categorization==
<pdf width="1500px">https://ceur-ws.org/Vol-3180/paper-180.pdf</pdf>
<pre>
Solution for SnakeCLEF 2022 by Tackling Long-tailed
Categorization
Lingfeng Yang1,2 , Xiang Li3 , Renjie Song2 , Kexin Zhu2 and Gang Li1
1
  Nanjing University of Science and Technology, China
2
  Megvii Technology, China
3
  Nankai University, China


                                         Abstract
                                         SnakeCLEF 2022 is a fine-grained image classification benchmark for snake identification. Recently, the
                                         masked autoencoder (MAE) has shown superior performance on fine-grained image classification tasks.
                                         As a result, we use the MAE pretrained ViT models and refine them on the SnakeCLEF 2022. Overall, the
                                         learning process contains two difficulties: 1) dealing with fine-grained species that are visually similar and
                                         2) a long-tailed distribution. To address these issues, we propose using statistic-aware post-processing
                                         to process the metadata and refine image predictions. Next, we improve an effective logit adjustment
                                         loss (ELAL) to alleviate the classification bias toward the head class. Notably, we achieve 2nd place
                                         on the SnakeCLEF 2022 benchmark with a 0.84565 top F1 score. Codes and models are available at
                                         https://github.com/ylingfeng/snakeclef2022_fgvc9.

                                         Keywords
                                         SnakeCLEF, Fine-grained image classification, Masked autoencoder, Metadata, Long-tailed distribution


1. Introduction
Fine-grained visual categorization [1, 2, 3, 4, 5, 6, 7] is a popular task to identify fine categories
out of coarse divisions. Recently, there is an increasing necessity to develop a fine-grained
visual categorization algorithm for various species of snakes for biodiversity, conservation, and
global health. The SnakeCLEF 2022 benchmark1 [8] aims to tackle this requirement, which is
held by LifeCLEF [9, 10] jointly with FGVC92 of the CVPR 2022.
   The difficulty in fine-grained snake identification lies in the high intra-class and low inter-
class differences in appearance, and many species are visually similar to others. Moreover, the
species distribution in terms of geographical location is irregular, and some countries (e.g., US)
contain hundreds of species while some (e.g., Vatican) have only a few types. In addition, the
dataset suffers from a severe long-tailed problem in which two-thirds of categories contain less
than 100 instances.
   In terms of the above problems, we propose to solve them individually. First, as for the visually
similar samples which are confusing for image-only predictions, we utilize the metadata [11, 12,

CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy
$ yanglfnjust@njust.edu.cn (L. Yang); xiang.li.implus@qq.com (X. Li); songrenjie@megvii.com (R. Song);
zhukexin@megvii.com (K. Zhu); gang.li@njust.edu.cn (G. Li)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings         CEUR Workshop Proceedings (CEUR-WS.org)
                  http://ceur-ws.org
                  ISSN 1613-0073


                  1
                    https://www.kaggle.com/competitions/snakeclef2022/
                  2
                    https://sites.google.com/view/fgvc9
13, 14, 15, 16] provided in the dataset to form a prior distribution of whole species. Different
from previous multi-modal methods which embed the metadata to the feature space, we design
a parameter-free post-processing structure to refine the predictions. To be specific, we record
the number of occurrences of metadata corresponding to each species as the priors. More details
can be found in Sec. 4.1. Secondly, in Sec. 4.2 we propose the effective logit adjustment loss
(ELAL) to alleviate the prediction bias along with training the long-tailed samples by increasing
the optimization weight of the tailed classes while reducing the head.
   Our contributions can be summarized as:

    • We improve a new way to process the metadata by recording statistics referring to each
      category and a post-processing algorithm is designed to refine the image predictions.
    • We propose the effective logit adjustment loss (ELAL) to alleviate the prediction bias
      resulting from the long-tailed dataset.
    • Based on our algorithm, we achieve 2nd place on the SnakeCLEF 2022 benchmark with a
      0.84565 top F1 score.


2. Related Work
Fine-grained image classification: To deal with the fine-grained property which is hard
to recognize merely through the visual clues, there are three workable solutions: 1) to de-
tect the discriminative regions of an image and pass all parts through the networks for joint
classification[1, 3, 4, 6, 7]. 2) Design a robust feature extraction architecture to capture the
subtle representations from an image [17, 5, 2, 18, 19]. 3) Utilize the metadata (e.g., shooting
date, latitude, longitude, country, and a brief description of the image) [11, 12, 13, 14, 15, 16].
However, the region detector and feature extractor are heavily designed and thus not suitable
for our task. Meanwhile, the existing metadata fusion methods all deal with the multimodal
feature by embedding them to higher semantic representations before interaction. Specifically,
in SnakeCLEF 2022, the types of metadata are discrete (e.g., country, endemic, and code), which
is different from the continuous latitude, longitude, or date hypothesized in the previous works.
To make use of this metadata, we calculate the existence label within a certain country for all
country values in the metadata and form the prior matrix regarding all species.
Long-tailed distribution: In terms of the long-tailed classification, the data re-sampling [20, 21]
seeks to change class sampling probability based on the number of samples to get a class-
balanced dataset, which includes over-sampling and under-sampling. [22] develop a two-
stage paradigm to re-balanced the classifier in the second stage with a frozen backbone. Re-
weighting [23, 24, 25, 26] aims to assign the loss weight class-wise to reduce the optimization
bias between head-tail classes. The logit adjustment loss [23] encourages a large relative
margin between logits of rare versus dominant labels. Based on this work, we modify the
margin coefficient and propose an effective logit adjustment loss (ELAL) to solve the long-tailed
problem efficiently.
         5000

         4000
      num

         3000

         2000

         1000

                  0      0           200       400      600        800           1000            1200      1400     1600
                                                                 class
Figure 1: Visualization of the instance number for each class sorted by number in descending order.


                                                       softmax
                                                ViT               0.0    0.1   0.4   0.4   0.0


                                                                                                        Hadamard product
            sample num


                                                       one-hot
                                                                  0.0    1.0   1.0   0.0   0.0

                         a   b   c     d   e   class
                                 P                                             Po
Figure 2: Structure of the post-processing to refine image prediction by category prior extracted from
the metadata.

3. Task Description
3.1. Dataset
The SnakeCLEF 2022 dataset [8, 9, 10] is based on observations of 187,129 snakes, containing
318,532 photographs, belonging to 1,572 snake species, observed in 208 countries. The data
comes from the online biodiversity platform, iNaturalist. The provided dataset has a heavy
long-tailed class distribution (see Fig. 1), where the most frequent species (Natrix natrix) is
represented by 6,472 images and the least frequent species by just 5 samples.

3.2. Metric
The evaluation metric for this competition is Mean (Macro) F1-Score. The F1 score, commonly
used in information retrieval, measures accuracy using the statistics precision (P) and recall (R).
The macro F1 score is not biased by class frequencies and is more suitable for the long-tailed
class distributions observed in nature. This metric raises a higher requirement for classification
accuracy on tailed categories.
4. Method
4.1. Metadata-aware Post-processing
Given the metadata-label mapping, we count the instance number for all categories if it is
attached to a certain metadata value. Then, we obtain the statistic in form of a metadata-wise
category matrix P ∈ R𝑛×𝑐 , where 𝑛 is the value number within one type of metadata, and 𝑐 is
the number of classes. Next, we transform P to a one-hot form P𝑜 , known as the prior statistic,
which represents whether a specific category would appear in a certain place. Finally, P𝑜 is
utilized to refine the prediction from the visual networks via the Hadamard product. The whole
structure is illustrated in Fig. 2.

4.2. Effective Logit Adjustment Loss
In this section, we introduce our new effective logit adjustment loss (ELAL) function which
addresses the performance drop resulting from the prediction bias brought by the long-tailed
distribution. First, we give a brief review of the existing loss functions, and then we show how
ELAL is developed based on them.
   The vanilla softmax cross-entropy can be derived by:
                                             ⎛                           ⎞
                                                    ∑︁
                           ℓ(𝑦, 𝑓 (𝑥)) = log ⎝1 +        𝑒𝑓𝑦′ (𝑥)−𝑓𝑦 (𝑥) ⎠ ,                  (1)
                                                  𝑦 ′ ̸=𝑦
where 𝑦 denotes the ground-truth label. The logit adjustment loss [23] adds a label-dependent
offset to each of the logits, and modifies Eq. 1 with the shifted coefficient 𝑀 :
                                           ⎛                             ⎞
                                                 ∑︁
                         ℓ(𝑦, 𝑓 (𝑥)) = log ⎝1 +      𝑀 · 𝑒𝑓𝑦′ (𝑥)−𝑓𝑦 (𝑥) ⎠ ,               (2)
                                             𝑦 ′ ̸=𝑦
          𝜋𝑦′         𝑁
where 𝑀 = 𝜋𝑦 , 𝜋𝑦 = ∑︀ ′ 𝑦𝑁 ′ ∈ (0, 1), and 𝑁𝑦 is the total number of instances in each class.
                      𝑦    𝑦
Class-balanced Loss [24] proposes the concept of an effective number to replace the direct
label-wise instance number to represent the volume of samples. The definition of the effective
number is shown as:
                                             1 − 𝛽 𝑁𝑦
                                      𝐸𝑦 =            .                                     (3)
                                              1−𝛽
Inspired by the conception of effective number, which is an improved representation of the
vanilla number, we modify the logit adjustment loss by changing the shifted coefficient 𝑀 to
       𝜖          𝐸
𝑀 = 𝜖𝑦′𝑦 , 𝜖𝑦 = ∑︀ ′ 𝑦𝐸 ′ ∈ (0, 1) and propose ELAL. Notably, we set 𝛽 = 1𝑒 − 6 by default in
                  𝑦    𝑦
our experiments.


5. Experiments
In this section, we first elaborate on our experimental settings, then ablation studies are con-
ducted to demonstrate the performance of each component. Finally, we list the top results of
our methods and give a considerable analysis.
Table 1
Fine-tuning settings on the SnakeCLEF 2022 dataset.
                            Config                            Value
                            optimizer                         AdamW
                            base learning rate                1e-4 (ViT-L), 1e-3 (ViT-H)
                            weight decay                      0.05
                            optimizer momentum                𝛽1 , 𝛽2 =0.9, 0.999
                            layer-wise lr decay               0.75 (ViT-L), 0.8 (ViT-H)
                            global batch size (over 8 GPUs)   16
                            batch size per GPU                2
                            accumulated iteration             4
                            learning rate schedule            cosine decay
                            warmup epochs                     5
                            augmentation                      RandAug (9, 0.5)
                            label smoothing                   0.1
                            mixup                             0.8
                            cutmix                            1.0
                            random erase                      0.25
                            drop path                         0.2

5.1. Setup
In this paper, we use the Masked autoencoder (MAE) [27] pretrained ViT [28] models conducted
on ImageNet-1K [29] training set for 800 epochs. The fine-tuning codes and checkpoints refer to
the MAE repository3 . The ImageNet-1K dataset has 1.3M images with 1K categories for training
and 50K images for validation. Notably, we do not use the larger ImageNet-22K (IN22K) dataset,
which contains 14.2M images and 22K classes. Based on the MAE pretrained models, we finetune
50 epochs on the SnakeCLEF 2022 dataset, and the default setting is depicted in Table 1. We
randomly select 1/10 of the training dataset to form the validation set to update our algorithm,
and a full set is used to train models which present the final submissions. To be specific, we
set batch size per GPU to 2 to avoid exceeding the GPU memory. The effective learning rate is
obtained following MAE: lr= base_lr×globalbatchsize / 256. We apply random resizing/cropping,
random horizontal flipping [30], label-smoothing regularization [31], Mixup [32], CutMix [33],
RandomErasing [34], and RandAug [35] as the standard data augmentations. Notably, all
ablation studies are conducted under ViT-L for fair comparisons. The ViT-large and ViT-huge
models are trained on eight NVIDIA TITAN Xp GPUs (12G) and eight GeForce RTX 3090 GPUs
(24G), respectively.

5.2. Ablation Study
First, we compare the performance with different sets of metadata for post-processing. Table 2
shows that refining predictions with “endemic” and “code” metadata perform the best.
   Next, we conduct the ablation on two losses. Table 3 shows our ELAL achieves a higher
F1 score under two sets of input resolution. To demonstrate the effectiveness of ELAL on tail

   3
       https://github.com/facebookresearch/mae
Table 2
Ablation study on the performance of post-processing under different metadata combinations.
                            code       endemic          country     val acc val F1 test F1
                                                                       80.470   0.758     0.755
                                                           ✓           88.554   0.810     0.796
                             ✓                                         88.613   0.856     0.834
                             ✓              ✓                          89.949   0.873     0.864
                             ✓              ✓              ✓           93.893   0.920     0.815

Table 3
Ablation study on the performance of the long-tailed loss. CE: Cross-entropy loss. ELAL: Effective logit
adjustment loss.
                                   resolution loss         val acc val F1 test F1
                                                 CE         0.858       0.821    0.735
                                      224
                                                 ELAL       0.915       0.892    0.756
                                                 CE         0.889       0.859    0.792
                                      384
                                                 ELAL       0.939       0.920    0.815

Table 4
Ablation study on the performance of the head and tail class. We depict the accuracy of the top
10/50/100/500 from the head/tail classes. CE: Cross-entropy loss. ELAL: Effective logit adjustment loss.
                                     loss       class     10      50      100    500
                                                head 1.00 1.00 0.95 0.94
                                     CE
                                                tail 0.30 0.46 0.56 0.79
                                                head 1.00 1.00 0.94 0.94
                                     ELAL
                                                tail 0.90 0.82 0.88 0.93

Table 5
Performance of the final submissions on public/private benchmarks.
                                                          center crop              multi crop

                        model        resolution         public    private       public    private
                        large             384       0.87134       0.81199       0.87996   0.81997
                        large             432       0.88375       0.82382       0.89173   0.83063
                        huge              392       0.89692       0.83662       0.89449   0.84057
                        ensemble           –        0.90245       0.84409       0.89822   0.84565


class and the potential side effect on head class, we calculate the validation accuracy on the top
10/50/100/500 class from the head and tail classes, respectively (Table 4).

5.3. Results
Based on the strong ViT-L and ViT-H [28], we conduct experiments with an input resolution of
384/392/448 learned on a full training set. We adopt multi-crop [36] as post-processing strategies,
which would crop the given image into four corners and the central crop plus the flipped version
and average the predictions of whole crops. The model ensemble is an averaging operation
over each prediction score after the softmax of the selected models. Our final submissions come
from the ensemble of models w/o and w/ multi-crop, which receives 0.84409 and 0.84565 F1
scores on the private benchmark.

5.4. Analysis
We attempt to run a ViT-H with a 448 resolution, which is capable of reaching a higher accuracy
theoretically, however, due to the resource limitation we only present the result of a 392
resolution. Also, we notice that the effect of post-processing on the private benchmark is not as
significant as on the public benchmark. We suspect that there is a distribution gap between the
train and test set of metadata while the public benchmark is less affected.


6. Conclusion
In this paper, we give our solution to the Snake Recognition Competition (SnakeCLEF 2022) in
FGVC9, which is challenging due to the fine-grained categorization and long-tailed classes. To
deal with the difficulties, we utilize statistic-aware metadata to refile image predictions through
post-processing and propose the effective logit adjustment loss (ELAL) to handle the long-tailed
problem, respectively. Our team achieves the 2nd place on the private benchmark with a 0.84565
top F1 score.


References
 [1] N. Zhang, J. Donahue, R. Girshick, T. Darrell, Part-based r-cnns for fine-grained category
     detection, in: ECCV, 2014.
 [2] Z. Akata, S. Reed, D. Walter, H. Lee, B. Schiele, Evaluation of output embeddings for
     fine-grained image classification, in: CVPR, 2015.
 [3] T. Xiao, Y. Xu, K. Yang, J. Zhang, Y. Peng, Z. Zhang, The application of two-level attention
     models in deep convolutional neural network for fine-grained image classification, in:
     CVPR, 2015.
 [4] Z. Yang, T. Luo, D. Wang, Z. Hu, J. Gao, L. Wang, Learning to navigate for fine-grained
     classification, in: ECCV, 2018.
 [5] D. Chang, Y. Ding, J. Xie, A. K. Bhunia, X. Li, Z. Ma, M. Wu, J. Guo, Y.-Z. Song, The devil
     is in the channels: Mutual-channel loss for fine-grained image classification, IEEE TIP
     (2020).
 [6] F. Zhang, M. Li, G. Zhai, Y. Liu, Multi-branch and multi-scale attention learning for
     fine-grained visual categorization, in: MMM, 2021.
 [7] A. Behera, Z. Wharton, P. Hewage, A. Bera, Context-aware attentional pooling (cap) for
     fine-grained visual classification, arXiv preprint arXiv:2101.06635 (2021).
 [8] L. Picek, A. M. Durso, M. Hrúz, I. Bolon, Overview of SnakeCLEF 2022: Automated snake
     species identification on a global scale, in: Working Notes of CLEF 2022 - Conference and
     Labs of the Evaluation Forum, 2022.
 [9] A. Joly, H. Goëau, S. Kahl, L. Picek, T. Lorieul, E. Cole, B. Deneu, M. Servajean, A. Durso,
     I. Bolon, et al., Lifeclef 2022 teaser: An evaluation of machine-learning based species
     identification and species distribution prediction, in: European Conference on Information
     Retrieval, Springer, 2022, pp. 390–399.
[10] A. Joly, H. Goëau, S. Kahl, L. Picek, T. Lorieul, E. Cole, B. Deneu, M. Servajean, A. Durso,
     H. Glotin, R. Planqué, W.-P. Vellinga, A. Navine, H. Klinck, T. Denton, I. Eggel, P. Bonnet,
     M. Šulc, M. Hruz, Overview of lifeclef 2022: an evaluation of machine-learning based
     species identification and species distribution prediction, in: International Conference of
     the Cross-Language Evaluation Forum for European Languages, Springer, 2022.
[11] K. Tang, M. Paluri, L. Fei-Fei, R. Fergus, L. Bourdev, Improving image classification with
     location context, in: ICCV, 2015.
[12] H. C. Wittich, M. Seeland, J. Wäldchen, M. Rzanny, P. Mäder, Recommending plant taxa
     for supporting on-site species identification, BMC bioinformatics (2018).
[13] G. Chu, B. Potetz, W. Wang, A. Howard, Y. Song, F. Brucher, T. Leung, H. Adam, Geo-aware
     networks for fine-grained recognition, in: ICCV, 2019.
[14] O. Mac Aodha, E. Cole, P. Perona, Presence-only geographical priors for fine-grained
     image classification, in: ICCV, 2019.
[15] L. Yang, X. Li, R. Song, B. Zhao, J. Tao, S. Zhou, J. Liang, J. Yang, Dynamic mlp for fine-
     grained image classification by leveraging geographical and temporal information, arXiv
     preprint arXiv:2203.03253 (2022).
[16] Q. Diao, Y. Jiang, B. Wen, J. Sun, Z. Yuan, Metaformer: A unified meta framework for
     fine-grained recognition, arXiv preprint arXiv:2203.02751 (2022).
[17] H. Zheng, J. Fu, Z.-J. Zha, J. Luo, Learning deep bilinear transformation for fine-grained
     image representation, arXiv preprint arXiv:1911.03621 (2019).
[18] Y. Gao, X. Han, X. Wang, W. Huang, M. Scott, Channel interaction networks for fine-
     grained image categorization, in: AAAI, 2020.
[19] H. Touvron, A. Sablayrolles, M. Douze, M. Cord, H. Jégou, Grafit: Learning fine-grained
     image representations with coarse labels, in: ICCV, 2021.
[20] B. Zhou, Q. Cui, X.-S. Wei, Z.-M. Chen, Bbn: Bilateral-branch network with cumulative
     learning for long-tailed visual recognition, in: CVPR, 2020.
[21] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, Smote: synthetic minority
     over-sampling technique, Journal of artificial intelligence research (2002).
[22] B. Kang, S. Xie, M. Rohrbach, Z. Yan, A. Gordo, J. Feng, Y. Kalantidis, Decoupling rep-
     resentation and classifier for long-tailed recognition, arXiv preprint arXiv:1910.09217
     (2019).
[23] A. K. Menon, S. Jayasumana, A. S. Rawat, H. Jain, A. Veit, S. Kumar, Long-tail learning via
     logit adjustment, arXiv preprint arXiv:2007.07314 (2020).
[24] Y. Cui, M. Jia, T.-Y. Lin, Y. Song, S. Belongie, Class-balanced loss based on effective number
     of samples, in: CVPR, 2019.
[25] J. Tan, X. Lu, G. Zhang, C. Yin, Q. Li, Equalization loss v2: A new gradient balance approach
     for long-tailed object detection, in: CVPR, 2021.
[26] B. Li, Y. Yao, J. Tan, G. Zhang, F. Yu, J. Lu, Y. Luo, Equalized focal loss for dense long-tailed
     object detection, arXiv preprint arXiv:2201.02593 (2022).
[27] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, R. Girshick, Masked autoencoders are scalable vision
     learners, in: CVPR, 2021.
[28] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. De-
     hghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Trans-
     formers for image recognition at scale, arXiv preprint arXiv:2010.11929 (2020).
[29] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale hierarchical
     image database, in: CVPR, 2009.
[30] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke,
     A. Rabinovich, Going deeper with convolutions, in: CVPR, 2015.
[31] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the inception architecture
     for computer vision, in: CVPR, 2016.
[32] H. Zhang, M. Cisse, Y. N. Dauphin, D. Lopez-Paz, mixup: Beyond empirical risk minimiza-
     tion, arXiv preprint arXiv:1710.09412 (2017).
[33] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, Y. Yoo, Cutmix: Regularization strategy to train
     strong classifiers with localizable features, in: ICCV, 2019.
[34] Z. Zhong, L. Zheng, G. Kang, S. Li, Y. Yang, Random erasing data augmentation, in: AAAI,
     2020.
[35] E. D. Cubuk, B. Zoph, J. Shlens, Q. V. Le, Randaugment: Practical automated data augmen-
     tation with a reduced search space, in: CVPRW, 2020.
[36] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional
     neural networks, in: NeurIPS, 2012.

</pre>