Solution for SnakeCLEF 2022 by Tackling Long-tailed Categorization

Solution for SnakeCLEF 2022 by Tackling Long-tailed Categorization LingfengYang Nanjing University of Science and Technology

China

Megvii Technology

China

XiangLi Nankai University

China

RenjieSong songrenjie@megvii.com Megvii Technology

China

KexinZhu zhukexin@megvii.com Megvii Technology

China

GangLi gang.li@njust.edu.cn Nanjing University of Science and Technology

China

Evaluation Forum

September 5-8 2022 Bologna Italy

Solution for SnakeCLEF 2022 by Tackling Long-tailed Categorization 1613-0073 E3F4DF89B659CEF6E76032559B730CE2 GROBID - A machine learning software for extracting information from scholarly documents SnakeCLEF Fine-grained image classification Masked autoencoder Metadata Long-tailed distribution

SnakeCLEF 2022 is a fine-grained image classification benchmark for snake identification. Recently, the masked autoencoder (MAE) has shown superior performance on fine-grained image classification tasks. As a result, we use the MAE pretrained ViT models and refine them on the SnakeCLEF 2022. Overall, the learning process contains two difficulties: 1) dealing with fine-grained species that are visually similar and 2) a long-tailed distribution. To address these issues, we propose using statistic-aware post-processing to process the metadata and refine image predictions. Next, we improve an effective logit adjustment loss (ELAL) to alleviate the classification bias toward the head class. Notably, we achieve 2nd place on the SnakeCLEF 2022 benchmark with a 0.84565 top F1 score. Codes and models are available at https://github.com/ylingfeng/snakeclef2022_fgvc9.

Introduction

Fine-grained visual categorization [1,2,3,4,5,6,7] is a popular task to identify fine categories out of coarse divisions. Recently, there is an increasing necessity to develop a fine-grained visual categorization algorithm for various species of snakes for biodiversity, conservation, and global health. The SnakeCLEF 2022 benchmark 1 [8] aims to tackle this requirement, which is held by LifeCLEF [9,10] jointly with FGVC9 2 of the CVPR 2022.

The difficulty in fine-grained snake identification lies in the high intra-class and low interclass differences in appearance, and many species are visually similar to others. Moreover, the species distribution in terms of geographical location is irregular, and some countries (e.g., US) contain hundreds of species while some (e.g., Vatican) have only a few types. In addition, the dataset suffers from a severe long-tailed problem in which two-thirds of categories contain less than 100 instances.

In terms of the above problems, we propose to solve them individually. First, as for the visually similar samples which are confusing for image-only predictions, we utilize the metadata [11,12,13,14,15,16] provided in the dataset to form a prior distribution of whole species. Different from previous multi-modal methods which embed the metadata to the feature space, we design a parameter-free post-processing structure to refine the predictions. To be specific, we record the number of occurrences of metadata corresponding to each species as the priors. More details can be found in Sec. 4.1. Secondly, in Sec. 4.2 we propose the effective logit adjustment loss (ELAL) to alleviate the prediction bias along with training the long-tailed samples by increasing the optimization weight of the tailed classes while reducing the head.

Our contributions can be summarized as:

• We improve a new way to process the metadata by recording statistics referring to each category and a post-processing algorithm is designed to refine the image predictions. • We propose the effective logit adjustment loss (ELAL) to alleviate the prediction bias resulting from the long-tailed dataset. • Based on our algorithm, we achieve 2nd place on the SnakeCLEF 2022 benchmark with a 0.84565 top F1 score.

Related Work

Fine-grained image classification: To deal with the fine-grained property which is hard to recognize merely through the visual clues, there are three workable solutions: 1) to detect the discriminative regions of an image and pass all parts through the networks for joint classification [1,3,4,6,7]. 2) Design a robust feature extraction architecture to capture the subtle representations from an image [17,5,2,18,19]. 3) Utilize the metadata (e.g., shooting date, latitude, longitude, country, and a brief description of the image) [11,12,13,14,15,16]. However, the region detector and feature extractor are heavily designed and thus not suitable for our task. Meanwhile, the existing metadata fusion methods all deal with the multimodal feature by embedding them to higher semantic representations before interaction. Specifically, in SnakeCLEF 2022, the types of metadata are discrete (e.g., country, endemic, and code), which is different from the continuous latitude, longitude, or date hypothesized in the previous works.

To make use of this metadata, we calculate the existence label within a certain country for all country values in the metadata and form the prior matrix regarding all species.

Long-tailed distribution: In terms of the long-tailed classification, the data re-sampling [20,21] seeks to change class sampling probability based on the number of samples to get a classbalanced dataset, which includes over-sampling and under-sampling. [22] develop a twostage paradigm to re-balanced the classifier in the second stage with a frozen backbone. Reweighting [23,24,25,26] aims to assign the loss weight class-wise to reduce the optimization bias between head-tail classes. The logit adjustment loss [23] encourages a large relative margin between logits of rare versus dominant labels. Based on this work, we modify the margin coefficient and propose an effective logit adjustment loss (ELAL) to solve the long-tailed problem efficiently.

Task Description

Dataset

The SnakeCLEF 2022 dataset [8,9,10] is based on observations of 187,129 snakes, containing 318,532 photographs, belonging to 1,572 snake species, observed in 208 countries. The data comes from the online biodiversity platform, iNaturalist. The provided dataset has a heavy long-tailed class distribution (see Fig. 1), where the most frequent species (Natrix natrix) is represented by 6,472 images and the least frequent species by just 5 samples.

Metric

The evaluation metric for this competition is Mean (Macro) F1-Score. The F1 score, commonly used in information retrieval, measures accuracy using the statistics precision (P) and recall (R).

The macro F1 score is not biased by class frequencies and is more suitable for the long-tailed class distributions observed in nature. This metric raises a higher requirement for classification accuracy on tailed categories.

Method

Metadata-aware Post-processing

Given the metadata-label mapping, we count the instance number for all categories if it is attached to a certain metadata value. Then, we obtain the statistic in form of a metadata-wise category matrix P ∈ R 𝑛×𝑐 , where 𝑛 is the value number within one type of metadata, and 𝑐 is the number of classes. Next, we transform P to a one-hot form P 𝑜 , known as the prior statistic, which represents whether a specific category would appear in a certain place. Finally, P 𝑜 is utilized to refine the prediction from the visual networks via the Hadamard product. The whole structure is illustrated in Fig. 2.

Effective Logit Adjustment Loss

In this section, we introduce our new effective logit adjustment loss (ELAL) function which addresses the performance drop resulting from the prediction bias brought by the long-tailed distribution. First, we give a brief review of the existing loss functions, and then we show how ELAL is developed based on them. The vanilla softmax cross-entropy can be derived by:

ℓ(𝑦, 𝑓 (𝑥)) = log ⎛ ⎝ 1 + ∑︁ 𝑦 ′ ̸ =𝑦 𝑒 𝑓 𝑦 ′ (𝑥)−𝑓𝑦(𝑥) ⎞ ⎠ ,(1)

where 𝑦 denotes the ground-truth label. The logit adjustment loss [23] adds a label-dependent offset to each of the logits, and modifies Eq. 1 with the shifted coefficient 𝑀 :

ℓ(𝑦, 𝑓 (𝑥)) = log ⎛ ⎝ 1 + ∑︁ 𝑦 ′ ̸ =𝑦 𝑀 • 𝑒 𝑓 𝑦 ′ (𝑥)−𝑓𝑦(𝑥) ⎞ ⎠ ,(2)

where 𝑀 = 𝜋𝑦′ 𝜋𝑦 , 𝜋 𝑦 = 𝑁𝑦 ∑︀ 𝑦 ′ 𝑁 𝑦 ′ ∈ (0, 1), and 𝑁 𝑦 is the total number of instances in each class. Class-balanced Loss [24] proposes the concept of an effective number to replace the direct label-wise instance number to represent the volume of samples. The definition of the effective number is shown as:

𝐸 𝑦 = 1 − 𝛽 𝑁𝑦 1 − 𝛽 .(3)

Inspired by the conception of effective number, which is an improved representation of the vanilla number, we modify the logit adjustment loss by changing the shifted coefficient 𝑀 to 𝑀 = 𝜖𝑦′ 𝜖𝑦 , 𝜖 𝑦 = 𝐸𝑦 ∑︀ 𝑦 ′ 𝐸 𝑦 ′ ∈ (0, 1) and propose ELAL. Notably, we set 𝛽 = 1𝑒 − 6 by default in our experiments.

Experiments

In this section, we first elaborate on our experimental settings, then ablation studies are conducted to demonstrate the performance of each component. Finally, we list the top results of our methods and give a considerable analysis.

Setup

In this paper, we use the Masked autoencoder (MAE) [27] pretrained ViT [28] models conducted on ImageNet-1K [29] training set for 800 epochs. The fine-tuning codes and checkpoints refer to the MAE repository 3 . The ImageNet-1K dataset has 1.3M images with 1K categories for training and 50K images for validation. Notably, we do not use the larger ImageNet-22K (IN22K) dataset, which contains 14.2M images and 22K classes. Based on the MAE pretrained models, we finetune 50 epochs on the SnakeCLEF 2022 dataset, and the default setting is depicted in Table 1. We randomly select 1/10 of the training dataset to form the validation set to update our algorithm, and a full set is used to train models which present the final submissions. To be specific, we set batch size per GPU to 2 to avoid exceeding the GPU memory. The effective learning rate is obtained following MAE: lr= base_lr×globalbatchsize / 256. We apply random resizing/cropping, random horizontal flipping [30], label-smoothing regularization [31], Mixup [32], CutMix [33], RandomErasing [34], and RandAug [35] as the standard data augmentations. Notably, all ablation studies are conducted under ViT-L for fair comparisons. The ViT-large and ViT-huge models are trained on eight NVIDIA TITAN Xp GPUs (12G) and eight GeForce RTX 3090 GPUs (24G), respectively.

Ablation Study

First, we compare the performance with different sets of metadata for post-processing. Table 2 shows that refining predictions with "endemic" and "code" metadata perform the best. Next, we conduct the ablation on two losses. Table 3 shows our ELAL achieves a higher F1 score under two sets of input resolution. To demonstrate the effectiveness of ELAL on tail class and the potential side effect on head class, we calculate the validation accuracy on the top 10/50/100/500 class from the head and tail classes, respectively (Table 4).

Results

Based on the strong ViT-L and ViT-H [28], we conduct experiments with an input resolution of 384/392/448 learned on a full training set. We adopt multi-crop [36] as post-processing strategies, which would crop the given image into four corners and the central crop plus the flipped version and average the predictions of whole crops. The model ensemble is an averaging operation over each prediction score after the softmax of the selected models. Our final submissions come from the ensemble of models w/o and w/ multi-crop, which receives 0.84409 and 0.84565 F1 scores on the private benchmark.

Analysis

We attempt to run a ViT-H with a 448 resolution, which is capable of reaching a higher accuracy theoretically, however, due to the resource limitation we only present the result of a 392 resolution. Also, we notice that the effect of post-processing on the private benchmark is not as significant as on the public benchmark. We suspect that there is distribution gap between the train and test set of metadata while the public benchmark is less affected.

Conclusion

In this paper, we give our solution to the Snake Recognition Competition (SnakeCLEF 2022) in FGVC9, which is challenging due to the fine-grained categorization and long-tailed classes. To deal with the difficulties, we utilize statistic-aware metadata to refile image predictions through post-processing and propose the effective logit adjustment loss (ELAL) to handle the long-tailed problem, respectively. Our team achieves the 2nd place on the private benchmark with a 0.84565 top F1 score.

Figure 1 :Figure 2 :12Figure 1: Visualization of the instance number for each class sorted by number in descending order.

Table 11Fine-tuning settings on the SnakeCLEF 2022 dataset.ConfigValueoptimizerAdamWbase learning rate1e-4 (ViT-L), 1e-3 (ViT-H)weight decay0.05optimizer momentum𝛽1, 𝛽2=0.9, 0.999layer-wise lr decay0.75 (ViT-L), 0.8 (ViT-H)global batch size (over 8 GPUs) 16batch size per GPU2accumulated iteration4learning rate schedulecosine decaywarmup epochs5augmentationRandAug (9, 0.5)label smoothing0.1mixup0.8cutmix1.0random erase0.25drop path0.2

Table 22Ablation study on the performance of post-processing under different metadata combinations.codeendemic country val acc val F1 test F180.470 0.758 0.755✓88.554 0.810 0.796✓88.613 0.856 0.834✓✓89.949 0.873 0.864✓✓✓93.893 0.920 0.815

Table 33Ablation study on the performance of the long-tailed loss. CE: Cross-entropy loss. ELAL: Effective logit adjustment loss.resolution lossval acc val F1 test F1224CE ELAL 0.915 0.892 0.756 0.858 0.821 0.735384CE ELAL 0.939 0.920 0.815 0.889 0.859 0.792

Table 44Ablation study on the performance of the head and tail class. We depict the accuracy of the top 10/50/100/500 from the head/tail classes. CE: Cross-entropy loss. ELAL: Effective logit adjustment loss.lossclass 1050 100 500CEhead 1.00 1.00 0.95 0.94 tail 0.30 0.46 0.56 0.79ELALhead 1.00 1.00 0.94 0.94 tail 0.90 0.82 0.88 0.93

Table 55Performance of the final submissions on public/private benchmarks.center cropmulti cropmodelresolution public private public privatelarge3840.87134 0.81199 0.87996 0.81997large4320.88375 0.82382 0.89173 0.83063huge3920.89692 0.83662 0.89449 0.84057ensemble-0.90245 0.84409 0.89822 0.84565

https://github.com/facebookresearch/mae

Part-based r-cnns for fine-grained category detection NZhang JDonahue RGirshick TDarrell ECCV 2014 Evaluation of output embeddings for fine-grained image classification ZAkata SReed DWalter HLee BSchiele CVPR 2015 The application of two-level attention models in deep convolutional neural network for fine-grained image classification TXiao YXu KYang JZhang YPeng ZZhang CVPR 2015 Learning to navigate for fine-grained classification ZYang TLuo DWang ZHu JGao LWang ECCV 2018 DChang YDing JXie AKBhunia XLi ZMa MWu JGuo Y.-ZSong The devil is in the channels: Mutual-channel loss for fine-grained image classification 2020 FZhang MLi GZhai YLiu Multi-branch and multi-scale attention learning for fine-grained visual categorization MMM 2021 ABehera ZWharton PHewage ABera arXiv:2101.06635 Context-aware attentional pooling (cap) for fine-grained visual classification 2021 arXiv preprint Overview of SnakeCLEF 2022: Automated snake species identification on a global scale LPicek AMDurso MHrúz IBolon Working Notes of CLEF 2022 -Conference and Labs of the Evaluation Forum 2022 Lifeclef 2022 teaser: An evaluation of machine-learning based species identification and species distribution prediction AJoly HGoëau SKahl LPicek TLorieul ECole BDeneu MServajean ADurso IBolon European Conference on Information Retrieval Springer 2022 Overview of lifeclef 2022: an evaluation of machine-learning based species identification and species distribution prediction AJoly HGoëau SKahl LPicek TLorieul ECole BDeneu MServajean ADurso HGlotin RPlanqué W.-PVellinga ANavine HKlinck TDenton IEggel PBonnet MŠulc MHruz International Conference of the Cross-Language Evaluation Forum for European Languages Springer 2022 Improving image classification with location context KTang MPaluri LFei-Fei RFergus LBourdev ICCV 2015 Recommending plant taxa for supporting on-site species identification HCWittich MSeeland JWäldchen MRzanny PMäder BMC bioinformatics 2018 Geo-aware networks for fine-grained recognition GChu BPotetz WWang AHoward YSong FBrucher TLeung HAdam ICCV 2019 Presence-only geographical priors for fine-grained image classification OMacAodha ECole PPerona ICCV 2019 LYang XLi RSong BZhao JTao SZhou JLiang JYang arXiv:2203.03253 Dynamic mlp for finegrained image classification by leveraging geographical and temporal information 2022 arXiv preprint QDiao YJiang BWen JSun ZYuan arXiv:2203.02751 Metaformer: A unified meta framework for fine-grained recognition 2022 arXiv preprint HZheng JFu Z.-JZha JLuo arXiv:1911.03621 Learning deep bilinear transformation for fine-grained image representation 2019 arXiv preprint Channel interaction networks for finegrained image categorization YGao XHan XWang WHuang MScott 2020 AAAI Grafit: Learning fine-grained image representations with coarse labels HTouvron ASablayrolles MDouze MCord HJégou ICCV 2021 Bbn: Bilateral-branch network with cumulative learning for long-tailed visual recognition BZhou QCui X.-SWei Z.-MChen CVPR 2020 Smote: synthetic minority over-sampling technique NVChawla KWBowyer LOHall WPKegelmeyer Journal of artificial intelligence research 2002 BKang SXie MRohrbach ZYan AGordo JFeng YKalantidis arXiv:1910.09217 Decoupling representation and classifier for long-tailed recognition 2019 arXiv preprint AKMenon SJayasumana ASRawat HJain AVeit SKumar arXiv:2007.07314 Long-tail learning via logit adjustment 2020 arXiv preprint Class-balanced loss based on effective number of samples YCui MJia T.-YLin YSong SBelongie CVPR 2019 Equalization loss v2: A new gradient balance approach for long-tailed object detection JTan XLu GZhang CYin QLi CVPR 2021 BLi YYao JTan GZhang FYu JLu YLuo arXiv:2201.02593 Equalized focal loss for dense long-tailed object detection 2022 arXiv preprint Masked autoencoders are scalable vision learners KHe XChen SXie YLi PDollár RGirshick CVPR 2021 ADosovitskiy LBeyer AKolesnikov DWeissenborn XZhai TUnterthiner MDehghani MMinderer GHeigold SGelly arXiv:2010.11929 An image is worth 16x16 words: Transformers for image recognition at scale 2020 arXiv preprint Imagenet: A large-scale hierarchical image database JDeng WDong RSocher L.-JLi KLi LFei-Fei CVPR 2009 Going deeper with convolutions CSzegedy WLiu YJia PSermanet SReed DAnguelov DErhan VVanhoucke ARabinovich CVPR 2015 Rethinking the inception architecture for computer vision CSzegedy VVanhoucke SIoffe JShlens ZWojna CVPR 2016 HZhang MCisse YNDauphin DLopez-Paz arXiv:1710.09412 mixup: Beyond empirical risk minimization 2017 arXiv preprint Cutmix: Regularization strategy to train strong classifiers with localizable features SYun DHan SJOh SChun JChoe YYoo ICCV 2019 Random erasing data augmentation ZZhong LZheng GKang SLi YYang 2020 AAAI EDCubuk BZoph JShlens QVLe Randaugment: Practical automated data augmentation with a reduced search space CVPRW 2020 Imagenet classification with deep convolutional neural networks AKrizhevsky ISutskever GEHinton 2012 NeurIPS