Say No to the Poisonous Fungi: An Effective Strategy for Reducing 0-1 Cost in FungiCLEF2024 Bao-Feng Tan1,† , Yang-Yang Li1,† , Peng Wang1,† , Lin Zhao1 and Xiu-Shen Wei2,* 1 School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China 2 School of Computer Science and Engineering, and Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Southeast University, Nanjing, China Abstract The FungiCLEF2024 competition endeavors to precisely identify fungi species leveraging both metadata and image analysis. Pivotal to the success of this competition are two crucial evaluation metrics: minimizing the error rate and the 0-1 cost loss resulting from misclassification. To reduce the identification error rate, we introduce a Dynamic MLP framework, drawing inspiration from [1]. This approach effectively integrates image and metadata embeddings through recursive blocks, utilizing matrix multiplication for deep fusion of information. To further address the issue of 0-1 cost, we devise a novel probability-based screening strategy, which initially consolidates poisonous fungi categories into a single class, then employs marginal expected loss and a threshold parameter 𝛼 to optimize the recall rate for poisonous species. These approaches significantly reduce the error rate and 0-1 cost associated with misclassification and achieve a score of 0.5548 on the private leaderboard, securing the third-place ranking. The code is available at https://github.com/bftan1949/FungiCLEF2024. Keywords Fine-grained image recognition, Open-Set, 0-1 cost loss, Fungi Species Identification 1. Introduction Fine-grained visual classification, as a core challenge in the field of computer vision and pattern recognition, plays a pivotal role in diverse practical applications [2]. The FungiCLEF2024 [3] challenge, serving as a crucial component of the LifeCLEF2024 [4], aims to promote and incentivize in-depth research on fungi identification algorithms, particularly in complex scenarios that integrate image and metadata inputs. The achievement of this goal not only holds immense value for biodiversity conservation, but also plays a crucial role in maintaining human health. Prior FungiCLEF challenges have achieved significant progress through deep learning models [5, 6, 7, 8, 9, 10, 11]. To further enhance the practical significance of the competition and effectively address the challenges faced by developers, scientists, users, and the community, this year’s organizers have introduced additional constraints. Therefore, the challenges faced by this year’s competition can be summarized as follows: β€’ Fine-grained Image Recognition: As a persistent challenge in computer vision, fine-grained image analysis requires participants to conduct deeper research and technological innovations. β€’ Feature Fusion: Effectively fusing metadata features with intuitive image features is crucial, especially when it comes to fine-grained distinction. Relying on subtle cues from metadata to differentiate closely related categories becomes paramount. β€’ Open-set Recognition: Open-set recognition directly affects the robustness and security of artificial intelligence systems. The FungiCLEF2024 challenge specifically includes a large amount of open-set data with unknown categories in the test set. CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France * Corresponding author. † Under the supervision of Xiu-Shen Wei. $ tanbf@njust.edu.cn (B. Tan); lyylyyi599@njust.edu.cn (Y. Li); wangpeng@njust.edu.cn (P. Wang); linzhao@njust.edu.cn (L. Zhao); weixs.gm@gmail.com (X. Wei) Β© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Table 1 FungiCLEF2024 dataset statistics for each split. Subset Species Known Species Unknown Species All Images Poisonous Images Training 1,604 1,604 - 295,938 12,977 Validation 2,713 1,084 1,629 60,832 2,532 β€’ 0-1 Cost Loss: One of the core issues in FungiCLEF2024, which categorizes fungi into poisonous and non-poisonous, is how to construct a model that minimizes the misclassification of poisonous fungi as non-poisonous to ensure high reliability and safety in identification results. β€’ Hardware Constraints: All algorithms will be executed on the HuggingFace platform, subject to strict limitations of 16GB of GPU memory and a two-hour runtime. The FungiCLEF2024 dataset is based on data collected through the Atlas of Danish Fungi mobile and Web applications. All fungi specimen observation had to pass the expert validation process, therefore guaranteeing high-quality labels. For the training dataset, it contains 295,938 images - belonging to 1,604 species. The validation dataset contains 60832 images belonging to 2,713 species, 1,084 known from the training set and 1,629 unknown species. The dataset statistics are listed in Table 1. For fine-grained image recognition and feature fusion, this paper employs Dynamic MLP [1] to fully fuse diverse feature information. Additionally, for open-set recognition, an entropy-based approach is utilized, leveraging the model’s prediction confidence through entropy to identify open-set images, surpassing previous methods. Furthermore, to minimize the critical 0-1 cost loss caused by misclassifying poisonous fungi as non-poisonous, this paper proposes an easy but quite effective way to mitigate 0-1 cost by utilizing a marginal expected loss function during training, which signifi- cantly reduces the cost loss while maintaining accuracy. Details of methods will be discussed in Section 3. The subsequent sections are organized as follows: In Section 2, we will provide a detailed explanation and interpretation of the dataset and evaluation metrics used in the competition. Section 3 will outline the methodology and core concepts adopted in this paper. Section 4 will focus on presenting our experimental results and actual performance in the competition. Finally, in Section 5, we will provide a comprehensive review and summary of the entire content. 2. Related Work and Evaluation Metrics 2.1. Related Work Fine-grained image classification: To enhance fine-grained image classification, several approaches have been proposed. For instance, [12, 13, 14, 15, 16] detect the discriminative regions of an image to exploit subtle details. SnapMix [17] utilizes the class activation map (CAM) [18] to mitigate label noise in fine-grained data augmentation. Similarly, Attribute Mix [19] focuses on semantically meaningful attribute features from two images to identify the same super-categories. FixRes [20] investigates data augmentation and resolution strategies to boost classification performance. Other studies focus on extracting more valuable features from multi-channel networks [21] or through contrastive learning [22]. Using additional information: Besides visual information, researchers have incorporated additional information to enhance classification performance. Many existing works [23, 24, 25, 26] combine the image features with additional multi-modal features directly through channel-wise concatenation. Multi-modal features, including images, ages, and dates, were first introduced by Tang et al. [26], who concatenated them from an MLP backbone network to make a joint prediction. Subsequently, Minetto et al. [24] introduced metadata to the geo-spatial land classification task. Further, Salem et al. [25] integrated dense overhead imagery with location and date into a general framework by concatenating the outputs of the context network. Open-set recognition: Discriminative models are one of the most important ways for open-set recognition [27]. Traditional methods, such as 1-vs-Set machines based on SVM [28], often suffer from limitations stemming from the weak feature extraction ability of those traditional models. In recent years, deep learning-based methods have garnered increasing attention due to their powerful representation abilities. Bendale et al. [29] first proposed replacing the softmax layer in the network with OpenMax, which calibrates the output probability using the Weibull distribution. A similar work [30] replaced the softmax layer with one-vs-rest units. These methods have pioneered a new direction for the research of open-set recognition. Previous FungiCLEF work: Most contributions to FungiCLEF2023 were centered on modern Convo- lutional Neural Network (CNN) or transformer-inspired architectures, such as MetaFormer [31], Swin Transformer [32], and Volo [33]. The winning team [34] achieved 79.28% accuracy using MetaFormer. These results were often enhanced by combining predictions from the same observation and through data augmentations applied during both training and testing. Techniques such as Seesaw loss [35], Focal loss [36], Arcface loss [37], and Sub-Center loss [38] have achieved great success in addressing the unbalanced class distribution. Additionally, metadata was combined with image features to classify fungi categories, further improving the overall performance. 2.2. Evaluation Metrics FungiCLEF2024 has set a total of 3 evaluation metrics, namely Track1, Track2, and Track3, which will be introduced below. Track1: The first metric is the standard classification error, which is the average error in predicting class labels. All categories not present in the training set should be correctly classified as the "unknown" category (i.e., labeled as -1). The specific calculation formula is as follows: 0 if π‘ž(π‘₯) = 𝑦 {οΈ‚ Track1(𝑦, π‘ž(π‘₯))) = (1) 1 otherwise Here, π‘₯ represents the input image, 𝑦 represents the true label of the input image, and π‘ž represents the trained classification model. Track2: The second metric is the cost loss associated with confusing non-toxic and toxic species. Define d(Β·) as an indicator function, where if d(𝑦) = 1, it indicates that category 𝑦 is a toxic category, and if d(𝑦) = 0, it indicates that category 𝑦 is a non-toxic category. The specific calculation formula for Track2 is as follows: if d(π‘ž(π‘₯)) = d(𝑦) ⎧ βŽͺ ⎨ 0 Track2(𝑦, π‘ž(π‘₯)) = 𝑐𝑃 𝑆𝐢 if d(π‘ž(π‘₯)) = 0 & d(𝑦) = 1 (2) 𝑐𝐸𝑆𝐢 if d(π‘ž(π‘₯)) = 1 & d(𝑦) = 0 βŽͺ ⎩ In this competition, 𝑐𝑃 𝑆𝐢 = 100 and 𝑐𝐸𝑆𝐢 = 1. Track3: The third metric is the sum of Track1 and Track2. The specific formula is as follows: Track3 = Track1 + Track2 (3) The final ranking of the competition is based on the performance of Track3. Regardless of whether it is Track1, Track2, or Track3, the lower the score, the higher the ranking. 3. Method In this section, we will introduce our method to handle the recognition problem with open-set and raise an easy way to decrease the Track2 significantly. 3.1. Fine-grained Image Recognition with Feature Fusion Feature fusion. To enhance the image representation and improve the result of image classification, "Dynamic MLP" proposed in [1] is applied to fully release the potential of the meta information. Marking the input image feature as 𝑧𝑖 and the meta feature as 𝑧𝑒 respectively, Dynamic MLP is designed to fuse features through a matrix multiplication operation. Specifically, given the input image π‘₯𝑖 , we can obtain the image feature through the backbone network1 , and the meta (such as substrate and habitat) feature through a well pre-trained clip text model [39], which can be described as following: 𝑧𝑖 = Backbone(π‘₯𝑖 ) (4) 𝑧𝑒 = MLP(Cat(Clip(π‘₯𝑠𝑒𝑏 𝑒 ), Clip(π‘₯𝑒 ))) β„Žπ‘Žπ‘ (5) where Backbone(Β·) denotes the model before the classification head, and 𝑧𝑖 ∈ R𝑛 , n is the output dimension. π‘₯𝑠𝑒𝑏 𝑒 and π‘₯𝑒 β„Žπ‘Žπ‘ denote the substrate and habitat data, respectively. Clip denotes a well trained clip text model. Cat(Β·) denotes the channel-wise concatenation. MLP denotes a residual MLP network, following the descriptions in PriorsNet [40]. After MLP, the meta feature is projected into the same dimension as 𝑧𝑖 , i.e., 𝑧𝑒 ∈ R𝑛 . Then, let the original 𝑧𝑖 as 𝑧𝑖0 , Dynamic MLP takes 𝑧𝑖0 and 𝑧𝑒 as initial inputs, and the enhanced image representation 𝑧𝑖𝑁 is obtained after N recursive blocks. At last, 𝑧𝑖𝑁 is expanded to align the shape with 𝑧𝑖0 by a channel-increasing layer for classifying images. The process of Dynamic MLP can be specifically summarized into three steps: 1. Taking into image and meta features and reshaping the meta feature from a 1-d vector to a 2-d matrix, which can be formalized as following: π‘Š = Reshape(𝑓 (𝑧𝑒 )) (6) where Reshape(Β·) denotes reshape operation, and f denotes a fully connected layer. 2. Obtaining the enhanced image feature 𝑧𝑖𝑁 after N recursions are completed. 𝑧𝑖𝑛+1 = ReLU(LN(𝑓 (π‘Š @𝑧𝑖𝑛 ))), 𝑛 = 0, 1, ..., 𝑁 (7) ReLU(Β·) and LN(Β·) denote ReLU activation function and layer normalization, respectively. The operator @ denotes the matrix multiplication. 3. Aligning the dimension of 𝑧𝑖𝑁 and 𝑧𝑖0 . 𝑖 = Layer(𝑧𝑖 ) 𝑧ˆ𝑁 (8) 𝑁 Layer(Β·) denotes a channel-increasing layer. Fine-grained image classification. After the final enhanced image feature 𝑧ˆ𝑁 𝑖 is obtained, we can use it to classify the fine-grained Fungi images: π‘™π‘œπ‘”π‘–π‘‘π‘  = Head(𝑧ˆ𝑁 𝑖 ) (9) where Head is the last classification head which is used for recognizing images. 1 In CNN backbones, a image feature are acquired after a pooling layer. In Vit based models, a image feature is the [CLS] token in the last layer. Table 2 Comparison of different open-set methods on Fungi dataset. The model and training details of all methods are exactly the same during training, and the results are reported on the validation set. Method Top1 OpenMax [29] 61.37 OSRCI [41] 61.58 Entropy(Ours) 62.19 3.2. Entropy Based Open-set Identifier In the Fungi competition, models are not only asked to correctly identify the species in the close-set, but also the pictures in the open-set. As shown in Table 2, existing approaches such as [29, 41] show their superiority on coarse-grained datasets, but are not suitable for large scale fine-grained datasets. So we adopt an easy entropy based method to identify open-set images, which shows better result than [29, 41]. Entropy is defined following to measure the quality of the probability distribution: 𝐢 βˆ‘οΈ entropy(𝑝) = βˆ’ 𝑝𝑐 log 𝑝𝑐 (10) 𝑐=1 where 𝑝 = Softmax(π‘™π‘œπ‘”π‘–π‘‘π‘ ) (11) where Softmax(Β·) denotes the softmax operation, C denotes the number of categories to be classified in the close-set and 𝑝𝑐 denotes the the probability of being identified as the category 𝑐. In general, the model is more confident for known categories, corresponding to a lower entropy. Whereas for unknown categories the uncertainty is higher and hence the entropy will be higher. Thus, we can effectively distinguish between known/unknown categories through a entropy threshold 𝜏 . Once the threshold 𝜏 is determined, we can use the following formula to identify the open-set images: {οΈƒ βˆ’1 if entropy(𝑝) > 𝜏, π‘™π‘Žπ‘π‘’π‘™ = (12) Argmax(𝑝) if entropy(𝑝) ≀ 𝜏. where Argmax(Β·) denotes the argmax function. Since the choice of 𝜏 determines the effect of a model, we find the best threshold based on the validation set. 3.3. Probability-guided Poisonous Recognizer The two tasks of the competition are improving the classification accuracy and reducing the cost of classifying poisonous as non-poisonous, respectively. In fact, the latter task has a greater impact on the final score than the former. In this section, we will introduce an easy but quite effective way to reduce the cost of the latter task. Put poisonous categories together. The Fungi dataset is a long-tail dataset with uneven distribution, Figure 1 illustrates that the number of categories and the quantity of images for the poisonous are significantly lower compared to edible ones. This imbalance poses a challenge for models to adequately learn robust embeddings for poisonous species, thereby leading to misclassification where poisonous species may be incorrectly labeled as edible. Such errors incur substantial costs. Therefore, we first put all poisonous categories into a single class, effectively reducing the total number of categories from 1604 to 1556 (comprising 1555 edible species and 1 aggregated poisonous class). This approach facilitates the model’s focus on the general features of poisonous species, alleviating the need to discern subtle distinctions. Next, two models will be trained separately to classify mixed categories (1555 edible categories and a poisonous category) and only poisonous categories (49 poisonous categories). Counts of Poisonous vs Edible Fungi by Class ID 2000 Edible 1750 Poisonous 1500 1250 Count 1000 750 500 250 0 0 200 400 600 800 1000 1200 1400 1600 Class ID Figure 1: The distribution of the training dataset. The classes are arranged in descending order based on the number of samples in each category. Marginal expected loss. The most direct way to reduce cost is to optimize the cost function itself, but since the calculation of cost is discrete, we use the marginal expected loss function here: 𝐢 1 βˆ‘οΈ βˆ‘οΈ 𝑐 𝑀 πΈπ‘™π‘œπ‘ π‘  = 𝑝𝑖 Β· π‘π‘œπ‘ π‘‘π‘gt(𝑖) (13) |ℐ| π‘–βˆˆβ„ 𝑐=1 where ℐ represents the training images and gt(Β·) denotes the ground truth label of the image. As defined in Section 2, d(Β·) indicate poisonous species, where d(𝑦) = 1 if the category 𝑦 is poisonous, and d(𝑦) = 0 if 𝑦 is edible. According to the calculation formula of Track2, we can obtain the specific expression of π‘π‘œπ‘ π‘‘π‘gt(i) , however, according to Track2, no matter what the ground truth of the picture is, as long as the predicted label meets d(gt(𝑖)) = d(𝑐), the same loss will be obtained, thus disrupting the correct gradient descent direction to the ground truth. Hence, We amend the expression of π‘π‘œπ‘ π‘‘π‘gt(i) to introduce a penalty specifically for instances where d(gt(𝑖)) = d(𝑐) while gt(𝑖) ΜΈ= 𝑐, to address this issue. The formula is as following: if gt(𝑖) = 𝑐 ⎧ βŽͺ 0 βŽͺ ⎨ 5 if d(gt(𝑖)) = d(𝑐) & gt(𝑖) ΜΈ= 𝑐 βŽͺ π‘π‘œπ‘ π‘‘π‘gt(𝑖) = (14) βŽͺ 10 βŽͺ βŽͺ if d(gt(𝑖)) = 0 & d(𝑐) = 1 if d(gt(𝑖)) = 1 & d(𝑐) = 0 ⎩ 100 MEloss will force the model to improve its ability to identify the poisonous and the edible while improving the recognition accuracy. Increasing the recall rate of the poisonous. The primary aim of reducing Track2 lies in maximizing the recall rate for the poisonous species. To achieve this, we set a probability-guided threshold 𝛼, where an image is considered poisonous as soon as the predicted probability of the poisonous category exceeds 𝛼. As previously discussed, let define the model classifying mixed categories as β„Ž and the model classifying poisonous categories as 𝑔, the ultimate decision method is outlined as Algorithm 1. 4. Experiments In this section, we will introduce the details and main results in detail. Algorithm 1 Fungi’s main inference algorithm input: models of β„Ž and 𝑔, threshold 𝜏 , 𝛼 1: for image {π‘₯𝑖 }π‘–βˆˆβ„ do 2: 𝑝𝑖 = β„Ž(π‘₯𝑖 ) 3: if 𝑝𝑖 > 𝜏 then 4: return -1 #-1 represents the open-set categories 5: else if π‘π‘π‘œπ‘– 𝑖 ≀ 𝛼 then 6: return Argmax(𝑝𝑖 ) 7: else 8: 𝑝𝑖 = 𝑔(π‘₯𝑖 ) 9: return Argmax(𝑝𝑖 ) 10: end if 11: end for 4.1. Implementation Details Basic settings. The thresholds of 𝜏 and 𝛼 are 1.5 and 0.01 respectively. The proposed method has been developed utilizing the PyTorch framework [42]. Vit-large [43] and Eva02-large [44], implemented via the timm library [45], serve as the model β„Ž and 𝑔, respectively. All the models have been pre-training on the ImageNet dataset [46], and are conveniently accessible in HuggingFace. Fine-tuning of these models was performed using 8 Nvidia RTX3090 GPUs. Input size of images is 336. The initial learning rate was set to 2 Γ— 10βˆ’5 , and the total number of training epochs was set to 15, with the first epoch dedicated to warm-up by employing a learning rate of 2 Γ— 10βˆ’7 . For optimal model training, we employed the AdamW optimizer [47] in conjunction with a cosine learning rate scheduler [48], with the weight decay set to 1 Γ— 10βˆ’2 . Since the Fungi dataset with an unbalanced distribution, many studies have suggested solutions [49, 50, 51, 52], considering the practical effect, we finally use seesaw loss [35] and ME loss mentioned above to optimize the model. Data Augmentations. We employ a composed sequence of common augmentation techniques to enhance results. During training, we first perform random cropping on the image, where the size of the cropped region is randomly chosen between 50% and 100% of the original image size. Subsequently, the slice is resized using the bicubic interpolation method and flipped horizontally and vertically with a probability of 50%. Additionally, we incorporate hue-saturation and brightness-contrast augmentations to randomly adjust the hue, saturation, value, brightness, and contrast of the input image. Finally, standard normalization is applied to all input images. However, due to limitations in both running time and GPU memory on HuggingFace, test-time augmentations are simplified by first resizing images to 336 and then normalizing them using the same mean and std as during training. 4.2. Fungi Dataset Experiments The key experimental results are presented in Table 4. As evident from the table, Dynamic MLP exhibits superior feature fusion ability, effectively reducing the error rate for Vit and Eva. Concurrently, the strategy of consolidating all poisonous categories mitigates the model’s need to discern subtle differences between poisonous classes, enabling it to concentrate on macro differences instead. Notably, as observed in the last two lines of the table, Track2 and Track1 occupy opposing ends of the seesaw, which is a natural consequence of the poisonous recognition strategy outlined in this paper. However, from the standpoint of Track3, the advantages of optimizing Track2 outweigh those of Track1, thereby reaffirming the arguments put forth in section 3. Table 3 demonstrates the impact of different 𝛼 on Track2. It is evident from the table that as the threshold value decreases, Track2 steadily drops. This is because 𝛼 directly affects the model’s recall rate for poisonous classes; the smaller 𝛼 is, the higher the recall rate of the model, thus reducing the Table 3 The table shows how different values of 𝛼 impact Track2. The backbones are Vit-large and Eva02-large, and all the methods mentioned in Section 3 are adopted. The results are reported on the validation set. 𝛼 Track2 0.2 0.2577 0.15 0.2366 0.1 0.2037 0.05 0.1841 0.01 0.1226 Table 4 The Table shows results of different methods and settings, all of which are reported on the validation set. Track1 represents the error rate of classification (both on close-set and open-set). Track2 represents the cost caused by misclassification and Track3 is equivalent to Track1 plus Track2. All these metrics are as small as possible. DM represents Dynamic MLP, PPT represents for putting the poisonous categories together, cat represents the concatenate operation in channel-wise with meta and image features. Backbone Feature fusion Open-set 0-1 cost Track1 Track2 Track3 Vit-large cat - - 0.4623 0.5095 0.9718 Eva-large cat - - 0.4547 0.5587 1.0134 Vit-large DM - - 0.4581 0.4854 0.9435 Eva-large DM - - 0.4438 0.4709 0.9147 Vit-large & DM - PPT 0.4313 0.3234 0.7647 Eva-large Vit-large & DM Entropy PPT 0.3651 0.4834 0.8485 Eva-large Vit-large & DM Entropy ME-loss & PPT 0.3809 0.2868 0.6677 Eva-large Vit-large & ME-loss & PPT & DM Entropy 0.3951 0.1226 0.5177 Eva-large threshold 𝛼 Table 5 The table shows the final scores of different teams on the private leaderboard, Track1 represents the error rate of image recognition (including closed and open sets), Track2 represents the cost loss caused by wrong recognition, Track3 is equal to Track1 plus Track2, and the ranking is based on Track3, the smaller the better. Rank Team Track1 Track2 Track3 1 IES 0.3107 0.0904 0.3621 2 jack-etheredge 0.2436 0.1629 0.4075 3 upupup(Our) 0.3898 0.0718 0.513 4 chirmy 0.2693 0.4149 0.6667 5 TingTing1999 0.2749 0.4378 0.6934 6 glhr 0.4996 0.6511 1.1526 7 DS@GT 0.3907 1.604 2.0443 probability of identifying poisonous classes as non-poisonous. Table 5 illustrates the performance of our team in comparison to other competitors on the private leaderboard, where we secured the 3rd position, surpassing the 4th place in Track2 and achieving a performance even better than the 1st place. These outcomes collectively demonstrate the efficacy of the method presented in Section 3.3. Nonetheless, due to the simplicity of the approach we adopted for open-set recognition, we fell short in Track1 compared to other teams, highlighting an area that requires enhancement in our future endeavors. 5. Conclusion The core challenges of FungiCLEF2024 are identifying fine-grained fungi images in an open-set en- vironment and minimizing the 0-1 cost of misclassification. To address the first challenge, we use Dynamic MLP, a recursive structure utilizing matrix multiplication, for feature fusion to improve accuracy. To mitigate the 0-1 cost, we propose an easy yet effective approach that first places poisonous fungi categories into a single class and then employs ME loss and 𝛼 to optimize the recall rate for poisonous species. However, open-set fine-grained fungi recognition remains a significant challenge. Our current approach relies solely on entropy for classifying open-set species, which has proven to be overly simplistic and inefficient. Consequently, the open-set problem stands as an enduring challenge that necessitates further investigation and innovation. References [1] L. Yang, X. Li, R. Song, B. Zhao, J. Tao, S. Zhou, J. Liang, J. Yang, Dynamic mlp for fine-grained image classification by leveraging geographical and temporal information, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10945–10954. [2] X.-S. Wei, Y.-Z. Song, O. Mac Aodha, J. Wu, Y. Peng, J. Tang, J. Yang, S. Belongie, Fine-grained image analysis with deep learning: A survey, IEEE Transactions on Pattern Analysis and Machine Intelligence 44 (2021) 8927–8948. [3] L. Picek, M. Sulc, J. Matas, Overview of FungiCLEF 2024: Revisiting fungi species recognition beyond 0-1 cost, in: Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, 2024. [4] A. Joly, L. Picek, S. Kahl, H. GoΓ«au, V. Espitalier, C. Botella, B. Deneu, D. Marcos, C. Leblanc, T. Larcher, M. Ε ulc, M. HrΓΊz, M. Servajean, et al., Overview of LifeCLEF 2024: Challenges on species distribution prediction and identification, in: International Conference of the Cross- Language Evaluation Forum for European Languages, Springer, 2024. [5] L. Picek, M. Ε ulc, J. Matas, J. Heilmann-Clausen, Overview of fungiclef 2022: Fungi recognition as an open set classification problem, Working Notes of CLEF (2022). [6] J. Yu, H. Chang, K. Lu, G. Xie, L. Zhang, Z. Cai, S. Du, Z. Wei, Z. Liu, F. Gao, et al., Bag of tricks and a strong baseline for FGVC, Working Notes of CLEF (2022). [7] G. Fan, C. Zining, W. Weiqiu, S. Yinan, S. Fei, Z. Zhicheng, C. Hong, Does closed-set training generalize to open-set recognition?, Working Notes of CLEF (2022). [8] K. Desingu, A. Bhaskar, M. Palaniappan, E. A. Chodisetty, H. Bharathi, Classification of fungi species: A deep learning based image feature extraction and gradient boosting ensemble approach, Working Notes of CLEF (2022). [9] S. Wolf, J. Beyerer, Transformer-based fine-grained fungi classification in an open-set scenario, Working Notes of CLEF (2022). [10] F. Hu, P. Wang, Y. Li, C. Duan, Z. Zhu, Y. Li, X.-S. Wei, A deep learning based solution to fungiclef2023, Aliannejadi et al.[1] (2023) 2051–2059. [11] F. Hu, P. Wang, Y. Li, C. Duan, Z. Zhu, F. Wang, F. Zhang, Y. Li, X.-S. Wei, Watch out venomous snake species: A solution to snakeclef2023, arXiv preprint arXiv:2307.09748 (2023). [12] A. Behera, Z. Wharton, P. R. Hewage, A. Bera, Context-aware attentional pooling (cap) for fine- grained visual classification, in: Proceedings of the AAAI conference on artificial intelligence, volume 35, 2021, pp. 929–937. [13] T. Xiao, Y. Xu, K. Yang, J. Zhang, Y. Peng, Z. Zhang, The application of two-level attention models in deep convolutional neural network for fine-grained image classification, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 842–850. [14] Z. Yang, T. Luo, D. Wang, Z. Hu, J. Gao, L. Wang, Learning to navigate for fine-grained classification, in: Proceedings of the European conference on computer vision (ECCV), 2018, pp. 420–435. [15] N. Zhang, J. Donahue, R. Girshick, T. Darrell, Part-based r-cnns for fine-grained category detection, in: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, Springer, 2014, pp. 834–849. [16] X.-S. Wei, C.-W. Xie, J. Wu, C. Shen, Mask-cnn: Localizing parts and selecting descriptors for fine-grained bird species categorization, Pattern Recognition 76 (2018) 704–714. [17] S. Huang, X. Wang, D. Tao, Snapmix: Semantically proportional mixing for augmenting fine- grained data, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 2021, pp. 1628–1636. [18] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, A. Torralba, Learning deep features for discriminative localization, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2921–2929. [19] H. Li, X. Zhang, Q. Tian, H. Xiong, Attribute mix: Semantic data augmentation for fine grained recognition, in: 2020 IEEE International Conference on Visual Communications and Image Processing (VCIP), IEEE, 2020, pp. 243–246. [20] H. Touvron, A. Vedaldi, M. Douze, H. JΓ©gou, Fixing the train-test resolution discrepancy, Advances in neural information processing systems 32 (2019). [21] D. Chang, Y. Ding, J. Xie, A. K. Bhunia, X. Li, Z. Ma, M. Wu, J. Guo, Y.-Z. Song, The devil is in the channels: Mutual-channel loss for fine-grained image classification, IEEE Transactions on Image Processing 29 (2020) 4683–4695. [22] Y. Gao, X. Han, X. Wang, W. Huang, M. Scott, Channel interaction networks for fine-grained image categorization, in: Proceedings of the AAAI conference on artificial intelligence, volume 34, 2020, pp. 10818–10825. [23] G. Mai, K. Janowicz, B. Yan, R. Zhu, L. Cai, N. Lao, Multi-scale representation learning for spatial feature distributions using grid cells, arXiv preprint arXiv:2003.00824 (2020). [24] R. Minetto, M. P. Segundo, S. Sarkar, Hydra: An ensemble of convolutional neural networks for geospatial land classification, IEEE Transactions on Geoscience and Remote Sensing 57 (2019) 6530–6541. [25] T. Salem, S. Workman, N. Jacobs, Learning a dynamic map of visual appearance, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 12435–12444. [26] K. Tang, M. Paluri, L. Fei-Fei, R. Fergus, L. Bourdev, Improving image classification with location context, in: Proceedings of the IEEE international conference on computer vision, 2015, pp. 1008–1016. [27] C. Geng, S.-j. Huang, S. Chen, Recent advances in open set recognition: A survey, IEEE transactions on pattern analysis and machine intelligence 43 (2020) 3614–3631. [28] W. J. Scheirer, A. de Rezende Rocha, A. Sapkota, T. E. Boult, Toward open set recognition, IEEE transactions on pattern analysis and machine intelligence 35 (2012) 1757–1772. [29] A. Bendale, T. E. Boult, Towards open set deep networks, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1563–1572. [30] L. Shu, H. Xu, B. Liu, Doc: Deep open classification of text documents, arXiv preprint arXiv:1709.08716 (2017). [31] W. Yu, M. Luo, P. Zhou, C. Si, Y. Zhou, X. Wang, J. Feng, S. Yan, Metaformer is actually what you need for vision, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10819–10829. [32] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin transformer: Hierarchical vision transformer using shifted windows, in: Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10012–10022. [33] L. Yuan, Q. Hou, Z. Jiang, J. Feng, S. Yan, Volo: Vision outlooker for visual recognition, IEEE transactions on pattern analysis and machine intelligence 45 (2022) 6575–6586. [34] H. Ren, H. Jiang, W. Luo, M. Meng, T. Zhang, Entropy-guided open-set fine-grained fungi recognition., in: CLEF (Working Notes), 2023, pp. 2122–2136. [35] J. Wang, W. Zhang, Y. Zang, Y. Cao, J. Pang, T. Gong, K. Chen, Z. Liu, C. C. Loy, D. Lin, Seesaw loss for long-tailed instance segmentation, 2021, pp. 9695–9704. [36] T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. DollΓ‘r, Focal loss for dense object detection, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988. [37] J. Deng, J. Guo, N. Xue, S. Zafeiriou, Arcface: Additive angular margin loss for deep face recognition, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4690–4699. [38] J. Deng, J. Guo, T. Liu, M. Gong, S. Zafeiriou, Sub-center arcface: Boosting face recognition by large-scale noisy web faces, in: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, Springer, 2020, pp. 741–757. [39] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., Learning transferable visual models from natural language supervision, in: International conference on machine learning, PMLR, 2021, pp. 8748–8763. [40] O. Mac Aodha, E. Cole, P. Perona, Presence-only geographical priors for fine-grained image classification, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9596–9606. [41] L. Neal, M. Olson, X. Fern, W.-K. Wong, F. Li, Open set learning with counterfactual images, in: Proceedings of the European conference on computer vision (ECCV), 2018, pp. 613–628. [42] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, S. Chintala, Pytorch: An imperative style, high-performance deep learning library, in: Advances in Neural Information Processing Systems 32, Curran Associates, Inc., 2019, pp. 8024–8035. [43] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Transformers for image recognition at scale, arXiv preprint arXiv:2010.11929 (2020). [44] Y. Fang, Q. Sun, X. Wang, T. Huang, X. Wang, Y. Cao, Eva-02: A visual representation for neon genesis, arXiv preprint arXiv:2303.11331 (2023). [45] R. Wightman, Pytorch image models, https://github.com/rwightman/pytorch-image-models, 2019. [46] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale hierarchical image database, 2009, pp. 248–255. [47] I. Loshchilov, F. Hutter, Fixing weight decay regularization in adam (2017). [48] I. Loshchilov, F. Hutter, SGDR: Stochastic gradient descent with warm restarts, 2017. [49] X.-S. Wei, S.-L. Xu, H. Chen, L. Xiao, Y. Peng, Prototype-based classifier learning for long-tailed visual recognition, Science China Information Sciences 65 (2022) 160105. [50] Y.-Y. He, J. Wu, X.-S. Wei, Distilling virtual examples for long-tailed recognition, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 235–244. [51] Y. Zhang, X. Wei, B. Zhou, J. Wu, Bag of tricks for long-tailed visual recognition with deep convolutional neural networks, 2021, pp. 3447–3455. [52] B. Zhou, Q. Cui, X.-S. Wei, Z.-M. Chen, Bbn: Bilateral-branch network with cumulative learning for long-tailed visual recognition, in: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 9716–9725. doi:10.1109/CVPR42600.2020.00974.