Do Not Lose to Losses for SnakeCLEF2024 Matěj Sieber1 , Tomáš Železný1 1 University of West Bohemia, Faculty of Applied Sciences, Univerzitni 2732/8, 301 00 Pilsen, Czech Republic Abstract This paper presents participation in the SnakeCLEF 2024 challenge, which aims to automate the identification of snake species. We explore various custom loss functions that incorporate the venomousness of snakes. These loss functions are used to train the Swin-v2 tiny model with same training specification as baseline solution to accurately measure the impact of custom loss functions. Swin-v2 tiny model is beneficial due to its low computational demand and opens the possibility for use in handheld devices. Our results show that the best approach for maximising performance on the custom competition metrics is to apply a soft target set according to the venomousness of the snake. The best accuracy is achieved by the model trained with loss, which weights the different classes according to the number of their instances. Keywords SnakeCLEF, Snake Bite, Computer Vision, Classification, Snake Species Identification, Imbalanced dataset 1. Introduction An estimated 5.4 million people worldwide are bitten by snakes each year, resulting in 1.8 to 2.7 million cases of envenomings. This leads to approximately 81,410 to 137,880 deaths annually, with three times as many cases of amputations and other permanent disabilities. Venomous snakebites can cause severe health issues, including paralysis that can inhibit breathing, bleeding disorders leading to fatal hemorrhage, irreversible kidney failure, and tissue damage resulting in permanent disability and limb amputation [1]. To address these challenges, developing accurate and lightweight models for identifying snake species could significantly improve health outcomes. The LifeCLEF-SnakeCLEF 2024 [2, 3] makes it possible to train such models by providing the training data. In addition to the classic metrics used for classification, the competition evaluates on custom metrics that take into account the venomousness of snakes, which is a crucial in a real scenario. This work represents participation in SnakeCLEF2024 with a focus on minimizing custom venomous metrics. 2. Dataset The SnakeCLEF dataset [4] is a comprehensive collection of snake images used for the classification of snake species. The dataset consists of three parts: training, validation, and a private test set used for competition evaluation. The training and validation sets are publicly available, while the private test set is held back for evaluating the competition entries. The dataset is available in multiple variants, differing in the size of the images, to accommodate various computational capabilities and research needs. As stated in the SnakeCLEF2023 report [5], all subsets combined result in roughly 110,000 real snake observations with community-verified species labels, ensuring high-quality and reliable data for training models. Dataset contains 1,784 snake species, the classes exhibit a long-tailed distribution, meaning that a small number of classes have a large number of images, while many classes have only a few images. Figure 1 shows an illustrative representation of medically important snake species. CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France $ sieberm@kky.zcu.cz (M. Sieber); zeleznyt@kky.zcu.cz (T. Železný) € https://github.com/sieberm111 (M. Sieber); https://github.com/zeleznyt (T. Železný)  0009-0005-9406-0585 (M. Sieber); 0000-0002-0974-7069 (T. Železný) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Figure 1: Medically-important venomous snakes. Top row: Daboia russelii (Russell’s Viper – Asia), Bitis arietans (Puff Adder – Africa), and Crotalus adamanteus (Eastern Diamond-backed Rattlesnake – North America). Bottom row: Bothrops atrox (Common Lancehead – South America), Acanthophis antarcticus (Common Death Adder – Australia), and Vipera ammodytes (Nose-horned Viper – Europe). Courtesy of [5]. 3. Evaluation The competition evaluates competitors’ models using four different metrics. The first two are the macro-averaged F1 score and accuracy, which are the standard metrics for classification tasks. In the real world, however, misclassifying different snakes does not have the same consequences. In the worst case scenario, a deadly venomous snake is misclassified as a harmless one, which may result in death. With this in mind, the organisers came up with two other metrics, that take into account whether the snake is venomous or not. These metrics are denoted in Equation (2) and (3). In the real world, however, different venomous snake bites cannot be treated in the same way, as some are more venomous than others and the bite requires different types of serum, which may vary in side-effects, price or location. As it would be very expensive to achieve this level of granularity, a generalisation was made in the form of a universal, harmless, free and always-accessible antivenom. ⎧ ⎪ ⎪0 if 𝑦 = 𝑦ˆ ⎪1 if 𝑦 ̸= 𝑦ˆ and 𝑝(𝑦) = 0 and 𝑝(𝑦ˆ) = 0 ⎪ ⎪ ⎪ ⎨ 𝐿(𝑦, 𝑦ˆ) = 2 if 𝑦 ̸= 𝑦ˆ and 𝑝(𝑦) = 0 and 𝑝(𝑦ˆ) = 1 (1) 2 if 𝑦 ̸= 𝑦ˆ and 𝑝(𝑦) = 1 and 𝑝(𝑦ˆ) = 1 ⎪ ⎪ ⎪ ⎪ ⎪ 5 if 𝑦 ̸= 𝑦ˆ and 𝑝(𝑦) = 1 and 𝑝(𝑦ˆ) = 0 ⎪ ⎩ ∑︁ 𝐿= 𝐿(𝑦𝑖 , 𝑦ˆ𝑖 ), (2) 𝑖 where correct species is 𝑦 and predicted species is 𝑦ˆ. 𝑃 (𝑠) = 1 if species 𝑠 is venomous, otherwise 𝑝(𝑠) = 0. 𝑤1 𝐹1 + 𝑤2 𝐶ℎℎ + 𝑤3 𝐶ℎ𝑣 + 𝑤4 𝐶𝑣𝑣 + 𝑤5 𝐶𝑣ℎ 𝑀= , (3) 𝑤1 + 𝑤2 + 𝑤3 + 𝑤4 + 𝑤5 where 𝑤1 = 1, 𝑤2 = 1, 𝑤3 = 2, 𝑤4 = 2, and 𝑤5 = 5 are the weights of individual confusions, 𝐶𝑣ℎ is the percentage of wrongly classified venomous species as a harmless species, 𝐶ℎ𝑣 is the percentage of wrongly classified harmless species as a venomous species, 𝐶𝑣𝑣 is the percentage of wrongly classified venomous species as another venomous species, 𝐶ℎℎ is the percentage of wrongly classified harmless species as another harmless species, and the F1 is the macro averaged F1 score. 4. Experiments Given the practical limitations and the need for widespread accessibility, we aimed to develop a model that could run on handheld devices such as smartphones and tablets. Real-time identification of snake species has a potential to significantly improve the speed and effectiveness of medical response, potentially saving lives and reducing the incidence of serious snakebite complications. This capability is particularly important in remote areas where snakebite is most common and access to high performance computing resources is limited. We use the Swin-v2 tiny model [6], which is suitable for this task due to its small size and efficiency. Such lightweight models are less prone to overfitting, which is particularly important when dealing with diverse and unbalanced datasets. We aim to maximise the performance of the model for those metrics that take into account whether the snake is venomous or not. Primarily, we focus on minimizing the L metric (Equation 2), as this closely matches real-world scenarios where accurate identification of snake species is paramount. The main goal of our work is to optimise the loss functions. Specifically, we created four different custom losses to meet our objectives. All results are presented in Table 1. To effectively measure impact of custom losses we use the same training parameters as the baseline: RandResizedCrop and RandAugment augmentations, resolution size 256x256, learning rate 0.01 and SGD optimizer. Full training pipeline for the baseline solution is available at BVRA GitHub1 . Code for proposed methods can be found at Our GitHub2 . 4.1. Dual-head The aim of this experiment is to improve the performance of the model by incorporating snake venom information using a combination of two classification losses. We add a second head consisting of one neuron with Sigmoid activation function. In addition to the Categorical Cross Entropy loss, we also train the model on the binary classification of venomous/harmless classes using Binary Cross Entropy loss, resulting in equation denoted in 4. ℒDual-head = ℒBCE + ℒCE (4) 4.2. Rare class boost Given the nature of the data, there are several strategies to address a long-tailed distribution, such as uneven sampling of training data or assigning greater weight to rarer classes. Our approach utilizes the weight strategy, which demonstrates superior accuracy, aside from ensemble models. In this experiment, we compute the loss with SeeSawLoss [7], multiplying it by the rarity of each class. Although SeeSawLoss was originally developed for long-tailed distribution data, our modification of incorporating dynamic class rarity across the whole dataset further improves the results. These results can be directly compared to the Baseline solution, which also utilizes SeeSawLoss. 𝐵 ∑︁ 𝑆batch = 𝐶𝑘 𝑖 , 𝑖=1 𝑁 𝛽= 𝐵 , 𝑆batch ℒClsBoost = 𝛽 × ℒSeeSaw (x, y), (5) where 𝐶𝑘 is number of instances of the class 𝑘, 𝑁 is the dataset size, and 𝐵 is the batch size. 1 BVRA GitHub: https://github.com/BohemianVRA/FGVC-Competitions/tree/feat/baselineTrainingForSnakeCLEF2024/ SnakeCLEF2024 2 Authors GitHub: https://github.com/sieberm111/snakeclef2024 4.3. CE + VenomousPenalty Another approach to take the venom of the snake into account is to simply add a venomous penalty, according to the equation 1, to the classical Categoric Cross Entropy loss. Although this method does not perform notably better than the baseline method, it achieves the best results in the F1 metric. ℒ𝐶𝐸+𝑉 𝑃 = ℒ𝐶𝐸 + 𝐿, (6) where 𝐿 is penalty defined in (1). 4.4. Soft target Instead of classical approach of setting a Cross Entropy target as one-hot vector, we explore using a soft target. In this method, our goal is to set the negative targets to values accordingly to venomous penalties in Equation (1), while ensuring that the sum of the target values remains 1. First, we linearly transform these penalties using Equation (7). This results in a target value of 1 for the least penalized classification, i.e., (𝑦 = 𝑦ˆ), and a target value of 0 for the most penalized classification, i.e., (𝑦 ̸= 𝑦ˆ and the venomous snake is classified as harmless). The values are then normalised and used as a soft target. 𝜏 = −0.2 · 𝑝 + 1, (7) Targets = Norm(𝜏 ), (8) where 𝑝 are the penalties in Equation (1), and 𝜏 are the targets before normalisation. This method results in poor performance, as the positive and negative target values are close after normalisation. This motivates us to create a soft target method which sets the positive target to significantly higher value than the negative targets. Our aim is to set the positive target value in the range ⟨0.5; 1.0). We conducted empirical experiments, which resulted in the target denoted in Equation (9). Models with best performance are denoted as SoftT-3 for temperature parameter 𝑡 = 3 and SoftT-4 for 𝑡 = 4. Target = −𝑡 · Softmax(log(𝑝)), (9) where 𝑝 = 0.1 for 𝑦 = 𝑦ˆ, 𝑝 = 1 for 𝑦 ̸= 𝑦ˆ and 𝑝(𝑦) = 0 and 𝑝(𝑦ˆ) = 0, 𝑝 = 10 for 𝑦 ̸= 𝑦ˆ and 𝑝(𝑦) = 0 and 𝑝(𝑦ˆ) = 1, 𝑝 = 10 for 𝑦 ̸= 𝑦ˆ and 𝑝(𝑦) = 1 and 𝑝(𝑦ˆ) = 1, 𝑝 = 100 for 𝑦 ̸= 𝑦ˆ and 𝑝(𝑦) = 1 and 𝑝(𝑦ˆ) = 0, and 𝑡 is temperature parameter. 4.5. Model Ensemble The competition sets a limit of 60 minutes for the maximum model inference time on the test set. Since our model is able to process the test set in units of minutes, we decided to use the remaining time for additional experiments. We created an ensemble of our models by averaging the logits. The ensemble of models performed noticeably better. Since there was still a lot of time left, we also tested the ensemble of logits for the given image and its horizontally flipped version for each model as the flip was not part of the augmentations. This doubled our inference time. However, we did not gain any improvement by using this method, so we do not report it. Table 1 Comparison of results on the public dataset for proposed methods and baseline solution. CE refers to the same method as Baseline, but only trained with the Cross Entropy loss instead of the SeeSaw loss. The ensembles of the models are presented separately. Method M metric L metric F1 Accuracy Competition Baseline 67.01 1861 13.34 39.88 CE 62.33 2195 11.69 33.19 Dual-head 67.77 1813 14.56 41.87 ClsBoost 68.04 1788 13.87 42.75 CE + VenomousPenalty (CE-VP) 67.10 1863 14.58 39.51 SoftT linear 64.62 2024 11.13 36.72 SoftT-4 65.01 2004 12.55 35.54 SoftT-3 68.15 1785 14.41 41.94 Ensemble SoftT-3 + ClsBoost 69.92 1660 15.44 45.77 Ensemble SoftT-3 + SoftT-4 + CE-VP + ClsBoost 69.88 1668 16.12 45.11 Table 2 Comparison of results on the private dataset. The ensembles of the models are presented separately. Method M metric L metric F1 Accuracy Competition Baseline 64.30 5067 10.9 36.41 Dual-head 64.94 5004 12.2 37.77 ClsBoost 65.26 4921 11.4 38.09 CE + VenomousPenalty (CE-VP) 65.33 4860 11.75 36.1 SoftT linear 62.08 5472 10.94 34.53 SoftT-4 64.06 5075 10.39 35.12 SoftT-3 65.37 4901 13.17 37.77 Ensemble SoftT-3 + ClsBoost 67.00 4611 13.29 41.26 Ensemble SoftT-3 + SoftT-4 + CE-VP + ClsBoost 67.47 4515 13.89 41.4 5. Conclusion This work presents a participation in the competition SnakeCLEF2024. Our approach is based on the use of the compact Swin-v2 tiny model, known for its speed and suitability for running on mobile devices such as smartphones and tablets. Instead of focusing on conventional methods, we decided to experiment exclusively with custom loss functions tailored to our specific scenario. In particular, we focused on the L metric, which is designed to penalise misclassification of snakes based on their venomousness. The results (see Table 1 and 2) show that the Dual-head approach improved results compared to the baseline solution. And these improvements were stable on both datasets public and private alike. The ClsBoost loss was a viable idea that maintained the best accuracy on both the public and private datasets. Since the loss function did not focus on the custom metrics, but rather aimed to reduce the effect of the long-tailed distribution, the accuracy, even when tied to other metrics, was not sufficient to maintain the best custom metric score. The CE-VP loss function proved that incorporating the L metric into the loss function helped. However, there are notable differences between the performance on the public and test set. The SoftT loss function achieved the best results, namely the M and L metrics on the public dataset and the M and F1 metrics on the private dataset. Since this method uses a hyperparameter chosen by empirical study, this gives the opportunity for future work where the hyperparameter could be tuned within an ablation study with the aim of finding the best performing parameters of this method. Since the competition allows a maximum inference time of 60 minutes, and our model requires only a few minutes to infer the entire test set, we decided to create an ensemble of our best models, resulting in better scores. As an extension to our methods, we propose to use location data in the recognition process. By using GPS information, which is available on handheld devices such as smartphones, we can improve the accuracy of species identification by taking into account the geographical distribution of different snake species. Acknowledgments Computational resources were provided by the e-INFRA CZ project (ID:90254), supported by the Ministry of Education, Youth and Sports of the Czech Republic. References [1] World Health Organization, Snakebite envenoming, 2023. Https://www.who.int/news-room/fact- sheets/detail/snakebite-envenoming. [2] A. Joly, L. Picek, S. Kahl, H. Goëau, V. Espitalier, C. Botella, B. Deneu, D. Marcos, J. Estopinan, C. Leblanc, T. Larcher, M. Šulc, M. Hrúz, M. Servajean, et al., Overview of lifeclef 2024: Challenges on species distribution prediction and identification, in: International Conference of the Cross- Language Evaluation Forum for European Languages, Springer, 2024. [3] L. Picek, M. Hruz, A. M. Durso, Overview of SnakeCLEF 2024: Revisiting snake species identification in medically important scenarios, in: Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, 2024. [4] LifeCLEF, Snakeclef2024, 2024. Https://huggingface.co/spaces/BVRA/SnakeCLEF2024. [5] L. Picek, M. Šulc, R. Chamidullin, A. Durso, Overview of snakeclef 2023: snake identification in medically important scenarios, CLEF, 2023. [6] Z. Liu, H. Hu, Y. Lin, Z. Yao, Z. Xie, Y. Wei, J. Ning, Y. Cao, Z. Zhang, L. Dong, F. Wei, B. Guo, Swin transformer v2: Scaling up capacity and resolution, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 12009–12019. [7] J. Wang, W. Zhang, Y. Zang, Y. Cao, J. Pang, T. Gong, K. Chen, Z. Liu, C. C. Loy, D. Lin, Seesaw loss for long-tailed instance segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 9695–9704.