<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Matěj Sieber</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tomáš Železný</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of West Bohemia, Faculty of Applied Sciences</institution>
          ,
          <addr-line>Univerzitni 2732/8, 301 00 Pilsen</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents participation in the SnakeCLEF 2024 challenge, which aims to automate the identification of snake species. We explore various custom loss functions that incorporate the venomousness of snakes. These loss functions are used to train the Swin-v2 tiny model with same training specification as baseline solution to accurately measure the impact of custom loss functions. Swin-v2 tiny model is beneficial due to its low computational demand and opens the possibility for use in handheld devices. Our results show that the best approach for maximising performance on the custom competition metrics is to apply a soft target set according to the venomousness of the snake. The best accuracy is achieved by the model trained with loss, which weights the diferent classes according to the number of their instances.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;SnakeCLEF</kwd>
        <kwd>Snake Bite</kwd>
        <kwd>Computer Vision</kwd>
        <kwd>Classification</kwd>
        <kwd>Snake Species Identification</kwd>
        <kwd>Imbalanced dataset</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Dataset</title>
      <p>The SnakeCLEF dataset [4] is a comprehensive collection of snake images used for the classification of
snake species. The dataset consists of three parts: training, validation, and a private test set used for
competition evaluation. The training and validation sets are publicly available, while the private test
set is held back for evaluating the competition entries. The dataset is available in multiple variants,
difering in the size of the images, to accommodate various computational capabilities and research
needs. As stated in the SnakeCLEF2023 report [5], all subsets combined result in roughly 110,000 real
snake observations with community-verified species labels, ensuring high-quality and reliable data for
training models. Dataset contains 1,784 snake species, the classes exhibit a long-tailed distribution,
meaning that a small number of classes have a large number of images, while many classes have only a
few images. Figure 1 shows an illustrative representation of medically important snake species.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Evaluation</title>
      <p>
        The competition evaluates competitors’ models using four diferent metrics. The first two are the
macro-averaged F1 score and accuracy, which are the standard metrics for classification tasks. In the
real world, however, misclassifying diferent snakes does not have the same consequences. In the worst
case scenario, a deadly venomous snake is misclassified as a harmless one, which may result in death.
With this in mind, the organisers came up with two other metrics, that take into account whether the
snake is venomous or not. These metrics are denoted in Equation (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) and (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ). In the real world, however,
diferent venomous snake bites cannot be treated in the same way, as some are more venomous than
others and the bite requires diferent types of serum, which may vary in side-efects, price or location.
As it would be very expensive to achieve this level of granularity, a generalisation was made in the
form of a universal, harmless, free and always-accessible antivenom.
      </p>
      <p>⎧0 if  = ˆ
⎪
⎪
⎪⎪⎪1 if  ̸= ˆ and () = 0 and (ˆ) = 0
⎪
(, ˆ) = ⎨2 if  ̸= ˆ and () = 0 and (ˆ) = 1
⎪⎪⎪2 if  ̸= ˆ and () = 1 and (ˆ) = 1
⎪
⎪
⎪⎩5 if  ̸= ˆ and () = 1 and (ˆ) = 0
 = ∑︁ (, ˆ),</p>
      <p>where correct species is  and predicted species is ˆ.  () = 1 if species  is venomous,
otherwise () = 0.</p>
      <p>= 11 + 2ℎ ℎ + 3ℎ  + 4  + 5 ℎ ,</p>
      <p>
        1 + 2 + 3 + 4 + 5
where 1 = 1, 2 = 1, 3 = 2, 4 = 2, and 5 = 5 are the weights of individual confusions,
 ℎ is the percentage of wrongly classified venomous species as a harmless species,
ℎ  is the percentage of wrongly classified harmless species as a venomous species,
  is the percentage of wrongly classified venomous species as another venomous species,
ℎ ℎ is the percentage of wrongly classified harmless species as another harmless species,
and the F1 is the macro averaged F1 score.
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>Given the practical limitations and the need for widespread accessibility, we aimed to develop a model
that could run on handheld devices such as smartphones and tablets. Real-time identification of
snake species has a potential to significantly improve the speed and efectiveness of medical response,
potentially saving lives and reducing the incidence of serious snakebite complications. This capability is
particularly important in remote areas where snakebite is most common and access to high performance
computing resources is limited.</p>
      <p>We use the Swin-v2 tiny model [6], which is suitable for this task due to its small size and eficiency.
Such lightweight models are less prone to overfitting, which is particularly important when dealing
with diverse and unbalanced datasets.</p>
      <p>We aim to maximise the performance of the model for those metrics that take into account whether
the snake is venomous or not. Primarily, we focus on minimizing the L metric (Equation 2), as this
closely matches real-world scenarios where accurate identification of snake species is paramount. The
main goal of our work is to optimise the loss functions. Specifically, we created four diferent custom
losses to meet our objectives. All results are presented in Table 1.</p>
      <p>To efectively measure impact of custom losses we use the same training parameters as the baseline:
RandResizedCrop and RandAugment augmentations, resolution size 256x256, learning rate 0.01 and
SGD optimizer. Full training pipeline for the baseline solution is available at BVRA GitHub1. Code for
proposed methods can be found at Our GitHub2.</p>
      <sec id="sec-4-1">
        <title>4.1. Dual-head</title>
        <p>
          The aim of this experiment is to improve the performance of the model by incorporating snake venom
information using a combination of two classification losses. We add a second head consisting of one
neuron with Sigmoid activation function. In addition to the Categorical Cross Entropy loss, we also
train the model on the binary classification of venomous/harmless classes using Binary Cross Entropy
loss, resulting in equation denoted in 4.
(
          <xref ref-type="bibr" rid="ref4">4</xref>
          )
(
          <xref ref-type="bibr" rid="ref5">5</xref>
          )
ℒDual-head = ℒBCE + ℒCE
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Rare class boost</title>
        <p>Given the nature of the data, there are several strategies to address a long-tailed distribution, such as
uneven sampling of training data or assigning greater weight to rarer classes. Our approach utilizes the
weight strategy, which demonstrates superior accuracy, aside from ensemble models. In this experiment,
we compute the loss with SeeSawLoss [7], multiplying it by the rarity of each class. Although SeeSawLoss
was originally developed for long-tailed distribution data, our modification of incorporating dynamic
class rarity across the whole dataset further improves the results. These results can be directly compared
to the Baseline solution, which also utilizes SeeSawLoss.</p>
        <p>batch = ∑︁  ,
=1</p>
        <p>=  ,</p>
        <p>
          batch
ℒClsBoost =  × ℒ SeeSaw(x, y),
where  is number of instances of the class ,  is the dataset size, and  is the batch size.
1BVRA GitHub: https://github.com/BohemianVRA/FGVC-Competitions/tree/feat/baselineTrainingForSnakeCLEF2024/
SnakeCLEF2024
2Authors GitHub: https://github.com/sieberm111/snakeclef2024
where  is penalty defined in (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ).
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>4.4. Soft target</title>
        <p>
          ℒ+  = ℒ + ,
Instead of classical approach of setting a Cross Entropy target as one-hot vector, we explore using a
soft target. In this method, our goal is to set the negative targets to values accordingly to venomous
penalties in Equation (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ), while ensuring that the sum of the target values remains 1. First, we linearly
transform these penalties using Equation (
          <xref ref-type="bibr" rid="ref7">7</xref>
          ). This results in a target value of 1 for the least penalized
classification, i.e., ( = ˆ), and a target value of 0 for the most penalized classification, i.e., ( ̸= ˆ and
the venomous snake is classified as harmless). The values are then normalised and used as a soft target.
        </p>
      </sec>
      <sec id="sec-4-4">
        <title>4.3. CE + VenomousPenalty</title>
        <p>
          Another approach to take the venom of the snake into account is to simply add a venomous penalty,
according to the equation 1, to the classical Categoric Cross Entropy loss. Although this method does
not perform notably better than the baseline method, it achieves the best results in the F1 metric.
(
          <xref ref-type="bibr" rid="ref6">6</xref>
          )
(
          <xref ref-type="bibr" rid="ref7">7</xref>
          )
(8)
(9)
 = − 0.2 ·  + 1,
        </p>
        <p>
          Targets = Norm( ),
where  are the penalties in Equation (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ), and  are the targets before normalisation.
        </p>
        <p>This method results in poor performance, as the positive and negative target values are close after
normalisation. This motivates us to create a soft target method which sets the positive target to
significantly higher value than the negative targets. Our aim is to set the positive target value in
the range ⟨0.5; 1.0). We conducted empirical experiments, which resulted in the target denoted in
Equation (9). Models with best performance are denoted as SoftT-3 for temperature parameter  = 3 and
SoftT-4 for  = 4.</p>
        <p>Target = −  · Softmax(log()),
where  = 0.1 for  = ˆ,
 = 1 for  ̸= ˆ and () = 0 and (ˆ) = 0,
 = 10 for  ̸= ˆ and () = 0 and (ˆ) = 1,
 = 10 for  ̸= ˆ and () = 1 and (ˆ) = 1,
 = 100 for  ̸= ˆ and () = 1 and (ˆ) = 0,
and  is temperature parameter.</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Model Ensemble</title>
        <p>The competition sets a limit of 60 minutes for the maximum model inference time on the test set. Since
our model is able to process the test set in units of minutes, we decided to use the remaining time for
additional experiments. We created an ensemble of our models by averaging the logits. The ensemble of
models performed noticeably better. Since there was still a lot of time left, we also tested the ensemble
of logits for the given image and its horizontally flipped version for each model as the flip was not part
of the augmentations. This doubled our inference time. However, we did not gain any improvement by
using this method, so we do not report it.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>This work presents a participation in the competition SnakeCLEF2024. Our approach is based on the
use of the compact Swin-v2 tiny model, known for its speed and suitability for running on mobile
devices such as smartphones and tablets. Instead of focusing on conventional methods, we decided
to experiment exclusively with custom loss functions tailored to our specific scenario. In particular,
we focused on the L metric, which is designed to penalise misclassification of snakes based on their
venomousness.</p>
      <p>The results (see Table 1 and 2) show that the Dual-head approach improved results compared to the
baseline solution. And these improvements were stable on both datasets public and private alike. The
ClsBoost loss was a viable idea that maintained the best accuracy on both the public and private datasets.
Since the loss function did not focus on the custom metrics, but rather aimed to reduce the efect of the
long-tailed distribution, the accuracy, even when tied to other metrics, was not suficient to maintain the
best custom metric score. The CE-VP loss function proved that incorporating the L metric into the loss
function helped. However, there are notable diferences between the performance on the public and test
set. The SoftT loss function achieved the best results, namely the M and L metrics on the public dataset
and the M and F1 metrics on the private dataset. Since this method uses a hyperparameter chosen by
empirical study, this gives the opportunity for future work where the hyperparameter could be tuned
within an ablation study with the aim of finding the best performing parameters of this method.</p>
      <p>Since the competition allows a maximum inference time of 60 minutes, and our model requires only
a few minutes to infer the entire test set, we decided to create an ensemble of our best models, resulting
in better scores.</p>
      <p>As an extension to our methods, we propose to use location data in the recognition process. By using
GPS information, which is available on handheld devices such as smartphones, we can improve the
accuracy of species identification by taking into account the geographical distribution of diferent snake
species.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>Computational resources were provided by the e-INFRA CZ project (ID:90254), supported by the
Ministry of Education, Youth and Sports of the Czech Republic.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>World</given-names>
            <surname>Health</surname>
          </string-name>
          <string-name>
            <surname>Organization</surname>
          </string-name>
          , Snakebite envenoming,
          <year>2023</year>
          . Https://www.who.int/news-room/factsheets/detail/snakebite-envenoming.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Espitalier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Botella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Deneu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Marcos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Estopinan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Leblanc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Larcher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Šulc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hrúz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Servajean</surname>
          </string-name>
          , et al.,
          <source>Overview of lifeclef</source>
          <year>2024</year>
          :
          <article-title>Challenges on species distribution prediction and identification</article-title>
          ,
          <source>in: International Conference of the CrossLanguage Evaluation Forum for European Languages</source>
          , Springer,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hruz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Durso</surname>
          </string-name>
          , Overview of SnakeCLEF 2024:
          <article-title>Revisiting snake species identification in medically important scenarios</article-title>
          ,
          <source>in: Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4] LifeCLEF, Snakeclef2024,
          <year>2024</year>
          . Https://huggingface.co/spaces/BVRA/SnakeCLEF2024.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Šulc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Chamidullin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Durso</surname>
          </string-name>
          , Overview of snakeclef 2023:
          <article-title>snake identification in medically important scenarios</article-title>
          ,
          <source>CLEF</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Guo</surname>
          </string-name>
          , Swin transformer v2:
          <article-title>Scaling up capacity and resolution</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>12009</fpage>
          -
          <lpage>12019</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. C.</given-names>
            <surname>Loy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Seesaw loss for long-tailed instance segmentation</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>9695</fpage>
          -
          <lpage>9704</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>