=Paper= {{Paper |id=Vol-2936/paper-128 |storemode=property |title=A Deep Learning Method for Visual Recognition of Snake Species |pdfUrl=https://ceur-ws.org/Vol-2936/paper-128.pdf |volume=Vol-2936 |authors=Rail Chamidullin,Milan Šulc,Jiří Matas,Lukáš Picek |dblpUrl=https://dblp.org/rec/conf/clef/ChamidullinSMP21 }} ==A Deep Learning Method for Visual Recognition of Snake Species== https://ceur-ws.org/Vol-2936/paper-128.pdf
A Deep Learning Method for Visual Recognition
of Snake Species
Rail Chamidullin1 , Milan Šulc1 , Jiří Matas1 and Lukáš Picek2
1
    Department of Cybernetics, Faculty of Electrical Engineering, Czech Technical University in Prague
2
    Department of Cybernetics, Faculty of Applied Sciences, University of West Bohemia


                                         Abstract
                                         The paper presents a method for image-based snake species identification. The proposed method is
                                         based on deep residual neural networks – ResNeSt, ResNeXt and ResNet – fine-tuned from ImageNet
                                         pre-trained checkpoints. We achieve performance improvements by: discarding predictions of species
                                         that do not occur in the country of the query; combining predictions from an ensemble of classifiers;
                                         and applying mixed precision training, which allows training neural networks with larger batch size.
                                         We experimented with loss functions inspired by the considered metrics: soft F1 loss and weighted cross
                                         entropy loss. However, the standard cross entropy loss achieved superior results both in accuracy and
                                         in F1 measures. The proposed method scored third in the SnakeCLEF 2021 challenge, achieving 91.6%
                                         classification accuracy, Country F1 Score of 0.860, and F1 Score of 0.830.

                                         Keywords
                                         Snake Species Identification, Fine-grained Classification, Computer Vision, Convolutional Neural Net-
                                         works, Deep Learning




1. Introduction
The paper describes a method for automatic image-based snake species identification submitted
by the CMP team to the SnakeCLEF 2021 challenge [1] – a part of LifeCLEF 2021 workshop [2].
The problem of identifying snake species from images is difficult because the classification is
fine-grained, some species look very similar, and up to hundreds of different snake species live
in one country.
   Taxonomic knowledge about snakes is crucial in diagnosis and medical response to snakebites.
Accurate identification of the snake species is important for the appropriate treatment of
snakebite victims since specific antivenoms are effective against specific venomous snakes.
Moreover, antivenoms should not be used to treat bites from non-venomous snakes because of
side effects such as allergic reactions [3]. Snakebites are a global health problem that kills or
disables half a million people a year in developing countries [3].
   This paper is structured as follows: Section 2 describes related work focusing on snake species
identification. Section 3 introduces the input data and evaluation methodology of the SnakeCLEF
2021 challenge. Section 4 describes the adopted architecture of deep neural network and the

CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
" chamirai@fel.cvut.cz (R. Chamidullin); sulcmila@fel.cvut.cz (M. Šulc); matas@fel.cvut.cz (J. Matas)
 0000-0003-1728-8939 (R. Chamidullin); 0000-0002-6321-0131 (M. Šulc); 0000-0003-0863-4844 (J. Matas);
0000-0002-6041-9722 (L. Picek)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
optimization procedure. Section 5 covers all experiments, ranging from preliminary experiments
to the final challenge submissions. Finally, the results are summarized in Section 6.


2. Related Work
Before the existence of large-scale image datasets for snake species classification, Abeysinghe
et al. [4] proposed a one-shot learning approach for fine-tuning a Convolutional Neural Net-
work (CNN) for the task of snake species identification. The authors used a small dataset of 84
snake species, with most species having no more than 3 training images. The authors utilize
a Siamese network [5] that ranks similarity between two inputs: The network is trained by
binary cross entropy minimization to estimate the probability of the query image belonging
to the same class as the reference image. At test time the query image is compared against all
annotated reference images of each class.
   In 2020, the first year of the SnakeCLEF challenge [6], introduced a dataset with 287,632
images of 783 snake species taken in 145 countries. Only two teams presented their recognition
systems for identifying snake species.
   The best scoring team in SnakeCLEF 2020, gokuleloop [7], fine-tuned ResNet-50-V2 [8] from
ImageNet-1K and ImageNet-21K [9] pre-trained checkpoints, the latter leading to better results.
The author applied the following training techniques:
     • Gradient accumulation – a technique that accumulates gradients from small mini-batches
        allowing larger effective mini-batch size.
     • Mixup augmentation [10] – an augmentation technique that combines random image
        pairs from the training dataset.
     • Group normalization [11] – differently from batch normalization, GN divides the channels
        into groups and computes the mean and variance within each group.
   The second team in SnakeCLEF 2020, FHDO_BCSG [12], first detected regions where snakes
occur using a Mask R-CNN [13] object detector, and then classified the snake species in the
regions using EfficientNet [14]. The authors adjusted the output probabilities of EfficientNet
based on the geographic location of the image: The softmax values for each image were
multiplied by the species a priori probability for a given geographic location. To clean the
training dataset from noisy samples, the authors utilized an ImageNet-1K pre-trained ResNet-50
network and discarded images not classified as snake and reptile classes.


3. Challenge Description
3.1. Dataset
The training dataset provided by SnakeCLEF 2021 covers 772 snake species and contains
annotated images from three different sources: iNaturalist, HerpMapper and Flickr. Examples
of images are in Figure 1. The majority of images are from iNaturalist and HerpMapper, with
277,025 and 58,351 images, respectively. Their labels are confirmed by human annotators.
The Flickr dataset is the smallest, with 50,630 web-scraped images that contain noisy data.
Figure 1: Image examples from the SnakeCLEF 2021 dataset. The images are resized and center
cropped to 224 × 224. CC-BY-NC images from iNaturalist: ©srhein, ©jance, ©roig10, ©arturobtzz,
©John Clough, ©William Wimley, ©nobiscuits, ©feistygirl75.


In total, 386,006 images with annotations were provided. Training with external data was not
allowed.
   The challenge organizers suggested a subset of 70,208 images, referenced as a mini-subset
in the rest of this paper, made of samples from INaturalist and HerpMapper. The experiments
described in Section 5 are based on the said subset.
   In addition to the images, the dataset contains metadata with information about the country
where the image was taken. In total, the training dataset includes images from 188 countries.
The dataset is fine-grained with a long tail class distribution. More than 22,000 images represent
the most frequent species, while the least frequent species have only 10 images. The least
represented species are often found in regions such as Middle and South America, South Africa
and Australia. Table 1 shows the distribution of images in geographical regions. For some
images, information about the geographical location is missing.
   Furthermore, the challenge organizers provided 28,418 images without annotations. Top one
species predictions for the test images were sent to the organizers to participate in the challenge.

3.2. Data Preparation
During the data exploration phase, we discovered that training and validation datasets contain
noisy data from Flickr. The noisy data are non-relevant images with various animal species
or objects. We estimate1 that the percentage of non-relevant images is 10.6 ± 0.1, with 95%
confidence interval. We decided to remove all Flickr images and proceeded with verified images
from iNaturalist and HerpMapper.
    1
     We used the Student’s t-distribution with 𝑛 = 20 samples and 𝑛 − 1 degrees of freedom where each sample
denotes the percentage of non-relevant images in a set of randomly selected 100 images.
Table 1
Geographical distribution of SnakeCLEF 2021 images.
                                Region             Number of images
                                North America                 258,732
                                Europe                         18,689
                                Middle America                 17,403
                                Asia                           16,518
                                South America                  12,735
                                Africa                          6,017
                                Australia                       4,313
                                Oceania                           538
                                unknown                        51,061


   The challenge organizers suggested a data split with 90% training and 10% validation samples.
However, after removing Flickr images, it turned out that some species were not represented in
the proposed validation set. Table 2 displays the number of snake classes represented, i.e. classes
with at least one image, in the dataset sources. iNaturalist and HerpMapper combined have 768
classes which are all represented across the training set but only 733 classes in the validation
set. We thus created a new dataset split where all classes are represented in both training and
validation splits if more than one image of the species is available. If not, the image is placed in
the training set.
   Technically, the last 10% of images, ordered by metadata ID, for every species and country
combination were selected as the validation data. One validation image was selected for the cases
that had fewer than 10 images. We assume the ID ordering is random w.r.t. image content and
properties.


Table 2
Number of species included in SnakeCLEF 2021 dataset sources: iNaturalist, HerpMapper and Flickr.
The last row represents our new dataset split, after removing all Flickr images due to noisy labels and
sampling a new validation set covering as many species as possible.
     Source                                    Training set   Validation set   Number of images
     iNaturalist                                       762              716               277,025
     HerpMapper                                        603              357                58,351
     Flickr                                            730              585                50,630
     iNaturalist + HerpMapper + Flickr                 772              772               386,006
     iNaturalist + HerpMapper                          768              733               335,376
     Mini-subset (introduced in Section 3.1)           768              763                70,208
     iNaturalist + HerpMapper (new split)              768              765               335,376
3.3. Evaluation Metrics
The challenge used two metrics for the final evaluation. The primary metric is the macro
averaged F1 Score across countries ("Country F1 Score"), shown in equation 4. The secondary
metric is the macro averaged F1 Score ("F1 Score"), shown in equation 2.
  The F1 Score for each species 𝑠 = 1, 2, ..., 𝑘 is computed as a harmonic mean of precision 𝑝𝑠
and recall 𝑟𝑠 :
                                                 2𝑝𝑠 𝑟𝑠
                                        𝐹1𝑠 =           .                                    (1)
                                                𝑝𝑠 + 𝑟𝑠
  The macro averaged F1 Score is the average of the 𝐹1 scores of all species:
                                                         𝑘
                                                   1 ∑︁
                                   macro(𝐹1 ) =         𝐹1𝑠 .                                (2)
                                                   𝑘
                                                       𝑠=1

  Country F1 Score 𝐶𝐹1𝑐 for each country 𝑐 = 1, 2, ..., 𝑚 is the macro averaged F1 Score
computed only for species living in country 𝑐:
                                             ∑︀𝑘
                                                       𝐹1𝑠 𝐴𝑐𝑠
                                    CF1𝑐 =       𝑠=1
                                                ∑︀ 𝑘
                                                                 ,                           (3)
                                                   𝑠=1 𝐴𝑐𝑠
                                              {︃
                                                1, country 𝑐 is a habitat of species 𝑠
where 𝐴 is a 𝑘 × 𝑚 matrix with elements 𝐴𝑐𝑠 =                                          .
                                                0, otherwise

  Similarly, macro averaged Country F1 Score is obtained by averaging CF1𝑐 over all countries:
                                                         𝑚
                                                   1 ∑︁
                                  macro(CF1 ) =         CF1𝑐 .                               (4)
                                                   𝑚
                                                        𝑐=1

The macro averaged Country F1 Score thus increases the importance of species that appear in
more countries.


4. Methodology
The proposed method is based on the state-of-the-art Convolutional Neural Networks (CNNs)
for image classification, described in Subsection 4.1. The following subsections describe the op-
timization procedure, loss functions, the post-processing of the predictions, applying mixed
precision training and implementation details.

4.1. Deep Residual Networks
All experiments are based on deep residual neural networks, namely the original ResNet [15],
the ResNeXt [16], and the recent ResNeSt [17]. The ResNet architecture consists of a stack of
residual blocks – building modules with residual connections that combine input and output by
element-wise addition. The ResNeXt additionally includes a split-transform-merge strategy,
where each block performs a set of transformations with the same topology whose outputs are
aggregated by element-wise addition. For example, a single transformation can be a group of
convolutions. The ResNeSt incorporates a channel-wise attention strategy within each split-
transform-merge block: Each transformation consists of split groups over which the network
calculates the channel-wise split attention weights.
   All networks in our experiments were fine-tuned from ImageNet-1K [18] pre-trained check-
points. Residual networks typically [15, 16] use input size about 224 × 224, the pre-trained
ResNeSt-101 and ResNeSt-200 are available with a larger input sizes of 256 × 256 and 320 × 320,
respectively.

4.2. Optimization Procedure
We use two optimization algorithms for training CNN models: stochastic gradient descent with
momentum (SGD) and Adam [19]. Our preliminary experiments showed that Adam optimizer is
able to converge quickly, but the prediction score is inferior compared to SGD. The application of
the one cycle schedule policy [20] (one cycle) improved the results when applied with the Adam
optimizer while applying it with SGD did not work well in our preliminary experiments.
   The training hyper-parameters, such as learning rate, momentum and weight decay, are listed
in Table 3 and were set the same as in the network pre-training. Batch sizes were adjusted to fit
the network on the graphics processing unit (GPU). The input image size stays the same as in
the pre-trained networks.
   During the training, we select the best checkpoint based on the highest validation Country
F1 Score.

Table 3
The hyper-parameter setting used for training the challenge submissions.
             Network         ResNeSt-101     ResNeSt-200    ResNeXt-101     ResNet-101
             Optimizer           SGD             SGD           Adam           Adam
             LR Scheduler          -               -         one cycle       one cycle
             Learning Rate        0.1             0.1           0.01           0.01
             Weight Decay       0.0001          0.0001          0.01           0.01
             Batch Size           128             64            128            128



4.3. Country-specific Removal of Predictions
For each image, the dataset metadata include the country where the image was taken. Ad-
ditionally, the dataset comes with a list of countries and snake species that live there. We
utilize this information to adjust the model predictions to the country of the query as follows:
The classifier predictions are set to 0 for all species that do not live in the country of the query.
This adjustment is applied only at test time.
4.4. Mixed Precision Training
When training large CNN architectures, fitting the model into limited GPU memory is a bot-
tleneck. We considered the following workarounds: selecting a smaller batch size or applying
mixed precision training [21]. Both approaches have an accuracy trade-off.
   Mixed precision training is a technique that combines single-precision (32-bit floats, "FP32")
and half-precision (16-bit floats, "FP16") float numbers. In order to lower the memory require-
ments, the forward and backward pass with the large batch size only use a half-precision version
of the model. Then, the gradient descent is applied to the single-precision version of the model.
In every training step following procedure is applied:
    1. Apply the forward pass, compute the loss and apply backward pass on a model in FP16.
    2. Convert the gradients from FP16 to FP32.
    3. Apply the update on the primary model in FP32.
    4. Create a copy of the primary model in FP16.

4.5. Loss Functions
The baseline loss function for training the classifiers is the standard cross entropy loss:
                                                 𝑛
                                                ∑︁
                                      ℓce = −         log 𝑦𝑖,𝑡𝑖 ,                              (5)
                                                𝑖=1

where 𝑡𝑖 is the ground truth target and y𝑖 are the classifier predictions for the 𝑖-th example,
and 𝑦𝑖,𝑡𝑖 is the prediction for the ground truth class of the 𝑖-th example.
  The following subsections describe the loss functions proposed to use the challenge metrics,
described in Section 3.3, as a loss measure.

4.5.1. F1 Loss with Soft Assignments
The F1 Score from Equation 2 is not differentiable and thus cannot be utilized as a loss function
for back-propagation. We use an approximation of the F1 Score, referenced as soft F1 loss in
the rest of this paper, which uses soft assignments that make the function differentiable:

    • the true positives for species 𝑠 are estimated using the softmax predictions y and one-hot
                                                     𝑛
      encoded target vector t as follows: TP
                                                    ∑︀
                                             ̂︁ 𝑠 =    y 𝑖 t𝑖
                                                  𝑖=1

    • the false positives for species 𝑠 are estimated using the softmax predictions y and one-hot
                                                     𝑛
      encoded target vector t as follows: FP
                                                    ∑︀
                                             ̂︁𝑠 =     y𝑖 (1 − t𝑖 )
                                                  𝑖=1

    • the false negatives for species 𝑠 are estimated using the softmax predictions y and one-hot
                                                     𝑛
      encoded target vector t as follows: FN
                                                    ∑︀
                                             ̂︁ 𝑠 =    (1 − y𝑖 )t𝑖
                                                      𝑖=1

Notice, that TP,
             ̂︁ FP,
                 ̂︁ and FN
                        ̂︁ are now real valued. Soft F1 Score for species 𝑠, 𝐹̂︁
                                                                               1𝑠 , is obtained by
computing the harmonic mean of precision 𝑝̂︀𝑠 and recall 𝑟̂︀𝑠 :
                                  TP
                                  ̂︁ 𝑠                         TP
                                                               ̂︁ 𝑠                         2̂︀
                                                                                              𝑝𝑠 𝑟̂︀𝑠
                     𝑝̂︀𝑠 =               ,       𝑟̂︀𝑠 =                ,          𝐹̂︁
                                                                                     1𝑠 =             .   (6)
                              TP
                              ̂︁ 𝑠 + FP
                                      ̂︁𝑠                  TP
                                                           ̂︁ 𝑠 + FN
                                                                   ̂︁ 𝑠                   𝑝̂︀𝑠 + 𝑟̂︀𝑠
  The macro averaged soft F1 Score is obtained by averaging 𝐹̂︁
                                                              1𝑠 over all species:

                                                                     𝑘
                                                    ̂︁1 ) = 1
                                                                    ∑︁
                                              macro(𝐹                     𝐹̂︁
                                                                            1𝑠 .                          (7)
                                                            𝑘
                                                                    𝑠=1

  The final loss function is ℓ𝐹̂︀1 = 1 − macro(𝐹
                                               ̂︁1 ), so that it ranges from 0 (perfect) to 1 (worst).

4.5.2. Weighted Cross Entropy
Because the macro averaged Country F1 Score from Equation 4 increases the importance of
species appearing in more countries, we propose a weighted variant of the cross entropy loss
with species weights 𝑤𝑠 based on the number of countries in which it appears:
                                                            𝑛
                                                           ∑︁
                                          ℓwce = −               𝑤𝑡𝑖 log 𝑦𝑖,𝑡𝑖 ,                          (8)
                                                           𝑖=1
   The Maximum Likelihood Estimation (MLE) of 𝑤𝑠 would simply count the relative frequencies
𝑓𝑠 in the provided species-country incidence list. In order to avoid zero weights, we add Laplace
smoothing:

                                                              𝑓𝑠 + 1
                                                𝑤𝑠 =        𝑘
                                                                        .                                 (9)
                                                           ∑︀
                                                               (𝑓𝑗 + 1)
                                                       𝑗=1


4.6. Implementation Details
The proposed method was developed using the PyTorch [22] machine learning framework and
the fastai framework [23] built on top of PyTorch. The code is available online2 . All models
were fine-tuned from ImageNet-1K [18] pre-trained PyTorch Image Models [24] on one NVIDIA
Tesla V100 with 32GB graphic memory.


5. Experiments
5.1. Comparison of Residual Networks
Table 4 shows classification scores of residual networks ResNet, ResNeXt and ResNeSt with
50 and 101 layers. All networks are fine-tuned for 30 epochs on images of size 224 × 224,
minimizing the cross-entropy loss using SGD with momentum. One ResNeSt-101 version is
fine-tuned on a larger image size 256 × 256 to match the image size of the ImageNet pre-
trained checkpoint. Both ResNeSt versions, ResNeSt-50 and ResNeSt-101, achieve higher scores
compared to the corresponding ResNet and ResNeXt architectures.
   2
       https://github.com/chamidullinr/snake-species-identification
Table 4
Classification scores of residual networks fine-tuned for 30 epochs on the mini-subset from Section 3.1.
The results are computed on our validation set.
                Architecture     Input Size    Accuracy    F1 Score    Country F1 Score
                ResNet-50           224          44.0%       0.331           0.300
                ResNeXt-50          224          47.2%       0.352           0.333
                ResNeSt-50          224          53.8%       0.447           0.409
                ResNet-101          224          42.4%       0.290           0.273
                ResNeXt-101         224          50.5%       0.428           0.396
                ResNeSt-101         224          56.7%       0.475           0.432
                ResNeSt-101         256          58.8%       0.500           0.455


5.2. Results of Mixed Precision Training
As observed in the previous section, ResNeSt-101 with a higher input size achieves the highest
scores of the experimented residual networks. Since its deeper version, ResNeSt-200, does not fit
into our GPU memory with larger batch sizes, we experiment with the mixed precision training
from Section 4.4.
   Table 5 compares the training time and accuracy of ResNeSt-101 and ResNeSt-200 when
training with and without the mixed precision technique. Note that in our computational
environment, mixed precision runs slower than single precision. The prediction scores after
10 epochs show that mixed precision has little impact on prediction accuracy in setups with
the same architecture and batch size. Increasing the batch size from 32 to 64 has a much larger
impact on the accuracy. Thus the network trained on a larger batch size with mixed precision
achieves better scores than the single-precision network with a smaller batch size.


Table 5
Classification scores and training times when fine-tuning for 10 epochs with and without the mixed
precision technique. Cells with "×" denote setups for which the network did not fit into the 32GB GPU
memory. The networks are fine-tuned on the mini-subset from Section 3.1 and the results are computed
on our validation set.
   Architecture     BS    Precision type      Accuracy    F1 Score    Country F1 Score    Epoch time
   ResNeSt-101     128         Mixed           48.0%       0.387           0.355            14 min
   ResNeSt-101     128         Single          47.6%       0.385           0.348            10 min
   ResNeSt-200     128         Mixed             ×           ×               ×                ×
   ResNeSt-200     128         Single            ×           ×               ×                ×
   ResNeSt-200     64          Mixed           52.7%       0.424           0.398            40 min
   ResNeSt-200     64          Single            ×           ×               ×                ×
   ResNeSt-200     32          Mixed           46.9%       0.376           0.345            41 min
   ResNeSt-200     32          Single          46.9%       0.371           0.345            28 min
Table 6
Classification scores on ResNeSt-101 with different loss functions. Standard cross entropy achieves
superior results. The networks are fine-tuned on the mini-subset from Section 3.1 and the results are
computed on our validation set.
              Loss Function                     Accuracy    F1 Score    Country F1 Score
              Cross Entropy Loss                 58.8%        0.500           0.455
              Weighted Cross Entropy Loss        48.4%        0.349           0.385
              F1 Loss                             0.2%        0.001           0.000


5.3. Evaluation of Different Loss Functions
The loss functions introduced in Section 4.5, namely the soft F1 loss and the weighted cross
entropy loss, resulted in inferior classification scores compared to cross entropy loss, see Table 6.
We, therefore, fine-tune the CNN classifiers with cross entropy loss, and then choose the best
training checkpoint based on the highest validation Country F1 Score.
   One possible explanation for the failure of the soft F1 loss is that the batch size of 64 is
significantly smaller than the total number of classes, 772. This leads to the classes not being
represented in every mini-batch, making the approximation of the F1 loss inaccurate. Figure 2
illustrates the inaccurate approximation of the F1 loss on an example, where the loss values are
mostly 0s or 1s.

5.4. Evaluation of Country-specific Removal of Predictions
We measure the prediction scores of ResNeSt-200 with and without the removal of species
predictions based on the country incidence information. Table 7 compares the prediction scores
on our validation set. The improvement is 0.150 in F1 Score and 0.193 in Country F1 Score.




Figure 2: F1 Scores for Acrochordus granulatus across all training iterations in one epoch. The example
illustrates the inaccurate approximation of the F1 loss: the loss rarely takes values other than 0 and 1.
Table 7
Comparing classification scores of ResNeSt-200 with and without the removal of species predictions
based on the country incidence information. The networks are fine-tuned on the mini-subset from
Section 3.1 and the results are computed on our validation set.
         Country-specific removal of predictions      Accuracy   F1 Score    Country F1 Score
         No                                            74.8%        0.483          0.504
         Yes                                           79.0%        0.633          0.697


5.5. Challenge Submissions
We submitted the following five runs to the SnakeCLEF 2021 challenge:
CMP_S1: ResNeSt-200 fine-tuned for 20 epochs on the full dataset with SGD.
CMP_S2: ResNeSt-200 from CMP_S1 fine-tuned for additional 10 epochs on the full dataset
        with SGD.
CMP_S3: ResNet-101 fine-tuned for 25 epochs on the full dataset with Adam and one cycle.
CMP_S4: ResNeXt-101 fine-tuned for 30 epochs on the mini-subset from Section 3.1 with Adam
        and one cycle.
CMP_S5: An ensemble of all four previous runs, combining the top one predictions by majority
        voting strategy. In case of ties, predictions of CMP_S1 are preferred.
  Table 8 shows the final challenge scores on the test set. While different in accuracy, the CNN
architectures ResNeSt-200, ResNeXt-101 and ResNet-101 achieve similar results in the primary
challenge metric, the Country F1 Score. The highest scores are achieved by the ensemble.
  We recognize a shortcoming of the ensemble submission (CMP_S5), which inclines towards
the ResNeSt-200 submissions related to each other (CMP_S2 is fine-tuned from CMP_S1).
The remaining networks cannot outvote an agreement of CMP_S1 and CMP_S2.


Table 8
Classification scores of the submitted challenge runs on the SnakeCLEF 2021 challenge test set. The net-
works are fine-tuned either on the full dataset (Full) or on the mini-subset (Mini) from Section 3.1. The
CNN architectures ResNeSt-200, ResNeXt-101 and ResNet-101 achieve similar results in the Country
F1 Score. The highest scores are achieved by the ensemble of all networks.
    Submission     Architecture                Dataset    Accuracy     F1 Score   Country F1 Score
    CMP_S1         ResNeSt-200                 Full         90.6%       0.772           0.839
    CMP_S2         ResNeSt-200                 Full         89.5%       0.779           0.819
    CMP_S3         ResNet-101                  Full         90.7%       0.795           0.837
    CMP_S4         ResNeXt-101                 Mini         77.6%       0.796           0.839
    CMP_S5         Ensemble of CMP_S1-S4       -            91.6%       0.830           0.860
6. Conclusions
The paper presents a deep learning method for image-based snake species identification, a fine-
grained classification problem with a long tail class distribution. The method is based on
deep residual neural networks – ResNeSt, ResNeXt and ResNet – fine-tuned from ImageNet
pre-trained checkpoints. We achieve performance improvements by: discarding predictions of
species that do not occur in the country of the query; combining predictions from an ensemble
of classifiers; and applying mixed precision training, which allows training neural networks
with larger batch size.
   The experimented soft F1 loss and weighted cross entropy loss produced inferior results
compared to the standard cross entropy minimization. Thus, the competition submissions are
fine-tuned with the standard cross entropy loss.
   The proposed method scored third in the SnakeCLEF 2021 challenge, achieving 91.6% classifi-
cation accuracy, Country F1 Score of 0.860, and F1 Score of 0.830.


Acknowledgments
This research was supported by the OP VVV funded project CZ.02.1.01/0.0/0.0/16_019/0000765.
LP was supported by the UWB grant, project No. SGS-2019-027.


References
 [1] L. Picek, A. M. Durso, R. Ruiz De Castañeda, I. Bolon, Overview of SnakeCLEF 2021:
     Automatic Snake Species Identification with Country-Level Focus, in: Working Notes of
     CLEF 2021 - Conference and Labs of the Evaluation Forum, 2021.
 [2] A. Joly, H. Goëau, S. Kahl, L. Picek, T. Lorieul, E. Cole, B. Deneu, M. Servajean, R. Ruiz De
     Castañeda, G. H. Bolon, Isabelle, R. Planqué, W.-P. Vellinga, A. Dorso, P. Bonnet, I. Eggel,
     H. Müller, Overview of LifeCLEF 2021: a System-oriented Evaluation of Automated
     Species Identification and Species Distribution Prediction, in: Proceedings of the Twelfth
     International Conference of the CLEF Association (CLEF 2021), 2021.
 [3] I. Bolon, A. M. Durso, S. Botero Mesa, N. Ray, G. Alcoba, F. Chappuis, R. Ruiz de Castañeda,
     Identifying the snake: First scoping review on practices of communities and healthcare
     providers confronted with snakebite across the world, PLOS ONE 15 (2020). URL: https:
     //doi.org/10.1371/journal.pone.0229989. doi:10.1371/journal.pone.0229989.
 [4] C. Abeysinghe, A. Welivita, I. Perera, Snake Image Classification Using Siamese Networks,
     in: Proceedings of the 2019 3rd International Conference on Graphics and Signal Processing,
     2019. URL: https://doi.org/10.1145/3338472.3338476.
 [5] G. Koch, R. Zemel, R. Salakhutdinov, Siamese Neural Networks for One-Shot Image
     Recognition, in: ICML Deep Learning Workshop, 2015.
 [6] L. Picek, I. Bolon, A. M. Durso, R. Ruiz De Castañeda, Overview of the snakeclef 2020:
     Automatic snake species identification challenge, in: CLEF task overview 2020, CLEF:
     Conference and Labs of the Evaluation Forum, 2020.
 [7] G. K. Moorthy, Impact of Pretrained Networks For Snake Species Classification, in: CLEF
     working notes 2020, CLEF: Conference and Labs of the Evaluation Forum, 2020.
 [8] K. He, X. Zhang, S. Ren, J. Sun, Identity Mappings in Deep Residual Networks, in:
     Computer Vision – ECCV 2016, 2016.
 [9] T. Ridnik, E. Ben-Baruch, A. Noy, L. Zelnik-Manor, ImageNet-21K Pretraining for the
     Masses, 2021. arXiv:2104.10972.
[10] H. Zhang, M. Cisse, Y. N. Dauphin, D. Lopez-Paz, mixup: Beyond Empirical Risk
     Minimization, in: International Conference on Learning Representations, 2018. URL:
     https://openreview.net/forum?id=r1Ddp1-Rb.
[11] Y. Wu, K. He, Group Normalization, in: Computer Vision – ECCV 2018, 2018.
[12] L. Bloch, A. Boketta, C. Keibel, E. Mense, A. Michailutschenko, O. Pelka, J. Rückert,
     L. Willemeit, C. M. Friedrich, Combination of image and location information for snake
     species identification using object detection and EfficientNets, in: CLEF working notes
     2020, CLEF: Conference and Labs of the Evaluation Forum, 2020.
[13] K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask R-CNN, in: 2017 IEEE International
     Conference on Computer Vision (ICCV), 2017. doi:10.1109/ICCV.2017.322.
[14] M. Tan, Q. Le, EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks,
     in: Proceedings of the 36th International Conference on Machine Learning, 2019. URL:
     http://proceedings.mlr.press/v97/tan19a.html.
[15] K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition, in:
     Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
     2016.
[16] S. Xie, R. Girshick, P. Dollar, Z. Tu, K. He, Aggregated Residual Transformations for Deep
     Neural Networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern
     Recognition (CVPR), 2017.
[17] H. Zhang, C. Wu, Z. Zhang, Y. Zhu, H. Lin, Z. Zhang, Y. Sun, T. He, J. Mueller, R. Manmatha,
     M. Li, A. Smola, ResNeSt: Split-Attention Networks, 2020. arXiv:2004.08955.
[18] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, ImageNet: A large-scale hierarchical
     image database, in: 2009 IEEE Conference on Computer Vision and Pattern Recognition,
     2009.
[19] D. P. Kingma, J. Ba, Adam: A Method for Stochastic Optimization, CoRR abs/1412.6980
     (2015).
[20] L. N. Smith, N. Topin, Super-Convergence: Very Fast Training of Neural Networks Using
     Large Learning Rates, 2018. arXiv:1708.07120.
[21] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston,
     O. Kuchaiev, G. Venkatesh, H. Wu, Mixed Precision Training, in: International Conference
     on Learning Representations, 2018. URL: https://openreview.net/forum?id=r1gs9JgRZ.
[22] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin,
     N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani,
     S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, S. Chintala, PyTorch: An Imperative Style,
     High-Performance Deep Learning Library, in: H. Wallach, H. Larochelle, A. Beygelzimer,
     F. d'Alché-Buc, E. Fox, R. Garnett (Eds.), Advances in Neural Information Processing Sys-
     tems 32, Curran Associates, Inc., 2019, pp. 8024–8035. URL: http://papers.neurips.cc/paper/
     9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.
[23] J. Howard, S. Gugger, Fastai: A Layered API for Deep Learning, Information 11 (2020) 108.
     URL: http://dx.doi.org/10.3390/info11020108. doi:10.3390/info11020108.
[24] R.    Wightman,       PyTorch      Image     Models,    https://github.com/rwightman/
     pytorch-image-models, 2019. doi:10.5281/zenodo.4414861, visited on 2021-06-
     28.