=Paper=
{{Paper
|id=Vol-2696/paper_194
|storemode=property
|title=Impact of Pretrained Networks For Snake Species Classification
|pdfUrl=https://ceur-ws.org/Vol-2696/paper_194.pdf
|volume=Vol-2696
|authors=Moorthy Gokula Krishnan
|dblpUrl=https://dblp.org/rec/conf/clef/Krishnan20
}}
==Impact of Pretrained Networks For Snake Species Classification==
<pdf width="1500px">https://ceur-ws.org/Vol-2696/paper_194.pdf</pdf>
<pre>
       Impact of Pretrained Networks For Snake
                 Species Classification

                             Moorthy Gokula Krishnan ID

                             Eloop Mobility Solutions, India
                                    gokul@eloop.ai


        Abstract. A robust snake species classifier could aid in the treatment
        of snake bites. In this report, the technique of transfer learning is revis-
        ited to understand the significance of the underlying pre-trained network
        and the supervised datasets used for pre-training. In low data regime,
        the methodology of transfer learning has been instrumental in building
        reliable image classifiers. Comparisons are made between the pre-trained
        networks trained on datasets of different sizes and classes. Performance
        improves significantly when the pre-trained network is trained on a much
        larger supervised dataset. Using country metadata improves the perfor-
        mance considerably. In SnakeCLEF2020 challenge, an F1-score of 0.625
        was achieved.

        Keywords: Snake Species Classification, Computer Vision, Transfer Learn-
        ing, Convolutional Neural Networks


1     Introduction

 Snakebite is the second most deadly neglected tropical disease [1], being respon-
sible for a dramatic humanitarian crisis in global health. Snakebite envenoming
(SBE) affects as many as 2.7 million people every year[3], most of whom live
in some of the world’s remote, poorly developed, and politically marginalized
tropical communities. With annual mortality of 81,000 to 138,000 and 400,000
surviving victims with permanent physical and psychological disabilities, SBE
is a disease in urgent need of attention. Antivenoms can be life-saving when
correctly administered but this often depends on the correct taxonomic iden-
tification (i.e. family, genus, and species) of the biting snake. But, snakes are
never identified in nearly 50% of cases globally[4]. An automated system that
suggests an identification to the healthcare provider from a low-quality photo
can speed up the process of treatment. The participants of SnakeCLEF2020 [10]
were challenged to build an accurate snake species classifier that works under
diverse conditions.

    Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem-
    ber 2020, Thessaloniki, Greece.
2     Dataset
  With the goal of developing biodiversity monitoring systems, LifeCLEF [6]
evaluation campaign aims at benchmarking the progress every year in the iden-
tification of plants and animals. SnakeCLEF challenge was introduced in 2020
to benchmark the progress in building a snake species classifier. In this chal-
lenge, 245,185 training images are provided split into 783 species. As shown in
Figure 1, several aspects of snake morphology make this task challenging for
computer vision. Evaluation is done using F1-score which ensures the need for
better precision and recall over all the species. The trained model is used to infer
labels on the test images that are hidden to participants on platform AICrowd 1
directly. The dataset is extremely imbalanced as indicated in Figure 2 with the
minimum number of images per class being 17 and the highest class containing
12,201 images. Additional geographical metadata (country and continent) for
the image is also provided. All ablation studies were done locally with the given
validation set comprising of 14,029 images.


                            Fig. 1. Snakes diversity [2]


3     Related Work
 With the renaissance of deep learning for building image classifiers since 2012 [8],
deep convolutional neural networks have become the standard for developing
state of the art of image classifiers that work well under diverse conditions given
that a large supervised dataset is available. In certain domain-specific cases, the
availability of such large scale dataset comprising of millions of images might
not be possible. The images might not be readily available, geographically con-
strained, or rare. In such cases, the technique of transfer learning is used. In this
methodology, the network is trained on a different data distribution containing
1
    https://www.aicrowd.com/challenges/snake-species-identification-
    challenge
                             Fig. 2. Data distribution


millions of images and later fine-tuned to domain-specific tasks such as snake
species. The dataset used in The ImageNet Large Scale Visual Recognition Chal-
lenge (ILSVRC) [12] which comprises of 1.4 million images categorized into 1000
classes (ImageNet-1k) is often used to benchmark the performance of image clas-
sifiers and the availability of pre-trained models motivates the computer vision
community to use the learned representations from the ImageNet-1k dataset.
However, ImageNet-1k is a small subset of a much larger dataset containing 14.2
million images categorized into 21,841 classes (ImageNet-21k).


 Recently, an extensive study has been performed [7] to understand the impact
of these learned representations on the downstream task (fine-tuning to domain-
specific image classifier). Corollary to popular belief, larger models trained on
larger datasets do not always perform better on the downstream task. The size
of the domain-specific dataset plays a crucial role in determining the training
strategy and the size of the model. In the context of the SnakeCLEF2020 chal-
lenge, experiments were carried out to understand the differences between the
models trained on both these datasets (ImageNet-1k and ImageNet-21k).


4     Implementation Details

4.1   Pretrained Classifiers

 Vanilla ResNet50-v2 [5] classifier is used for experimentation. Open-source mod-
els that were trained on both ImageNet-1k and ImageNet-21k were used. Both
of the pre-trained classifiers were trained under the same conditions. Specifically,
this involved keeping the hyperparameters, image resolution, and augmentations
constant. The fully connected (FC) layer differs depending on the labels specific
to the dataset. While fine-tuning, the FC layer is replaced with a domain-specific
FC layer randomly initialized.
 The following strategies were adopted during training:

 – Trained for 10,000 steps.
 – The batch size of each step was 512.
 – Mixup augmentation was used.
 – Staircase based Learning rate scheduler.
 – Optimizer: Schocastic gradient descent with momentum 0.9.
 – Cross Entropy Loss.


4.2   Training Techniques

Preprocessing The given images are of varied sizes. During the training process,
the images are first resized to 512x512x3 dimensions using bilinear interpolation
method and a random crop of 456x456x3 was taken. The images were also hori-
zontally flipped with a probability of 0.5. During the validation and the testing
process, the images were only resized to 456x456x3 dimensions using bilinear
interpolation. The images were also normalized with a standard deviation and
mean of 0.5 and 0.5 respectively for training, validation, and testing process.

Batch Accumulation Training can be very inefficient if the mini-batch size is
small due to noisy gradients. To accommodate large mini-batch size into GPU
memory, batch accumulation is generally used. Gradients are accumulated over
16 steps without updating the model and then updated. Although the size of
each mini-batch is 32, the effective mini-batch size is 512.

Learning Rate Learning rate is the crucial hyperparameter to the task of fine-
tuning. After linear warmup, a stair-case based learning rate scheduler was used
following the hyperrule provided by [7]. Base learning rate (lrb ) of 0.03 was used.
                                      i
 – Step 0-500: Linear warmup : lrb ∗ 500 where i = 0, 1, 2..., 500
 – Step 500-10000: lrb decayed by a factor of 10 at 3000,6000 and 9000 steps

Normalization Group normalization [14] technique along with weight standard-
ization [11] was used. The accuracy of Group normalization is stable across a
wide range of batch sizes. It is worth noting that other common techniques like
weight decay and dropout were not used.

Augmentation Deep neural networks are prone to undesirable behaviors such as
memorization, sensitivity to adversarial examples, and sampling bias. To combat
the issues of overfitting, mixup augmentation [15] was used. Mixup strategy
trains the network on convex combinations of pairs of examples and their labels.
The combination can be controlled by a factor, α. With α = 0.1 , a sample
training image is shown in Figure 3. This favours the network to discriminate
between various classes better.
                         Fig. 3. Augmentation Technique


5   Experiments and Analysis
  Fine-tuning is done on two open source models of ResNet50-v2 architecture
pre-trained on ImageNet-1k (Model-A) and ImageNet-21k (Model-B) datasets
respectively under same training conditions. Results are summarized below:


                                        Pretrained On
                   Metric     ImageNet-1k ImageNet-21k
               Top-1 Accuracy     68.48 %    79.57 % +11.09%
               Top-5 Accuracy     87.70%      93.58 % +5.88%
                   F1 score         0.27       0.5813 +0.3113
            Table 1. Performance comparisons of pre-trained networks


 The network pre-trained on ImageNet-21k significantly outperforms its Image-
Net-1k counterpart. Especially, classes with fewer data points are discriminated
better reflected by the significant improvement in the F1-score.

 Model-A disagrees with Model-B for 1,989 images, where Model-B is correct.
Also, Model-B disagrees with Model-A for 433 images, where Model-A is correct.
By analyzing images with the highest discrepancies (i.e) the images for which
Model-B is correct and the probability of the correct species inferred from Model-
A is very small, further insights could be gained. An attribution technique [13]
to understand which pixels(features) are considered important by the model was
performed. The generated saliency maps for top 3 images where models disagree
the maximum was chosen. A single gradient step with respect to the target
class for the given image was calculated. As shown in Figure 4, by ranking the
pixels with respect to the gradients, the saliency maps generated from Model-
B tend to be much more concentrated in the area of interest indicating better
generalization.


Fig. 4. Generated saliency maps. The first column denotes the images given, second
column and third column denote the saliency maps from Model-A and Model-B re-
spectively.


6   Metadata Usage

  Several snake species are constrained by their geographical location. Metadata
about where the image was taken was given in the form of Country and Conti-
nent. The distribution of images per country follows a long tail distribution and
is concentrated mostly in the United States Of America (61.42%). Images were
taken from 187 countries. In the absence of such information, “UNKNOWN” is
marked. The probability of a species given a country is precomputed from the
training dataset distribution as follows:

                                 p(s | c) = tsc /tc                           (1)

where:
p(s | c) = probability of a species, given a country.
tsc      = total number of images belonging to the species, s found in the
           country, c.
tc       = total number of images found in the country, c.


  The generated, p(s | c), is provided at 1 along with the source code used
for training. It is worth noting that the images marked with ”UNKNOWN”
data is considered as a country for the purposes of pre-computation. The final
probabilities are then adjusted as follows:

                                ps = pφs ∗ p(s | c)                            (2)

where:

ps = probability of the species, given the image and country.
pφs = probability of the species inferred from the model.


 The probabilites are normalized to ensure a sum of 1. Using this technique on
Model-B, the scores improved from 0.5813 to 0.6019.

 The test dataset on which the final scores were calculated, follows a data distri-
bution similar to validation dataset and the model achieves an F1-score of 0.625
when tested on AICrowd platform.


7   Conclusion and Future Work

 Although the models were trained with the same hyper-parameters, Model-B
performs better than Model-A. These results signify the importance of having
generalized visual representations before fine-tuning is done on a domain-specific
dataset. Label smoothing[9] could improve the performance and can handle the
noisy images found in the dataset. Bigger models and stronger augmentations
such as rotation and jittering could make the model more resilient.


8   Acknowledgements

 The compute resources required to perform the experiments was provided by
hostkey.com 2
1
  https://github.com/GokulEpiphany/snakes-round-4-train/tree/master/
  metadata
2
  https://www.hostkey.com/
References
 1. First medical decision support tool for snake identification based on ar-
    tificial intelligence and remote collaborative expertise.              URL: https:
    //www.unige.ch/medecine/isg/en/research/one-health/snapp-first-
    medical-decisionsupport-tool-for-snake-identification-based-on-
    artificial-intelligence-and-remote-collaborative-expe/.
 2. Image      shared     on      the    challenge    page.            URL:    https://
    crowdai-shared.s3.eu-central-1.amazonaws.com/markdown editor/
    f4e927cb3680ceb410ff825a8c0a53c4 picture2 challenge datasets.png.
 3. Snakebite envenoming. URL: https://www.who.int/news-room/fact-sheets/
    detail/snakebite-envenoming.
 4. Isabelle Bolon, Andrew M. Durso, Sara Botero Mesa, Nicolas Ray, Gabriel Alcoba,
    François Chappuis, and Rafael Ruiz de Castañeda. Identifying the snake: First
    scoping review on practices of communities and healthcare providers confronted
    with snakebite across the world. PLOS ONE, 15:1–24, 03 2020. URL: https:
    //doi.org/10.1371/journal.pone.0229989.
 5. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings
    in deep residual networks. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max
    Welling, editors, Computer Vision – ECCV 2016, pages 630–645, Cham, 2016.
    Springer International Publishing.
 6. Alexis Joly, Hervé Goëau, Stefan Kahl, Benjamin Deneu, Maximilien Servajean,
    Elijah Cole, Lukáš Picek, Rafael Ruiz De Castañeda, Isabelle Bolon, Titouan
    Lorieul, Christophe Botella, Hervé Glotin, Julien Champ, Willem-Pier Vellinga,
    Fabian-Robert Stöter, Andrew Dorso, Pierre Bonnet, Ivan Eggel, and Henning
    Müller. Overview of lifeclef 2020: a system-oriented evaluation of automated species
    identification and species distribution prediction. In Proceedings of CLEF 2020,
    CLEF: Conference and Labs of the Evaluation Forum, Sep. 2020, Thessaloniki,
    Greece., 2020.
 7. Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung,
    Sylvain Gelly, and Neil Houlsby. Big transfer (bit): General visual representation
    learning, 2019. arXiv:1912.11370.
 8. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi-
    fication with deep convolutional neural networks. In F. Pereira, C. J. C.
    Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural In-
    formation Processing Systems 25, pages 1097–1105. Curran Associates, Inc.,
    2012.     URL: http://papers.nips.cc/paper/4824-imagenet-classification-
    with-deep-convolutional-neural-networks.pdf.
 9. Rafael Müller, Simon Kornblith, and Geoffrey Hinton. When does label smoothing
    help?, 2019. arXiv:1906.02629.
10. Lukáš Picek, Rafael Ruiz De Castañeda, Andrew M. Durso, Isabelle Bolon, and
    P. Mohanty Sharada. Overview of the snakeclef 2020: Automatic snake species
    identification challenge. In CLEF task overview 2020, CLEF: Conference and Labs
    of the Evaluation Forum, Sep. 2020, Thessaloniki, Greece., 2020.
11. Siyuan Qiao, Huiyu Wang, Chenxi Liu, Wei Shen, and Alan Yuille. Weight stan-
    dardization, 2019. arXiv:1903.10520.
12. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean
    Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexan-
    der C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge, 2014.
    arXiv:1409.0575.
13. Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolu-
    tional networks: Visualising image classification models and saliency maps, 2013.
    arXiv:1312.6034.
14. Yuxin Wu and Kaiming He. Group normalization, 2018. arXiv:1803.08494.
15. Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. mixup:
    Beyond empirical risk minimization, 2017. arXiv:1710.09412.

</pre>