<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Conference and Labs of the Evaluation Forum, September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>EficientNets and Vision Transformers for Snake Species Identification Using Image and Location Information</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>FHDO Biomedical Computer Science Group (BCSG)</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Louise Bloch</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christoph M. Friedrich</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of Applied Sciences and Arts Dortmund (FHDO)</institution>
          ,
          <addr-line>Emil-Figge-Straße 42, 44227 Dortmund</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute for Medical Informatics, Biometry and Epidemiology (IMIBE), University Hospital Essen</institution>
          ,
          <addr-line>Essen</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>2</volume>
      <fpage>1</fpage>
      <lpage>24</lpage>
      <abstract>
        <p>The automatic classification of snake species based on non-standardized photographs is important to improve the care of patients sufering from snake bites. The SnakeCLEF 2021 challenge, provides a large database containing images and their recording location of 772 snake species to overcome this problem. This paper describes the participation of the FHDO Biomedical Computer Science Group (BCSG) in this challenge. In the experiments, deep learning-based EficientNets and Vision Transformer (ViT) models were trained. In a subsequent step, the prior probabilities of the location information were multiplied with the model predictions. An ensemble of both deep learning models achieved the best results, which was a macro averaging 1-score across countries of 82.88 % for the independent test set.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Snake species identification</kwd>
        <kwd>EficientNets</kwd>
        <kwd>Vision Transformer</kwd>
        <kwd>Image classification</kwd>
        <kwd>Metadata inclusion</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>can also reduce the number of snakes that were killed out of peoples fear and thus might
improve the protection of harmful snakes [6].</p>
      <p>The aim of the SnakeCLEF challenge is, to deploy data-driven analysis to improve the
identification of snake species based on non-standardized photographs. This paper summarizes
the experiments and results of FHDO BCSG for the SnakeCLEF 2021 challenge. The presented
approach expands the FHDO BCSG submissions [7] for SnakeCLEF 2020.</p>
      <p>The article is structured as follows: Section 2 describes related work in this field of research.
Afterwards, Section 3 gives a summary of the dataset and Section 4 describes the Machine
Learning (ML) workflow and the methods used to implement it. Section 5 shows the results
achieved using this workflow. Finally, the results are summarized and concluded in Section 6
which also mentions limitations and gives an outlook about future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>The identification of snake species has been previously investigated using ML. However, the
manual extraction of features to describe snake species is tedious. Taxonomic features have been
previously extracted in a semiautomatic approach [8] to identify six species in 1,299 images.
Additionally, snake species identification is often performed using field-based investigations,
which contain unstructured and non-standardized photographs often of poor image quality.
Thus, most ML models used textural features or deep learning approaches to automatically
extract features from snake images.</p>
      <p>A textural approach [9] extracted Color and Edge Directivity Descriptor (CEDD) [10] features
to diferentiate between 22 Malaysian snake species. The dataset was recorded at the Perlis
Snake Park, Malaysia and contained 349 images. The rarest species in this dataset included
only three images. Five classical ML models were applied for the final classification. The best
classification accuracy of 89.22 % was achieved by the nearest neighbour classifier.</p>
      <p>Recently published studies [11, 12, 13, 14, 15] often used deep learning-based approaches to
distinguish between snake species. Some of the studies were designed as an Object Detection
(OD) task.</p>
      <p>For example, the mean Average Precision (mAP) of diferent deep learning-based OD methods
were compared to each other [11] to distinguish 1,027 images of eleven Australian snake species.
The dataset was extracted from ImageNet [16] and was augmented by a Google Image search3
and the least frequent class contained 60 images. The best mAP was achieved for a Faster
Region-Based Convolutional Neural Network (Faster R-CNN) [17] with a ResNet [18] backbone.</p>
      <p>Another similar approach [12] used Faster R-CNN with diferent detection layers. In this
approach, 250 images of nine species, all occurring on the Galápagos Islands, Ecuador were
distinguished from each other. The dataset was collected using three data sources, two internet
searches performed on the platforms Google and Flickr and an image dataset provided by the
Ecuadorian Institution of Tropical Herping4. The ResNet backbone achieved the best accuracy
of 75 %.
3Google Image Search: https://images.google.com/imghp?hl=de&amp;gl=de&amp;gws_rd=ssl, [Last accessed:
2021-074Tropical Herping: https://www.tropicalherping.com/, [Last accessed: 2021-07-01]</p>
      <p>Further studies performed deep learning-based classification tasks. For example, one
approach [13] compared the accuracies of three diferent classification networks namely VGG16 [ 19],
DenseNet161 [20] and MobileNetV2 [21]. The dataset contained 3,050 images containing 28
species. The GrabCut [22] algorithm was applied as a preprocessing step to extract the snakes
from the image background. After 50 training epochs, an accuracy of 72 % was reached for the
test dataset and the DenseNet161 architecture.</p>
      <p>Another approach [14] trained a deep Siamese network [23] for one-shot learning [24] to
classify between 84 snake species based on the World Health Organization (WHO) venomous
snake database5. The dataset contained 200 images and three to 16 images per class.</p>
      <p>Although there are more than 3,700 snake species worldwide [25], and more than 600 of them
were medically relevant [25], most recently ML approaches were trained on a small number of
snake species.</p>
      <p>
        The SnakeCLEF challenge [
        <xref ref-type="bibr" rid="ref1">1, 25</xref>
        ] provided a large dataset containing images of more than 700
snake species to overcome this problem. Diferent deep learning approaches were successfully
submitted in previous rounds of this challenge. The winning approach [26] of SnakeCLEF 2020
[27] used a ResNet architecture pre-trained on ImageNet-21k and reached a macro-averaging
F1-score of 62.54 %. The FHDO BCSG [7] combined OD and image classification using a Mask
Region-Based Convolutional Neural Network (Mask R-CNN) [28] instance detection framework
and an EficientNet-B4 [ 29] classification model. This method reached a macro-averaging
F1-score of 40.35 %. In post-competition submissions, the score could be optimized to 59.4 %.
      </p>
      <p>This research expands the ML workflow developed from FHDO BCSG [ 7] in SnakeCLEF 2020.
In particular, the workflow was extended to use Vision Transformer (ViT) [ 30] models and an
ensemble of ViTs and EficientNets.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>The training dataset of the SnakeCLEF 2021 and AICrowd Snake Species Identification Challenge
round 5 consists of 386,006 photographs of 772 snake species. Those photographs were taken
in 188 countries. The photographs originated from three diferent data sources, two online
biodiversity platforms, namely iNaturalist6 (n=277,025 images; 71.77 %), and Herpmapper7
(n=58,351 images; 15.12 %) and another source containing noisy data downloaded from Flickr8
(n=50,630 images; 13.12 %). The entire dataset was split into a training set containing 347,405
photographs (90.00 %), and a validation set containing 38,601 subjects (10.00 %). The class
distribution of the united training and validation set is visualized in Figure 1. It can be seen that
the dataset was highly imbalanced. Both subsets followed similar class distributions and each
class was represented in both datasets. The test dataset consists of 23,673 images.</p>
      <p>5Venomous snakes distribution and species risk categories:
snakeantivenoms/database/, [Last accessed: 2021-07-01]
6iNaturalist: https://www.inaturalist.org/, [Last accessed: 2021-07-01]
7Herpmapper: https://www.herpmapper.org/, [Last accessed: 2021-07-01]
8Flickr: https://www.flickr .com/, [Last accessed: 2021-07-01]
https://apps.who.int/bloodproducts/</p>
      <p>Snake species</p>
      <sec id="sec-3-1">
        <title>3.1. Metadata</title>
        <p>In addition to the photographs, metadata that provides information about the continent and
country of the image location were available. Most images (n=246,482; 63.85 %) were recorded in
the United States of America. For 50,879 (13.18 %) images, no country information was provided
and 51,061 images (13.23 %), had no continent information. Those images were marked with
the “unknown” flag.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Methods</title>
      <p>The ML workflow used to learn the diferences between snake species is visualized in Figure 2.
This ML workflow was implemented in a modular way, thus, during the challenge diferent
combinations of workflow parts and their interactions were investigated. The entire workflow
was implemented using the programming language Python v3.6.9 [31].</p>
      <p>The preprocessing stage starts with optional filtering of the dataset. Afterwards, optional OD
was implemented, which was trained to detect single snakes in the photographs. Due to time
constraints, no models using the OD could be submitted during the challenge. The OD stage
was followed by an image preprocessing stage which produced images of uniform, quadratic
size. Afterwards, data augmentation was used to make the subsequent deep learning models
more robust, for example, against rotation, scaling, and noise. EficientNets and ViTs were
trained to distinguish between the snake species. Finally, optional multiplication of the models’
prediction probabilities and the a priori probability distribution of the snake species given the
location was implemented.</p>
      <p>Image dataset</p>
      <p>Dataset filtering</p>
      <p>Object detection
Image augmentation</p>
      <p>Model training
Classification</p>
      <p>Metadata
multiplication</p>
      <sec id="sec-4-1">
        <title>4.1. Dataset Filtering</title>
        <p>An analysis of the dataset with AntiDupl9 revealed 29 “Image not found” images, which were
the result of download problems.</p>
        <p>Another problem that has been found are out-of-class images appearing in the Flickr dataset.
These images contain no snakes but for example, ice-hockey players, churches, other animals,
persons, and mangas. To identify them for exclusion from the training set, a standard ImageNet
classifier with 1,000 classes and based on a ResNet50 [ 18] architecture has been used. Therefore,
a positive list of 46 snake and reptile classes (e.g., garter_snake, sidewinder, . . . ) that are part
of the ImageNet Large Scale Visual Recognition Challenge 2012 (ILSRVC2012) [32] dataset
has been used. With this classifier, 6,110 out-of-class images (10.47 %) have been identified as
out-of-class images in the Flickr dataset. The excluded images were assigned to 384 species. The
ifltered dataset contained images of all 772 snake species. The least frequent species after the
ifltering were “bolyeria multocarinata” and “echinanthera melanostigma”, both containing one
image. The out-of-class images were manually checked and a large proportion of the identified
images visualize no snakes. Due to reasons of time limitations during the competition, the
out-of-class images were not formally validated. The efects of the reduced dataset have been
tested and compared to the unfiltered dataset.</p>
        <p>9https://github.com/ermig1979/AntiDupl, [Last accessed: 2021-07-01]</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Object Detection</title>
        <p>The aim of using OD as a preprocessing step to improve image classification was, to focus the
subsequently implemented classification models on the object that should be classified. The
idea of using OD for the identification of snake species in photographs before executing the
image classification was inspired by the winning team [ 33] of round 2 of the AICrowd Snake
Species Identification Challenge [ 25]. The OD used in this paper was already implemented
and described in the contribution of the FHDO BCSG in the SnakeCLEF 2020 challenge [7]. A
Mask R-CNN [28] model was trained to identify snakes in non-standardized photographs. The
Mask R-CNN implements instance segmentation, which is a combination of OD and semantic
segmentation. Thus, in each image, bounding boxes are identified and classified for each object
of interest and each pixel of those bounding boxes was segmented into a range of given classes.
Mask R-CNN is an extension of the OD method Faster R-CNN. The Mask R-CNN architecture
consists of two phases, in the first phase, identically to Faster R-CNN, a Region Proposal Network
(RPN) was implemented to identify candidate bounding boxes followed by a non-maximum
suppression [34] to focus on the most promising candidates. The second stage first used a
Region Of Interest (ROI) Align Network on the remaining candidate bounding boxes followed
by a parallel implementation of Fully Connected Networks (FCN) to identify the object class
and the ofset of the bounding boxes as well as a Convolutional Neural Network (CNN) for the
semantic segmentation task.</p>
        <p>In this research, the Mask R-CNN was only used as an OD framework to diferentiate between
snakes and background and did not use the whole instance segmentation pipeline. Due to this
reason, it may be an adequate alternative to use Faster R-CNN instead of the Mask R-CNN. The
comparison of both models in the TensorFlow OD Application Programming Interface (API) [35]
for the Microsoft Common Objects in Context (COCO) dataset [36] shows an increased mAP for
the Mask R-CNN model. ResNet-50 was used as the backbone network and was pre-trained on
the ImageNet-1k dataset. Based on this model, the OD training process was split into two phases.
In the first warm-up phase, the newly added layers were trained for 20 epochs. Afterwards, the
weights of the entire model were fine-tuned for another 30 epochs.</p>
        <p>An adaption10 to Tensorflow 2.1.0 of the implementation of the Mask R-CNN model
implemented by Abdulla11 has been used to implement the Mask R-CNN. The OD process included
no data augmentation. A threshold of 0.3 was used for the minimum detection confidence.
Stochastic Gradient Descent (SDG) was used to optimize the model weights, with a momentum
value of 0.9, a weight decay of 10− 4 and a mini-batch size of 8.</p>
        <p>The annotated snake images were available from the winning solution of round 2 [33] of the
AICrowd Snake Species Identification Challenge. This dataset contained annotated bounding
boxes for 1,426 snake images which was a subset of round 2 [33] of the AICrowd Snake Species
Identification Challenge.</p>
        <p>The risk of using OD as a preprocessing step for species identification was that important
background information was excluded from the images.</p>
        <p>10DifProML Mask R-CNN: https://github .com/DifPro-ML/Mask_RCNN, [Last accessed: 2021-07-01]
11Matterport Mask R-CNN: https://github.com/matterport/Mask_RCNN, [Last accessed: 2021-07-01]</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Image Pre-processing</title>
        <p>Deep learning models expect the input of squared images. However, the ROIs which were
extracted from the OD procedure do not have predefined dimensions. Thus, the ROIs need to
be transformed to the image dimensions of the deep learning model. If all ROI dimensions were
larger than the image dimensions of the deep learning model, the ROIs were resized. Images
that were smaller than the image dimensions of the deep learning model, were not resized. The
aspect ratio of the ROIs was not modified during the image preprocessing. Thus, occurring
borders were padded using a specified color. In this approach, the image borders, which result
from non-squared OD results were padded with the mean color of the truncated image parts, to
ifnd a color that matched the image best [37].</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Data Augmentation</title>
        <p>Data augmentation has been used to increase the training image dataset by adding slightly
modified copies from existing training images to the training dataset. This method helps to
reduce overfitting in deep learning models [ 38]. For the subsequently used classification models,
diferent data augmentation procedures were implemented.</p>
        <p>For all EficientNet models, the data augmentation pipeline includes random cropping of the
images from a size of 430 × 430 pixels to 380 × 380 pixels, random rotation in the range of
± 40∘ , a width shift, height shift, random shearing and zooming each with a factor of 0.2 as
well as a random horizontal flipping. During the data augmentation procedure, missing image
positions were padded with the color of the nearest pixel neighbour. Additionally, the Lanczos
interpolation [39] was used.</p>
        <p>For the ViT models, the augmentation pipeline included random cropping from an image
size of 250 × 250 to 224 × 224, and a random horizontal as well as vertical flip each with a
probability of 0.5.</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Classification</title>
        <p>Two diferent model types, as well as an ensemble, were trained to detect snake species.
EficientNets were already used in the SnakeCLEF 2020 challenge from the FHDO BCSG [7]. In
comparison to those models, ViT models were used to compare their results and to examine if
an ensemble of both models can improve the classification results.</p>
        <sec id="sec-4-5-1">
          <title>4.5.1. EficientNets</title>
          <p>EficientNets [ 29] are an optimized CNN based model family. The base architecture of
EficientNets was developed by a CNN architecture search, which was optimized for accuracy
and Floating Point Operations Per Second (FLOPS). The main building blocks of the resulting
EficientNets are Mobile Inverted Bottleneck Convolutional (MBConv) layers. The base model
is successively scaled up using a uniform balance between model depth, model width and
image resolution. The developed model architecture achieved state of the art performances
on the ImageNet classification task while being smaller and faster than many of the compared
models [29].</p>
          <p>In this work, EficientNet-B4 models were trained to diferentiate between snake species. All
images were scaled to an image size of 380 × 380 pixels. The model weights were initialized by
a model which was pre-trained using noisy student [40]. This model was extended by adding a
lfatten layer, a dense layer with 1,000 neurons using a Swish [ 41] activation function and an
output dense layer with 772 neurons using the softmax activation function. The entire model
contained 276,495,588 parameters.</p>
          <p>A warm-up phase of three epochs was used to train the weights of the newly added layers
and the Batch Normalization layers. Afterwards, the weights of all layers were optimized for a
larger number of epochs. All EficientNets were optimized by using an Adam optimizer [ 42]
with the parameters  1 = 0.9,  2 = 0.999 and  = 10− 7. In both phases, a mini-batch size of
11 was used. The warm-up phase used a learning rate of 10− 4 and afterwards, the learning rate
was decreased to 10− 6.</p>
          <p>Figure 1 shows, that the training dataset is imbalanced across snake species. During the
SnakeCLEF 2020 challenge, the FHDO BCSG tested diferent oversampling techniques for snake
identification using EficientNets [ 7]. In this research, the class weight oversampling function
which performed best in the SnakeCLEF 2020 and is described in Equation 1 was implemented.
The oversampling rate increases less for classes with very small frequencies in comparison to a
linear oversampling function.  () denoted the frequency of class .</p>
          <p>1
() = 1 − √︁ max  () + 0.5
 ()
(1)</p>
        </sec>
        <sec id="sec-4-5-2">
          <title>4.5.2. Vision Transformers</title>
          <p>ViTs [30] are an alternative to classical CNN deep learning-based image classifiers. Instead
of using convolutional operations which focus on local parts of the image, ViTs consider
the whole image in parallel. ViT were based on the self-attention principle [43], which was
previously applied to Natural Language Processing (NLP). The model design of ViTs follows
the Transformers which have been introduced in NLP [43]. In ViTs, the input image is first
disassembled into a set number of patches. Additionally, the position of the patch is encoded by
a positional encoder. Multiple transformer encoders, which each mainly consists of Multiheaded
Self-Attention (MSA) layers and a Multi-Layer Perceptron (MLP) were used to encode the image
patches. Another MLP is trained to learn the overall classification from the image encoding.</p>
          <p>Two diferent ViT models were trained in this research. First, a ViT Base model architecture
with an image size of 224 × 224 pixels, a patch size of 16 × 16 pixels which was pre-trained
on the ImageNet21k dataset was used. This model was trained using a mini-batch size of 70, a
learning rate of 10− 5 an Adam optimizer ( 1 = 0.9,  2 = 0.999,  = 10− 8) and cross-entropy
loss. The ViT Base model contained 86,392,324 parameters.</p>
          <p>During the challenge, another ViT model was trained. The architecture of this model was a
ViT Large model with an image size of 224 × 224 pixels and a patch size of 16 × 16 pixels. This
model was also initialized using the weights achieved for the ImageNet21k dataset. During the
training of this model, a mini-batch size of 18 was used and a learning rate of 10− 5. This model
was also trained using an Adam optimizer ( 1 = 0.9,  2 = 0.999,  = 10− 8) and cross-entropy
loss. The ViT Large model contained 304,092,932 parameters.</p>
          <p>The pre-trained ViT models were loaded from the Python library timm v0.4.8 [44] and PyTorch
v1.8.0+cu101 [45] was used to train the model.</p>
          <p>No class weight oversampling was used for all ViT models in our experiments. Preliminary
experimental tests on the training dataset showed decreased classification results using the
class weight function used for the EficientNets with the described ViT models and given
hyperparameters. However, due to reasons of time limitations during the challenge, no complex
hyperparameter optimization was applied.</p>
        </sec>
        <sec id="sec-4-5-3">
          <title>4.5.3. Ensemble Model</title>
          <p>In addition, a simple ensemble model was added in the experiments. This model ensembles the
results of a ViT Large model and an EficientNet-B4 model, by calculating the mean softmax-scaled
probability predictions of both models. It was expected that the ensemble model outperforms
the results of the individual models.</p>
        </sec>
      </sec>
      <sec id="sec-4-6">
        <title>4.6. Adding Location Information</title>
        <p>The last part of the ML workflow was to optionally add location information to probability
predictions of the classification models. For this reason, the prior probabilities were estimated
by the relative frequency distribution of the snake species at a location in the training and
validation dataset. Two diferent strategies were applied to combine the location and the image
information. In the first strategy, the prior probabilities of the image location were multiplied
with the prediction probabilities of the image. The second strategy was similar to the first
one, except that the prior probabilities were binarized with a cut-of value of 0. Thus, in the
binarized strategy, prior probabilities with non-zero values were transformed to one, whereas
prior probabilities with a value of zero were kept unchanged. Both strategies primarily use
the country in this step, as it contains more information than the continent. The continent
information was only added for images with unknown country information. For images with
missing country and continent information, the prior probability of the “unknown” class was
used.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>The following sections describe the classification results of the ML workflow and diferent
ablation studies executed to the workflow modules for the validation and the test dataset.</p>
      <sec id="sec-5-1">
        <title>5.1. Preliminary Experiments on the Validation Dataset</title>
        <p>Table 1 summarizes the macro-averaging 1-scores of the classification models for the original
(1) and the filtered ( 1−  ) version of the validation dataset. The experiments included
the use of deep learning-based classifiers (introduced in Sec. 4.5), metadata encoding strategies
(introduced in Sec. 4.6), a filtering strategy (introduced in Sec. 4.1) and the used training dataset
(t: only training dataset, t+v: unified training and validation dataset). For each model, the results
of three versions were provided. The results of the base models are named by an “S” followed
ID
by a single number, a “C_” was added as a prefix for all models which used a multiplication of
the location information and a “B_” was added as a prefix for the binary country encoding.</p>
        <p>Model S4 was trained on the training and validation set and thus overestimated the 1-score
for the validation dataset. The idea of this model was, to improve the results on the test set.</p>
        <p>Figure 3 compares the 1-scores of the ablation study executed for diferent classifiers.
The validation results of model S5 and model S6 showed improved performances for the ViT
Large model in comparison to the ViT Base model, although, the ViT Large model was only
trained for 13 epochs, whereas the ViT Base model was trained for 50 epochs. Model S1, which
was an EficientNet-B4 model trained for 123 epochs achieved better classification results for the
validation dataset than the ViT Large model. However, the overall comparison of these models is
quite complex due to diferences in the training pipelines. The best results for both, the original
and the filtered validation dataset, were achieved for the ensemble model S7. Ensemble model
B_S7 which used the binary encoding of the location information achieved an 1-score of
63.20 % and an 1−  -score of 63.76 %. Thus, the ensemble model outperformed the results
of the individual models.</p>
        <p>Figure 4 compares the 1-scores for the diferent metadata inclusion strategies. It can be
seen, that all models showed increased classification results if the location information were
multiplied with the model predictions. The binary encoding of the location information reached
an additional increase of the performances for models S1, S2, S3, S4, S5 and S7.</p>
        <p>Figure 5 compares the 1−  -scores achieved using the dataset filtering strategy. This
comparison did not show a positive efect for the dataset filtering strategy in these experiments
neither for the 1-score nor the 1−  -score. One reason for this efect might be the
smaller number of diferent images for rare classes. Another problem in the experiments was
that none of the EficientNet models saturated for the validation set, thus it was hard to do a
fair comparison between the models trained on the entire and the filtered dataset. Therefore,
the efect of the dataset filtering was also investigated in the post-competition experiments</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Challenge Submissions and Results</title>
        <p>Table 2 summarizes the results of the classification models for the oficial test dataset. Each
team was able to submit up to ten submissions to the organizing team. Thus, some models in
Table 1 were not validated for the test set. The primary metric for the SnakeCLEF 2021 challenge
was the macro averaging 1-score across countries for the independent test set (1). In
addition, this table summarizes the macro-averaging 1-scores (1) and the accuracy for the
test set (). Similar to Table 1, the experiments included the use of deep learning-based
classifiers (introduced in Sec. 4.5), metadata encoding strategies (introduced in Sec. 4.6), a
ifltering strategy (introduced in Sec. 4.1) and the used training dataset (t: only training dataset,
t+v: unified training and validation dataset). Due to time constraints, none of the submissions
used the described OD during the challenge. Thus, the efect of the OD was investigated in the
post-competition experiments described in Section 5.3. The submission IDs correspond to the
IDs of Table 1 with the name of the team “FHDO-BCSG-” as a prefix.</p>
        <p>Similar to the validation results, the comparison of model FHDO-BCSG-B_S6 and model
FHDO-BCSG-B_S5 showed that the ViT Large model architecture outperformed the ViT Base
architecture, although the ViT Base model was trained using a larger number of epochs. The
ViT Large model achieved an 1-score of 77.39 %. Additionally, model FHDO-BCSG-B_S1,
which was the best performing EficientNet-B4 model, reached better results than model
FHDOBCSG-B_S5, which was the ViT Large model. Model FHDO-BCSG-B_S1 reached an 1-score
of 78.33 %. Both models were trained with diferent data augmentation pipelines and learning
parameters, which make it hard to do a fair model comparison. The comparison of models
FHDO-BCSG-B_S1, FHDO-BCSG-B_S5 and FHDO-BCSG-B_S7 showed that the ensemble model,
which was a combination of model S5 and model S1 outperformed the individual models.</p>
        <p>The comparison of models FHDO-BCSG-C_S7 and FHDO-BCSG-B_S7, showed that using
binary location information achieved slightly better results than using the raw prior probabilities
for the ensemble model. The results of Model FHDO-BCSG-C_S7, which multiplied the raw
aThis result was afected by a mistake during the softmax normalization in the initial submission.
Post-competition investigations corrected this score to a value of 72.71 %.
bThis result was afected by a mistake during the softmax normalization in the initial submission.
Post-competition investigations corrected this score to a value of 79.89 %.
cThis result was afected by a mistake during the softmax normalization in the initial submission.</p>
        <p>Post-competition investigations corrected this score to a value of 88.74 %.
prior probabilities of the location to the model predictions achieved worse results in all metrics
in comparison to model FHDO-BCSG-B_S7, which used a binarized version of the location
information. Due to a mistake during the softmax normalization in the initial submission, the
diferences observed by using both location information methods for the ViT Large model
showed stronger diferences. The post-competition resubmission of this model with corrected
softmax normalization led to similar results of both models.</p>
        <p>Model FHDO-BCSG-B_S3, which was trained on the filtered version of the training dataset,
achieved slightly worse 1 and 1-scores but a better 1-score in comparison to model
FHDO-BCSG-B_S2, which was trained on the unfiltered dataset. Model FHDO-BCSG-B_S3 also
reached the overall best -score of 91.17 %.</p>
        <p>In comparison to model FHDO-BCSG-B_S3, model FHDO-BCSG-B_S4 was trained on the
unified training and validation dataset. Due to reasons of time limitations during the
competition, this model was only trained for 25 instead of 113 epochs. This model showed that the
classification results might be improved by training the EficientNet-B4 model on the training
and validation dataset.</p>
        <p>The best model submitted by FHDO BCSG was model FHDO-BCSG-B_S7, which was an
ensemble of model S1 and model S5 and added binary location to the prediction probabilities.
This model reached an 1-score of 78.75 %, an 1 of 82.88 % and an  of 90.42 %.
Model FHDO-BCSG-B_S7 reached fourth place in the SnakeCLEF 2021 challenge.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Post-Competition Experiments</title>
        <p>As previously mentioned, due to reasons of time limitations during the competition, no fair
comparison between a model that used OD and one that did not use OD was performed during
the competition. The same applied to using the dataset filtering strategy. Therefore, some
post-competition experiments were executed leading to the results presented in Table 3 and
Table 4. Additionally, the implementation of the EficientNet models showed some drawbacks.
Mainly, the implementation limited the mini-batch size to a small number, which leads to a high
number of required epochs before the model saturated. Therefore, the EficientNet training
pipeline was reimplemented after the competition deadline. This reimplementation was similar
to the implementation of the ViT training pipeline and thus the pre-trained EficientNet models
were loaded from the Python library timm v0.4.8 [44] and PyTorch v1.8.0+cu101 [45] was used
to train the models.</p>
        <p>Following the implementation described in Section 4.5.1, EficientNet-B4 models were used
for all experiments. It should be mentioned that the reimplemented training pipeline difers
from the pipeline described in Section 4.5.1. All models were pre-trained on the ImageNet
dataset. Additionally, a mini-batch size of 25, a learning rate of 10− 4 an Adam optimizer
( 1 = 0.9,  2 = 0.999,  = 10− 8) and cross-entropy loss were used to train the models. No
additional dense layer was introduced before the classification layer. Thus, the models contain
18,932,812 parameters. Similar to the ViT implementation, no class weights were used to train
the models. A scheduler was implemented which reduced the learning rate by a factor of 0.1 if
the classification accuracies did not improve for five epochs. Early stopping was implemented if
the loss function showed no improvement for more than eight epochs. The maximum number of
epochs chosen was 100. A mixed-precision strategy was implemented to speed up the training
process and increase the mini-batch size.</p>
        <p>Similar to the ViT models, the augmentation pipeline included random cropping from an
image size of 418 × 418 to 380 × 380, and a random horizontal as well as vertical flip each with
a probability of 0.5.</p>
        <p>The experimental results for the validation dataset are summarized in Table 3. In comparison
to the models submitted during the challenge, the results showed that the number of epochs to
train the EficientNet model was reduced from more than 100 to less than 10 epochs using the
reimplemented pipeline. Nevertheless, the achieved results outperformed the challenge results.</p>
        <p>The efect of two pipeline modules was investigated. First, the efect of the dataset filtering
strategy was examined by training two models, model S8 was trained on the entire training
dataset and model S9 used the filtered training set. As can be seen in Figure 6, the comparison
of both models showed improved results for model S9. Additionally, the overall best
1score of 70.92 % was achieved for model B_S9, which used the binary encoding of the location
information. The best 1−  -score was 70.11 % reached by model C_S9. The best model
trained on the entire training dataset was model B_S8, which reached an 1-score of 68.55 %
and an 1−  -score of 68.85 %.
Binary</p>
        <p>The second experiment investigated the efect of the OD module and was visualized in Figure 7
using barplots. Therefore, the 1-scores achieved by model S9 were compared to those reached
for model S10, trained using the OD module described in Section 4.2. It can be noted that the
results of model S10 reached worse results than model S9. Thus, for this model architecture,
the OD module harmed the classification results. It was assumed that this reduction of the
classification results might lead from the preprocessing pipeline used after the OD. For this
reason, model S11 was trained using the OD module but no further image preprocessing. It can
be noted that the results for model S11 outperformed the results of model S10, and reached
similar results as model S9. However, the OD showed no improvement for the validation dataset
in comparison to using the unprocessed images.</p>
        <p>All post-competition models with the binary location encoding were evaluated for the test
set. The achieved results are summarized in Table 4. Model B_S8 was the EficientNet-B4 model
that was trained on the entire training dataset. This model reached an 1-score of 65.70 %.
Model B_S9, which was trained for the filtered dataset, outperformed this model by reaching an
ID
1-score of 78.11 %.</p>
        <p>The comparison between model B_S9, B_S10 and B_S11 investigated the efect of the OD
module. Model B_S10 used both, the OD and image preprocessing pipelines and performed
worse than model B_S9. However, model B_S11 that used only the OD module outperformed
both models with an 1-score of 78.44 %.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>Overall, snake species identification is a challenging task, dealing with highly imbalanced class
distributions, high intra-class variance and small inter-class variance.</p>
      <p>The experiments investigated in this research, show that both, EficientNets and ViTs can
be successfully trained to distinguish between snake species. The best results were achieved
by combining both model types using an ensemble model. ViT models which used the large
architecture achieved better classification results than the ViT Base architecture.</p>
      <p>The multiplication of location information with the model predictions can improve the results
of both model types.</p>
      <p>The training dataset was also filtered for images that did not include snakes. This filtering
strategy showed improvements in the post-competition experiments for the test set. The visual
validation of the out-of-class images showed some misclassified images that contained snakes. It
was planned to improve this filtering strategy in future work to prevent those misclassifications
using more recently deep learning models for this step.</p>
      <p>Due to time constraints, only one model was trained on the entire training and validation
dataset. It was expected that a model training on the entire dataset would increase the results
for the test dataset. Additionally, during the competition, most models did not reach a plateau
and thus the optimal number of epochs and the optimal classification results were not achieved
for the submissions. This fact may also lead to worse performances for the filtering strategy
during the competition and could be surmounted in the post-competition experiments. Another
drawback that resulted from time constraints was that no models were evaluated for the OD
method during the challenge. Post-competition experiments overcome this limitation and the
OD showed improved test results when using no image preprocessing after the OD.</p>
      <p>As has already been prepared in the post-competition experiments, future work will address
the implementation of a more uniform implementation pipeline to enable a fair comparison of
EficientNets and ViT models. Additionally, the impact of diferent mini-batch sizes and learning
rates should be investigated more systematically. To improve the classification results, up-scaled
EficientNet and ViT architectures should also be used and the pipeline will be enhanced using
EficientNetv2 classifiers [ 46]. Furthermore, the results of using diferent Transfer-Learning
strategies should be compared to each other.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>The work of Louise Bloch was partially funded by a PhD grant from University of Applied
Sciences and Arts Dortmund, Dortmund, Germany.</p>
      <p>The authors thank Lukáš Picek, Department of Cybernetics, University of West Bohemia,
Czechia, for post-competition evaluation.
Species Identification and Species Distribution Prediction, in: Proceedings of the 12th
International Conference of the CLEF Association (CLEF 2021): 21-24.09.2021; Bucharest,
Romania, 2021.
[3] A. Joly, H. Goëau, E. Cole, S. Kahl, L. Picek, H. Glotin, B. Deneu, M. Servajean, T. Lorieul,
W.-P. Vellinga, P. Bonnet, A. M. Durso, R. R. de Castañeda, I. Eggel, H. Müller, LifeCLEF
2021 Teaser: Biodiversity Identification and Prediction Challenges, in: D. Hiemstra, M.-F.
Moens, J. Mothe, R. Perego, M. Potthast, F. Sebastiani (Eds.), Proceedings of the European
Conference on Information Retrieval (ECIR 2021): 28.03-01.04.2021; Online Event, Springer
International Publishing, Cham, 2021, pp. 601–607.
doi:10.1007/978-3-030-722401_70.
[4] J. M. Gutiérrez, J. J. Calvete, A. G. Habib, R. A. Harrison, D. J. Williams, D. A.
Warrell, Snakebite Envenoming, Nature Reviews Disease Primers 3 (2017). doi:10.1038/
nrdp.2017.63.
[5] H. F. Williams, H. J. Layfield, T. Vallance, K. Patel, A. B. Bicknell, S. A. Trim, S.
Vaiyapuri, The Urgent Need to Develop Novel Strategies for the Diagnosis and Treatment of
Snakebites, Toxins 11 (2019). doi:10.3390/toxins11060363.
[6] H.-T. Tseng, L.-K. Huang, C.-C. Hsieh, No More Fear of Every Snake: Applying
ChatbotBased Learning System for Snake Knowledge Promotion Improvement: A Regional Snake
Knowledge Learning System, in: Proceedings of the IEEE 20th International Conference
on Advanced Learning Technologies (ICALT 2020): 06-09.07.2020; Tartu, Estonia, 2020, pp.
72–76. doi:10.1109/ICALT49669.2020.00029.
[7] L. Bloch, A. Boketta, C. Keibel, E. M. an Alex Michailutschenko, O. Pelka, J. Rückert, L.
Willemeit, C. M. Friedrich, Combination of Image and Location Information for Snake Species
Identification using Object Detection and EficientNets, in: Working Notes of the 11th
Conference and Labs of the Evaluation Forum (CLEF 2020): 22-25.09.2020; Thessaloniki,
Greece, 2020. URL: http://ceur-ws.org/Vol-2696/paper_201.pdf.
[8] A. James, D. Kumar, B. Mathews, S. Sugathan, Discriminative Histogram Taxonomy
Features for Snake Species Identification, Human-Centric Computing and Information
Sciences 4 (2014). doi:10.1186/s13673-014-0003-0.
[9] A. Amir, N. A. H. Zahri, N. Yaakob, R. B. Ahmad, Image Classification for Snake Species
Using Machine Learning Techniques, in: S. Phon-Amnuaisuk, T.-W. Au, S. Omar (Eds.),
Proceedings of the Computational Intelligence in Information Systems Conference (CIIS
2016): 18-20.11.2016; Brunei, Brunei Darussalam, Springer International Publishing, Cham,
2017, pp. 52–59. doi:10.1007/978-3-319-48517-1_5.
[10] S. A. Chatzichristofis, Y. S. Boutalis, CEDD: Color and Edge Directivity Descriptor: A
Compact Descriptor for Image Indexing and Retrieval, in: A. Gasteratos, M. Vincze, J. K.
Tsotsos (Eds.), Proceedings of the International Computer Vision Systems (ICVS 2008):
12-15.05.2008; Santorini, Greece, Springer Berlin Heidelberg, Berlin, Heidelberg, 2008, pp.
312–322. doi:10.1007/978-3-540-79547-6_30.
[11] Z. Yang, R. Sinnott, Snake Detection and Classification using Deep Learning, in:
Proceedings of the 54th Hawaii International Conference on System Sciences (HICSS 2021):
05-08.1.2021; Maui, Hawaii, US, 2021. doi:10.24251/hicss.2021.148.
[12] A. Patel, L. Cheung, N. Khatod, I. Matijosaitiene, A. Arteaga, J. W. Gilkey, Revealing
the Unknown: Real-Time Recognition of Galápagos Snake Species Using Deep Learning,
Animals 10 (2020) 806. doi:10.3390/ani10050806.
[13] M. Vasmatkar, I. Zare, P. Kumbla, S. Pimpalkar, A. Sharma, Snake Species
Identiifcation and Recognition, in: Proceedings of the IEEE Bombay Section Signature
Conference (IBSSC 2020): 04-06.12.2020; Mumbai, India, 2020, pp. 1–5. doi:10.1109/
IBSSC51096.2020.9332218.
[14] C. Abeysinghe, A. Welivita, I. Perera, Snake Image Classification Using Siamese Networks,
in: Proceedings of the 3rd International Conference on Graphics and Signal
Processing (ICGSP 2019): 01-03.06.2019; Hong Kong, Hong Kong, Association for Computing
Machinery, New York, NY, USA, 2019, p. 8–12. doi:10.1145/3338472.3338476.
[15] I. S. Abdurrazaq, S. Suyanto, D. Q. Utama, Image-Based Classification of Snake Species
Using Convolutional Neural Network, in: Proceedings of the International Seminar on
Research of Information Technology and Intelligent Systems (ISRITI 2019): 05-06.12.2019;
Yogyakarta, Indonesia, Institute of Electrical and Electronics Engineers (IEEE), 2019, pp.
97–102. doi:10.1109/isriti48646.2019.9034633.
[16] J. Deng, W. Dong, R. Socher, L. Li, K. Li, L. Fei-Fei, ImageNet: A Large-Scale Hierarchical
Image Database, in: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR 2009): 20-25.06.2009; Miami Beach, Florida, US, Institute of Electrical
and Electronics Engineers (IEEE), 2009, pp. 248–255. doi:10.1109/cvpr.2009.5206848.
[17] S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: Towards Real-Time Object Detection
with Region Proposal Networks, IEEE Transactions on Pattern Analysis and Machine
Intelligence 39 (2017) 1137–1149. doi:10.1109/tpami.2016.2577031.
[18] K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition, in:
Proceesings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016):
27-30.07.2016; Las Vegas, Nevada, US, Institute of Electrical and Electronics Engineers
(IEEE), 2016, pp. 770–778. doi:10.1109/cvpr.2016.90.
[19] K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image
Recognition, in: Y. Bengio, Y. LeCun (Eds.), Conference Track Proceedings of the 3rd
International Conference on Learning Representations (ICLR 2015): 07-09.05.2015; San
Diego, California, US, 2015. URL: http://arxiv.org/abs/1409.1556.
[20] G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger, Densely Connected
Convolutional Networks, in: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (CVPR 2017): 21-26.07.2017, Honolulu, Hawaii, 2017, pp. 2261–2269.
doi:10.1109/CVPR.2017.243.
[21] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, L.-C. Chen, MobileNetV2: Inverted
Residuals and Linear Bottlenecks, in: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR 2018): 18-22.06.2018, Salt Lake City, Utah, US, 2018,
pp. 4510–4520. doi:10.1109/CVPR.2018.00474.
[22] S. Hua, P. Shi, GrabCut Color Image Segmentation Based on Region of Interest, in:
Proceedings of the 7th International Congress on Image and Signal Processing (ICISP 2014):
30.06-02.07.2014; Cherburg, France, 2014, pp. 392–396. doi:10.1109/CISP.2014.7003812.
[23] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, R. Shah, Signature Verification Using a
"Siamese" Time Delay Neural Network, in: J. Cowan, G. Tesauro, J. Alspector (Eds.),
Advances in Neural Information Processing Systems (NIPS 1993): Denver, Colorado, US,
volume 6, Morgan-Kaufmann, 1994. URL: https://proceedings.neurips.cc/paper/1993/file/
288cc0f022877bd3df94bc9360b9c5d-Paper .pdf.
[24] G. Koch, R. Zemel, R. Salakhutdinov, Siamese Neural Networks for One-Shot Image
Recognition, in: Proceedings of the Deep Learning workshop of the International Conference
on Machine Learning (ICML 2015): 06-11.06.2015; Lille, France, volume 2, 2015. URL:
https://www.cs.cmu.edu/~rsalakhu/papers/oneshot1.pdf.
[25] A. M. Durso, G. K. Moorthy, S. P. Mohanty, I. Bolon, M. Salathé, R. Ruiz de Castañeda,
Supervised Learning Computer Vision Benchmark for Snake Species Identification from
Photographs: Implications for Herpetology and Global Health, Frontiers in Artificial
Intelligence 4 (2021) 17. doi:10.3389/frai.2021.582110.
[26] M. G. Krishnan, Impact of Pretrained Networks for Snake Species Classification, in:
Working Notes of the 11th Conference and Labs of the Evaluation Forum (CLEF 2020):
2225.09.2020; Thessaloniki, Greece, 2020. URL: http://ceur-ws.org/Vol-2696/paper_194.pdf.
[27] L. Picek, R. Ruiz De Castañeda, A. M. Durso, P. M. Sharada, Overview of the SnakeCLEF
2020: Automatic Snake Species Identification Challenge, in: Proceedings of the 11th
Conference and Labs of the Evaluation Forum (CLEF 2020): 22-25.09.2020; Thessaloniki,
Greece, 2020. URL: http://ceur-ws.org/Vol-2696/paper_258.pdf.
[28] K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask R-CNN, in: Proceedings of the IEEE
International Conference on Computer Vision (ICCV 2017): 22-29.10.2017; Venice, Italy,
Institute of Electrical and Electronics Engineers (IEEE), 2017, pp. 2980–2988. doi:10.1109/
iccv.2017.322.
[29] M. Tan, Q. Le, EficientNet: Rethinking Model Scaling for Convolutional Neural Networks,
in: K. Chaudhuri, R. Salakhutdinov (Eds.), Proceedings of the 36th International Conference
on Machine Learning (ICML 2019): 10-15.06.2019; Long Beach, California, US, volume 97,
2019, pp. 6105–6114. URL: http://proceedings.mlr.press/v97/tan19a.html.
[30] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M.
Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, An Image is Worth
16x16 Words: Transformers for Image Recognition at Scale, in: Proceedings of the 9th
International Conference on Learning Representations (ICLR 2021): 03-07.05.2021; Online
Event, 2021. URL: https://openreview.net/forum?id=YicbFdNTTy.
[31] G. Van Rossum, F. L. Drake Jr, Python Tutorial, Centrum voor Wiskunde en Informatica</p>
      <p>Amsterdam, The Netherlands, 1995.
[32] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,
A. Khosla, M. Bernstein, A. C. Berg, L. Fei-Fei, ImageNet Large Scale Visual Recognition
Challenge, International Journal of Computer Vision 115 (2015) 211–252. doi:10.1007/
s11263-015-0816-y.
[33] Moorthy Gokula Krishnan, Diving into Deep Learning — Part 3 — A Deep
Learning Practitioner’s Attempt to Build State of the Art Snake-Species Image Classifier,
2019. URL:
https://medium.com/@Stormblessed/diving-into-deep-learning-part-3-a-deeplearning-practitioners-attempt-to-build-state-of-the-2460292bcfb, [last accessed:
2021-0524].
[34] R. Girshick, F. Iandola, T. Darrell, J. Malik, Deformable Part Models are Convolutional
Neural Networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR 2015): 07-12.06.2015; Boston, Massachusetts, US, 2015, pp. 437–446.
doi:10.1109/CVPR.2015.7298641.
[35] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna,
Y. Song, S. Guadarrama, K. Murphy, Speed/Accuracy Trade-Ofs for Modern
Convolutional Object Detectors, in: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR 2017): 21-26.07.2017; Honolulu, Hawaii, US,
2017, pp. 3296–3297. URL: https://openaccess.thecvf.com/content_cvpr_2017/papers/
Huang_SpeedAccuracy_Trade-Ofs_for_CVPR_2017_paper .pdf.
[36] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick,
Microsoft COCO: Common Objects in Context, in: D. Fleet, T. Pajdla, B. Schiele, T.
Tuytelaars (Eds.), Proceedings of the 13th European Conference on Computer Vision (ECCV
2014): 06-12.09.2014; Zurich, Switzerland, Springer International Publishing, Cham, 2014,
pp. 740–755. doi:10.1007/978-3-319-10602-1_48.
[37] S. Koitka, C. M. Friedrich, Optimized Convolutional Neural Network Ensembles for Medical
Subfigure Classification, in: G. J. Jones, S. Lawless, J. Gonzalo, L. Kelly, L. Goeuriot, T. Mandl,
L. Cappellato, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and
Interaction: Proceedings of the 8th International Conference of the CLEF Association
(CLEF 2017): 11-14.09.2017; Dublin, Ireland, Springer International Publishing, Cham, 2017,
pp. 57–68. doi:10.1007/978-3-319-65813-1_5.
[38] C. Shorten, T. M. Khoshgoftaar, A Survey on Image Data Augmentation for Deep Learning,</p>
      <p>Journal of Big Data 6 (2019). doi:10.1186/s40537-019-0197-0.
[39] C. E. Duchon, Lanczos Filtering in One and Two Dimensions, Journal of Applied
Meteorology 18 (1979) 1016–1022. doi:10.1175/1520-0450(1979)018&lt;1016:lfioat&gt;
2.0.co;2.
[40] Q. Xie, M.-T. Luong, E. Hovy, Q. V. Le, Self-Training with Noisy Student Improves ImageNet
Classification, in: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR 2020): 14-19.06.2020; Online Event, 2020, pp. 10687–10698.
doi:10.1109/CVPR42600.2020.01070.
[41] P. Ramachandran, B. Zoph, Q. V. Le, Searching for Activation Functions, in: Proceedings of
the Workshop Track of 6th International Conference on Learning Representations (ICLR
2018): 30.4-03.5.2018; Vancouver, Canada, 2018. URL: https://openreview.net/forum?id=
SkBYYyZRZ.
[42] D. P. Kingma, J. Ba, Adam: A Method for Stochastic Optimization, in: Proceedings of the
3rd International Conference for Learning Representations (ICLR 2014): 14-16.04.2014;
Banf, Canada, 2014. URL: https://arxiv .org/abs/1412.6980.
[43] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser,
I. Polosukhin, Attention is All you Need, in: I. Guyon, U. V. Luxburg, S. Bengio,
H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett (Eds.), Advances in Neural
Information Processing Systems (NIPS 2017): 04-09.12.2017; Long Beach, California, US,
volume 30, Curran Associates, Inc., 2017. URL: https://proceedings.neurips.cc/paper/2017/
ifle/3f5ee243547dee91fbd053c1c4a845aa-Paper .pdf.
[44] R. Wightman, PyTorch Image Models, 2019. doi:10.5281/zenodo.4414861.
[45] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin,
N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani,
S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, S. Chintala, PyTorch: An Imperative Style,
High-Performance Deep Learning Library, in: H. Wallach, H. Larochelle, A. Beygelzimer,</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Durso</surname>
          </string-name>
          , R. Ruiz De Castañeda,
          <string-name>
            <surname>I. Bolon</surname>
          </string-name>
          , Overview of SnakeCLEF 2021:
          <article-title>Automatic Snake Species Identification with Country-Level Focus</article-title>
          ,
          <source>in: Working Notes of the 12th International Conference of the CLEF Association (CLEF</source>
          <year>2021</year>
          ):
          <fpage>21</fpage>
          -
          <lpage>24</lpage>
          .
          <fpage>09</fpage>
          .
          <year>2021</year>
          ; Bucharest, Romania,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lorieul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Cole</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Deneu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Servajean</surname>
          </string-name>
          , R. Ruiz De Castañeda, I. Bolon,
          <string-name>
            <given-names>H.</given-names>
            <surname>Glotin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Planqué</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.-P.</given-names>
            <surname>Vellinga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dorso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonnet</surname>
          </string-name>
          , I. Eggel,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          , Overview of LifeCLEF 2021:
          <article-title>A System-oriented Evaluation of Automated</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>