=Paper=
{{Paper
|id=Vol-2696/paper_201
|storemode=property
|title=Combination of Image and Location Information for Snake Species Identification using Object Detection and EfficientNets
|pdfUrl=https://ceur-ws.org/Vol-2696/paper_201.pdf
|volume=Vol-2696
|authors=Louise Bloch,Adrian Boketta,Christopher Keibel,Eric Mense,Alex Michailutschenko,Obioma Pelka,Johannes Rückert,Leon Willemeit,Christoph M. Friedrich
|dblpUrl=https://dblp.org/rec/conf/clef/BlochBKMMPRWF20
}}
==Combination of Image and Location Information for Snake Species Identification using Object Detection and EfficientNets==
Combination of image and location information for snake species identification using object detection and EfficientNets FHDO Biomedical Computer Science Group (BCSG) Louise Bloch1,2[0000−0001−7540−4980] , Adrian Boketta1[0000−0002−4182−2479] , Christopher Keibel1[0000−0003−4598−5504] , Eric Mense1[0000−0003−2748−7958] , Alex Michailutschenko1 , Obioma Pelka1,3[0000−0001−5156−4429] , Johannes Rückert1[0000−0002−5038−5899] , Leon Willemeit1 , and Christoph M. Friedrich1,2[0000−0001−7906−0038] 1 Department of Computer Science, University of Applied Sciences and Arts Dortmund (FHDO), Emil-Figge-Str. 42, 44227 Dortmund, Germany {louise.bloch, obioma.pelka, johannes.rueckert, christoph.friedrich}@fh-dortmund.de, {adrian.boketta001, keibel, eric.mense001, alex.michailutschenko004, leon.willemeit002}@stud.fh-dortmund.de 2 Institute for Medical Informatics, Biometry and Epidemiology (IMIBE), University Hospital Essen, Essen, Germany 3 Department of Diagnostic and Interventional Radiology and Neuroradiology, University Hospital Essen, Essen, Germany Abstract. Snake species identification based on images is important to quickly treat patients suffering from snake bites using the correct an- tivenom. The SnakeCLEF 2020 challenge, which is part of the LifeCLEF research platform, is focused on this task and provides snake images and associated location information. This paper describes the partici- pation of the FHDO Biomedical Computer Science Group (BCSG) in this challenge. The implemented machine learning workflow uses Mask Region-based Convolutional Neural Network (Mask R-CNN) for object detection, various image pre-processing steps, EfficientNets for classifi- cation as well as different methods to fuse image and location informa- tion. The best model submitted before the challenge deadline achieved a macro-averaging F1 -score of 0.404. After the expiration of this deadline, the results could be improved up to a macro-averaging F1 -score of 0.594. Keywords: snake species identification · object detection · EfficientNets · image classification · metadata inclusion Copyright c 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem- ber 2020, Thessaloniki, Greece. 1 Introduction This paper explains the participation of University of Applied Sciences and Arts Dortmund (FHDO) Biomedical Computer Science Group (BCSG) at the Con- ference and Labs of the Evaluation Forum (CLEF) 20204 SnakeCLEF challenge5 for snake species identification [20]. This challenge is part of the LifeCLEF 2020 research platform which focuses on the automated identification of species [14] and consists of four challenges. The implemented approach in this paper is in- spired by an article [9] about the winning entry of round 2 of the AICrowd Snake Species Identification Challenge6 . The identification of snake species is important as there are approximately between 81,410 and 137,880 victims of snakebites dying every year [29]. These deaths result from inaccurate knowledge about the species and consequently about the antivenom needed [5]. The high diversity of snake species [27] and their partially similar appearances lead to confusion [5] and make this choice more complicated. It is also mentioned, that an increasing amount of people who were bitten by a snake bring pictures of the snake, for example, taken with a smartphone, or the killed snake itself to the physician [5]. Therefore, the target of the SnakeCLEF challenge is the improved and robust identification of snake species based on photographs [20]. In this article, the experiments and results of FHDO BCSG are presented. For this reason, Section 2 describes previous work in this field of research. After- wards, the general machine learning workflow is illustrated in Section 4, followed by a description of the achieved results in Section 5. Finally, the results are sum- marized in Section 6. 2 Related Work Automated identification of snake species using machine learning is rarely stud- ied, resulting from small datasets of annotated images. James et al. [13] described a semiautomatic approach, where taxonomical features have been extracted from images to discriminate six different species. The dataset contained 1,299 images and the least frequent class included 88 im- ages. Using different feature selection approaches, it has been concluded that the bottom-view taxonomical features are less important for the species identifica- tion than the front- and side-view features. As manual extraction of features describing the appearance of a snake is tedious, recent articles used automated feature extraction, for example, texture features [4] or deep learning [2,3,9,18]. 4 https://clef2020.clef-initiative.eu/, [last accessed: 2020-07-17] 5 https://www.imageclef.org/SnakeCLEF2020, [last accessed: 2020-07-17] 6 https://www.aicrowd.com/challenges/snake-species-identification- challenge, [last accessed: 2020-07-17] Texture features were used in Amir et al. [4] to distinguish between 22 dif- ferent species. Their dataset contained 349 images and the least frequent snake species included three images. Using classical machine learning methods, the approach achieved a classification accuracy of 87 %. Patel et al. [18] used deep learning methods to develop an application for smartphones which distinguishes images of nine different snake species, occur- ring on the Galápagos Islands in Ecuador. To this end, object detection, as well as classification algorithms, have been used. The training dataset for their imple- mentation has been a bundle of three data sources, two internet searches of the platforms Google and Flickr were combined with an image dataset provided by the Ecuadorian institution Tropical Herping7 . In total, 250 images were collected and the least frequent class contained seven images. Different model architec- tures have been tested for object detection and image classification. The model which was based on Faster Region-based Convolutional Neural Network (Faster R-CNN) [23] ResNet [11] achieved the best classification accuracy of 75 %. The authors state that a larger amount of training samples would be important for further investigations in this field. Abdurrazaq et al. [2] used three different Convolutional Neural Network (CNN) architectures to distinguish five different snake species. They used a dataset containing 415 images. For the least frequent snake species, 72 images were available. The best results were achieved using a medium-sized classification network. Abeysinghe et al. [3] used a deep Siamese network [6] to classify a relatively small dataset containing 200 images of 84 species based on World Health Or- ganization (WHO) venomous snake database8 . The approach described in their article concentrated on single-shot learning as the dataset included 3 to 16 im- ages per species. The achieved results of the automated classification model per- formed worse than human classification accuracy. Pairwise classification results exceed class prediction accuracy. As already mentioned, Gokula Krishnan [9] described the results of round 2 of the AICrowd Snake Species Identification Challenge. The solution which achieved the best results has used object detection as a pre-processing step to focus on the image parts containing the snake. On this basis, EfficientNets were applied afterwards for image classification. In round 2 the dataset included 187,720 images assigned to 85 classes. 3 Dataset The training dataset used in the actual SnakeCLEF and AICrowd Snake Species Identification Challenge round 4 consists of 245,185 red-green-blue- (RGB-) color-space-images (models trained on the training dataset were referred to as T1) assigned to 783 different snake species. Additionally, a validation dataset is 7 https://www.tropicalherping.com/, [last accessed: 2020-07-17] 8 https://apps.who.int/bloodproducts/snakeantivenoms/database/, [last ac- cessed: 2020-07-17] available, which includes another 14,029 images (models trained on the train- ing and validation dataset were referred to as T2). The class distribution of the snake species is highly unbalanced as can be seen in the bar plot of the absolute class frequencies depicted in Figure 1. 14000 12000 10000 Absolute frequency 8000 6000 4000 2000 0 thamnophis−sirtalis (n=12,918) agkistrodon−laticinctus / bitis−armata / bothrocophias−microphthalmus / chironius−bicarinatus / geophis−hoffmanni / hebius−miyajimae / lycodon−effraenis / macrovipera−schweizeri / naja−pallida / philothamnus−punctatus / sibynophis−collaris / spalerosophis−dolichospilus / thamnophis−chrysocephalus (n=18) Fig. 1. Distribution of the snake species in training and validation dataset. 3.1 Image Dataset An analysis of the dataset with AntiDupl9 revealed 1,713 duplicate images in the training set. Some of these duplicates are associated with different species like “Image not found” images, that are the result of download problems. Other duplicates are correctly associated with several species as they depict distinct 9 https://github.com/ermig1979/AntiDupl, [last accessed: 2020-07-15] snakes. When the mean squared difference between images in AntiDupl is relaxed to 2 %, another 2,114 duplicates can be found. These are the result of different jpeg compression rates for the same image, resize operations or deletion of copy- right information. Another problem that has been found are out-of-class images, that have been injected by the organizers. These images contain no snakes but for example, ice-hockey players, churches, other animals, persons, and mangas. To identify them for exclusion from the training set, a standard ImageNet [8] classifier with 1,000 classes and based on a ResNet50 [11] architecture has been used and a positive list of snake and reptile classes, that are part of the ImageNet Large Scale Visual Recognition Challenge 2012 (ILSRVC2012) [25] dataset has been used. With this classifier, about 4,000 out-of-class images have been identi- fied and the effects of the reduced dataset (abbreviated as D1 hereafter) has been tested and compared to the unfiltered dataset. The results of this comparison are summarized in Table 7. 3.2 Metadata The images are associated with metadata that provides information about the continent and country of the place where the image has been taken. For some snake depictions, the information is not given and only “UNKNOWN” is pro- vided in the metadata. This information could be used for better classification. It should be noted, that the number of snake species in the dataset does not match the natural occurrence of a snake in a location. For example, the most frequent species with German country information in the dataset is pantherophis guttatus, the corn snake which is not endemic in Germany but is the pet snake number one in Germany. Accordingly, the data set takes into account that pet snakes can also attack humans. 4 Methods This section describes the workflow used to learn a discrimination between the different snake species. The generalized workflow is depicted in Figure 2. The workflow is modular and in the course of the challenge, it was examined how different implementations of the individual modules affect the classification per- formance on the test dataset. In this section, the components are described more precisely and different implementations of them are demonstrated. The workflow has been implemented using the programming language Python 3.6.9 [28] and was based on Keras 2.2.4-tf [7] with a Tensorflow 2.1.0 [1] backend. For the infer- ence on the AICrowd submission platform, Tensorflow 2.0.0 was used for reasons of compatibility. Image pre-processing included an optional object detection stage and a manda- tory stage, where rectangular images were transferred to a square shape after- wards. Additionally, the images were augmented, optionally branded using loca- tional information and fed into the deep learning training network. Finally, an optional multiplication of the prediction probabilities and the a priori probabil- ity distribution of the snake species occuring at the corresponding location has been implemented. Input image Object detection Image preprocessing Preprocessing Metadata Augmentation Image branding Model training multiplication Classification Fig. 2. Generalized workflow for snake species classification. 4.1 Object Detection The idea of using an object detection stage before executing an image classifi- cation was inspired by the winning team [9] of round 2 of the AICrowd Snake Species Identification Challenge. Object detection has been implemented using the Mask R-CNN procedure, first described by He et al. [10]. Mask R-CNN per- forms instance segmentation as it extracts a bounding box, a class label and a pixel-wise segmentation mask for each object detected in an image. The Mask R-CNN algorithm is organized using two different stages. In the first stage, a backbone CNN extracts a feature map from the original image. In this paper, Resnet-50 has been used as a backbone. Afterwards but also in the first stage, a Region Proposal Network (RPN) is used to identify candidate object regions. So-called anchor boxes are used in this step to predefine bounding boxes. The second stage consists of a Region of Interest (ROI) align network which extracts multiple possible ROI sections. Based on these sections, a fully connected layer network is trained to perform a parallel softmax classification for class identifica- tion (snake vs. background in this case) and a regression task to specify bounding boxes. Additionally, a CNN-based mask classifier is employed for pixel-wise seg- mentation. In this article, the backbone model weights were initialized by the model weights trained on the ImageNet [8] dataset. The training on the snake dataset has been implemented in two different phases. First all layers except the layers which are included in the backbone were trained using 20 epochs to warm up the model and afterwards 30 epochs were performed to train the entire model. The implementation of the Mask R-CNN used in this article is an adaption10 of the implementation of Abdulla11 transferred to use Tensorflow 2.1.0. No data augmentation has been used for object detection. The threshold of minimum detection confidence has been set to 0.3. Stochastic gradient descent (SDG) was used as an optimizer to train the model, momentum was set to 0.9. Further parameters include a weight decay, which was set to 0.0001 and a batch size of 8 was used. In order to train the object detection model, the annotated snake images available from the winning solution of round 2 of the AICrowd Snake Species Identification Challenge [9] (O1 in Section 5) were used initially. Later, 400 addi- tionally annotations were added to this dataset (O2 in Section 5) to investigate whether the object detection and thus the classification performance can be im- proved. The object detection results can be found in Table 3. Since Mask R-CNN is used in this approach only for object detection and not for instance segmenta- tion, it may be an adequate solution to use Faster R-CNN instead of the Mask R-CNN. However, the results of the TensorFlow Object Detection application programming interface (API) [12], which represents a guide to choose an ade- quate object detection architecture shows an increased mean average precision (mAP) of 39.0 for the Microsoft Common Objects in Context (COCO) dataset [17] for Mask R-CNN object detection in comparison to Faster RCNN, which achieves a mAP of 38.712 . The use of the Mask R-CNN object detection makes it easier to supplement segmentation data prospectively, which was not used during this challenge due to a lack of time. 4.2 Image Pre-processing As most of the deep learning classification models expect input images of square shape and predefined dimensions, it has been important to transform the mostly rectangular images, or image parts detected by the object detection into square shape and adjust the image for the expected image dimension of the classifica- tion model. There are different possibilities for the extraction of quadratic from rectangular images. The methods used in this paper are described in this section and the implemented combinations of this methods are summarized in Table 1. The results of the experiments achieved using different image pre-processing methods are summarized in Table 4. Resize The least complex possibility has been to rescale images without con- sideration of aspect ratio. This resulted in highly distorted images so the tex- ture and the shape of the snake have been disturbed especially for images with 10 https://github.com/DiffPro-ML/Mask_RCNN, [last accessed: 2020-06-30] 11 https://github.com/matterport/MaskRCNN, [last accessed: 2020-06-30] 12 https://github.com/tensorflow/models/blob/master/research/object_ detection/g3doc/tf2_detection_zoo.md [last accessed: 2020-08-03] strongly different image dimensions. In this paper, two rescaling procedures, one considering and one retaining the aspect ratio were compared to each other. In the latter case, images had to be padded with further information to transfer them to a square shape. Scaling Another problem which occurs during pre-processing is the problem of upscaling. Upscaling small images lead to poor image quality. It has been suspected that this could bring difficulties in texture recognition. In this paper, approaches which did and did not use upscaling for image pre-processing were compared to each other. If upscaling has been avoided, approaches were needed to pad pixel information for the remaining image sections. Fill boundaries As previously mentioned, there were some different cases, where padding was required to get input images with preset image dimensions. One strategy to solve this issue has been to pad the image by a monochrome color. Koitka and Friedrich [16] recommended padding with a color matching to the image instead of using a predefined color (usually black or white). Since black is usually the most frequently occurring color in shady images, this approach used the average color of the original image or rather of the cropped areas as an alternative to pad the image. In combination with object detection, it has been possible to increase the ROI and thus pad the image using background information instead of monochrome color. In this case, the image section predefined by the object detection work- flow has been expanded as long as a quadratic section is found or one of the dimensions of the original image were smaller than the expected dimension of the square. If this happened, the average color of the image has been used to pad protruding boundaries. It has been attempted to include background evenly on all sides to center the snake. Sometimes this was not possible, for example, if the snake was located in a corner of the original image. In this case, the ROI has been moved to include background information of the remaining directions, thus the snake has not been centered in the image. Table 1. Methods used for image pre-processing. Abbre- Resizing Scaling Fill boundaries viation I1 No consideration of the aspect ratio Up-scaling No padding I2 Consideration of the aspect ratio Up-scaling Monochrome padding I3 Consideration of the aspect ratio No up-scaling Monochrome padding I4 Consideration of the aspect ratio No up-scaling Background padding 4.3 Data Augmentation Data augmentation has been used to expand the training images and avoid overfitting. In each epoch of the training process, the images were randomly transformed. These transformations included random cropping of approximately 10 % of the image pixels per dimension, a rotation in the range of ±40 ◦ , a width- shift, height-shift, random shearing, zooming each with a factor of 0.2, as well as the possibility of horizontal flipping. If pixel positions were generated during this procedure, for which no image information has been available, those were filled using the value of the nearest available image position. During the challenge, the workflow has been adapted to speed up the image classification procedure. In the later version of the workflow those pixels used black as a monochrome color. 4.4 Image Classification EfficientNets As also used by Gokula Krishnan [9], EfficientNets, first de- scribed in Tan and Le [26], were used for classification in this approach. The baseline EfficientNet-B0 architecture is generated using an architecture search that parallely optimizes accuracy on a predefined classification task and Floating Point Operations Per Second (FLOPS) [26]. Based on this baseline model, larger models of the same family are created by scaling the depth, height and resolu- tion of the baseline model uniformly. The different models of this family achieve state-of-the-art classification accuracy on ImageNet [8]. Additionally, the archi- tecture is smaller and faster on inference compared to other existing CNNs [26]. EfficientNets were successfully adapted to different machine learning problems using transfer learning [26]. Various models of the EfficientNets family were used in this competition from EfficientNet-B0 up to EfficientNet-B4 networks (B0 - B4 in Section 5). The results of using different models of the EfficientNets family can be found in Table 6. The model weights were initialized by a model pre-trained using noisy student [30]. The EfficientNets were extended by a flatten layer, a dense layer with 1000 neurons and Swish [22] as an activation function and a dense layer with 783 neurons, which corresponds to the number of snake species and softmax activation were added to the previous architecture. The described model was trained for a few epochs on the snake classification task to warm-up the network. In this phase only the newly added layers and the batch normalization layers have been trained. Afterwards all layers were trained for a larger number of epochs (N10+50 denotes a warm-up phase including ten epochs and 50 epochs are used to train the entire model). Different batch sizes were used as further parameters to train the model (32 is encoded as BS32, 64 as BS64 etc., BS64/32 means that a batch size of 64 has been used during the warm-up phase and a batch size of 32 has been used afterwards). The chosen batch size depended on the image size (e.g., an image size of 128×128 is encoded as S128 in Section 5) the classification model and the available graphics processing unit (GPU) memory. The results of models using different image sizes can be found in Table 5. The learning rate (α) was likewise adjusted depending on the batch size (LR1 encodes a learning rate of 10−4 during the warm-up phase and 10−5 during fine-tuning and LR2 encodes a learning rate of 10−5 during the warm-up phase and 10−6 during fine-tuning in Section 5). All submissions described in this paper used the Adam optimizer (β1 = 0.9, β2 = 0.999, = 10−7 ) [15] to minimize categorical cross entropy. The implementation of the classification model workflow used an EfficientNets 1.1.0 implementation of Tensorflow Keras 2.2.4 [7]. Since the dataset of the challenge had very unbalanced class frequencies, dif- ferent class weight functions were used in order to implement an oversampling. Equation 1 describes a linear class weight function (W1 in Section 5) and Equa- tion 2 describes a function where very low frequencies were less oversampled (W2 in Section 5). For both equations, F (c) denotes the frequency of class c. For comparability reasons, one model has been trained without class weights. max F (c) w1 (c) = (1) F (c) 1 w2 (c) = 1 − q (2) max F (c) F (c) + 0.5 Polyak Averaging Polyak averaging, based on the approach of Polyak [21] and Ruppert [24], is a method to combine the learned weights of different epochs during the model training in order to obtain a final model with more robust weights. In this paper, it has been tested if Polyak averaging leads to improved classification results (P1 denotes the described Polyak averaging in Section 5). Therefore the learned weights of the last five epochs were averaged using an exponential function described in Equation 3, where i has a value of 1 for the last epoch, 2 for the penultimate epoch and 5 for the fifth last epoch. −i Wpolyak (i) = exp (3) 2 4.5 Addition Of Location Information Optionally, location information was added to some models by multiplying the prediction probabilities of the classification model by the a priori probability of the snake class for the specified location (M1 denotes the multiplication of the locational distribution). The a priori probabilities were estimated by the relative frequency distribution of the snake species at the location in the training and validation dataset. Usually, the country information was used in this step, only if this information was missing, the distribution of the continent has been used instead. For some images, both country and continent information were missing. In those cases, the frequency distribution of the entire dataset has been used. The softmax function was applied after this multiplication, to normalize the results. Another variant has been implemented based on the previously described procedure. The sole exception has been that the raw prediction probabilities of images with missing country and continent information were not multiplied (abbreviated as M2 in Section 5). As a second variation of this method, all prediction probabilities were multiplied by a binary variant of the frequency distribution, which thus denotes if a snake was or was not present at a location (M3 in Section 5). The results achieved using the different metadata integration strategies are summarizes in Table 8. During the experiments of the FHDO BCSG a few alternatives have been investigated. These methods were only tested in small experiments and are not described in this paper for reasons of clarity. Image Branding As an alternative to the simple multiplication of the loca- tion distributions, an approach has been implemented, which directly adds the location information into the classification network. This has been done using a binary image branding technique introduced in Pelka et al. [19], which adds grey (RGB = [102,102,102]) boxes encoding the location information directly to the images. The height of the boxes was set to 8 pixels while the width (bw ) depends on the image dimensions d and is described in Equation 4. d bw = −4 (4) 8 The first box starts directly at the left border of the image and after every box, space was left for 4 pixels. The continent information has been added as binary boxes at the top border of the image, while the country information has been added at the bottom border of the image. Since a distinction has been made between seven continents as well as the “unknown”-class, every box at the top of the image represents a continent (abbreviated as M4). A similar approach to encode the country information would result in small boxes because 189 countries had to be distinguished. The used image branding approach is illustrated in Figure 3. In this case, a binary encoding of the country index has been chosen, so that eight boxes could be used to represent 28 = 256 different countries. Hereafter, the combined branding of continent and country information is abbreviated as M5. 5 Results In this section, the classification results for the test dataset of the challenge are described. Table 2 summarizes the most relevant successful submissions of the FHDO BCSG for the SnakeCLEF challenge. This table is mainly used to give an overview about the submitted models. In order to get a better insight into the partial results and the effects of the different methods used, partial aspects are considered in individual tables in the further course of this section. It was possible to submit models to the AICrowd Snake Species Identification Challenge after the submission deadline of Snake- CLEF expired. Therefore Table 2 presents a few models which achieve better Original image Branding lookup tables Branded image Continent Index Encoding Africa 0 10000000 Continent encoding: … … … South America 6 00000010 Unknown 7 00000001 Country Index Encoding Afghanistan 0 00000000 Country encoding … … … Brazil 22 01101000 … … … Zimbabwe 188 00111101 Fig. 3. Appropriated branding approach for country and continent branding. results than the best submission in the SnakeCLEF challenge. In order to avoid miscommunication, the submissions in Table 2 are listed in chronological order and the deadline of the challenge is highlighted. The results of the different object detection datasets are summarized in Table 3. This table only presents the parameters, which are necessary for this compar- ison. It should be noted, that all the other parameters of the compared models are identical, as can be verified in Table 2. This type of presentation is also used in subsequent tables. The comparison of submissions 68418 and 68450, as reflected in Table 3 shows, that the macro-averaging F1 -score (abbreviated as F1 hereafter) increased by 0.010 when additional images were annotated, whereas log loss remains stable. Moreover, the number of images where no snakes were identified decreases from 141 to 123 in the joined training and validation dataset. Additionally, submission 68678 is a model, which was trained using no object detection. This model is not completely comparable to any other models, but submission 68632 differs only in the image pre-processing step. Comparing those two models, shows a slightly better performance of the model which used the object detection. As previously mentioned, this comparison is not entirely fair. Table 4 summarizes the results of models trained based on different pre- processing methods. As can be seen, pre-processing influenced the classification results achieved for the test dataset. Remarkable was the good performance of the submission 68506 which used image resizing without consideration of the aspect ratio. This model achieved the best F1 of 0.452. It has been expected, that this image pre-processing would achieve bad results as major distortions were possible, so in some cases humans were not able to recognize snakes in those images. The second-best result was achieved for submission 67962. In this submission, the ROIs detected during object detection have been expanded and thus were padded using background information. The submission reached an F1 of 0.403 Table 2. Classification results achieved for the official test dataset, including macro- averaging F1 -score (F1 ) and log loss. The best results in each section are highlighted in bold. ID Ob- Image Classification model training Data- Meta- F1 Log ject pre- set data loss de- process- tec- ing tion 67675 O1 I2 S128 B0 BS64 W2 LR1 N10+50 - D1 T2 M1 0.338 6.652 67696 O1 I2 S128 B2 BS64 W2 LR1 N10+50 - D1 T2 M1 0.392 6.630 67700 O1 I2 S128 B0 BS64 W2 LR1 N10+50 - - T2 M1 0.352 6.651 67727 O1 I2 S128 B2 BS64 W2 LR1 N10+50 - - T2 M1 0.389 6.650 67734 O1 I2 S128 B4 BS64 W2 LR1 N10+50 - D1 T1 M1 0.403 6.650 67882 O1 I2 S128 B2 BS64 W1 LR1 N10+50 - - T2 M1 0.365 6.657 67901 O1 I2 S128 B2 BS64 - LR1 N10+50 - - T2 M1 0.377 6.647 67962 O1 I4 S128 B2 BS64 W2 LR1 N10+50 - - T2 M1 0.403 6.650 68023 O1 I2 S128 B4 BS64 W2 LR1 N10+50 P1 D1 T1 M1 0.404 6.650 submission deadline 68418 O1 I4 S196 B2 BS64 W2 LR1 N10+50 - - T2 M1 0.475 6.645 68432 O1 I3 S128 B2 BS64 W2 LR1 N10+50 - - T2 M1 0.369 6.650 68450 O2 I4 S196 B2 BS64 W2 LR1 N10+50 - - T2 M1 0.485 6.645 68506 O1 I1 S128 B2 BS64 W2 LR1 N10+50 - - T2 M1 0.452 6.648 68520 O1 I2 S224 B0 BS64 W2 LR1 N10+50 - - T1 - 0.322 1.877 68541 O1 I2 S128 B4 BS64 W2 LR1 N10+50 - - T2 M1 0.426 6.648 68574 O1 I2 S224 B0 BS64 W2 LR1 N10+50 - - T1 M1 0.431 1.659 68575 O1 I2 S224 B0 BS64 W2 LR1 N10+50 - - T1 M2 0.447 1.583 68593 O1 I2 S196 B4 BS64 W2 LR1 N10+50 - D1 T1 M1 0.483 6.645 68632 O1 I3 S196 B4 BS64/32 W2 LR1 N10+50 - - T2 M1 0.366 6.646 68655 O1 I2 S224 B0 BS64 W2 LR1 N10+50 - - T1 M3 0.445 1.596 68678 - I1 S196 B4 BS64/32 W2 LR1 N10+50 - - T2 M1 0.347 6.647 69365 O2 I4 S380 B4 BS13 W2 LR2 N10+50 - - T1 M1 0.460 1.379 69750 O2 I4 S380 B4 BS13 W2 LR2 N10+50 - - T1 M5 0.361 1.541 69768 O2 I4 S380 B4 BS13 W2 LR2 N10+50 - - T1 M4 0.437 1.363 69849 O2 I4 S380 B4 BS13 W2 LR2 N10+50 - - T1 M4+M1 0.459 1.355 69888 O1 I3 S380 B4 BS13 W2 LR2 N10+109 - - T1 M2 0.594 1.064 Abbreviations: O1: Object detection dataset from [9], O2: Expanded dataset, I1: No aspect ratio, up-scaling, no padding, I2: Aspect ratio, up-scaling, monochrome padding, I3: Aspect ratio, no up-scaling, monochrome padding, I4: Aspect ratio, no up-scaling, background padding, Sx: Image size: x×x pixels, Bx: EfficientNet-Bx, BSx: Batch size of x for image classification, BSx/y: Batch size: warm-up-phase: x, fine- tuning: y, W1: Linear weights, W2: Nonlinear weights, LR1: Learning rate warm-up phase: 10−4 , fine-tuning: 10−5 , LR2: Warm-up phase: 10−5 , fine-tuning: 10−6 , Nx+y: Training epochs warm-up phase: x, fine-tuning: y, P1: Polyak averaging, D1: Reduced dataset, T1: Training dataset, T2: Training + test dataset, M1: Multiplication of metadata, M2: Multiplication without unknown cases, M3: Binary multiplication, M4: Continent branding, M5: Continent and country branding Table 3. Official classification results on the test dataset to compare object detection datasets. The results include F1 and log loss. The best results in each section are highlighted in bold. Presented results represent ablation studies, thus non-mentioned parameters are fixed in each section. ID Object detection Image pre-processing F1 Log loss 68418 Dataset from [9] (O1) Aspect ratio, no up-scaling, 0.475 6.645 background padding (I4) 68450 Expanded dataset (O2) Aspect ratio, no up-scaling, 0.485 6.645 background padding (I4) 68632 Dataset from [9] (O1) Aspect ratio, no up-scaling, 0.366 6.646 monochrome padding (I3) 68678 No object detection (-) No aspect ratio, up-scaling, 0.347 6.647 without padding (I1) and thus outperformed the F1 of submission 68432, which used a monochrome color padding strategy, by 0.034. The comparison between submission 68432 and submission 67727 shows a slightly positive effect of using upscaling, as the F1 of submission 67727 is 0.020 higher than the F1 of submission 68432. The previously described comparison is based on small images containing 128×128 pixels, for future investigations, it would be interesting how the pre-processing methods affect larger images. Table 4. Official classification results on the test dataset to compare pre-processing methods. The results include F1 and log loss. The best results are highlighted in bold. Presented results represent ablation studies, thus non-mentioned parameters are fixed. ID Pre-processing pipeline F1 Log loss 68506 No aspect ratio, up-scaling, without padding (I1) 0.452 6.648 67727 Aspect ratio, up-scaling, monochrome padding (I2) 0.389 6.650 68432 Aspect ratio, no up-scaling, monochrome padding (I3) 0.369 6.650 67962 Aspect ratio, no up-scaling, background padding (I4) 0.403 6.650 Table 5 summarizes the official classification results achieved using different image sizes as model input. The results of the comparison corresponds to other experiments executed during the challenge and shows that models trained on larger image input sizes achieved better classification results. Increasing the im- age size from 128×128 to 196×196 boosted the F1 by approximately 0.080. The used image sizes may look striking, because EfficientNet-B0 models are usually trained using images including 224×224 pixels and EfficientNet-B4 models are optimized for an image size of 380×380 pixels. The use of small images in this approach resulted from the fact that some early submissions failed because of memory issues. The problem has been fixed after the deadline of the SnakeCLEF challenge expired. Some of the later submissions used larger image sizes consis- tent to the original EfficientNets input sizes and thus achieved better results. Table 5. Official classification results on the test dataset to compare image input sizes. Both models use the EfficientNet-B4 architecture. The results include F1 and log loss. The best results are highlighted in bold. Presented results represent ablation studies, thus non-mentioned parameters are fixed. ID Image size F1 Log loss 67734 128×128 (S128) 0.403 6.650 68593 196×196 (S196) 0.483 6.645 Next, the influence of different model architectures on the classification re- sults were investigated. In Table 6, a comparison is presented concerning different model architectures. The comparison shows, concurrently to some experiments not listed here for reasons of clarity, increased F1 for upscaled models. Submis- sion 67727, which was based on an EfficientNet-B2 architecture outperformed submission 67700 by an increase of the macro averaging F1 -score of 0.037. Sub- mission 68541, which represents an EfficientNet-B4 architecture, achieved an F1 of 0.426 and thus outperformed the results of submissions 67727 and 67700 by 0.037 and 0.074. It should be noted that all of the submissions summarized in Table 6 were trained using an image size of 128×128 pixels which is due to some memory issues already mentioned before. Some additional experiments were performed comparing different top layer architectures, for lack of time those were not completely comparable to each other, especially because the number of epochs used for training differed for most of the models. For this reason these results are not elaborated in this paper. Table 6. Official classification results on the test dataset to compare different models of the EfficientNets family. The results include F1 and log loss. All models used input images containing 128×128 pixels. The best results are highlighted in bold. Presented results represent ablation studies, thus non-mentioned parameters are fixed. ID Model architecture F1 Log loss 67700 EfficientNet-B0 (B0) 0.352 6.651 67727 EfficientNet-B2 (B2) 0.389 6.650 68541 EfficientNet-B4 (B4) 0.426 6.648 It has been mentioned in Section 4 that different weight functions can be used to overcome unbalanced class distributions. The results of submissions 67727, 67882 and 67901 show, that the function introduced in Equation 2, which has been used in submission 67727 achieved a macro averaging F1 -score of 0.389 and thus outperformed submission 67882, which used a linear class weight func- tion and achieved an F1 of 0.365 and submission 67901, which used no class weights and achieved a macro averaging F1 -score of 0.377. One possible reason for the poor results of the linear weighting could be the high differences in class frequencies, which lead to larger weights for rare classes. The results of the dataset filtering strategies, which are presented in Ta- ble 7, have been inconclusive. For the workflow used in submissions 67675 and 67700, the model trained on the reduced dataset performed worse than the model trained on the complete dataset. The opposite behaviour has been observed for the workflow used in submissions 67696 and 67727, which achieved F1 of 0.392 and 0.389. As the filtering removed images from the training dataset, where no snakes were present and no clear benefit is reached using this filtering, one could assume, that there might be some images in the test dataset where no snakes are present. Table 7. Official classification results on the test dataset to compare different dataset filtering strategies. The results include F1 and log loss. The best results in each sec- tion are highlighted in bold. Presented results represent ablation studies, thus non- mentioned parameters are fixed in each section. ID Strategy for dataset filtering F1 Log loss 67675 Duplicates and plausibility filtering (D1) 0.338 6.652 67700 No filtering (-) 0.352 6.651 67696 Duplicates and plausibility filtering (D1) 0.392 6.630 67727 No filtering (-) 0.389 6.650 As can be noted in Table 2, most of the earlier submissions, achieved high log losses of about 6.6, while others had log losses of about 1 with a confusing dependency to the achieved macro averaging F1 -scores. This problem appeared because of softened prediction results if softmax normalization is used after the multiplication of the location frequencies and has been fixed using maximum- normalization instead (e.g., submissions 68520, 68574, 68575 and 68655). The results of adding metadata to improve image classification are described in Table 8. It can be seen, that adding metadata to a model by multiplying the model results using the a priori probability of the snake class for the given location lead to an increased F1 . Submission 68574, which was trained using the same workflow as submission 68520, except adding metadata, outperforms this model by an increase of F1 of 0.109. As can be seen looking at submission 68575 those results can be further improved by a value of 0.016, if the multiplication is only used for available country and continent information. Submission 68655 exhibited a similar result of 0.445 using binary information about the availability of a species in a country or continent. The results of the submissions 69365, 69750, 69768 and 69849 show that mod- els which used the image branding presented in Section 4, achieved no benefit in comparison to multiplying the raw predictions by the location information. It can be noted, that submission 69750, which combined country and continent branding achieved a poor F1 of 0.361, whereas the model, which used only con- tinent branding (submission 69768), achieved a better F1 of 0.437. One possible reason for this might be the use of the complex positional encoding of the country information which is hard to learn for a CNN, which focuses more on local differ- ences. Because of the limited time, it was not possible to investigate this problem more thoroughly, thus additional investigations should be under examination in future work. The combination of continent branding and multiplication of the a priori probability distribution in submission 69849 achieved similar results than the model, which used no branding, but the multiplication. Table 8. Official classification results on the test dataset to compare strategies to fuse image and location data. The results include F1 and log loss. The best results in each section are highlighted in bold. Presented results represent ablation studies, thus non-mentioned parameters are fixed in each section. ID Strategy to integrate metadata F1 Log loss 68520 No metadata (-) 0.322 1.877 68574 Multiplication (M1) 0.431 1.659 68575 Multiplication without unknown (M2) 0.447 1.583 68655 Binary multiplication (M3) 0.445 1.596 69365 Multiplication (M1) 0.460 1.379 69750 Branding continent and country (M5) 0.361 1.541 69768 Branding continent (M4) 0.437 1.363 69849 Branding continent and multiplication (M4+M1) 0.459 1.355 The best model submitted before the SnakeCLEF deadline expired was sub- mission 68023 which was based on an EfficientNet-B4 model architecture and achieved an F1 of 0.404. Due to the previously mentioned memory issues, this model used an unusual small image size of 128×128 pixels. The newly added lay- ers of the described model were trained with a warm-up phase of ten epochs and another 50 epochs were used to fine-tune the entire model. During the training process a batch size of 64, a learning rate of 0.0001 and Adam optimizer have been used. Location information was added using the described multiplication procedure. Polyak averaging with exponential weights has been used to combine the results of the last five training epochs. The Polyak averaging achieved an improvement of F1 of 0.001 in comparison to submission 67734 which used no Polyak averaging. The best submission after the SnakeCLEF deadline expired was submission 69888 which achieved a macro-averaging F1 -score of 0.594 and a log loss of 1.064. The main differences in comparison to the best model before the deadline ex- pired were, that the model was trained using the predefined image dimensions of an EfficientNet-B4, which are 380×380 pixels. Due to the increased image size, a smaller batch size of 13 and a decreased learning rate has been used. Fur- thermore, 109 instead of 50 training epochs have been applied, and the location distribution was multiplied only for known countries and continents. The model included no Polyak averaging. 6 Conclusions In conclusion, it can be stated that snake species identification is a challenging task, primarily because of the high diversity of snake species, high intra-class variance, and low inter-class variance. The main improvements in snake species classification presented in this paper are based on increasing image size, combining location and image information as well as upscaled model architecture. The results presented in this article show improved classification results using an object detection strategy previously to the image classification. However, a plausibility filtering of the training dataset showed no clear improvement. Some differences were detected in dependence of the pre-processing steps. Nevertheless, no clear insights could be achieved about which steps are particularly promising for good classification results. The implementation and application of the different pre-processing steps turned out to be relatively time-consuming. Besides, it has been previously mentioned, that there were some memory issues which lead to a focus on small image sizes as well as less upscaled model architectures in the early course of the challenge. Thus the time needed to optimize the classification parameters more precisely and to try out different optimizers was reduced. It is expected, that the results may be further improved by adjusting those parameters. Due to the use of AICrowd as a submission platform, it has been possible to test a large number of different models. This enables to get direct feedback about the performance on the test dataset, and thus gives a good estimate about which methods gets the most promising results. In addition it facilitates the com- parison between teams before the deadline expires. In this article, it has been mentioned before, that there were some memory issues which were related to the architecture of the test environment. In some cases debugging has been compli- cated because the logs were not accessible. These concerns were compensated by the very prompt and useful help from the organizers. 7 Acknowledgment The work of Louise Bloch and Obioma Pelka was partially funded by a PhD grant from University of Applied Sciences and Arts Dortmund, Germany. The authors want to thank Raphael Brüngel for the constructive proofreading of the manuscript. References 1. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghe- mawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D.G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., Zheng, X.: Tensorflow: A system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). pp. 265–283 (2016), https://www.usenix.org/system/files/conference/ osdi16/osdi16-abadi.pdf 2. Abdurrazaq, I.S., Suyanto, S., Utama, D.Q.: Image-Based Classification of Snake Species Using Convolutional Neural Network. In: 2019 International Sem- inar on Research of Information Technology and Intelligent Systems (ISRITI). pp. 97–102. Institute of Electrical and Electronics Engineers (IEEE) (2019). https://doi.org/10.1109/isriti48646.2019.9034633 3. Abeysinghe, C., Welivita, A., Perera, I.: Snake Image Classification Using Siamese Networks. In: Proceedings of the 2019 3rd International Conference on Graphics and Signal Processing (ICGSP ’19). pp. 8–12. Association for Computing Machin- ery (ACM), New York, NY, USA (2019). https://doi.org/10.1145/3338472.3338476 4. Amir, A., Zahri, N.A.H., Yaakob, N., Ahmad, R.B.: Image Classification for Snake Species Using Machine Learning Techniques. In: Phon-Amnuaisuk, S., Au, T.W., Omar, S. (eds.) Computational Intelligence in Information Systems. pp. 52–59. Springer International Publishing, Cham (2017). https://doi.org/10.1007%2F978- 3-319-48517-1 5 5. Bolon, I., Durso, A.M., Botero Mesa, S., Ray, N., Alcoba, G., Chappuis, F., Ruiz de Castañeda, R.: Identifying the snake: First scoping review on practices of commu- nities and healthcare providers confronted with snakebite across the world. PLOS ONE 15(3), e0229989 (03 2020). https://doi.org/10.1371/journal.pone.0229989 6. Bromley, J., Bentz, J.W., Bottou, L., Guyon, I., Lecun, Y., Moore, C., Säckinger, E., Shah, R.: Signature Verification using a “Siamese” Time Delay Neural Network. International Journal of Pattern Recognition and Artificial Intelligence 07(04), 669–688 (08 1993). https://doi.org/10.1142/s0218001493000339 7. Chollet, F.: Keras (2015), https://keras.io,[last accessed: 2020-07-14] 8. Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. pp. 248–255. Institute of Electrical and Electronics Engineers (IEEE) (2009). https://doi.org/10.1109/cvpr.2009.5206848 9. Gokula Krishnan: Diving into Deep Learning — Part 3 — A Deep learning practitioner’s attempt to build state of the art snake-species image classifier (2019), https://medium.com/@Stormblessed/diving-into-deep-learning- part-3-a-deep-learning-practitioners-attempt-to-build-state-of-the- 2460292bcfb, [last accessed: 2020-06-10] 10. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: 2017 IEEE International Conference on Computer Vision (ICCV). pp. 2980– 2988. Institute of Electrical and Electronics Engineers (IEEE) (2017). https://doi.org/10.1109/iccv.2017.322 11. He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recog- nition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 770–778. Institute of Electrical and Electronics Engineers (IEEE) (2016). https://doi.org/10.1109/cvpr.2016.90 12. Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A., Fathi, A., Fischer, I., Wojna, Z., Song, Y., Guadarrama, S., Murphy, K.: Speed/accuracy trade-offs for modern convolutional object detectors. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3296–3297 (2017) 13. James, A., Kumar, D., Mathews, B., Sugathan, S.: Discriminative histogram tax- onomy features for snake species identification. Human-centric Computing and Information Sciences 4(1) (02 2014). https://doi.org/10.1186/s13673-014-0003-0 14. Joly, A., Goëau, H., Kahl, S., Deneu, B., Servajean, M., Cole, E., Picek, L., Ruiz De Castañeda, R., Bolon, I., Lorieul, T., Botella, C., Glotin, H., Champ, J., Vel- linga, W.P., Stöter, F.R., Dorso, A., Bonnet, P., Eggel, I., Müller, H.: Overview of LifeCLEF 2020: a System-oriented Evaluation of Automated Species Identifica- tion and Species Distribution Prediction. In: Proceedings of CLEF 2020, CLEF: Conference and Labs of the Evaluation Forum, Sep. 2020, Thessaloniki, Greece. (2020) 15. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: 3rd In- ternational Conference for Learning Representations (2014), https://arxiv.org/ abs/1412.6980 16. Koitka, S., Friedrich, C.M.: Optimized Convolutional Neural Network Ensembles for Medical Subfigure Classification. In: Jones, G.J., Lawless, S., Gonzalo, J., Kelly, L., Goeuriot, L., Mandl, T., Cappellato, L., Ferro, N. (eds.) Experimen- tal IR Meets Multilinguality, Multimodality, and Interaction: Proceedings of the 8th International Conference of the CLEF Association, CLEF 2017. pp. 57–68. Springer International Publishing, Cham (09 2017). https://doi.org/10.1007/978- 3-319-65813-1 5 17. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) Computer Vision – ECCV 2014. pp. 740–755. Springer International Publishing, Cham (2014). https://doi.org/10.1007/978-3- 319-10602-1 48 18. Patel, A., Cheung, L., Khatod, N., Matijosaitiene, I., Arteaga, A., Gilkey, J.W.: Revealing the Unknown: Real-Time Recognition of Galápagos Snake Species Using Deep Learning. Animals 10(5), 806 (2020). https://doi.org/10.3390/ani10050806 19. Pelka, O., Nensa, F., Friedrich, C.M.: Variations on Branding with Text Occur- rence for Optimized Body Parts Classification. In: 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). pp. 890–894. Institute of Electrical and Electronics Engineers (IEEE) (2019). https://doi.org/10.1109/EMBC.2019.8857478 20. Picek, L., Ruiz De Castañeda, R., Durso, A.M., Bolon, I., Sharada, P.M.: Overview of the SnakeCLEF 2020: Automatic Snake Species Identification Challenge. In: CLEF task overview 2020, CLEF: Conference and Labs of the Evaluation Forum, Sep. 2020, Thessaloniki, Greece. (2020) 21. Polyak, B.: New method of stochastic approximation type. Automatic Remote Control 51, 937—-946 (1990) 22. Ramachandran, P., Zoph, B., Le, Q.V.: Searching for activation functions. Comput- ing Research Repository (CoRR) abs/1710.05941 (2017), http://arxiv.org/ abs/1710.05941 23. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real- time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39(6), 1137–1149 (2017). https://doi.org/10.1109/tpami.2016.2577031 24. Ruppert, D.: Efficient Estimations from a Slowly Convergent Robbins-Monro Pro- cess. Tech. rep., School of Operations Research and Industrial Engineering, Cornell University, Ithaca, NY (02 1988) 25. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115, 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y 26. Tan, M., Le, Q.: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning. vol. 97, pp. 6105–6114. Long Beach, California, USA (06 2019), http://proceedings.mlr.press/v97/tan19a.html 27. Uetz, P., Hallermann, J., Hosek, J.: The Reptile Database 2019, http://reptile- database.reptarium.cz, [last accessed: 2020-06-10] 28. Van Rossum, G., Drake, F.L.: Python 3 Reference Manual. CreateSpace, Scotts Valley, CA, 1 edn. (2009) 29. World Health Organization (WHO): Snakebite envenoming - Key Facts 2019 (2019), https://www.who.int/news-room/fact-sheets/detail/snakebite- envenoming, [last accessed: 2020-06-10] 30. Xie, Q., Luong, M.T., Hovy, E., Le, Q.V.: Self-training with Noisy Student im- proves ImageNet classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 10687–10698 (06 2020)