Coral Reef annotation, localisation and
pixel-wise classification using Mask R-CNN and
                   Bag of Tricks

            Lukáš Picek1,5 , Antonín Říha2 , and Aleš Zita3,4 
  1
   Dept. of Cybernetics, Faculty of Applied Sciences, University of West Bohemia
         2
           Faculty of Information Technology, Czech Technical University
3
  The Czech Academy of Sciences, Institute of Information Theory and Automation
            4
              Faculty of Mathematics and Physics, Charles University
                                    5
                                      PiVa AI


      Abstract. This article describes an automatic system for detection,
      classification and segmentation of individual coral substrates in under-
      water images. The proposed system achieved the best performances in
      both tasks of the second edition of the ImageCLEFcoral competition.
      Specifically, mean average precision with Intersection over Union (IoU)
      greater then 0.5 (mAP@0.5) of 0.582 in case of Coral reef image an-
      notation and localisation, and mAP@0.5 of 0.678 in Coral reef image
      pixel-wise parsing. The system is based on Mask R-CNN object detec-
      tion and instance segmentation framework boosted by advanced training
      strategies, pseudo-labeling, test-time augmentations, and Accumulated
      Gradient Normalisation. To support future research, code has been made
      available at: https://github.com/picekl/ImageCLEF2020-DrawnUI.

      Keywords: Deep Learning, Computer Vision, Instance Segmentation,
      Convolutional Neural Networks, Machine Learning, Object Detection,
      Corals, Biodiversity, Conservation


1 Introduction

The ImageCLEFcoral [4] challenge was organized in conjunction with the Im-
ageCLEF 2020 evaluation campaign [12] at the Conference and Labs of the
Evaluation Forum (CLEF1 ). The main goal for this competition was to create
such an algorithm or system that can automatically detect and annotate a variety
of benthic substrate types over image collections taken from multiple coral reefs
as part of a coral reef monitoring project with the Marine Technology Research
Unit at the University of Essex.

  Copyright © 2020 for this paper by its authors. Use permitted under Creative Com-
  mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem-
  ber 2020, Thessaloniki, Greece.
1
  http://www.clef-initiative.eu/
Fig. 1. Example training images showing different types of annotations - Bounding
Boxes and Segmentation Masks. Every colour represents one substrate type, e.g. yellow
represents Soft Coral and red belongs to Hard Coral Boulder.


1.1 Motivation
Live corals are an important biological class that has a massive contribution to
the ocean ecosystem biodiversity. Corals are key habitat for thousands of marine
species [5] and provide an essential source of nutrition and yield for people in
the developing countries [3,2]. Therefore, automatic monitoring of coral reefs
condition plays a crucial part in understanding future threats and prioritizing
conservation efforts.

1.2 Datasets
This section will briefly describe the provided data and their subsets: an anno-
tated dataset that contains 440 images, and a testing dataset with 400 images
without annotations. Additionally, we introduce an precisely engineered train-
ing/validation split of the annotated dataset for the training purposes.

Annotated dataset - The annotated dataset is a combination of 440 images
containing 12,082 individual coral objects. Each coral was annotated with expert
level knowledge, including segmentation mask, bounding box, and class that
represents 1 out of 13 substrate types. The dataset is heavily unbalanced (refer
to Table 1), having almost 50% of objects from a single class (Soft Coral) and
approximately 8% for the eight least frequent classes. Moreover, images have
different colour variations, are heavily blurred, and came from different locations
and geographical regions. Furthermore, coral substrates belonging to the same
class can be observed in different morphology, colour variations, or patterns.
Finally, some images contain a measurement tape that partially covers objects
of interest.
    For the network training process evaluation, the annotated dataset needed to
be divided into two parts. One used for network optimization and the second for
network performance validation. To create these subsets, every tenth image was
designated for validation set, the rest was used for training. As the validation
set class distribution did not match the training one, particular images from the
validation set needed to be replaced by carefully cherry-picked images from the
training set. This resulted in an almost perfect split with similar distributions for
both, the training and the validation set. This similarity ensured a representative
validation process.


Testing dataset - The testing dataset contains 400 images from four different
locations. Namely, the same location as is in the training set, similar location
to the training set, geographically similar location to the training set, and geo-
graphically distinct location from the training set.


Table 1. Dataset class distribution including training and validation split description.
396 images were used for training; 44 for validation.

                Dataset distribution                           Train. / Val. split
       Substrate type        # Bboxes     Fraction [%]      Train. Boxes   Val. Boxes
         Soft Coral            5,663         46.87             5,035          628
          Sponge               1,691         13.99             1,472          219
   Hard Coral – Boulder        1,642         13.59             1,513          129
 Hard Coral – Branching        1,181         9.774             1,084           97
 Hard Coral – Encrusting        946          7.829              831           115
 Hard Coral – Mushroom          223          1.845              199            24
 Hard Coral – Submassive        198          1.845              162            36
   Hard Coral – Foliose         177          1.464              144            33
      Sponge – Barrel           139          1.150              124            15
 Algae - Macro or Leaves.        92          0.761               81            11
  Soft Coral – Gorgonian         90          0.745               70            20
    Hard Coral – Table           21          0.175               17             4
  Fire Coral – Millepora         19          0.157               15             4


1.3 The System

The proposed object detection and instance segmentation system extends recent
state-of-the-art Convolutional Neural Network (CNN) object detection frame-
work (Mask R-CNN [8]) with additional Bag of Tricks that considerably in-
creased the performance. The TensorFlow Object Detection API2 [11] was used
as a deep learning framework for fine-tuning the publicly available checkpoints.
All bells and whistles are further described in Section 2. Additionally, approaches
that did not contribute positively but could have some potential for future edi-
tions of the ImageCLEFcoral competition are discussed.
2
    https://github.com/tensorflow/models/blob/master/research/object_detection
2 Methodology
This section describes all approaches and techniques used in the benthic sub-
strate detection, annotation and segmentation tasks. The modern object de-
tection and instance segmentation methods are summarized, followed by the
description of the chosen system and its configuration. Furthermore, all the used
bells and whistles (Bag of Tricks) are introduced and described.

2.1 Object Detection
Although conventional digital image processing methods are capable of detecting
particular local features, modern object detectors based on Deep Convolutional
Neural Networks (DCNN) achieve superior performance in object detection and
instance segmentation tasks. Several network architectures were pre-selected
based on study published by Huang et al. [11], namely the Faster R-CNN [18],
SSD [15] and Mask R-CNN [8]. The initial performance experiment was to train
these detection frameworks with default or recommended configurations. This
experiment revealed the most suitable framework for both the tasks within the
ImageCLEFcoral competition - the Mask R-CNN.

2.2 Network parameters
Experiments on the validation set, reveled the best optimizer settings for the
framework. These settings were shared between all of our experiments, unless
stated otherwise. For detailed description refer to Table 2.

     Table 2. Training and network parameters shared among all experiments.

 Parameter                      Value     Parameter                       Value
 Optimizer                  RMSprop       Gradient Clipping                 12.5
 Momentum                          0.9    Input size                 1000 × 1000
 Initial and min LR    0.032 - 0.00004    Feature extractor stride             8
 LR decay type            Exponential     Pretrained Checkpoints          COCO
 LR decay factor                 0.975    Num epochs                          50
 Batch size                          1    Gradient accumulation               16


2.3 Bag of Tricks
Augmentations - The provided dataset contains 440 images. Considering that
44 were used for validation, 396 images is too few for robust network optimiza-
tion. To alleviate this issue, multiple data augmentation techniques were utilized.
The following methods were included in the final training pipeline:
 Colour Distortions - Brightness variations with max delta of 0.2, contrast
   and saturation variations scale each by random value in range of 0.8 - 1.25,
   hue variations offsets by random value of up to 0.02, and random RGB to
   grayscale conversion with 10% probability.
 Image Flips - Random horizontal and vertical flip, and 90 degree rotations.
   Each with 50% chance.
 Random Jitter - Every bounding box corner can be randomly shifted by
   amount corresponding up to 2% of the bounding box width and height in x
   and y coordinates, respectively.
 Cut Out [6] - Random black square patches are added into the image. More
   precisely, add up to 10 patches with 50% occurrence probability and each
   with side length corresponding to 10% of the image height or width, whichever
   is smaller.

By utilizing techniques mentioned above, we have increased the model mAP@0.5
performance by 0.0392 as measured on the validation set.


Input Resolution - In the task of object detection, primarily where a small
object occurs, input resolution plays a crucial role. Theoretically, the higher the
resolution is, the more objects will be detected. Unfortunately, the detection of
high resolution images is GPU memory-limited. Hence, it always is a trade-off
between performance and hardware requirements.


Backbone - To find the best backbone architecture for Mask R-CNN frame-
work. We performed an experiment over 3 different backbone models including
ResNet-50 [9], ResNet-101 [9], and Inception-ResNet-V2 [20]. Detailed perfor-
mance comparison is included in Table 3.


Table 3. Effect of input resolution and backbone architecture on model performance.

             Backbone          Input Resolution   mAP@0.5      mAP@0.75
             ResNet-50            600 × 600        0.1826       0.0956
             ResNet-50            800 × 800        0.2077       0.1017
             ResNet-50           1000 × 1000       0.2227       0.1260
             ResNet-50           1200 × 1200       0.2380       0.1579
            ResNet-101            800 × 800         0.2381       0.1453
       Inception-ResNet-V2        800 × 800         0.2362       0.1361


Pseudo Labels - Performance of DCNN’s heavily depends on the size of the
training set. To facilitate this issue, we have developed a naive pseudo-labelling
approach inspired by [1]. In short, already trained network is used to label the
unlabelled testing data with so-called weak labels. Only the overconfident de-
tections were used; the rest of the image was blurred out. Even though there
is a high chance of overfitting to incorrect pseudo-labels due to the confirma-
tion bias, pseudo-labels can significantly improve the performance of the CNN
if pseudo-labelled images are added sensitively.
Transfer Learning - Big-transfer [13] or transfer learning is a fine-tuning
technique commonly used in deep learning. Rather then initialize the weights
of neural network randomly, pretrained weights are used. Furthermore, final
model could benefit from similar domain weights. To evaluate a potential of
such approach for the purposes of this competition, we experimented with fine-
tuning of the publicly available checkpoints, including ImageNet3 , iNaturalist3 ,
COCO [14], PlantCLEF2018 [19] and PlanCLEF2019 [17]. The idea was that
fine-tuning checkpoints trained on nature-oriented datasets would outperform
the non-nature oriented ones. One could assume, that this is caused by significant
difference when compared to other domains. Based on that it has been decided
to use the COCO pretrained checkpoint which includes both the backbone and
region proposed weights.


Table 4. Transfer Learning experiment - Effect of pretrained weights on model per-
formance. For this experiment, the Mask R-CNN with ResNet-50 backbone and input
size of 800 × 800 was used.

          Pretrained weights                   mAP@0.5       mAP@0.75
          ImageNet (only backbone)              0.1826        0.0956
          COCO (All Mask R-CNN weights)         0.2077        0.1017
          iNaturalist (only backbone)           0.2091        0.0854
          PlantCLEF2018 (only backbone)         0.1991        0.0914
          PlantCLEF2019 (only backbone)         0.1895        0.0932


Test Time Augmentations - Test time augmentation is a method of apply-
ing transformations on a given image to generate its several slightly different
variations that are used to create predictions that, when combined, can improve
final prediction. Our submissions utilized augmentations consisting of simple
horizontal and vertical flips of the image. Their combinations produced four sets
of detections for each image. These sets were then joined using voting strategy
described in [16] by Moshkov et al..

Ensembles - Ensemble methods combine predictions from multiple models to
obtain final output [21]. These methods can be used to improve accuracy in
machine learning tasks. In our work, we utilize a simple method for combining
outputs from multiple detection networks based on voting [16]. Detections de-
scribing one object are grouped together by size of the overlap region belonging
to the same class. Instances, where majority of the detectors agree on class label
and position are replaced by single detection with the highest score.

Accumulated Gradient Normalization - In order to achieve the best per-
formance possible, we aimed to maximize the resolution of input data. Therefore,
3
    https://github.com/tensorflow/models/blob/master/research/object_detection/
    g3doc/tf1_detection_zoo.md
we have decided to train the network on mini-batches of size 1. To overcome dis-
advantages that comes with using minimal mini-batch size [7], the Accumulated
Gradient Normalization [10] technique was utilized. This approach resulted in a
considerable performance gain.


3 Submissions

For evaluation of the participants submissions, the AICrowd platform4 was used.
Each participating team was allowed to submit up to 10 submission files follow-
ing specific requirements for both tasks. We have used allowed maximum for
both tasks. Because we have utilized single architecture for both the detection
and segmentation tasks, multiple submissions were produced using the same net-
work. Therefore in the following part, we denoted annotation and localisation
task submissions by D and pixel-wise parsing task submissions by S. Finally,
thresholding was used to discard predictions with low confidence.

Baseline configuration - As a baseline for all our experiments we used Mask
   R-CNN with ResNet-50 as a backbone. For training we used parameters and
   augmentations described in Table 2.2 and Section 2.3, respectively. Input
   resolution was 1000 × 1000 pixels.
Submission 1D/1S - Baseline experiment using a confidence threshold that
   corresponded to the best F1 score on our validation dataset (0.58).
Submission 2D - Submission 1D with a fixed programming bug that resulted
   in few detections being incorrectly generated.
Submission 3D - Submission 2D with confidence threshold set to 0.95.
Submission 4D/2S - Baseline configuration that used Pseudo-labels as de-
   scribed in Section 2.3. The confidence threshold was set to 0.95.
Submission 5D/3S - Baseline configuration that utilized test time augmen-
   tations as described in Section 2.3 with confidence threshold of 0.9.
Submission 6D/4S - Submission 5D/3S with confidence threshold of 0.999.
Submission 7D/5S - Ensemble of two checkpoints of baseline configuration
   model. Taken after 40 epochs and 50 epochs. Confidence threshold of 0.9.
Submission 8D/6S - Submission 7D/5S with confidence threshold of 0.999.
Submission 9D/8S - Submission 7D/5S with test time augmentations and
   with confidence threshold of 0.999.
Submission 10D/10S - Submission 7D/5S with confidence threshold of 0.95.
Submission 7S - Submission 9D/8S with confidence threshold of 0.9.
Submission 9S - Submission 9D/8S with modified voting ensemble. Only one
   detection is sufficient as opposed to majority voting.

4
    https://www.aicrowd.com
  0,9
                                                                                                                                                                                                                                                                                                                             MAP_0.0                           MAP_0.5


        0,853

                    0,851
  0,8


                                0,825


                                                       0,822
                                           0,814


                                                                                                                                                               0,806
                                                                  0,775


                                                                                         0,774


                                                                                                                 0,762


                                                                                                                                                   0,759


                                                                                                                                                                                                                                                                          0,753
                                                                                                                             0,747

                                                                                                                                         0,73
                                                                                                     0,729
  0,7


                                                                                                                                                                                                                                                              0,728


                                                                                                                                                                                                                                                                                                  0,727
                                                                              0,725


                                                                                                                                                                                                                          0,722
                                                                                                                                                                                                              0,721
                                                                                                                                                                           0,72


                                                                                                                                                                                                                                      0,712

                                                                                                                                                                                                                                                  0,709
                                                                                                                                                                                      0,707

                                                                                                                                                                                                  0,703


                                                                                                                                                                                                                                                                                      0,702


                                                                                                                                                                                                                                                                                                              0,684


                                                                                                                                                                                                                                                                                                                                                 0,664
                                                                                                                                                                                                                                                                                                                         0,663


                                                                                                                                                                                                                                                                                                                                                             0,644
  0,6


                                                                                                                                                                                                                                                                                                                                     0,628
            0,582

                        0,565

  0,5                               0,53

                                               0,517

                                                           0,49

                                                                      0,457

                                                                                  0,44

                                                                                             0,439
  0,4


                                                                                                         0,424

                                                                                                                     0,422

                                                                                                                                 0,415

                                                                                                                                            0,41

                                                                                                                                                       0,405

                                                                                                                                                                   0,392

                                                                                                                                                                              0,391

                                                                                                                                                                                          0,388

                                                                                                                                                                                                      0,383

                                                                                                                                                                                                                  0,377

                                                                                                                                                                                                                              0,369

                                                                                                                                                                                                                                          0,357

                                                                                                                                                                                                                                                      0,349

                                                                                                                                                                                                                                                                  0,347
  0,3


                                                                                                                                                                                                                                                                              0,323

                                                                                                                                                                                                                                                                                          0,313

                                                                                                                                                                                                                                                                                                      0,303

                                                                                                                                                                                                                                                                                                                  0,28


                                                                                                                                                                                                                                                                                                                                                                                    0,274
                                                                                                                                                                                                                                                                                                                             0,263

                                                                                                                                                                                                                                                                                                                                         0,245

                                                                                                                                                                                                                                                                                                                                                     0,243

                                                                                                                                                                                                                                                                                                                                                                 0,233
  0,2


                                                                                                                                                                                                                                                                                                                                                                         0,206
  0,1


                                                                                                                                                                                                                                                                                                                                                                             0,01

                                                                                                                                                                                                                                                                                                                                                                                       0,01
   0


Fig. 2. Results for all runs submitted in annotation and localisation task by the com-
petition participants, including mAP@0.0 and mAP@0.5 metrics.


4 Competition Results

The official competition results are shown in Figure 2 for annotation and local-
isation task, and in Figure 3 for pixel-wise parsing. Our System achieved the
best performances in both tasks of the second edition of the ImageCLEFcoral
competition. Specifically, mAP@0.5 of 0.582 in case of Coral reef image anno-
tation and localisation (Run ID 68143), and mAP@0.5 of 0.678 in Coral reef
image pixel-wise parsing (Run ID 67864). Results of all our submissions are
listed in Table 5. Table 6 illustrates the performance over different subsets of the
test dataset. The system performed comparably over the Same Location (SL),
Similar Location (SiL) and Geographically Similar Location (GS) subsets. The
performance significantly drops in Geographically Distinct Location (GD). This
is probably caused by a lack of diverse training data.
    The best scoring submission for pixel-wise parsing task was a single Mask R-
CNN with ResNet-50 backbone architecture and input resolution of 1000 × 1000.
The system was trained for 50 epochs while using heavy augmentations as de-
scribed in Section 2.3. Additionally, the pseudo-labeling (refer to Section 2.3) was
used to increase the training dataset size with overconfident detections from the
test set. Finally, the predictions were filtered with confidence threshold of 0.95
to maximize the official mAP metric while still having decent recall score.
    The best scoring submission for annotation and localisation task was an
ensemble of two checkpoints of the same Mask R-CNN model with ResNet-50
backbone architecture and input resolution of 1000×1000, one taken after 40 and
other one after 50 epochs. The system was trained using heavy augmentations.
Furthermore, the predictions were filtered with confidence threshold of 0.999 to
maximize the official metric of mAP.
  Table 5. Submission scores achieved over test set. Official competition metrics.

            Annotation and localisation task submissions
Submission 1D   2D    3D    4D     5D    6D    7D    8D   9D    10D
 mAP@0.5 0.347 0.357 0.439 0.565 0.349 0.530 0.377 0.582 0.517 0.415
 mAP@0.0 0.728 0.712 0.774 0.851 0.709 0.825 0.721 0.853 0.814 0.747
  Run ID     67857 67858 67862 67863 68093 68094 68138 68143 68145 68146
                Pixel-wise parsing task submissions
Submission 1S   2S    3S     4S    5S    6S    7S   8S    9S    10S
 mAP@0.5 0.441 0.678 0.434 0.629 0.470 0.664 0.407 0.624 0.617 0.507
 mAP@0.0 0.694 0.845 0.689 0.817 0.701 0.842 0.675 0.813 0.807 0.727
  Run ID     67856 67864 68092 68095 68137 68139 68140 68142 68144 68147

Table 6. Submission results achieved over 4 subsets of the testing set: Same Location
(SL), Similar Location (SiL), Geographically Similar Location (GS), Geographically
Distinct Location (GD).

             Annotation and localisation task submissions
 Submission  1D    2D    3D    4D    5D     6D   7D    8D    9D                10D
SL mAP@0.5 0.401 0.417 0.489 0.614 0.410 0.566 0.434 0.648 0.547              0.475
SiL mAP@0.5 0.234 0.247 0.322 0.440 0.230 0.431 0.254 0.343 0.438             0.258
GS mAP@0.5 0.470 0.446 0.508 0.562 0.453 0.516 0.516 0.627 0.533              0.527
GD mAP@0.5 0.225 0.230 0.280 0.292 0.231 0.346 0.210 0.329 0.344              0.242
    Run ID     67857 67858 67862 67863 68093 68094 68138 68143 68145 68146
                  Pixel-wise parsing task submissions
 Submission  1S     2S    3S    4S     5S   6S    7S   8S    9S   10S
SL mAP@0.5 0.527 0.744 0.513 0.670 0.545 0.742 0.480 0.663 0.656 0.583
SiL mAP@0.5 0.312 0.516 0.309 0.553 0.335 0.448 0.284 0.529 0.546 0.34
GS mAP@0.5 0.476 0.588 0.493 0.537 0.553 0.627 0.493 0.586 0.546 0.573
GD mAP@0.5 0.276 0.403 0.283 0.439 0.266 0.386 0.267 0.446 0.418 0.291
    Run ID     67856 67864 68092 68095 68137 68139 68140 68142 68144 68147


5 Conclusion and Discussion

The proposed system designed for automatic pixel-wise detection of 13 coral sub-
strates achieved impressive mAP@0.5 of 0.582 in localization task and 0.678,
for instance segmentation task of the ImageCLEFcoral competitions. The system
is wrapped up around the Mask R-CNN, the state-of-the-art instance segmen-
tation framework, and additional known as well as some unique techniques, e.g.,
detection ensemble, test time data augmentations, accumulated gradient nor-
malization, and pseudo-labelling. Surprisingly, results for pixel-wise parsing are
considerably better. This is unexpected mainly because the test set is the same
for both tasks, and our submissions used the same set of detections. Therefore,
more similar scores were expected. This led us to believe that annotations for
both tasks are not the same.
 0,9
                                                                                                                                                                                                                                                                                        mAP@0.0                       mAP@0.5


       0,845


                       0,842
 0,8


                                       0,817


                                                       0,813


                                                                       0,807
 0,7


                                                                                       0,727


                                                                                                                                                      0,72


                                                                                                                                                                     0,717
                                                                                                       0,715


                                                                                                                                      0,708
                                                                                                                       0,701


                                                                                                                                                                                                     0,695
                                                                                                                                                                                     0,694


                                                                                                                                                                                                                                     0,694


                                                                                                                                                                                                                                                                     0,692
                                                                                                                                                                                                                     0,689
               0,678


                                                                                                                                                                                                                                                                                     0,675
                                                                                                                                                                                                                                                     0,668
                               0,664


                                                                                                                                                                                                                                                                                                                     0,632
                                               0,629


                                                                                                                                                                                                                                                                                                     0,629
 0,6


                                                               0,624


                                                                               0,617


                                                                                                                                                                                                                                                                                                                                     0,602
 0,5


                                                                                               0,507


                                                                                                               0,474


                                                                                                                               0,47


                                                                                                                                              0,469


                                                                                                                                                             0,453


                                                                                                                                                                             0,449


                                                                                                                                                                                             0,441


                                                                                                                                                                                                             0,435


                                                                                                                                                                                                                             0,434


                                                                                                                                                                                                                                             0,433
 0,4


                                                                                                                                                                                                                                                             0,424


                                                                                                                                                                                                                                                                             0,416


                                                                                                                                                                                                                                                                                             0,407


                                                                                                                                                                                                                                                                                                             0,376


                                                                                                                                                                                                                                                                                                                             0,371
 0,3


                                                                                                                                                                                                                                                                                                                                             0,304
 0,2


 0,1


  0


Fig. 3. Results for all runs submitted in pixel-wise parsing task by the competition
participants, including mAP@0.0 and mAP@0.5 metrics.


    More in-depth performance examination of our submissions revealed a small
regularisation capability related to geographical regions and specific locations.
This is indication that the network could be over-fitted on the training dataset lo-
cation, which have specific distribution of coral species. The system could achieve
better performance with class priors corresponding to desired location. If the lo-
cation transfer is essential, location generalisation should be main goal for the
future challenges.

    While comparing the model performance with the top results from the previ-
ous edition of this challenge (mAP@0.5 of 0.2427 and 0.0419), our model achieved
superior performance. Even though the test datasets are not identical, such dif-
ference shows the increasing trend of machine learning model performance. This
increase is probably related to a higher number of training images.

    Lastly, due to our GPU memory constraints we were limited to an input
image resolution of 1000 × 1000 combined with ResNet-50 backbone. Conducted
experiments showed that input resolution of 1200 × 1200 and ResNet-101 would
yield better results, therefore usage of GPUs with more memory would lead to
a considerable increase of the system’s performance.


Acknowledgements

Lukáš Picek was supported by the Ministry of Education, Youth and Sports of
the Czech Republic project No. LO1506, and by the grant of the UWB project
No. SGS-2019-027.
References
 1. Arazo, E., Ortego, D., Albert, P., O’Connor, N.E., McGuinness, K.: Pseudo-
    labeling and confirmation bias in deep semi-supervised learning. arXiv preprint
    arXiv:1908.02983 (2019)
 2. Birkeland, C.: Global status of coral reefs: In combination, disturbances and stres-
    sors become ratchets pp. 35–56 (2019)
 3. Brander, L.M., Rehdanz, K., Tol, R.S., Van Beukering, P.J.: The economic impact
    of ocean acidification on coral reefs. Climate Change Economics 3(01), 1250002
    (2012)
 4. Chamberlain, J., Campello, A., Wright, J.P., Clift, L.G., Clark, A., García Seco de
    Herrera, A.: Overview of the ImageCLEFcoral 2020 task: Automated coral reef
    image annotation. In: CLEF2020 Working Notes. CEUR Workshop Proceedings,
    CEUR-WS.org <http://ceur-ws.org> (2020)
 5. Coker, D.J., Wilson, S.K., Pratchett, M.S.: Importance of live coral habitat for
    reef fishes. Reviews in Fish Biology and Fisheries 24(1), 89–126 (2014)
 6. DeVries, T., Taylor, G.W.: Improved regularization of convolutional neural net-
    works with cutout. arXiv preprint arXiv:1708.04552 (2017)
 7. Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tul-
    loch, A., Jia, Y., He, K.: Accurate, large minibatch sgd: Training imagenet in 1
    hour. arXiv preprint arXiv:1706.02677 (2017)
 8. He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask r-cnn. In: The IEEE Interna-
    tional Conference on Computer Vision (ICCV) (Oct 2017)
 9. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
    In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
    (June 2016)
10. Hermans, J., Spanakis, G., Möckel, R.: Accumulated gradient normalization. arXiv
    preprint arXiv:1710.02368 (2017)
11. Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A., Fathi, A., Fischer, I.,
    Wojna, Z., Song, Y., Guadarrama, S., et al.: Speed/accuracy trade-offs for modern
    convolutional object detectors. In: Proceedings of the IEEE conference on computer
    vision and pattern recognition. pp. 7310–7311 (2017)
12. Ionescu, B., Müller, H., Péteri, R., Abacha, A.B., Datla, V., Hasan, S.A., Demner-
    Fushman, D., Kozlovski, S., Liauchuk, V., Cid, Y.D., Kovalev, V., Pelka, O.,
    Friedrich, C.M., de Herrera, A.G.S., Ninh, V.T., Le, T.K., Zhou, L., Piras, L.,
    Riegler, M., l Halvorsen, P., Tran, M.T., Lux, M., Gurrin, C., Dang-Nguyen, D.T.,
    Chamberlain, J., Clark, A., Campello, A., Fichou, D., Berari, R., Brie, P., Dogariu,
    M., Ştefan, L.D., Constantin, M.G.: Overview of the ImageCLEF 2020: Multimedia
    retrieval in medical, lifelogging, nature, and internet applications. In: Experimental
    IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the 11th
    International Conference of the CLEF Association (CLEF 2020), vol. 12260. LNCS
    Lecture Notes in Computer Science, Springer, Thessaloniki, Greece (September 22-
    25 2020)
13. Kolesnikov, A., Beyer, L., Zhai, X., Puigcerver, J., Yung, J., Gelly, S., Houlsby, N.:
    Big transfer (bit): General visual representation learning (2019)
14. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P.,
    Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference
    on computer vision. pp. 740–755. Springer (2014)
15. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.:
    Ssd: Single shot multibox detector. In: European conference on computer vision.
    pp. 21–37. Springer (2016)
16. Moshkov, N., Mathe, B., Kertesz-Farkas, A., Hollandi, R., Horvath, P.: Test-time
    augmentation for deep learning-based cell segmentation on microscopy images.
    Scientific reports 10(1), 1–7 (2020)
17. Picek, L., Sulc, M., Matas, J.: Recognition of the amazonian flora by inception
    networks with test-time class prior estimation. In: Working Notes of CLEF 2019 -
    Conference and Labs of the Evaluation Forum (2019)
18. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object de-
    tection with region proposal networks. In: Cortes, C., Lawrence, N.D., Lee, D.D.,
    Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Sys-
    tems 28, pp. 91–99. Curran Associates, Inc. (2015)
19. Sulc, M., Picek, L., Matas, J.: Plant recognition by inception networks with test-
    time class prior estimation. In: Working Notes of CLEF 2018 - Conference and
    Labs of the Evaluation Forum (2018)
20. Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-resnet
    and the impact of residual connections on learning. In: Thirty-first AAAI confer-
    ence on artificial intelligence (2017)
21. Zhang, C., Ma, Y.: Ensemble machine learning: methods and applications. Springer
    (2012)