The effects of colour enhancement and IoU
      optimisation on object detection and
      segmentation of coral reef structures

Marina Arendt1,2[0000−0002−4573−2462] , Johannes Rückert1[0000−0002−5038−5899] ,
            Raphael Brüngel1[0000−0002−6046−4048] , Christopher
            Brumann1[0000−0002−4117−2541] , and Christoph M.
                      Friedrich1,3[0000−0001−7906−0038]
    1
       Department of Computer Science, University of Applied Sciences and Arts
              Dortmund, Emil-Figge-Str. 42, 44227 Dortmund, Germany
                           marina.arendt@fh-dortmund.de
                         2
                           Kairos GmbH, Bochum, Germany
3
  Institute for Medical Informatics, Biometry and Epidemiology (IMIBE), University
                           Hospital Essen, Essen, Germany


        Abstract. This paper considers approaches used to localise and anno-
        tate coral reef structures in underwater images. Beside the actual local-
        isation and annotation the focus laid on image pre-processing and their
        evaluation. Underwater images differ from terrestrial images in illumi-
        nation, acuity and colour which make them more blurred with a green
        and blue cast. To enhance those physical properties, Image Blurriness
        and Light Absorption (IBLA) with additional Rayleigh optimisation or
        additional colour reduction were used. Afterwards, for both competi-
        tion tasks Mask R-CNN was used, involving on-the-fly data augmen-
        tation and oversampling to combat the coral class imbalances. Several
        types of post-processing were applied to the generated boxes and poly-
        gons, mostly to account for the evaluation methodologies. IBLA and
        Rayleigh pre-processing improved accuracy for the localisation and an-
        notation task, while colour reduction led to overall worse results than the
        original images and also oversampling led to even worse mean Average
        Precision (mAP) and only a slightly better average accuracy. For pixel-
        wise parsing IBLA achieved better mAP score but worse accuracy and
        Rayleigh achieved worse results for mAP and accuracy. Colour reduction
        worked well and oversampling reduced mAP but strongly improved aver-
        age accuracy. Concluding, image pre-processing – in particular IBLA and
        Rayleigh – has improved accuracy for both tasks and only achieved bet-
        ter mAP on the pixel-wise parsing task. In future work, the results could
        be improved by using larger images, trying other types of oversampling
        and train separate models for different classes and object size.

        Keywords: underwater colour correction · box optimisation · Mask R-
        CNN · deep learning · Jaccard index
  Copyright © 2020 for this paper by its authors. Use permitted under Creative
  Commons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25
  September 2020, Thessaloniki, Greece.
1     Introduction
This paper considers the results of the 2020 ImageCLEFcoral challenge which
took place for the second time [3]. Main focus lies on automatic image detec-
tion and classification of coral reef structures. ImageCLEFcoral is a subtask of
ImageCLEF [7]. Underwater images differ from terrestrial images in terms of
colour, illumination and acuity which can cause problems in automatic detec-
tion. Nevertheless, automatic detection and classification is needed as manual
detection is cost and time intensive [15]. To regard the issues with underwater
images the approaches focused on image pre-processing to enhance image struc-
ture, illumination and colour to test the effect of these steps on detection and
segmentation.
    After actual training of the models, several types of post-processing were
applied to the generated boxes and polygons, mostly to account for the evalua-
tion methodologies. For this, generated polygons were validated and boxes were
shrunk to achieve a better Jaccard index, also known as Intersection over Union
(IoU).
    The paper is separated in the sections of data set description and pre-
processing showing the main issues with underwater images followed by annota-
tion and localisation (object detection task 1) and pixel-wise parsing (segmen-
tation task 2). The results and discussion build the main part and is completed
with the conclusion. All scripts used during this project are available in a GitHub
repository4 .

2     Image and annotation data
All details about the task and data can be found in the task overview paper
[3]. The provided version 4 of the data set consists of n = 440 images in the
development part (training set) and m = 400 images in the test part. All images
are provided in the Joint Photographic Experts Group (JPEG) format with a
resolution of 4032 × 3024 px. An annotation file for the development part holds
k = 12, 082 annotations, structured in eight variables:
image id: Data set-wide unique image identifier, equal to the image filename.
substrate: Image identifier-wide unique substrate index, starting by 0.
c class: Name of one of the 13 present coral classes.
confidence: Confidence of annotation, always 1 for 100% confidence.
x min: Minimum x-axis bounding box coordinate5 .
y max: Maximum y-axis bounding box coordinate5 .
x max: Maximum x-axis bounding box coordinate5 .
y min: Minimum y-axis bounding box coordinate5 .
   In the following, the results of explorative analyses on annotation data and
applied image pre-processing methods are presented.
4
    https://github.com/saviola777/fhdo-imageclef2020-coral/, accessed 2020-07-07
5
    On the image level the coordinate system origin (x min = 0, y min = 0) is located
    in the upper left corner.
2.1   Explorative data analyses on annotations

Explorative analyses on annotation data of the training set were conducted using
the statistical language R6 [12] in version 4.0.1 and the integrated development
environment RStudio7 [14] in version 1.2.5033. Spatial analysis of substrate
bounding boxes was performed using the Simple Features for R (sf) package8
[9] in version 0.9-2.
    During a first screening minor inconsistencies were identified. These com-
prise (i) a single negative coordinate value of x min = −1 present in one row
(substrate 10 in 2018 0714 112502 024), and (ii) five dot-sized bounding boxes
with x min = x max and y min = y max (substrate 17 in 2018 0714 112534 047;
substrate 14 in 2018 0714 112535 042; substrate 1 in 2018 0714 112535 050;
substrate 21 in 2018 0729 112613 064; substrate 5 in 2018 0729 112458 039).
For the negative coordinate the respective substrate was checked manually on
the respective image. A sign flip was performed and its bounding box was kept.
Dot-sized bounding boxes were removed. Cleansing of annotation data resulted
in a total of k = 12, 077 entries, still related to n = 440 images. Presented results
are based on the cleansed annotations.


Table 1: Class frequencies, images presence and figures of per-image occurrences
in the training seta . Top-5 occurrences are highlighted.

Class                                      Frequency Percentages Images Maximum
Algae Macro or Leaves                             91      0.75 %     53       11
Fire Coral Millepora                              19      0.16 %      9        6
Hard Coral Boulder                             1640     13.58 %     398       18
Hard Coral Branching                           1181      9.78 %     349       16
Hard Coral Encrusting                            945     7.82 %     310       21
Hard Coral Foliose                               177      1.47 %    104       11
Hard Coral Mushroom                              223      1.85 %    140        6
Hard Coral Submassive                            198      1.64 %     93       12
Hard Coral Table                                  21      0.17 %     21        1
Soft Coral                                     5662     46.88 %     425       66
Soft Coral Gorgonian                              90      0.74 %     66        5
Sponge                                         1691     14.00 %     342       34
Sponge Barrel                                    139      1.15 %    118        3

a)
   The Frequency column describes the number in the overall data set and the Images
column in how many images this coral type is pictured. Percentages stands for the
overall frequency distribution and Maximum is the maximum number of a coral class
representatives in one image.

6
  https://www.r-project.org/, accessed 2020-07-09
7
  https://rstudio.com/, accessed 2020-07-09
8
  https://github.com/r-spatial/sf, accessed 2020-07-09
   Annotations comprise 13 substrate classes, listed in Table 1 together with
their frequencies, overall percentages, presence in images and maximum count
they occurred in an image. The top two classes in regards to their frequency
account for 60.1% of all annotations while the top five classes account for 92.1%.
Soft Coral is the most common class and represents 47.2% of all annotations.

    Statistics on substrate bounding box per-image frequency, aspect ratio (x:y)
and area (px) are listed in Table 2. The substrate density in images features a
high variety. While 2018 0712 073801 116 is the only image with a single sub-
strate, 2018 0712 073920 154 shows the maximum of 96 substrates in a single
image. The median number of substrates in an image is 24, rather few images
contain a vast amount. The aspect ratio of substrate bounding boxes also shows
a wide span with a minimum of 0.12 and a maximum of 8.51. Highly elongated
bounding boxes are present, however, a median of 1.08 and an interquartile range
of 0.30 suggest a moderate elongation in most cases. Also the areas of substrate
bounding boxes show a wide span and with partially extreme low and high
values. For a better understanding square area values are discussed, assuming
substrate bounding boxes with an aspect ratio of 1:1. Here, the minimum area is
12.73 px2 while also a maximum of 3, 249.23 px2 is present. The median square
area is 241.94 px2 .


   Table 2: Statistics on substrate bounding box features of the training set.

                        Min. 1st Qu. Median Mean 3rd Qu.    Max. Std. deviation
Per-image frequency        1      16     24 27.45     36       96         16.13
Aspect ratio (x:y)      0.12    0.87   1.08 1.17    1.35     8.51          0.50
Square area (px2 )     12.73 158.39 241.94 412.26 388.81 3,249.23       261.33


    The spatial analysis of substrate bounding boxes revealed a notable amount
of overlaps where up to five boxes shared an intersecting area. The most common
overlap scenario involve two and three substrate bounding boxes. For two sub-
strate bounding boxes the mean (median) is 14.90 (12) overlaps per image, for
three substrate bounding boxes it is 2.46 (3). A maximum of up to 81 overlaps
between two and up to 32 between three substrate bounding boxes indicates
numerous redundancies of annotated areas for several images. Overlaps between
four and five substrate bounding boxes are rare. Also full overlaps between sub-
strate bounding boxes have been found where a bounding box of a substrate fully
covers that of another. For intra-class overlaps 146 full overlaps have been identi-
fied, while for inter-class overlaps 509 were found. Especially inter-class overlaps
and full overlaps may display a challenging condition for object detection tasks.
2.2    Pre-processing

Underwater images differ from other images in their physical properties. The
deeper the images where taken the darker the images get as well as red light is
more absorbed than green and blue light. This often results in blurred images
with a green and blue cast [10].
    The idea was to enhance image quality prior to segmentation and parsing
which should lead to enhanced segmentation and parsing results. The best pre-
processing steps were chosen by visual inspection. Pre-processing functions [16]
that have been used are described in the following sections. The images were
processed first by Image Blurriness and Light Absorption (IBLA) followed by
either a transformation of Rayleigh distribution or an octree colour reduction.
Figure 1 (a) and (b) are examples showing the problems with underwater images
described above.
    To visualise the colour distribution, normalised histograms were
made using the NumPy9 package. Figure 2 shows the histograms of images
2018 0714 112438 016 and 2018 0729 112414 024. Clearly seen is the either
high intensity of the green channel or the high intensity of the blue channel in
combination with very low intensity of the red channel typical for underwater
images.


Image Blurriness and Light Absorption (IBLA) Underwater image
restoration based on IBLA was conducted on both the training and test set
[10,16]. IBLA transformation is based on four main steps. First, the image blur-
riness is analysed and afterwards a smoothed and refined blurriness map is gen-
erated to optimise the image. Second, the background light pixels are estimated
by image blurriness and variance via a quad-tree algorithm. Third, the actual
enhancement using depth estimation based on light absorption and blurriness
which results in an optimised depth map. Last, the transmission map is esti-
mated leading to restoration rather than estimation [10]. The results are shown
in Figure 1c and Figure 1d.


Enhancement based on Rayleigh distribution The method for image en-
hancement with Rayleigh distribution is separated into two main steps [5]. First,
the contrast is corrected and second, the colour is corrected. For contrast cor-
rection a global histogram stretching is implemented followed by division into
a lower and an upper side by the average point. Both parts are then Rayleigh-
stretched to the full gray-scale from 0 to 255 and recombined. For colour en-
hancement the image is transformed into the Hue, Saturation, and Value (HSV)
colour model. The saturation and value levels are stretched and reconverted to
an Red, Green, Blue colour space (RGB) image. This led to an enhancement in
contrast and details and reduced image artefacts [5,16]. Results are in Figure 1e
and Figure 1f.
9
    https://numpy.org/, accessed 2020-07-10
    (a) Original 2018 0714 112438 016       (b) Original 2018 0729 112414 024


    (c) Image (a) IBLA transformed           (d) Image (b) IBLA transformed


   (e) Image (c) Rayleigh transformed       (f) Image (d) Rayleigh transformed


       (g) Image (c) colour reduced            (h) Image (d) colour reduced

Fig. 1: Comparison of the original images 2018 0714 112438 016 (a) and
2018 0729 112414 024 (b) from the training set with their three different trans-
formation (c) - (h). (a) and (b) contain the main problems with underwater im-
ages. They are blurry and show a large translation of histograms towards green
or blue. The original images (a) and (b) are under copyright of the organisers
[3,7]
                                        .
                         ·10−2
                 3

       Density   2
                 1
                 0
                     0            0.2          0.4               0.6       0.8   1
                                                     Intensity

                                   (a) Histogram of 2018 0714 112438 016
                          ·10−2
                 10
     Density


                  5


                  0
                      0           0.2          0.4               0.6       0.8   1
                                                     Intensity

                                  (b) Histogram of 2018 0729 112414 024

Fig. 2: Histograms of two training set images showing the typical unequally dis-
tributed colours of underwater images.


Colour reduction The colour reduction was conducted using the octree process
reducing all IBLA transformed images to maximum 256 colours as implemented
in the following code 10 . The octree colour reduction (for instance described in
[2, p. 333 sqq.]) results in an image with 256 colours with a harmonised colour
distribution [2]. The results are shown in Figure 1g and Figure 1h.


3                Methods

Mask R-CNN [6] is an instance segmentation framework which extends Faster R-
CNN [13] with a parallel branch for instance segmentation on region of interests.
    The models described in this paper were trained on a Mask R-CNN im-
plementation using TensorFlow and Keras in Python 311 which was patched
to support TensorFlow 2.112 . All models used weights pre-trained on the MS
COCO data set [8].
    To speed up on-the-fly pre-processing and avoid padding, all images were
resized to 1536 × 1536 beforehand.
10
   https://github.com/delimitry/octree color quantizer, accessed 2020-07-07
11
   Abdulla, W.: Mask R-CNN for object detection and instance segmentation on Keras
   and TensorFlow. https://github.com/matterport/Mask RCNN, accessed 2020-07-07
12
   https://github.com/DiffPro-ML/Mask RCNN, accessed 2020-07-07
    The training was split into two phases: first, only the newly added layers
were trained for one epoch. Second, the complete network was trained for the
remaining epochs. Then, Polyak averaging [11] was performed on the top five
models based on their .632 error [4]. For the submission models, early stopping
was used based on the average epoch number for the top five models during
cross-validation training.
    For on-the-fly data augmentation, the images were randomly rotated (up to
±180◦ in each direction), flipped (up/down or left/right) with 33% chance each,
and 0 to 5 of blur, sharpen, random crop (up to 20% on each side), Gaussian
noise, brightness, hue/saturation, and contrast were changed.
    To combat the class imbalance in the data set, oversampling was performed
which entails iterative optimisation of the Shannon entropy of the data set by
adding images until the number of images is tripled, with constraints on the
number of times a single image can appear in the final data set.
    Considering only the five most frequent classes was a consideration due to
the imbalance of the data set and achieved good results in our cross-validation
runs.
    Table 3 lists the most important parameters used for the different training
runs. The training run parameters are listed in Table 4. It includes the submission
ID, the run name, which data set was used (original images, IBLA pre-processing,
IBLA plus Rayleigh pre-processing, colour reduced, see Section 2.2), whether on-
the-fly data augmentation as described above was applied, whether images of size
1024 × 1024 or 1536 × 1536 were used, whether oversampling was used as well
as the number of epochs.


Table 3: Mask R-CNN parameters used for training variations. Image size, batch
size and anchor scales depended on the image size. Learning rate η was reduced
for the second phase and was different for oversampling runs.
Parameter                           Value
Backbone                            Resnet101
Image size (larger images)          1024 × 1024 (1536 × 1536)
Batch size (larger images)          2 (1)
                                    Stochastic gradient descent
Optimiser
                                    with Nesterov momentum enabled
η first phase (oversampling)        0.005 (0.002)
η second phase (oversampling)       0.0005 (0.0005)
Learning momentum                   0.9
Weight decay                        0.0001
Epochs (augmentation, oversampling) 15 (30, 20)
Minimum detection confidence        0.6
Anchor scales                       32, 64, 128, 256, 512
Anchor scales larger images         48, 96, 192, 384, 768
Table 4: Run configurations with submission ID, Run name, pre-processing, Aug-
mentation, Larger images, Oversampling, and number of epochs.
                                                       Larger
ID    Run name               Data set          Augm.          Overs.    Epochs
                                                       images
67914 Baseline              Original            No       No    No         15
67919 5 classes             Original            No       No    No         20
68188 Augmentation          Original            Yes      No    No         30
68187 IBLA                  IBLA                Yes      No    No         30
68186 IBLA + Rayleigh       IBLA + Rayleigh     Yes      No    No         30
68184 Colour reduction (CR) IBLA + CR           Yes      No    No         30
68185 Oversampling          IBLA                Yes      No    Yes        20
68183 Larger images         IBLA                Yes      Yes   Yes        20
68182 Segmentation          IBLA (polygons)     Yes      Yes   Yes        20
      Segmentation
68181                       IBLA (polygons)     Yes     Yes    Yes        20
       + IoU optimisation


3.1   Annotation and localisation

The focus of the approach described in this paper was on the annotation and
localisation task, and models were trained using the bounding box annotations
and optimised against the PASCAL Visual Object Classes (VOC)-style mean
Average Precision (mAP) implementation in Mask R-CNN [6].
    The different configurations that were analysed (as seen in Table 4) included
different data sets based on the pre-processing described in section 2.2, different
levels of on-the-fly data augmentation, using two different image sizes, using
oversampling, as well as trying to train models for varying numbers of epochs.


3.2   Pixel-wise parsing

For this task, similar approaches as for the annotation and localisation task were
applied only using the polygon annotations for training. The colour reduction
run was skipped due to its poor results for the first task, instead a number of
different runs with larger images were included to see the results of more and
less training as well as a lower confidence threshold.


3.3   Post-processing

When the evaluation code for the challenge was published and it turned out
that the submitted bounding boxes would be evaluated against the polygons
instead of the bounding box annotations, it became clear that training with the
polygon annotations would be much more effective. For example, this led to an
evaluation F1 score of 0.8 for the bounding box training ground truth, whereas
the score was 0.99 for the polygon training ground truth. The loss of 0.01 was
due to invalid polygons in the ground truth annotations.
    This value can be increased from 0.8 to 0.9 simply by reducing the size of the
bounding boxes by 7.5% on each side as seen in Figure 3. This post-processing
step was therefore used for all submissions of the first task.
    Additionally, an iterative algorithm was created to approximate the best pos-
sible rectangular box for a given polygon according to their IoU. This algorithm
is described in the next section. To make use of this algorithm, a model was
trained on the polygon annotations and the resulting polygons were used to
generate boxes.
    For the second task, the polygons generated from binary masks by OpenCV13
[1] were validated against the shapely library14 which was used in the evaluation
script since about 1% of generated polygons were not valid according to the
shapely library’s definition of a valid polygon and would be ignored by the
evaluation script. A valid polygon may not cross itself and may only touch in a
single point.
    To clean up invalid polygons, first duplicate points were removed, then the
polygons were split in several separated polygons using the touching / self-
crossing points and the biggest polygon was used. Separately, the buffer func-
tion provided by the shapely library was used to generate a valid polygon. Then
the polygon with the overall least absolute area difference compared to the orig-
inal, invalid polygon was used.


                        0.534
                           0.593


                           0.73


                       Bounding         Reduced          Optimized

Fig. 3: IoU values for a detected polygon. The minimum bounding box shows an
IoU of 0.534. After reducing the size by 7.5% in all dimensions the result was
increased to 0.593 as the dotted box shows. Applying the iterative optimisation
algorithm led to the highest IoU of 0.73.


13
     https://opencv.org/, accessed 2020-07-10
14
     https://github.com/Toblerity/Shapely, accessed 2020-07-09
Bounding box IoU optimisation Given a solid polygon P , which is defined
by a contour as a set of points. The corresponding minimal enclosing rectangle
can be defined by four parameters: R0 := [x, y, w, h]. Here x and y define the top
rectangles left corner and w and h the width and height, respectively. Calculating
the IoU of the polygon with the rectangle showed that this value is not necessarily
the best to be achieved as can be seen in Figure 3. This became particularly
clear with thin long polygon arms that run parallel to the rectangle edges. In
these cases it often led to a higher IoU if the rectangle was slightly reduced for
the corresponding side. When considered the minimum bounding box, however,
an increase in edge length can only led to an IoU deterioration. Therefore the
polygons minimum bounding box was chosen as the algorithm’s starting point
for maximising the IoU. The parameter space was thus defined by the rectangle’s
parameters.
                                                         |P ∩R|
    The optimised objective function was IoU(P, R) = |P     ∪R| for the given poly-
                                  k
gon P and a current rectangle R at optimisation step k. To maximise the target
function, R was iterative changed in all parameters via translation and scaling, so
that the most optimised rectangle was used for the next iteration. This process
was continued until no further objective function improvement was achieved.
Consequently, the rectangle was accepted as optimised. During the optimisation
process, step sizes for translation, shrinkage and growth are given by t, s and g.
In this application all values were set to four. Each optimisation iteration was
then performed by calculating:
         RTk+1   k
            ← = R + [−t, 0, 0, 0]              RTk+1   k
                                                  → = R + [t, 0, 0, 0]
         RTk+1   k
            ↑ = R + [0, −t, 0, 0]              RTk+1   k
                                                  ↓ = R + [0, t, 0, 0]         (1)

          k+1                                   k+1
         RS←  = Rk + [s, 0, −s, 0]             RS→  = Rk + [0, 0, −s, 0]
          k+1                                   k+1
         RS↑  = Rk + [0, s, 0, −s]             RS↓  = Rk + [0, 0, 0, −s]       (2)

          k+1                                   k+1
         RG←  = Rk + [−g, 0, g, 0]             RG→  = Rk + [0, 0, g, 0]
          k+1                                   k+1
         RG↑  = Rk + [0, −g, 0, g]             RG↓  = Rk + [0, 0, 0, g]        (3)
    As can be seen in Equation 1, the rectangle movement was performed in
every possible direction. It should be noted that the rectangle never excesses
the bounding box dimensions. Equation 2 shows the corresponding contraction
of the rectangle, while Equation 3 calculates its enlargement. Once all possi-
ble rectangles Rk+1 were calculated, the corresponding objective function with
P was evaluated, so that the corresponding improvement factor was known.
Subsequently the operation with the most promising improvement factor was
performed by setting the corresponding rectangle as the new Rk and thus set-
ting the starting point for the next iteration. As soon as no rectangle shows an
improvement, Rk was no longer updated and was assumed to be optimised with
regard to the objective function. An example for an optimised rectangle is shown
in Figure 3.
4    Results and Discussion

For the image annotation and localisation task (see Table 5), on-the-fly data
augmentation expectedly reduced over-fitting and led to better results for both
mAP and average accuracy. Pre-processing using IBLA and Rayleigh led to lower
mAP values but increased average accuracy, while colour reduction produced
overall worse results compared to the original images. The Rayleigh run had
the highest average accuracy of all runs submitted for the first task, followed by
the IBLA and augmentation runs, showing that while the models do not detect
objects as accurately as some of the competitors, they classify the detected
objects very well.


Table 5: 5-fold cross-validation results on the training data set along with the
corresponding submission results for the image annotation and localisation task.
                                                                 Submission
                                   F1 score        F1 score              Avg.
ID    Run name                                                 MAP 0.5
                                  train (sd)      valid (sd)           accuracy
67914 Baseline                    .688 (.012)      .425 (.010)    .391      .113
67919 5 classes                    .573 (.011)    .569 (.020)     .383      .083
68188 Augmentation                 .492 (.005)     .433 (.011)    .440      .149
68187 IBLA                         .482 (.005)     .426 (.012)    .424      .155
68186 IBLA + Rayleigh              .474 (.012)     .420 (.008)    .410     .159
68184 Colour reduction             .447 (.008)     .382 (.005)    .388      .137
68185 Oversampling                 .511 (.010)     .432 (.011)    .369      .117
68183 Larger images                .505 (.012)     .446 (.011)    .405      .107
68182 Segmentation                 .564 (.007)     .470 (.010)    .422      .108
      Segmentation
68181                             .579   (.007)   .483   (.011)     .457     .107
       + IoU optimisation


    Considering only the 5 most frequent classes achieved worse results than the
baseline in both mAP and accuracy, unlike in the cross-validation runs, where
it achieved the best F1 score on the validation data.
    Surprisingly, oversampling led to even worse mAP results than the baseline
model, while having only slightly better average accuracy. In our cross-validation
runs on the other hand, oversampling achieved results on par with the data
augmentation run. This may be due to a slightly different class distribution in
the test data set, or too many epochs of training for the submission models.
    Using larger images led to better mAP scores but worse average accuracy.
Using polygon annotations for training clearly improved the mAP, but did not
improve the average accuracy.
    Looking at the per-substrate accuracies, the IoU optimised run produced the
worst results with seven classes having 0% accuracy. The run without IoU opti-
misation and the run with larger images still have six classes without any correct
detections, meaning that all runs with larger images have trouble with the less
frequent classes. The baseline, for comparison, produced no correct detections
for five classes. The oversampling run only has three classes without any correct
detections, but three more classes with accuracies below 0.05.
    The runs without oversampling performed better, each of them produced
correct detections for all but one class which was the class that had 0% accuracy
across all submitted runs of all participants and was only represented by less
than 30 instances in the training data set.
    The evaluation code was released only several weeks before the submission
deadline for the models, and it produced very different results than the PAS-
CAL VOC-style mAP evaluation recommended in the task’s description and
implemented in Mask R-CNN.
    The evaluation strategy was mainly the same for both tasks. The focus of this
work would have been much more on the second task and on training instance
segmentation models if the evaluation code had been published earlier. It is
easier and more effective to generate suitable boxes from polygons compared to
guessing suitable boxes based on the bounding box or just using the bounding
box itself.
    In effect, a model which predicts perfect bounding boxes would never be able
to exceed an mAP score of about 0.8 based on objects that are shaped in a way
that produce bounding boxes with very low IoU. This effect was demonstrated
by the bounding box refinement and IoU optimisation, which led to a significant
improvement in mAP score.
    For the image pixel-wise parsing task (see Table 6), IBLA achieved a better
mAP score than the original data set, but a worse average accuracy than the
baseline. Rayleigh performed even worse with a similar mAP score than the
baseline but worse average accuracy.


       Table 6: Submission results for the image pixel-wise parsing task.
                                                                           Avg.
ID    Run name                                                 MAP 0.5
                                                                         accuracy
67963 Baseline                                                      .433      .134
67964 Augmentation                                                  .449      .140
67965 IBLA                                                          .453      .128
67967 IBLA + Rayleigh                                               .435      .120
68192 Oversampling                                                  .424      .191
67968 Larger images                                                 .469      .174
68190 Larger images, colour reduction                              .474       .158
68191 Larger images, 10 epochs                                      .416      .128
67969 Larger images, 0.4 confidence                                 .376     .196
68189 Larger images, 60 epochs, more augmentation                   .371      .157


   Oversampling once again reduced the mAP score, but strongly improved the
average accuracy. Similar to the first task, using larger images increased the
mAP score but reduced the average accuracy.
    Using larger images and the data set with reduced colours further improved
the mAP score, reaching the overall best value for the second task among the
submitted runs but produced a much lower average accuracy.
    Training for more or less epochs led to overall worse results. Reducing the
confidence threshold led to a very low mAP score but slightly improved the
average accuracy, led once again to the highest average accuracy out of all runs
submitted for the second task.
    Looking at the overall and per-substrate accuracies the models with larger
images interestingly had better accuracies than the one with smaller images,
unlike in the first task, where it was the other way around. Excluding the run with
0.4 minimum detection threshold which achieved the highest overall accuracy
with a much lower mAP value the oversampling run had the best accuracy.
    The runs without oversampling achieved comparatively worse results with
four to five classes having no correct detections whereas all other runs had only
the one class without correct detections.
    Post-processing as described in Section 2.1 did not achieved an impact on
classification quality as expected. Hence, those results are not shown as they
were not finally implemented.

5   Conclusion
Concluding, image pre-processing using IBLA and Rayleigh has improved accu-
racy for the localisation and annotation task while achieving better mAP on the
pixel-wise parsing task. Colour reduction worked well with larger images for the
second task in terms of mAP, but falls behind in accuracy.
    Oversampling was overall not successful even though it led to better accuracy
in the second task. This result was not reflected in cross-validation analysis on
the training data. Therefore, oversampling was used in the majority of submitted
runs decreasing the performance especially of the runs with larger images.
    Larger images performed worse in early runs that were not properly fine-
tuned and hence enlargement was not considered in most of the further analysis.
All runs with larger images used oversampling which most likely hurt the model
performance. A stronger focus on larger images would have been useful, since
the results are promising at least for the second task.
    Nevertheless, this work has produced competitive models especially in terms
of the classification accuracy that could be improved in the future. Examples
are using larger images, trying other types of oversampling, or training separate
models for different classes or object sizes.

References
 1. Bradski, G.: The OpenCV Library. Dr. Dobb’s Journal of Software Tools 25, 120–
    125 (2000)
 2. Burger, W., Burge, M.J.: Digital Image Processing - An Algorithmic
    Introduction Using Java. Springer, Berlin, Heidelberg, 2nd edn. (2016).
    https://doi.org/10.1007/978-1-4471-6684-9
 3. Chamberlain, J., Campello, A., Wright, J.P., Clift, L.G., Clark, A., Garcı́a Seco de
    Herrera, A.: Overview of the ImageCLEFcoral 2020 Task: Automated Coral Reef
    Image Annotation. In: CLEF2020 Working Notes. CEUR Workshop Proceedings,
    CEUR-WS.org (2020)
 4. Efron, B., Tibshirani, R.: Improvements on cross-validation: the 632+ bootstrap
    method. Journal of the American Statistical Association 92(438), 548–560 (1997)
 5. Ghani, A.S.A., Isa, N.A.M.: Underwater image quality enhancement through
    composition of dual-intensity images and Rayleigh-stretching. SpringerPlus 3(1)
    (2014). https://doi.org/10.1186/2193-1801-3-757
 6. He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask R-CNN. In: 2017
    IEEE International Conference on Computer Vision (ICCV). IEEE (2017).
    https://doi.org/10.1109/iccv.2017.322
 7. Ionescu, B., Müller, H., Péteri, R., Abacha, A.B., Datla, V., Hasan, S.A., Demner-
    Fushman, D., Kozlovski, S., Liauchuk, V., Cid, Y.D., Kovalev, V., Pelka, O.,
    Friedrich, C.M., de Herrera, A.G.S., Ninh, V.T., Le, T.K., Zhou, L., Piras, L.,
    Riegler, M., l Halvorsen, P., Tran, M.T., Lux, M., Gurrin, C., Dang-Nguyen, D.T.,
    Chamberlain, J., Clark, A., Campello, A., Fichou, D., Berari, R., Brie, P., Dogariu,
    M., Ştefan, L.D., Constantin, M.G.: Overview of the ImageCLEF 2020: Multimedia
    Retrieval in Lifelogging, Medical, Nature, and Internet Applications. In: Experi-
    mental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of
    the 11th International Conference of the CLEF Association (CLEF 2020), vol.
    12260. LNCS Lecture Notes in Computer Science, Springer (2020)
 8. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár,
    P., Zitnick, C.: Microsoft COCO: Common Objects in Context. In: Computer
    Vision – ECCV 2014. pp. 740–755. Springer International Publishing (2014).
    https://doi.org/10.1007/978-3-319-10602-1 48
 9. Pebesma, E.: Simple Features for R: Standardized Support for Spatial Vector Data.
    The R Journal 10(1), 439–446 (2018). https://doi.org/10.32614/RJ-2018-009
10. Peng, Y.T., Cosman, P.C.: Underwater Image Restoration Based on Image Blurri-
    ness and Light Absorption. IEEE Transactions on Image Processing 26(4), 1579–
    1594 (2017). https://doi.org/10.1109/tip.2017.2663846
11. Polyak, B.T., Juditsky, A.B.: Acceleration of stochastic approximation by averag-
    ing. SIAM journal on control and optimization 30(4), 838–855 (1992)
12. R Core Team: R: A Language and Environment for Statistical Computing. R Foun-
    dation for Statistical Computing, Vienna, Austria (2020), https://www.R-project.
    org/
13. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards Real-Time Object
    Detection with Region Proposal Networks. In: Advances in neural information
    processing systems. pp. 91–99 (2015)
14. RStudio Team: RStudio: Integrated Development Environment for R. RStudio,
    Inc., Boston, MA (2019), https://www.rstudio.com/
15. Srividhya, K., Ramya, M.M.: Object classification in underwater images using
    adaptive fuzzy neural network. In: 2017 13th International Conference on Nat-
    ural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD). pp.
    142–148. IEEE (2017). https://doi.org/10.1109/fskd.2017.8392973
16. Wang, Y., Song, W., Fortino, G., Qi, L.Z., Zhang, W., Liotta, A.: An
    Experimental-Based Review of Image Enhancement and Image Restoration
    Methods for Underwater Imaging. IEEE Access 7, 140233–140251 (2019).
    https://doi.org/10.1109/access.2019.2932130