Deep Segmentation: Using deep convolutional networks for coral reef pixel-wise parsing Aljoscha Steffens1 , Antonio Campello1,2 , James Ravenscroft1 , Adrian Clark2 , and Hani Hagras2 1 Filament AI, United Kingdom 2 University of Essex, United Kingdom Abstract. In this paper, we describe a deep-convolutional network based method to segment coral reef images into different types of substrates. The method described in the paper includes data preparation, model summary, specific techniques to deal with class imbalance, and down- stream post-processing computer vision tasks, such as morphological op- erations and polygon generation from pixel segmentation. We present the results of our method in the ImageCLEFcoral pixel-wise parsing task, evaluated across the different classes of substrate. 1 Introduction Semantic segmentation models have received significant attention in computer vision due to their applicability in medical imagining, autonomous driving, and full-scene understanding. In this paper, we consider the ImageCLEF 2019 pixel- wise parsing competition [4], [9], which consists of segmenting pictures from coral reefs into 13 different substrates. We are particularly interested in evaluating the applicability of deep convolutional neural networks (DCNNs) to the coral images, taken under real conditions in the ocean. A model that performs well on the task of automatic segmentation of corals could be beneficial to the conservation of reefs by measuring the amounts of different corals, their condition and other characteristics. This paper is organised as follows. In section 2 we explore the ImageCLEFco- ral dataset [4], how to split it into training and validation set in order to keep the distributions balanced, as well as a data augmentation approach. In section 3, we describe DeeplabV3 [6], a DCNN designed for semantic segmentation, alongside with our pipeline. Our method includes post-processing tasks such as morpho- logical operations and polygon filling. Training, bootstrapping and inference are also described. In section 4 we discuss possible routes to improve the results. Copyright c 2019 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 Septem- ber 2019, Lugano, Switzerland. 2 Data The data that was provided consisted of 240 images of size 4032 × 3024 × 3 as well as a text file containing the polygons in the images for the following classes/substrates: Hard Coral - Branching Soft Coral Hard Coral - Submassive Soft Coral - Gorgonian Hard Coral - Boulder Sponge Hard Coral - Encrusting Sponge - Barrel Hard Coral - Table Fire Coral - Millepora Hard Coral - Foliose Algae - Macro or Leaves Hard Coral - Mushroom Table 1: Substrate names We mapped the 13 substrates to integers {0, 1, . . . 13} (the background corre- sponding to 0) and created a 4032 × 3024 integer matrix for each image defined as k Mij = c if pixel i, j corresponds to substrate c, (1) with ties broken arbitrarily. The corresponding matrix acts as a per-class "mask" for each substrate, an example can be seen in figure 1. Fig. 1: Substrate mask and image for id 2018_0714_112417_024 The class-distribution of the pixel reflects the naturally occurring composition of corals in the photographed area and is thus highly imbalanced, see figure 2. Also, the class-distributions between two images can vary strongly as only a limited selection of coral types will be present in each image. Both the overall class-imbalance and the difference in inter-image class-distributions need to be taken into account and are discussed in section 3.2 and 2.1 respectively. 2.1 Data Split The data was split into a training and validation set with 204 images (85%) for training and 36 for validation. Due to reasons discussed in section 2.2, we chose to split the data on a per-image basis. This posed a problem given the difference in inter-image class-distributions and the relatively small number of images: for a random split it would be highly likely that the overall, training, and validation class-distribution would differ strongly which would have an effect on the evaluation of the model performance. In order to achieve balanced distributions for the training and validation set, we created N training and validation sets of constant size, with randomly selected images and compared the resulting training/validation distributions by using a cosine distance that is weighted by the overall class-distribution. Pd 2 i=1 wi × vi1 × vi2 dist(v 1 , v 2 , w) = qP qP (2) d 1 d 2 i=1 wi × vi i=1 wi × vi From those N splits, we chose the split that resulted in the lowest distance. Afterwards, we selected one item from each set that, if swapped, resulted in the biggest decrease in the weighted cosine distance. Swaps were performed until there was no further decrease possible by swapping individual items. The whole procedure - N random splits, choosing the best split, optimise by swapping - was performed several times in order to increase the chance of finding an good final split. While the given approach does not result in an optimal solution, it is fast and did achieve satisfying result for the given task. The three distributions (overall, training, validation) can be seen in figure 2. 2.2 Data Augmentation Splitting the data on image level was mainly done due to how we prepared the data before feeding it into the Neural Network. With their large spatial dimensions that resulted in both rich details and many different coral types per image, it seemed sensible to use random cropping as a main preparation step. For each image that was loaded into memory during training, 16 random crops with square sizes between 400 × 400 and 1400 × 1400 were performed. The crops were then bi-linearly scaled to 256 × 256 and randomly flipped in vertical and horizontal direction before they were fed into the network. Both the flipping and the random cropping served as a data augmentation method that ensured that the network was exposed to a variety of relative coral sizes, orientations, and image-compositions. This way, the network never saw the exact same image twice which reduced overfitting. For the validation data, random crops and re-scaling were done prior to training the network and thus always the same. This was done to make the metrics that were computed after each epoch comparable. Fig. 2: Class distributions overall and for training and validation split 3 The model We used DeeplabV3 [6], a deep convolutional neural network that improves over existing networks. In particular, DeeplabV3 extends Deeplab [5] and avoids the need of a post-processing machine learning model (such as conditional random fields). Nevertheless, since the challenge is evaluated over polygons, we needed to apply image processing techniques (in particular, morphological transformations and region-filling algorithms) in order to generate the final file. Note that the polygon-filling operations have more degrees-of-freedom than the training pre- processing, therefore adding extra parameters to the model. These operations are described in more details in 3.5. A high-level operational diagram of the model and evaluation can be found below. More details will be described in the next sections. Fig. 3: Training and inference flows. 3.1 DeeplabV3 Deeplab V3 is a Deep Convolutional Neural Network (DCNN) for semantic im- age segmentation proposed by Chen et al in [6] in 2017. It is a state-of-the-art network architecture that, with pretraining on the ImageNet [7] and JFT-300M [10] dataset resulted in a mIOU of 86.9% on the PASCAL VOC 2012 test set [6]. The model consist of two parts: The first part is a feature extracting backbone that is not strictly limited to be of a given type - in the paper the authors use a ResNet-50 and ResNet-101 but other architectures can be employed as well. The second part is where the novelty happens with an extensive use of atrous convolutions. An atrous convolution filter kernel layout is defined by the normal size of say 3x3 and in addition to that, an atrous rate that specifies how many 0-values are between the individual filter entries along the spatial dimensions. With 0-values between, the atrous convolution would be the same as a normal convolution; inserting one zero between two neighbouring values in a 3x3 convo- lution would make it have the same receptive field as a 5x5 convolution, while only employing 9 weights instead of 25. Deeplab V3 uses atrous convolutions for constructing feature pyramids. Feature pyramids are used to combine features from different scales into one feature map, which is done by using atrous con- volutions with different rates on the same feature map and concatenating the individual outputs to a new feature map. For our submission we used a PyTorch implementation of DeepLab V3 found from [3] with a ResNet101 backbone [8] and an output-stride of 16 Prior to polygon-filling post-processing, the model outputs, for every pixel, a probability that such pixel belongs to a class, or more formally: k fij (c) = probability that pixel ij on image k belongs to class c (3) 3.2 Class imbalance and weighted loss function As discussed in section 2, the class-distribution of the data is highly skewed. This is a problem, as the model will emphasize more on classifying frequent classes correctly to achieve lower errors if no counter-measures are in place. One approach that is often used - and that was used here as well - is to weigh the loss function based on the class distribution. We used a pixel-wise cross-entropy loss and weighted the individual components with following weights: 1 w(c) = (4) log(α + p(c)) with p(c) being the relative occurrence of class c and α being a hyper-parameter that scales the weights (in our submission α = 1.025 yielded good results). This was done as there are orders of magnitudes between the individual relative occurrences, and the model would over-emphasize on the infrequent classes if we just used the reciprocal. The final cross-entropy loss is as follows: N height ! 1 X X width X XC k exp(fij (c)) k × −yij (c) × w(c) × log P k (c)) , N × height × width i=1 j=1 c=1 exp(fij k=1 (5) k k where fij (c) and yij (c) describe the predicted confidence (f ) and ground truth (y) for image k at position ij and class c respectively and N is the number of images that are included in the loss. 3.3 Training and Bootstrapping We trained the Neural Network for 50 epochs, with a batch size of 32 (2 images per batch, 16 crops per image) on a Nvidia GeForce GTX 1080 Ti. After the training, we used the network to predict the training images and cropped out areas where the network was particularly bad. The network was then trained on those images for another 30 epochs. 3.4 Inference In order to predict a full-sized image at inference time, we used a sliding win- dow approach. With window-sizes of 500 × 500, 1000 × 1000, and 1500 × 1500 corresponding step-sizes of 400, 800, and 1200 we cut each 4032 × 3024 image into 112 partially overlapping sections. Each section was then scaled to 256×256 and fed into the Neural Network. The results were scaled to their original size and added at their respective position to a 4032 × 3024 × 14 confidence matrix C. For each position Ci,j the number of votes were stored (meaning how often a given pixel was predicted) so that the average confidence could be calculated subsequently. The final classification for pixel i, j was then given by c = argmax Ci,j (k) (6) k∈{0,...,13} By using sliding windows with different window-sizes we made sure that each pixel was predicted at several different resolutions and thus also with different amounts of context. 3.5 Post-processing After calculating the classification mask for a predicted image, we used several basic computer vision algorithm for post-processing and transforming the data into the given submission format. 1. Find connected components. 2. Morphological opening with kernel size = (31, 31) 3. Morphological closing with kernel size = (31, 31) 4. Flood fill 5. Polygon approximation using Douglas-Peucker algorithm [2], with maximum distance to correct output ε equals 0.1% of the contour arc-length for that connected component. We used the OpenCV 3.4.2 implementations [1] of the corresponding algo- rithms. 4 Final results and further work The average intersection over union for all classes, as reported in the official ImageCLEF 2019 Coral competition, over the test set, is described in Table 2. Soft corals and hard corals (boulder) performed relatively well in comparison to the other classes, which was expected, anticipated by the class abundance, as shown in figure 2. Surprisingly, mushroom pixels have been correctly identified 21%, in spite of the small number of samples. A full analysis and visualisation of the results per class is left for future investigation. There are a number of areas where our pipeline can potentially be improved in the future and things that can be investigated: – Increasing the input size to the model – Increase the batch size – Test different model backbones – Tuning hyper-parameters (both for training and for post-processing) – Investigate the impact of crop sizes – Investigate the impact of bootstrapping – Try different methods to counteract the class-imbalance Substrate mIoU (%) Hard Coral - Branching 9.58 Hard Coral - Submassive 0.0 Hard Coral - Boulder 16.59 Hard Coral - Encrusting 4.46 Hard Coral - Table 0.0 Hard Coral - Foliose 0.65 Hard Coral - Mushroom 21.9 Soft Coral 13.0 Soft Coral - Gorgonian 1.86 Sponge 5.73 Sponge - Barrel 8.89 Fire Coral - Millepora 0.0 Algae - Macro or Leaves 0.07 Table 2: Accuracy per class References 1. Open source computer vision library 4.1.0. https://docs.opencv.org/4.1.0/, 2019. 2. Opencv contour features. https://docs.opencv.org/3.1.0/dd/d49/tutorial_ py_contour_features.html, 2019. 3. pytorch-deeplab-xception. https://github.com/jfzhang95/ pytorch-deeplab-xception, 2019. 4. J. Chamberlain, A. Campello, J. P. Wright, L. G. Clift, A. Clark, and A. García Seco de Herrera. Overview of ImageCLEFcoral 2019 task. In CLEF2019 Working Notes, volume 2380 of CEUR Workshop Proceedings, 2019. 5. L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell., 40(4):834–848, 2018. 6. L. Chen, G. Papandreou, F. Schroff, and H. Adam. Rethinking atrous convolution for semantic image segmentation. CoRR, abs/1706.05587, 2017. 7. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large- Scale Hierarchical Image Database. In CVPR09, 2009. 8. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, June 2016. 9. B. Ionescu, H. Müller, R. Péteri, Y. D. Cid, V. Liauchuk, V. Kovalev, D. Klimuk, A. Tarasau, A. B. Abacha, S. A. Hasan, V. Datla, J. Liu, D. Demner-Fushman, D.- T. Dang-Nguyen, L. Piras, M. Riegler, M.-T. Tran, M. Lux, C. Gurrin, O. Pelka, C. M. Friedrich, A. G. S. de Herrera, N. Garcia, E. Kavallieratou, C. R. del Blanco, C. C. Rodríguez, N. Vasillopoulos, K. Karampidis, J. Chamberlain, A. Clark, and A. Campello. ImageCLEF 2019: Multimedia retrieval in medicine, lifelogging, se- curity and nature. In Experimental IR Meets Multilinguality, Multimodality, and Interaction, Proceedings of the 10th International Conference of the CLEF Asso- ciation (CLEF 2019), Lugano, Switzerland, September 9-12 2019. LNCS Lecture Notes in Computer Science, Springer. 10. C. Sun, A. Shrivastava, S. Singh, and A. Gupta1. Revisiting unreasonable effec- tiveness of data in deep learning era. ICCV, 2017.