A two-staged Approach for Localization and Classification of Coral Reef Structures and Compositions Kirill Bogomasov, Philipp Grawe, and Stefan Conrad Heinrich Heine University, Universitätsstraße 1, 40225 Düsseldorf, Germany http://dbs.cs.uni-duesseldorf.de {bogomasov,grawe,stefan.conrad}@hhu.de Abstract. In this paper we present the approaches that achieved the first place in this years ImageCLEFcoral challenge. The task of the chal- lenge was the localization and classification of corals within images of sea ground. Therefore we had to extract bounding boxes for each coral and labeling them with the specific type of substrate. We applied a state-of-the-art deep learning approach (YOLO) and also developed a two-staged approach, using a grid along with two classifiers. One that classifies the tiles of the grid, the other that classifies the found boxes. We had moderate results using YOLO and discovered that locating the corals is the most challenging part. Furthermore class imbalance and intersecting boxes, made the problem even harder. Keywords: Image Segmentation · Image Classification · Object Local- ization 1 Introduction Climate change is one of the major problems of the 21st century. Its impact is growing every year and therefore researched a lot. Since corals are a signifi- cant part of the maritime environment they are affected by the climate change in many ways [6]. Corals have their own self-contained and over many decades developed ecosystem, which is why the influence of damage to coral reefs can have serious consequences for every maritime organism. Every year the danger of complete destruction of coral reefs becomes more realistic. To sophisticatedly plan protective procedures, coverage of current stocks are required. For this pur- pose, images of the sea ground are currently viewed and annotated manually, which is nearly impossible for the whole considered surface area. This raises the question of whether an automatic localication and annotation of the coral is feasible. We will address this question in this paper. Therefore we use this year’s Copyright c 2019 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 Septem- ber 2019, Lugano, Switzerland. ImageCLEFcoral dataset [2] as the base for our research and also participated in their challenge, which is part of the ImageCLEF 2019 [8]. The task can be divided into two logical subtasks, localization and classification of objects. This is a wide spread research field of computer science, with many fields of appli- cation. The automotive industry seems to be an obvious field of research [3] and is commercially relevant. Today, driver assistance systems are ubiquitous. Recognition of road signs is a part of it. At first images of the road are taken via vehicle camera while driving. Second road signs are searched and classified in these recordings. The greatest results in such application scenarios are achieved by artificial neural networks. YOLO [11] showed one of the best results. The ap- plication scenario can be transferred very well. The localization and labeling of corals is similar, because the images are taken automatically and contain corals in unknown areas. 2 Data The training set, which we define as dataset (A), contains 240 images with 6670 annotated substrates. Generally there is a differentiation between 13 substrate types. Which are: “Hard Coral – Branching, Hard Coral – Submassive, Hard Coral – Boulder, Hard Coral – Encrusting, Hard Coral – Table, Hard Coral – Foliose, Hard Coral – Mushroom, Soft Coral, Soft Coral – Gorgonian, Sponge, Sponge – Barrel, Fire Coral – Millepora and Algae - Macro or Leaves” [2]. For the submitted runs a test set containing 200 raw images is used, which correct labels and boxes were not available at the time of the publication. Table 1: Substrate types with their relative frequency in the training set. Class label Relative frequency c algae macro or leaves 0.0046 c fire coral millepora 0.0015 c hard coral boulder 0.1549 c hard coral branching 0.1280 c hard coral encrusting 0.0528 c hard coral foliose 0.0082 c hard coral mushroom 0.0258 c hard coral submassive 0.0031 c hard coral table 0.0009 c soft coral 0.5223 c soft coral gorgonian 0.0024 c sponge 0.0808 c sponge barrel 0.0145 The substrate types have an unbalanced distribution, as shown in table 1. Fur- thermore does the quality of the images vary, as well as the resolution. Some of the images contain a measurement white line, which is an obstacle while image processing. 2.1 Investigating the Dataset When investigating the dataset, the problem of overlapping boxes appeared to us. Many of these bounding boxes fully contained or intersected with other boxes. To be more specific, only 2672 of 6670 bounding boxes do neither overlap or are contained in a bigger one. For this reason, we had started to investigate whether the substrates differ from each other at all, why we searched for meaningful features. These features were extracted from extracted bounding boxes. We ap- plied a classical approach using SIFT [9]. Furthermore we calculated structure[5], texture[7] and color histograms in another approach. We used the calculated fea- tures to train a k-Nearest Neighbors classifier. The considered neighborhood k was set to [3, 25] = {k ∈ N|3 ≤ k ≤ 25}. Setting k to a higher value would lead to a strong dominatation of the neighborhood by frequent classes. With a train-/validation split of 80 : 20 we got our best results on a combination of texture, structure and color features. The following values show that the rare substrates are basically not found. In this way, we did not succeed in improving these values. 2.2 Augmentation The amount of given data is remarkably low. Usually even a pre-trained neural network requires a larger data set, why we decided to use data augmentation. To generate new data, we used the following methods: noise and blur[1]. Other augmentation methods did not seem practical, since it would change bounding boxes. Therefore, we generated a second dataset (B) and could triple the data set size. Within the new dataset, which consists of substrate bounding boxes, we kept the class distribution, due to the probability of finding a frequently represented substrate type is significantly higher than that of a rare one. Also because balancing the dataset would require to cut frequent substrate types, which did not seem appropriate regarding the low number of annotated corals. 2.3 Sharpening The images vary in quality and many of them are out of focus or blurry. To counter this and to create an improved dataset (C), we increased the contrast of entire images and highlighted the details. For this purpose, each pixel value was replaced by the weighted average of its 3 × 3 neighborhood. The following matrix shows the filter:   −1 −1 −1 −1 12 −1 −1 −1 −1 3 Approaches The challenge of the annotation and localization task is to find corals within images of sea ground, define bounding boxes for each coral and label them with their specific type of substrate. We applied one state-of-the-art deep learning approach and additionally devel- oped an own one. These two are presented in the following subsections, whereas the focus lies on explaining our own approach. 3.1 YOLO In contrast to comparable neural networks, like ”fast R-CNN” [4], which locate and classify objects multiple times for various regions of an image, the YOLO architecture passes the whole input image at once. That is achieved by dividing each image into square cells inside of which bounding boxes are predicted. In our work we scaled input images to a size of 608 x 608 pixels, because of the many corals contained in each image. This is the largest resolution we tested on our GPU and was the most promising. The classification process is basically a regres- sion problem, which leads from image pixel values to bounding boxes with their class probabilities in one go. Part of the training is the optimization of predicted class probabilities, which defines the bounding boxes. In doing so, the calcula- tion of each box considers features of the entire image. Therefore YOLO has the advantage of making less background errors as R-CNN, because more context information is taken into account. YOLO also outputs a confidence, which is calculated as the product of the precision of an object and its intersection over union (IoU). In a later step, this is multiplied by the conditional class probabil- ity of an object. Finally an output confidence is obtained, which describes how probable the particular class of the box is and how well the predicted bounding box fits this particular object. Limitations However, there are some limitations. On the one hand each cell of the grid predicts only two boxes, which share the same class label. This is an algorithmic limitation on the number of objects with different labels, if the objects are close to each other. On the other hand the authors of YOLO mention that they treat errors in small bounding boxes the same way they treat large bounding box errors. Because of that, errors in small boxes have a larger impact on IoU, which leads to incorrect localization. 3.2 Own Developments We developed a two-staged approach that first locates and then labels the sub- strates. Both of these steps make use of machine learning, to be more precise classification algorithms. This leaves room to improve the classification task, e.g. by evaluating different classification algorithms. Resize image to Raw image fit grid Classify tiles Find Classify connected- Substrate label area components Delete small areas (<10 tiles) Extract Box coordinates boxes Fig. 1: Flow diagram of our approach. Red clouds are input data and green ones output data. The blue boxes are steps described in text. One advantage of this two-staged approach is, that the two stages are indepen- dent from each other, which makes it possible to combine different algorithms and approaches together. The algorithm of our approach is shown in figure 1. Locating Substrates The main idea behind locating substrates is based on the assumption, that the coral images have coral and non-coral areas. Such non-coral areas should look relatively similar for all coral types. This is quite different for images showing objects like cars or birds. Following our assumption, we segmented an image in coral and non-coral areas. First the image is divided in a grid, small enough to predict all boxes. To classify these areas we used a grid and then extract features from the tiles of this grid. We used a square of a fixed size for the tiles, which is based on the size of the smallest boxes in the training set, i.e. the integer average of the smallest width and height. Based on the training set, we recommend to use a tile size of 12 × 12, so that the smallest box can be located completely without background. To ensure that all tiles of an image have the same size, i.e. image size is the whole multiple of the tile size, the image is scaled to fit. Next we extracted features for every tile of each of the training set images. For this purpose we used concatenated feature vectors consisting of features describing the color, texture and shape. For color, normalized histograms are used which describe the characteristics regarding the color [10]. The textural features of the tiles are modeled by Haralick texture features [5], which applies co-occurrence matrices on the gray scale level. Lastly the shape is represented by Hu moments [7]. Hu moments are invariant to translation, rotation and scale. All of these characteristics are useful for the domain of coral images. These features are used to train a binary classifier, which classifies whether a tile is a coral area or a non-coral area. An area of a training set image is considered a coral area, if more than 50% of its area intersects with a bounding box area of the ground truth. To classify areas of images that should be predicted, this image is also divided into tiles of the same previously defined size. Now the labels were obtained by feeding the features into the learned classifier. We decided to use K-Nearest-Neighbor with k = 15 to classify the tiles, because k = 15 performed best on our validation split. After each tile of the grid was classified, we got an black and white image with 12 × 12 pixel large tiles. There are multiple strategies to extract boxes out of the resulting picture. We used a relatively naive approach with the application of connected-component labeling. Since we discovered a large amount of single, not connected tiles, we only kept components, that consisted of more than ten tiles. This counters a less beneficial performance of our classifier. Each unique component is now bordered with a bounding box, that borders the outside tiles of the component. Figure 2 is showing the different stages of the location process (b - d), as well as the ground truth (a). Labeling Found Boxes The found bounding boxes were classified on previ- ously mentioned features 2.1 using a k-Nearest Neighbors Classifier. In addition to this already presented classification approaches, we studied whether the fea- tures can also be classified using a convolutional neural network. For the research, we subdivided the training data into a training and validation set in a ratio of 80:20 as previously. For comparability of the results we scaled the input data to the size of our grid. In consideration of the low amount of image data, we begun our work with a correspondingly small CNN, which we call baseline. The given CNN consists of one convolutional layer with maxpooling and rectified linear activation. We use dropout to prevent overfitting. The deactivation of neurons happens with a 20% probability. Subsequently, the data is handed to a flattening layer which serves as connection between convolutional and following dense lay- ers. The result first enters a dense layer with RELU as the activation function and is then passed on to a density layer with softmax as activation function. This leads us to a confidence for each bounding box to belong to one of our 13 classes. Considering that such a simple architecture may not be able to ”remember” all relevant features of coral images, we extended our baseline architecture. There- fore we enlarged the existing architecture with two additional convolutional hid- (a) Grund truth boxes. (b) Inside (white) and outside tiles (black). (c) Refined inside and outside areas. (d) Bounding boxes of connected compo- nents. Fig. 2: Visualized process of localization corals with our approach. The raw pic- ture is taken from the ImageCLEFcoral dataset [2]. den layers. Finally we looked for an extra deep architecture for comparison. All networks were trained with a batch size of 100 and with up to 1000 epochs. We decided to use VGG19 [12] and trained it on our data via transfer learning, since it has been proven to be gold standard in recent years. 4 Evaluation and Results The following section discusses the submitted runs at ImageCLEF 2019. For a better understanding of the results of the approaches, we evaluated the lo- calization and labeling separately. The results show that YOLO is considered state-of-the-art for a reason. Besides presenting our results of the submissions, we also discuss the limitations and potentials of our approach as well. In table 2 we present the results of our submissions. Our own approach is marked with I, and YOLO based submissions with II. MAP 0.5 stands for the localised mean average precision for each submitted method with an IoU ≥ 0.5 of the ground truth and R 0.5 for the recall value, respectively MAP 0 represents the image annotation average without any localization. The results for I show, that CNNs and k-NN deliver comparable results. Sharpening has not led to better results, perhaps because it accentuates noise. YOLO combined with statistical probability distribution provides best results with an precision of 0.243 and a recall of 0.131. Table 2 shows that we worked only on data set (A) and (C). We did not use data set (B) for our run submissions, since it did not lead to any kind of improvement. Table 2: Results of our submitted runs at ImageCLEF 2019. Comparison of the results of the different approaches in our submissions. The methods used in I are our developed two-staged approach, whereas II are approaches using YOLO. Approach Dataset MAP 0.5 R 0.5 MAP 0 k-NN, k = 13 A 0.003 0.004 0.272 Statistical labeling A 0.002 0.003 0.203 I Baseline CNN A 0.003 0.004 0.228 Transfer Learning A 0.003 0.004 0.291 3-Layer CNN A 0.003 0.004 0.205 YOLO + k-NN A 0.229 0.131 0.500 YOLO + Statistical A 0.243 0.131 0.488 II YOLO + k-NN C 0.210 0.122 0.455 YOLO + Statistical C 0.220 0.122 0.442 All approaches we used have some limitations and therefore leave space for improvement. Some of which we will describe in the following. YOLO The weakness of YOLO is evident on rather smaller bounding boxes. Predictions on the validation data set showed that small coral substrates are either not found or subsequently labeled incorrectly. This results in the low recall value of 0.131. Corals that are found however, are mostly labeled as ”c soft coral”. Nevertheless, even on larger corals, YOLO shows rather mod- erate results. In a quarter of images it did not find boxes at all, that is why we used the found boxes from our other approach I to complete the results. Our Approach Not only the performance (see Table 2) shows flaws in our approach, but also some obvious conclusions do. Since we got an accuracy of 0.534 on labeling boxes, which was evaluated on a 80 : 20 split of the training set, we assume that our approach fails to locate corals correctly. We also tested using SIFT features which had an accuracy of 0.4744. One problem of the two-staged approach is the assumption, features of coral and non-coral tiles are distinct enough. This leaves room for further evaluation and research, regarding the choice of features and labels. It might be beneficial to use more than two labels, i.e. more than just coral and non-coral. This could e.g. be water in the background, because we discovered that water in the background is often ”false positive” classified, i.e. as coral area. An additional label would also need an additional annotation. The question arises, whether 14 (13 substrate classes + background) labels could be used. This approach would only need one, instead of two classifiers. Regarding the tile classification, the usage of CNNs to classify the tiles sounds promising because of the high number of tiles. Another issue with our approach is the size of the tiles. Big tiles prevent small boxes from getting found and increase the chance of two corals in one tile. From a design perspective, boxes should be as small as possible to be as precise as possible. But if tiles are chosen relatively small, not only does the computa- tional time extend, but features contain less information. This could lead e.g. to forms not getting recognized. We encountered the problem of an enormous computational time, because of that we increased the size to 24 × 24. Also we reduced the training set of tiles by 80%, which decreases the performance not significantly as seen in table 3. An approach of using a sliding window should also be considered in future work. Table 3: Performance of 20% of the training set tiles compared to all tiles. Amount of dataset Precision Recall F1 Score Non-coral Coral Non-coral Coral Non-coral Coral 1.0 0.6916 0.4950 0.7664 0.4013 0.7271 0.4433 0.2 0.6907 0.4900 0.7603 0.4034 0.7238 0.4425 Lastly using connected components as the method to extract boxes from the tile images, could be not sophisticated enough. Firstly with a perfect labeling of coral and non-coral tiles, it would not be able to recognize inlying boxes. And secondly it only considers two labels as features. The use of density-based clustering, working on more than just the predicted labels could lead to better results. 5 Conclusion Overall, our approaches show moderate results. The idea to use neural networks proved to be promising. However, afterwards we can assert that YOLO was not the best choice. It completely fails to find smaller bounding boxes. The concept of using feature engineering and searching for features or feature constellations, which are able to describe and represent different types of benthic substrate, still seems to be useful regarding the small amount of given data. But there is a lot of room for improvement. Beside of that there are multiple images in the data set that show the same sea ground and contain for the most part the same corals. This kind of information can be used locally to improve the bounding boxes of corals, since their position can be tracked. With regard to our approach 2b of labeling coral and non-coral areas, we can make the conclusion that the chosen features are not working properly. Probably we need a kind of back propagation to mark wrong labeled areas and process images multiple times. Additionally we could investigate the set of our features for a more performant subset using boosting. We would also stick to the deep learning approach and try another, maybe more time consuming but also more precise neural network, like an R-CNN. Finally, the concept of combining deep learning and classic feature engineer- ing is where we see the most potential. Besides that, another point of potential improvement is the correction and balancing of the data set itself. Currently, seven of 13 coral type classes have a relative ratio of less than two percent, six out of them even less than one percent. The quality of the pictures is very variable too. Some of the images do not even seem to be completely annotated. For future approaches, we would recommend publishing a larger and more balanced data set, in which each class has almost the same number of represen- tatives. To address the initial question whether an automatic localization and anno- tation of corals is feasible, we see good chances for future research. References 1. Bloice, M.D., Stocker, C., Holzinger, A.: Augmentor: an image augmentation li- brary for machine learning. arXiv preprint arXiv:1708.04680 (2017) 2. Chamberlain, J., Campello, A., Wright, J.P., Clift, L.G., Clark, A., Garcı́a Seco de Herrera, A.: Overview of ImageCLEFcoral 2019 task. In: CLEF2019 Working Notes. CEUR Workshop Proceedings, CEUR-WS.org (2019) 3. Dollár, P., Wojek, C., Schiele, B., Perona, P.: Pedestrian detection: A benchmark (2009) 4. Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision. pp. 1440–1448 (2015) 5. Haralick, R.M., Shanmugam, K., et al.: Textural features for image classification. IEEE Transactions on systems, man, and cybernetics (6), 610–621 (1973) 6. Hoegh-Guldberg, O., Mumby, P.J., Hooten, A.J., Steneck, R.S., Greenfield, P., Gomez, E., Harvell, C.D., Sale, P.F., Edwards, A.J., Caldeira, K., et al.: Coral reefs under rapid climate change and ocean acidification. science 318(5857), 1737– 1742 (2007) 7. Hu, M.K.: Visual pattern recognition by moment invariants. IRE transactions on information theory 8(2), 179–187 (1962) 8. Ionescu, B., Müller, H., Péteri, R., Dang-Nguyen, D.T., Piras, L., Riegler, M., Tran, M.T., Lux, M., Gurrin, C., Cid, Y.D., et al.: Imageclef 2019: Multimedia retrieval in lifelogging, medical, nature, and security applications. In: European Conference on Information Retrieval. pp. 301–308. Springer (2019) 9. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Com- put. Vision 60(2), 91–110 (Nov 2004) 10. Pass, G., Zabih, R., Miller, J.: Comparing images using color coherence vectors. In: ACM multimedia. vol. 96, pp. 65–73. Citeseer (1996) 11. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 779–788 (2016) 12. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)