VGG16-based approach for side-scan sonar image analysis Antoni Jaszcz Faculty of Applied Mathematics, Silesian University of Technology, Kaszubska 23, 44100 Gliwice, Poland Abstract Side-scan sonar (SSS) images are based on the reflection of the signal from an underwater object. As a result, such data may contain a lot of noise or ambiguous objects to be analyzed. In this paper, we propose a simple system for analyzing such images and classifying objects on them. For this purpose, the convolutional neural network and learning transfer (VGG-16) were used. Such a network model was preceded by the process of dividing the sonar image into smaller fragments in order to avoid the omission of objects by reducing the size. The proposed solution was tested on a dedicated database, which made it possible to evaluate the proposal and reach the high accuracy of the used network. The obtained research results were analyzed and discussed due to the possibility of implementing such a model in practice. Keywords Side-scan sonar, learning transfer, geospatial data, vgg16, machine learning 1. Introduction [7]. Again in [8], the automatic overlapping and seg- mentation techniques were developed. Learning transfer Recent years have brought enormous growth in image based on yolov5 was also used to detect some objects analysis through the use of artificial neural networks. In on SSS images [9]. Similar research was conducted by particular, this is due to convolutional neural networks the use of different CNN models for many types of im- (CNNs) that automatically detect features and perform ages [10, 11]. Again in [12], the authors used a similar classifications. However, the use of such networks comes approach, but classify small images. Segmentation, clas- with additional requirements. Before classification, each sification tools can be used for tracking bottom [13, 14]. network must undergo a training process on a dedicated A segmentation of the image can be done by the use of re- database. For the network to achieve high accuracy current residual CNN and self-guidance module [15, 16]. scores, it requires an enormous amount of data. Hence, A field-programmable gate array for SSS images based neural networks are called data-hungry algorithms. How- on neural networks was proposed in [17]. SSS data are ever, quite often there are situations when the number of not only used for finding some object but also to com- data is small and it is difficult to obtain more samples. For bine its feature to create a surface in 3d projection [18]. this purpose, techniques of augmentation [1] or learning Another application is underwater communication for transfer [2] can be used. compressed SSS transmission [19]. An interesting case of such images is the ones gath- In this paper, we propose a simple system based on ered underwater [3, 4]. An example is a side-scan sonar automatic splitting SSS images into smaller parts and (SSS), which is obtained by visualizing the signal being using them to train VGG-16. As a result, an automatic reflected from objects. Such data are exposed to large system for classifying sonar data is modeled. The main amounts of noise, hence an important element is the contribution of this paper is: construction of systems based on neural networks and allowing to increase the accuracy of these images [5]. • the methodology for analyzing SSS images, In this paper, the authors proposed a solution based on • the use of pre-trained model called VGG-16 to a generative adversarial network. The image was pro- classify SSS data, cessed by down-sampling and then recreated with the • the method of automatically adding a sample to up-sampling approach. A similar idea was presented in the database if the probability of belonging is high [6], where a new type of such network was modeled and enough, which allows increasing the number of called s2 rgan. However, the detection of objects and clas- samples in the database. sification of them is also important. Such analysis can bring information about the state of the seafloor. One ap- proach is to the analysis of smaller parts to find a region 2. Methodology of interest and then applied neural networks to find areas In this section, the proposition of the system to analyze side-scan sonar images is described (Fig. 1). The in- IVUS 2022: 27th International Conference on Information Technology coming sonar data are split into smaller fragments and $ aj303181@student.polsl.pl (A. Jaszcz) © 2022 Copyright for this paper by its authors. Use permitted under Creative then processed by a convolutional neural network. If the CEUR Workshop Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) Proceedings http://ceur-ws.org ISSN 1613-0073 classification result shows with high probability one of CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Figure 1: Visualization of the proposed approach based on the pre-trained convolutional network model the classes (higher than 0.9), the sample is placed in the Algorithm 1: SSS-image division algorithm database and used in the next training. Data: Sonar image Result: Image samples of a given size 2.1. Image division 1 while not past the bottom of the image do 2 while not past the right edge of the image do At first, the image is processed. It is done because the 3 cut and save a sample of desired shape sample size has to be reduced to that of the first network // 256x256 pixels layer before further processing. Hence, a large sonar im- age, when reduced in size, can distort or simplify certain 4 move right by the desired amount of elements. As a result, this may result in much worse pixels // 100 pixels learning outcomes and subsequent classification. To pre- 5 end vent this, we propose a simple algorithm for subdividing 6 move back to the left edge and shift down by the sonar image into smaller fragments (see Alg. 1). The the desired amount of pixels // 100 main idea is to cut the image in half into two samples. pixels The reason is an area that was not visible to sonar. As a result, two samples are created from one image (pre- 7 end senting the left and right sides of the sonar). Then, the sample is divided by the specified height and passed on. Note that if, after the cutting process, the sample is larger and extract the features on it (therefore the changed im- than the first layer of the network, it will be reduced to age is called a feature map). that size. The second type is known as pooling and it has one task - reduce the image size. This reduction is based 2.2. CNN on mathematical functions like max(·). The operation is understood as a selection of one pixel in a grid that The split image is processed by convolutional neural satisfies this function. Of course, the grid is moved over network. Its structure consists of three different layers: the entire image until the last pixel is covered with it. convolutional, pooling, and fully-connected (dense). The The minimum size of such a mesh is 2 × 2. convolutional layer change the image 𝐼 by the filter 𝑘 Third layer is full-connected that presents a classic (a matrix of size 𝑝 × 𝑝). It is done by the convolutional column of neurons that receives a numerical values and operator (here marked as *) defined as: process them to next neurons by the connection. Each 𝑝 𝑝 connection between two neurons has a weight 𝑤 and (1) this value is modified in the training process. The mathe- ∑︁ ∑︁ 𝑘 * 𝐼𝑥,𝑦 = 𝑘𝑖,𝑗 · 𝐼𝑥+𝑖−1,𝑦+𝑗−1 . 𝑖=1 𝑗=1 matical formulation of it is: (︃𝑛−1 )︃ The values of filter 𝑘 are found during the training pro- ∑︁ 𝑓 𝑤𝑖,𝑗 𝑥𝑖 , (2) cess. The main task of this image is to modify the image 𝑖=0 (a) Part of larger image Figure 2: Visualization of the pre-trained model known as VGG16 where 𝑓 (·) is an activation function, 𝑖, 𝑗 are the indexes of two neurons in adjacent layers. (b) Cut out fragment indicated (c) Cut out fragment indicated The main reason for using CNN for image recognition by red frame by green frame is feature extraction. The more consecutive layers the neural network has, the more abstract features can be considered. Basic features extracted by the model first are generally lines along the axis of the image, so features such as vertical, horizontal and diagonal lines. Then some more advanced characteristics are discovered, like shapes. For example rectangles, circles, straight lines, etc. The deeper the model goes, the more complex and abstract (d) Cut out fragment indicated newly extracted features become, to the point, that we by blue frame as humans do not even consider them or are unaware of Figure 3: Windowing technique visualised on a small part of them. an image In the proposed methodology, we propose using a pre- trained model VGG-16 [20] that is presented in Fig. 2. The problem of object recognition can be universalised to some extent in terms of object detection. That’s why 3. Experiments pretrained models are commonly used. Such models are In this section, the database, CNN configuration, obtained trained to extract unique features related to the sought results and discussion are presented. objects. The most popular database for training those models is ImageNet database, containing over a million manually labeled images of thousand classes. By incor- 3.1. Database and data preparation porating in such pretrained models, training time can be In this paper, side-scan sonar images of a river floor were greatly reduced and focus of the training can be shifted used. The data were gathered between two water chan- towards detection of specific objects, rather than feature nels in north-western Poland. In order for a deep learning extraction. Some of the ImageNet pretrained models in- model to identify objects on a riverbed, target samples clude: were required. To achieve that, the objects had to be hand-picked and manually classified. To facilitate the • VGG-family process of doing so, the images were automatically split • ResNet50 into smaller parts of a fixed size of 256 by 256 pixels. It • Inception V3 is worth mentioning, that those parts could overlap with • Xception one another. In this paper, the next two samples in the same row were 100 pixels afar from each other, as well In our experiments, we chose VGG-16 because it is most as considered row (each following row was 100 pixels commonly used model, which means that result compar- lower). This can be presented by a simple image division ison to other models in the field can be easier. It also algorithm shown in Alg. 1. The windowing technique incorporates samples assessed with certainty above cer- has been visualised in Fig. 3. tain threshold, which benefited our research greatly. After obtaining samples, they were then manually clas- goal for adding this class into consideration was sified into 3 categories, depending on what could be seen to make the model able to detect cluttered areas in the picture. Those categories are: and distinguish them from the clear ones. 1. object - anything distinguishable as a larger ob- In total, there were 665 samples in the final database, ject laying on a riverbed. That includes ship/vehicle 55 of which were classified as 𝑜𝑏𝑗𝑒𝑐𝑡𝑠, 352 as 𝑠𝑎𝑛𝑑 and wrecks (Fig. 4), logs and pipes (Fig. 5), etc. In 257 as 𝑟𝑢𝑏𝑏𝑙𝑒. Next, 226 samples were randomly chosen other words, single objects of considerable size and put into validation group. Class distribution of those appeared in the river. objects can be read from Fig. 8, and is as follows: • 36 𝑜𝑏𝑗𝑒𝑐𝑡 samples, • 124 𝑠𝑎𝑛𝑑 samples, • 63 𝑟𝑢𝑏𝑏𝑙𝑒 samples. The rest of 439 samples formed the training set for the neural network. It is crucial to mention, that due to the nature of side- scan sonar images, some samples turned out to be in- adequate. There was a concern, that they would bring nothing more, than confusion to the model. As a result, it was decided that if a sample is too confusing (in terms Figure 4: Possible wreck of a car (object class) of which class it shall belong to), contains a considerable amount of the boat’s passage area (a thick, black line stretching along the edges of a sonar image and crossing its center, an example of which can be seen in Fig. 7) or if its quality was concerning (the image was greatly distorted), it should be removed from the dataset. Figure 5: Log or pipe (object class) 2. sand - a plain surface of the riverbed, on which nothing particular can be picked out. An example of such a sample is shown in Fig 6 Figure 7: An example of a sample, that was removed In the experiments, pre-trained VGG-16 model con- nected to the dense neural network with input augmen- tation was used. The structure of the model is displayed below: 1. Augmentation layer - random horizontal and ver- Figure 6: Sand (sand class) tical flip and random rotation 2. VGG-16 layer 3. rubble - a plain surface of the riverbed with a 3. deep neural layer visible and considerable amount of debris. The • flatten layer (), • dense layer, 50 neurons, ReLU activation, FP - false sample predicted as true, • dense layer, 20 neurons, ReLU activation, FN - true sample predicted as false. • dropout layer (threshold: 0.5) Please note that in this paper, we consider multi-class detection, which also applies to the results. Thus, by true, 4. Output layer - a dense layer with 3 neurons (one a correct class assignment is meant. That is, at a given for each output class) time during assessment, only class is considered 𝑡𝑟𝑢𝑒 and the others are 𝑓 𝑎𝑙𝑠𝑒. That applies for every class and is visible in Fig. 8 and its impact on obtained metrics is displayed in Tab. 1 Table 1 Calculated metrics precision recall specificity f1-score 0.7895 0.8333 0.9579 0.8108 object accuracy=0.9381 0.8425 0.9685 0.7677 0.9011 sand accuracy=0.8805 0.8810 0.5873 0.9693 0.7048 rubble accuracy=0.8628 0.8376 0.7964 0.8983 0.8056 average accuracy=0.8938 Figure 8: Multi-class confusion matrix As seen in the Fig. 8 and Tab 1, the model achieves decent accuracy, when detecting objects. It also performs very well when detecting non-𝑜𝑏𝑗𝑒𝑐𝑡 samples, which is indicated by high specificity. In terms of identifying 3.2. Results 𝑠𝑎𝑛𝑑, the accuracy has dropped. However, its recall is In the Tab. 1, calculated metrics, such as 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦, high. That suggests, that model correctly assigned most 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛, 𝑠𝑝𝑒𝑐𝑖𝑓 𝑖𝑐𝑖𝑡𝑦, 𝑟𝑒𝑐𝑎𝑙𝑙 and 𝑓 1 − 𝑠𝑐𝑜𝑟𝑒 are dis- of the 𝑠𝑎𝑛𝑑 samples. However, it was at cost of precision, played for each class, as well as their mean value. The which is not great. As can be observed in Fig. 8, 21 of values were calculated with following formulas: the 𝑟𝑢𝑏𝑏𝑙𝑒 samples were categorized as 𝑠𝑎𝑛𝑑, which in- dicates, that the model aggressively assesses 𝑠𝑎𝑛𝑑-class • Accuracy: objects and requires some more balance. As a result, 𝑟𝑢𝑏𝑙𝑒’s metrics are all fairly low, accuracy included. All 𝑇𝑃 + 𝑇𝑁 𝛼= , (3) of that implies, that the 𝑟𝑢𝑏𝑙𝑒 detection is the weakest 𝑇𝑃 + 𝐹𝑃 + 𝑇𝑁 + 𝐹𝑁 link of the model. • Precision: In Fig. 9 and Fig. 10, a trade off between classifying 𝑇𝑃 train and test group (both derived from aforementioned 𝜓= , (4) 𝑇𝑃 + 𝐹𝑃 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔𝑔𝑟𝑜𝑢𝑝, the former counting samples to learn • Recall: from, and the ladder having samples to validate the out- 𝑇𝑃 comes) in terms of 𝑙𝑜𝑠𝑠 (9) and 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 (10) during the 𝜌= , (5) 𝑇𝑃 + 𝐹𝑁 subsequent epochs of training is presented. • Specificity: 𝑇𝑁 4. Conclusions 𝜎= , (6) 𝑇𝑁 + 𝐹𝑃 The analysis of the bed of any body of water using side- • F1-Score: scan sonar requires great image data to be manually (︂ )︂ reviewed. To narrow the dataset needed to be hand- 1 1 1 checked, artificial intelligence can be used to pick out = 0.5 · + , (7) 𝑓1 𝜓 𝜌 fragments of the images with sought objects (for example, shipwrecks). For this purpose, the use of a pre-trained where convolutional neural network (VGG-16) connected to a TP - true sample predicted as true, dense neural network was presented. In the research, TN - false sample predicted as false, Figure 9: Trade off between 𝑡𝑒𝑠𝑡 and 𝑡𝑟𝑎𝑖𝑛 loss value during the subsequent epochs of training Figure 10: Trade off between 𝑡𝑒𝑠𝑡 and 𝑡𝑟𝑎𝑖𝑛 accuracy value during the subsequent epochs of training larger images were divided into target samples and aug- Acknowledgements mented during training (by randomly flipping and ro- tating them). The obtained results indicate that such a This work is supported by the Silesian University of Tech- model can satisfactorily distinguish objects of unconven- nology by the mentoring project. tional shapes for the riverbed. It is also suggested, that with some improvements, the model can be used in more advanced riverbed analysis by detecting objects harder References to distinguish from plain river’s ground such as rocks, [1] G. Chandrashekar, A. Raaza, V. Rajendran, water weeds, silt, etc., as well as their percentage in the D. Ravikumar, Side scan sonar image augmenta- whole image. The model tested in this paper was ag- tion for sediment classification using deep learning gressive towards classifying samples containing ruble as based transfer learning approach, Materials Today: plain ground. That said, it was not the case with object Proceedings (2021). detection. The reason for that is as mentioned above, lit- [2] D. Połap, Fuzzy consensus with federated learning tle and the hardly-noticeable difference between several, method in medical systems, IEEE Access 9 (2021) distorted small objects and plain surfaces in computer 150383–150392. vision. [3] W. Kazimierski, G. Zaniewicz, Determination of process noise for underwater target tracking with forward looking sonar, Remote Sensing 13 (2021) 1014. [4] N. Wawrzyniak, G. Zaniewicz, Detecting small Symposium (IRS), IEEE, 2017, pp. 1–9. moving underwater objects using scanning sonar [17] C. Wang, Y. Jiang, K. Wang, F. Wei, A field- in waterside surveillance and complex security solu- programmable gate array system for sonar image tions, in: 2016 17th International Radar Symposium recognition based on convolutional neural network, (IRS), IEEE, 2016, pp. 1–5. Proceedings of the Institution of Mechanical Engi- [5] P. Shen, L. Zhang, M. Wang, G. Yin, Deeper super- neers, Part I: Journal of Systems and Control Engi- resolution generative adversarial network with gra- neering 235 (2021) 1808–1818. dient penalty for sonar image enhancement, Mul- [18] M. Włodarczyk-Sielicka, I. Bodus-Olkowska, timedia Tools and Applications 80 (2021) 28087– M. Łącka, The process of modelling the elevation 28107. surface of a coastal area using the fusion of spatial [6] H. Song, M. Wang, L. Zhang, Y. Li, Z. Jiang, G. Yin, data from different sensors, Oceanologia (2021). S2rgan: sonar-image super-resolution based on gen- [19] J. Cui, G. Han, Y. Su, X. Fu, Non-uniform non- erative adversarial network, The Visual Computer orthogonal multicarrier underwater communica- 37 (2021) 2285–2299. tion for compressed sonar image data transmis- [7] D. Połap, N. Wawrzyniak, M. Włodarczyk-Sielicka, sion, IEEE Transactions on Vehicular Technology Side-scan sonar analysis using roi analysis and deep 70 (2021) 10133–10145. neural networks, IEEE Transactions on Geoscience [20] K. Simonyan, A. Zisserman, Very deep convolu- and Remote Sensing (2022). tional networks for large-scale image recognition, [8] X. Shang, J. Zhao, H. Zhang, Automatic overlapping 3rd International Conference on Learning Repre- area determination and segmentation for multiple sentations, ICLR 2015, San Diego, CA, USA, May side scan sonar images mosaic, IEEE Journal of 7-9, 2015, Conference Track Proceedings (2015). Selected Topics in Applied Earth Observations and Remote Sensing 14 (2021) 2886–2900. [9] F. Yu, B. He, K. Li, T. Yan, Y. Shen, Q. Wang, M. Wu, Side-scan sonar images segmentation for auv with recurrent residual convolutional neural network module and self-guidance module, Applied Ocean Research 113 (2021) 102608. [10] W. Yanchen, Sonar image target detection and recognition based on convolution neural network, Mobile Information Systems 2021 (2021). [11] D. Połap, M. Woźniak, Meta-heuristic as manager in federated learning approaches for image process- ing purposes, Applied Soft Computing 113 (2021) 107872. [12] X. Qin, X. Luo, Z. Wu, J. Shang, Optimizing the sediment classification of small side-scan sonar im- ages based on deep learning, IEEE Access 9 (2021) 29416–29428. [13] X. Qin, X. Luo, Z. Wu, J. Shang, D. Zhao, Deep learning-based high accuracy bottom tracking on 1-d side-scan sonar data, IEEE Geoscience and Re- mote Sensing Letters 19 (2021) 1–5. [14] G. Zheng, H. Zhang, Y. Li, J. Zhao, A universal au- tomatic bottom tracking method of side scan sonar data based on semantic segmentation, Remote Sens- ing 13 (2021) 1945. [15] Y. Yu, J. Zhao, Q. Gong, C. Huang, G. Zheng, J. Ma, Real-time underwater maritime object detection in side-scan sonar images based on transformer- yolov5, Remote Sensing 13 (2021) 3555. [16] N. Wawrzyniak, M. Włodarczyk-Sielicka, A. State- czny, Msis sonar image segmentation method based on underwater viewshed analysis and high-density seabed model, in: 2017 18th International Radar