VGG16-based approach for side-scan sonar image analysis
                                Antoni Jaszcz
                                Faculty of Applied Mathematics, Silesian University of Technology, Kaszubska 23, 44100 Gliwice, Poland


                                                                    Abstract
                                                                    Side-scan sonar (SSS) images are based on the reflection of the signal from an underwater object. As a result, such data may
                                                                    contain a lot of noise or ambiguous objects to be analyzed. In this paper, we propose a simple system for analyzing such
                                                                    images and classifying objects on them. For this purpose, the convolutional neural network and learning transfer (VGG-16)
                                                                    were used. Such a network model was preceded by the process of dividing the sonar image into smaller fragments in order to
                                                                    avoid the omission of objects by reducing the size. The proposed solution was tested on a dedicated database, which made
                                                                    it possible to evaluate the proposal and reach the high accuracy of the used network. The obtained research results were
                                                                    analyzed and discussed due to the possibility of implementing such a model in practice.

                                                                    Keywords
                                                                    Side-scan sonar, learning transfer, geospatial data, vgg16, machine learning


                                1. Introduction                                                                       [7]. Again in [8], the automatic overlapping and seg-
                                                                                                                      mentation techniques were developed. Learning transfer
                                Recent years have brought enormous growth in image based on yolov5 was also used to detect some objects
                                analysis through the use of artificial neural networks. In on SSS images [9]. Similar research was conducted by
                                particular, this is due to convolutional neural networks the use of different CNN models for many types of im-
                                (CNNs) that automatically detect features and perform ages [10, 11]. Again in [12], the authors used a similar
                                classifications. However, the use of such networks comes approach, but classify small images. Segmentation, clas-
                                with additional requirements. Before classification, each sification tools can be used for tracking bottom [13, 14].
                                network must undergo a training process on a dedicated A segmentation of the image can be done by the use of re-
                                database. For the network to achieve high accuracy current residual CNN and self-guidance module [15, 16].
                                scores, it requires an enormous amount of data. Hence, A field-programmable gate array for SSS images based
                                neural networks are called data-hungry algorithms. How- on neural networks was proposed in [17]. SSS data are
                                ever, quite often there are situations when the number of not only used for finding some object but also to com-
                                data is small and it is difficult to obtain more samples. For bine its feature to create a surface in 3d projection [18].
                                this purpose, techniques of augmentation [1] or learning Another application is underwater communication for
                                transfer [2] can be used.                                                             compressed SSS transmission [19].
                                   An interesting case of such images is the ones gath-                                  In this paper, we propose a simple system based on
                                ered underwater [3, 4]. An example is a side-scan sonar automatic splitting SSS images into smaller parts and
                                (SSS), which is obtained by visualizing the signal being using them to train VGG-16. As a result, an automatic
                                reflected from objects. Such data are exposed to large system for classifying sonar data is modeled. The main
                                amounts of noise, hence an important element is the contribution of this paper is:
                                construction of systems based on neural networks and
                                allowing to increase the accuracy of these images [5].                                      • the methodology for analyzing SSS images,
                                In this paper, the authors proposed a solution based on                                     • the use of pre-trained model called VGG-16 to
                                a generative adversarial network. The image was pro-                                           classify SSS data,
                                cessed by down-sampling and then recreated with the                                         • the method of automatically adding a sample to
                                up-sampling approach. A similar idea was presented in                                          the database if the probability of belonging is high
                                [6], where a new type of such network was modeled and                                          enough, which allows increasing the number of
                                called s2 rgan. However, the detection of objects and clas-                                    samples in the database.
                                sification of them is also important. Such analysis can
                                bring information about the state of the seafloor. One ap-
                                proach is to the analysis of smaller parts to find a region 2. Methodology
                                of interest and then applied neural networks to find areas
                                                                                                                      In this section, the proposition of the system to analyze
                                                                                                                      side-scan sonar images is described (Fig. 1). The in-
                                IVUS 2022: 27th International Conference on Information Technology coming sonar data are split into smaller fragments and
                                $ aj303181@student.polsl.pl (A. Jaszcz)
                                         © 2022 Copyright for this paper by its authors. Use permitted under Creative then processed by a convolutional neural network. If the
                                 CEUR
                                 Workshop
                                         Commons License Attribution 4.0 International (CC BY 4.0).
                                         CEUR Workshop Proceedings (CEUR-WS.org)
                                 Proceedings
                                               http://ceur-ws.org
                                               ISSN 1613-0073
                                                                                                                      classification result shows with high probability one of


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Figure 1: Visualization of the proposed approach based on the pre-trained convolutional network model


the classes (higher than 0.9), the sample is placed in the    Algorithm 1: SSS-image division algorithm
database and used in the next training.                        Data: Sonar image
                                                               Result: Image samples of a given size
2.1. Image division                                          1 while not past the bottom of the image do
                                                             2    while not past the right edge of the image do
At first, the image is processed. It is done because the     3        cut and save a sample of desired shape
sample size has to be reduced to that of the first network                // 256x256 pixels
layer before further processing. Hence, a large sonar im-
age, when reduced in size, can distort or simplify certain 4             move right by the desired amount of
elements. As a result, this may result in much worse                       pixels                  // 100 pixels
learning outcomes and subsequent classification. To pre-
                                                               5     end
vent this, we propose a simple algorithm for subdividing
                                                               6     move back to the left edge and shift down by
the sonar image into smaller fragments (see Alg. 1). The
                                                                      the desired amount of pixels           // 100
main idea is to cut the image in half into two samples.
                                                                      pixels
The reason is an area that was not visible to sonar. As
a result, two samples are created from one image (pre- 7 end
senting the left and right sides of the sonar). Then, the
sample is divided by the specified height and passed on.
Note that if, after the cutting process, the sample is larger and extract the features on it (therefore the changed im-
than the first layer of the network, it will be reduced to age is called a feature map).
that size.                                                       The second type is known as pooling and it has one
                                                              task - reduce the image size. This reduction is based
2.2. CNN                                                      on mathematical functions like max(·). The operation
                                                              is understood as a selection of one pixel in a grid that
The split image is processed by convolutional neural satisfies this function. Of course, the grid is moved over
network. Its structure consists of three different layers: the entire image until the last pixel is covered with it.
convolutional, pooling, and fully-connected (dense). The The minimum size of such a mesh is 2 × 2.
convolutional layer change the image 𝐼 by the filter 𝑘           Third layer is full-connected that presents a classic
(a matrix of size 𝑝 × 𝑝). It is done by the convolutional column of neurons that receives a numerical values and
operator (here marked as *) defined as:                       process them to next neurons by the connection. Each
                       𝑝    𝑝                                 connection between two neurons has a weight 𝑤 and
                                                          (1) this value is modified in the training process. The mathe-
                      ∑︁ ∑︁
         𝑘 * 𝐼𝑥,𝑦 =            𝑘𝑖,𝑗 · 𝐼𝑥+𝑖−1,𝑦+𝑗−1 .
                      𝑖=1 𝑗=1                                 matical formulation of it is:
                                                                                    (︃𝑛−1         )︃
The values of filter 𝑘 are found during the training pro-                              ∑︁
                                                                                  𝑓        𝑤𝑖,𝑗 𝑥𝑖 ,                 (2)
cess. The main task of this image is to modify the image                               𝑖=0
                                                                                 (a) Part of larger image


Figure 2: Visualization of the pre-trained model known as
VGG16


where 𝑓 (·) is an activation function, 𝑖, 𝑗 are the indexes
of two neurons in adjacent layers.                             (b) Cut out fragment indicated (c) Cut out fragment indicated
   The main reason for using CNN for image recognition             by red frame                   by green frame
is feature extraction. The more consecutive layers the
neural network has, the more abstract features can be
considered. Basic features extracted by the model first
are generally lines along the axis of the image, so features
such as vertical, horizontal and diagonal lines. Then some
more advanced characteristics are discovered, like shapes.
For example rectangles, circles, straight lines, etc. The
deeper the model goes, the more complex and abstract                          (d) Cut out fragment indicated
newly extracted features become, to the point, that we                            by blue frame
as humans do not even consider them or are unaware of          Figure 3: Windowing technique visualised on a small part of
them.                                                          an image
   In the proposed methodology, we propose using a pre-
trained model VGG-16 [20] that is presented in Fig. 2.
   The problem of object recognition can be universalised
to some extent in terms of object detection. That’s why
                                                               3. Experiments
pretrained models are commonly used. Such models are           In this section, the database, CNN configuration, obtained
trained to extract unique features related to the sought       results and discussion are presented.
objects. The most popular database for training those
models is ImageNet database, containing over a million
manually labeled images of thousand classes. By incor-         3.1. Database and data preparation
porating in such pretrained models, training time can be In this paper, side-scan sonar images of a river floor were
greatly reduced and focus of the training can be shifted used. The data were gathered between two water chan-
towards detection of specific objects, rather than feature
                                                         nels in north-western Poland. In order for a deep learning
extraction. Some of the ImageNet pretrained models in-   model to identify objects on a riverbed, target samples
clude:                                                   were required. To achieve that, the objects had to be
                                                         hand-picked and manually classified. To facilitate the
      • VGG-family
                                                         process of doing so, the images were automatically split
      • ResNet50                                         into smaller parts of a fixed size of 256 by 256 pixels. It
      • Inception V3                                     is worth mentioning, that those parts could overlap with
      • Xception                                         one another. In this paper, the next two samples in the
                                                         same row were 100 pixels afar from each other, as well
In our experiments, we chose VGG-16 because it is most
                                                         as considered row (each following row was 100 pixels
commonly used model, which means that result compar-
                                                         lower). This can be presented by a simple image division
ison to other models in the field can be easier. It also
                                                         algorithm shown in Alg. 1. The windowing technique
incorporates samples assessed with certainty above cer-
                                                         has been visualised in Fig. 3.
tain threshold, which benefited our research greatly.
   After obtaining samples, they were then manually clas-           goal for adding this class into consideration was
sified into 3 categories, depending on what could be seen           to make the model able to detect cluttered areas
in the picture. Those categories are:                               and distinguish them from the clear ones.
    1. object - anything distinguishable as a larger ob-       In total, there were 665 samples in the final database,
       ject laying on a riverbed. That includes ship/vehicle 55 of which were classified as 𝑜𝑏𝑗𝑒𝑐𝑡𝑠, 352 as 𝑠𝑎𝑛𝑑 and
       wrecks (Fig. 4), logs and pipes (Fig. 5), etc. In 257 as 𝑟𝑢𝑏𝑏𝑙𝑒. Next, 226 samples were randomly chosen
       other words, single objects of considerable size and put into validation group. Class distribution of those
       appeared in the river.                                objects can be read from Fig. 8, and is as follows:
                                                                 • 36 𝑜𝑏𝑗𝑒𝑐𝑡 samples,
                                                                 • 124 𝑠𝑎𝑛𝑑 samples,
                                                                 • 63 𝑟𝑢𝑏𝑏𝑙𝑒 samples.
                                                            The rest of 439 samples formed the training set for the
                                                            neural network.
                                                               It is crucial to mention, that due to the nature of side-
                                                            scan sonar images, some samples turned out to be in-
                                                            adequate. There was a concern, that they would bring
                                                            nothing more, than confusion to the model. As a result,
                                                            it was decided that if a sample is too confusing (in terms
       Figure 4: Possible wreck of a car (object class)     of which class it shall belong to), contains a considerable
                                                            amount of the boat’s passage area (a thick, black line
                                                            stretching along the edges of a sonar image and crossing
                                                            its center, an example of which can be seen in Fig. 7)
                                                            or if its quality was concerning (the image was greatly
                                                            distorted), it should be removed from the dataset.


       Figure 5: Log or pipe (object class)


    2. sand - a plain surface of the riverbed, on which
       nothing particular can be picked out. An example
       of such a sample is shown in Fig 6


                                                            Figure 7: An example of a sample, that was removed


                                                               In the experiments, pre-trained VGG-16 model con-
                                                            nected to the dense neural network with input augmen-
                                                            tation was used. The structure of the model is displayed
                                                            below:
                                                                1. Augmentation layer - random horizontal and ver-
       Figure 6: Sand (sand class)                                 tical flip and random rotation
                                                                2. VGG-16 layer
    3. rubble - a plain surface of the riverbed with a          3. deep neural layer
       visible and considerable amount of debris. The                   • flatten layer (),
            • dense layer, 50 neurons, ReLU activation,       FP - false sample predicted as true,
            • dense layer, 20 neurons, ReLU activation,       FN - true sample predicted as false.
            • dropout layer (threshold: 0.5)                  Please note that in this paper, we consider multi-class
                                                              detection, which also applies to the results. Thus, by true,
    4. Output layer - a dense layer with 3 neurons (one
                                                              a correct class assignment is meant. That is, at a given
       for each output class)
                                                              time during assessment, only class is considered 𝑡𝑟𝑢𝑒
                                                              and the others are 𝑓 𝑎𝑙𝑠𝑒. That applies for every class
                                                              and is visible in Fig. 8 and its impact on obtained metrics
                                                              is displayed in Tab. 1


                                                              Table 1
                                                              Calculated metrics
                                                                            precision   recall    specificity   f1-score
                                                                             0.7895     0.8333     0.9579       0.8108
                                                                object                  accuracy=0.9381
                                                                             0.8425     0.9685     0.7677       0.9011
                                                                sand                    accuracy=0.8805
                                                                             0.8810     0.5873     0.9693       0.7048
                                                                rubble                  accuracy=0.8628
                                                                             0.8376     0.7964     0.8983       0.8056
                                                                average                 accuracy=0.8938

Figure 8: Multi-class confusion matrix                      As seen in the Fig. 8 and Tab 1, the model achieves
                                                         decent accuracy, when detecting objects. It also performs
                                                         very well when detecting non-𝑜𝑏𝑗𝑒𝑐𝑡 samples, which
                                                         is indicated by high specificity. In terms of identifying
3.2. Results                                             𝑠𝑎𝑛𝑑, the accuracy has dropped. However, its recall is
In the Tab. 1, calculated metrics, such as 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦, high. That suggests, that model correctly assigned most
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛, 𝑠𝑝𝑒𝑐𝑖𝑓 𝑖𝑐𝑖𝑡𝑦, 𝑟𝑒𝑐𝑎𝑙𝑙 and 𝑓 1 − 𝑠𝑐𝑜𝑟𝑒 are dis- of the 𝑠𝑎𝑛𝑑 samples. However, it was at cost of precision,
played for each class, as well as their mean value. The which is not great. As can be observed in Fig. 8, 21 of
values were calculated with following formulas:          the 𝑟𝑢𝑏𝑏𝑙𝑒 samples were categorized as 𝑠𝑎𝑛𝑑, which in-
                                                         dicates, that the model aggressively assesses 𝑠𝑎𝑛𝑑-class
     • Accuracy:                                         objects and requires some more balance. As a result,
                                                         𝑟𝑢𝑏𝑙𝑒’s metrics are all fairly low, accuracy included. All
                            𝑇𝑃 + 𝑇𝑁
                𝛼=                             ,     (3) of that implies, that the 𝑟𝑢𝑏𝑙𝑒 detection is the weakest
                     𝑇𝑃 + 𝐹𝑃 + 𝑇𝑁 + 𝐹𝑁                   link of the model.
     • Precision:                                           In Fig. 9 and Fig. 10, a trade off between classifying
                                 𝑇𝑃                      train and test group (both derived from aforementioned
                       𝜓=               ,            (4)
                            𝑇𝑃 + 𝐹𝑃                      𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔𝑔𝑟𝑜𝑢𝑝, the former counting samples to learn
     • Recall:                                           from, and the ladder having samples to validate the out-
                                𝑇𝑃                       comes) in terms of 𝑙𝑜𝑠𝑠 (9) and 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 (10) during the
                       𝜌=               ,            (5)
                            𝑇𝑃 + 𝐹𝑁                      subsequent epochs of training is presented.
     • Specificity:

                               𝑇𝑁                             4. Conclusions
                        𝜎=           ,                  (6)
                             𝑇𝑁 + 𝐹𝑃                          The analysis of the bed of any body of water using side-
     • F1-Score:                                              scan sonar requires great image data to be manually
                                  (︂           )︂             reviewed. To narrow the dataset needed to be hand-
                     1                 1   1                  checked, artificial intelligence can be used to pick out
                        = 0.5 ·          +          ,   (7)
                     𝑓1                𝜓   𝜌                  fragments of the images with sought objects (for example,
                                                              shipwrecks). For this purpose, the use of a pre-trained
where
                                                              convolutional neural network (VGG-16) connected to a
TP - true sample predicted as true,
                                                              dense neural network was presented. In the research,
TN - false sample predicted as false,
Figure 9: Trade off between 𝑡𝑒𝑠𝑡 and 𝑡𝑟𝑎𝑖𝑛 loss value during the subsequent epochs of training


Figure 10: Trade off between 𝑡𝑒𝑠𝑡 and 𝑡𝑟𝑎𝑖𝑛 accuracy value during the subsequent epochs of training


larger images were divided into target samples and aug- Acknowledgements
mented during training (by randomly flipping and ro-
tating them). The obtained results indicate that such a This work is supported by the Silesian University of Tech-
model can satisfactorily distinguish objects of unconven- nology by the mentoring project.
tional shapes for the riverbed. It is also suggested, that
with some improvements, the model can be used in more
advanced riverbed analysis by detecting objects harder
                                                            References
to distinguish from plain river’s ground such as rocks, [1] G. Chandrashekar, A. Raaza, V. Rajendran,
water weeds, silt, etc., as well as their percentage in the     D. Ravikumar, Side scan sonar image augmenta-
whole image. The model tested in this paper was ag-             tion for sediment classification using deep learning
gressive towards classifying samples containing ruble as        based transfer learning approach, Materials Today:
plain ground. That said, it was not the case with object        Proceedings (2021).
detection. The reason for that is as mentioned above, lit- [2] D. Połap, Fuzzy consensus with federated learning
tle and the hardly-noticeable difference between several,       method in medical systems, IEEE Access 9 (2021)
distorted small objects and plain surfaces in computer          150383–150392.
vision.                                                     [3] W. Kazimierski, G. Zaniewicz, Determination of
                                                                process noise for underwater target tracking with
                                                                forward looking sonar, Remote Sensing 13 (2021)
                                                                1014.
 [4] N. Wawrzyniak, G. Zaniewicz, Detecting small                 Symposium (IRS), IEEE, 2017, pp. 1–9.
     moving underwater objects using scanning sonar          [17] C. Wang, Y. Jiang, K. Wang, F. Wei, A field-
     in waterside surveillance and complex security solu-         programmable gate array system for sonar image
     tions, in: 2016 17th International Radar Symposium           recognition based on convolutional neural network,
     (IRS), IEEE, 2016, pp. 1–5.                                  Proceedings of the Institution of Mechanical Engi-
 [5] P. Shen, L. Zhang, M. Wang, G. Yin, Deeper super-            neers, Part I: Journal of Systems and Control Engi-
     resolution generative adversarial network with gra-          neering 235 (2021) 1808–1818.
     dient penalty for sonar image enhancement, Mul-         [18] M. Włodarczyk-Sielicka, I. Bodus-Olkowska,
     timedia Tools and Applications 80 (2021) 28087–              M. Łącka, The process of modelling the elevation
     28107.                                                       surface of a coastal area using the fusion of spatial
 [6] H. Song, M. Wang, L. Zhang, Y. Li, Z. Jiang, G. Yin,         data from different sensors, Oceanologia (2021).
     S2rgan: sonar-image super-resolution based on gen-      [19] J. Cui, G. Han, Y. Su, X. Fu, Non-uniform non-
     erative adversarial network, The Visual Computer             orthogonal multicarrier underwater communica-
     37 (2021) 2285–2299.                                         tion for compressed sonar image data transmis-
 [7] D. Połap, N. Wawrzyniak, M. Włodarczyk-Sielicka,             sion, IEEE Transactions on Vehicular Technology
     Side-scan sonar analysis using roi analysis and deep         70 (2021) 10133–10145.
     neural networks, IEEE Transactions on Geoscience        [20] K. Simonyan, A. Zisserman, Very deep convolu-
     and Remote Sensing (2022).                                   tional networks for large-scale image recognition,
 [8] X. Shang, J. Zhao, H. Zhang, Automatic overlapping           3rd International Conference on Learning Repre-
     area determination and segmentation for multiple             sentations, ICLR 2015, San Diego, CA, USA, May
     side scan sonar images mosaic, IEEE Journal of               7-9, 2015, Conference Track Proceedings (2015).
     Selected Topics in Applied Earth Observations and
     Remote Sensing 14 (2021) 2886–2900.
 [9] F. Yu, B. He, K. Li, T. Yan, Y. Shen, Q. Wang, M. Wu,
     Side-scan sonar images segmentation for auv with
     recurrent residual convolutional neural network
     module and self-guidance module, Applied Ocean
     Research 113 (2021) 102608.
[10] W. Yanchen, Sonar image target detection and
     recognition based on convolution neural network,
     Mobile Information Systems 2021 (2021).
[11] D. Połap, M. Woźniak, Meta-heuristic as manager
     in federated learning approaches for image process-
     ing purposes, Applied Soft Computing 113 (2021)
     107872.
[12] X. Qin, X. Luo, Z. Wu, J. Shang, Optimizing the
     sediment classification of small side-scan sonar im-
     ages based on deep learning, IEEE Access 9 (2021)
     29416–29428.
[13] X. Qin, X. Luo, Z. Wu, J. Shang, D. Zhao, Deep
     learning-based high accuracy bottom tracking on
     1-d side-scan sonar data, IEEE Geoscience and Re-
     mote Sensing Letters 19 (2021) 1–5.
[14] G. Zheng, H. Zhang, Y. Li, J. Zhao, A universal au-
     tomatic bottom tracking method of side scan sonar
     data based on semantic segmentation, Remote Sens-
     ing 13 (2021) 1945.
[15] Y. Yu, J. Zhao, Q. Gong, C. Huang, G. Zheng, J. Ma,
     Real-time underwater maritime object detection
     in side-scan sonar images based on transformer-
     yolov5, Remote Sensing 13 (2021) 3555.
[16] N. Wawrzyniak, M. Włodarczyk-Sielicka, A. State-
     czny, Msis sonar image segmentation method based
     on underwater viewshed analysis and high-density
     seabed model, in: 2017 18th International Radar