1. Introduction

Automatic detection of constructions using binary image segmentation algorithms

E A Dmitriev

dmitrievEgor94@yandex.ru 0

A A Borodinov

aaborodinov@yandex.ru 0

A I Maksimov

S A Rychazhkov

0 0 Samara National Research University , Moskovskoye shosse, 34, Samara, Russia, 443086

2019

264 268

This article presents binary segmentation algorithms for buildings automatic detection on aerial images. There were conducted experiments among deep neural networks to find the most effective model in sense of segmentation accuracy and training time. All experiments were conducted on Moscow region images that were got from open database. As the result the optimal model was found for buildings automatic detection.

1. Introduction

The automatically detecting objects in Earth remote sensing (RS) images task is one of the most difficult tasks. An example of a solution to the problem under consideration is [ 1 ]. Currently, one of the most effective approaches is semantic segmentation algorithms usage. In other words, for each image pixel, the object class to which it belongs is determined.

The segmentation of remote sensing images is used in many industries: geoinformatics, the creation of maps, analysis of land use, etc. At the moment, many segmentation process stages are solved manually with the help of operators, which leads to high economic costs in temporary resources, as well as some inaccuracies in the markup due to the human factor.

Currently, there are many algorithms for image segmentation [ 2, 3, 4 ], but the most effective are approaches using convolutional neural networks (CNN) [ 5 ]. For almost all computer vision tasks, convolutional networks provide more efficient results than other algorithms.

In recent years, various approaches have been proposed for the CNN models formation, which at the output give an original image segmentation map. One of the most effective methods is based on the use of fully connected neural networks [ 5 ]. Unlike the convolutional networks that are used for classification, there is no subnet of the multilayer perceptron for classification in fully connected networks.

The CNN architecture for semantic segmentation can be divided into two parts: the encoder and the decoder. The output coder produces feature maps with a smaller size than the input image. A decoder is used to restore the size of the feature maps. In the original versions of models of fully convolutional networks, the decoder was a geometric transformation to increase the size of images with various interpolation methods [ 5 ]. Currently, an approach is used where the decoder subnetwork is constructed symmetrically to the encoder’s subnetwork with the exception of pooling layers. Instead of pooling layers, transposed layers [ 6 ] or unpooling layers [ 7 ] can be used.

The paper discusses 4 convolutional networks for detecting buildings with different encoder and decoder architectures. As the criteria for the algorithms effectiveness, network learning time and segmentation accuracy are used.

The work is organized in the following order. The second section describes the considered neural network architectures. The third section presents the experimental studies results on real images of the Moscow region. The final section summarizes the results and tells about the future research direction in the field of semantic segmentation algorithms.

2. Methods

As algorithms for binary semantic segmentation, we used SegNet neural networks [ 7 ], a model with an encoder from the ResNet-50 network [ 8 ] and a decoder in the form of a geometric transformation with bilinear interpolation, U-Net [ 9 ], LinkNet [ 6 ].

The SegNet network model is a classic encoder-decoder architecture. The SegNet encoder network consists of 13 convolutional layers which correspond to the first 13 convolutional layers in the VGG16 network. The decoder architecture is almost symmetrical to the encoder's subnetwork, with the exception of pooling layers. In this paper, unpooling layers are used. The SegNet network model is shown in Figure 1.

The paper also considered a convolutional neural network for segmentation with an encoder based on ResNet-50. A feature of the ResNet-50 network is the use of residual connections, which make it possible to effectively solve the problem of a damped gradient arising with an increase in the number of neural network layers. The network model is shown in Figure 2.

The next neural network architecture under consideration is U-Net. The U-Net model feature is the feature maps concatenation on the lower and upper neural network levels. This approach is very similar to the residual connections in the ResNet-50 network, but in the case of U-Net, deeper connections are used. The network model is shown in Figure 3.

LinkNet is an evolution of the U-Net model. The encoder and decoder are divided into several subblocks. LinkNet requires less computational resources in comparison with the considered models due to the rapid decrease in the size of attribute maps. At the network input, a decrease in feature maps occurs at the expense of pooling and convolution with a step equal to 2, and in the encoder block, at the expense of convolution instead of pooling. In the decoder, transposed convolutional layers are used to restore the size of the images. The network model is shown in Figure 4. number of channels in the input image. Let Y (n1, n2 , n3 ) – mask the true segmentation, the dimensions of which coincide with the input image, and the number channels equal to the classes number. Each channel corresponded to a specific class. The classes were the buildings and the background. The values Y (n1, n2 , n3 ) in the channels were 0 or 1, depending on the pixel class in the input image. Let O(n1, n2 , n3 ) – the image obtained at the neural network output whose size and the channels number coincide with the image markup. Let y n3 , on3  – pixels with the same positions on the spaced and output images. Then the loss function as follows:

N3 1 H  y, o    y i  log o(i) . (1)

i0

The target function performed functional mean error of the neural network training set. Let X G – set with training images, where G – amount of elements, and w – neural network weights. Then the mean error is as follows:

1 G1 N11 N2 1 Q(w, X G )     H O i, j ,Y i, j  (2)

G i0 j0 k0

All models were trained using an adaptive stochastic gradient algorithm [ 12 ]. During the network training, the reducing technique the training coefficient was used in the event that the network quality value on the validation sample did not increase.

3. Experiments

The work considered photographs of settlements of the Moscow region [ 13 ]. RGB images of 512 × 512 size were fed to the network input. The number of shots was 3323. The ratio of the number of elements in the training sample to the number of elements of the test sample was 80:20. In the role of classes were the buildings and the background. An example of the image and mask is shown in Figure 5.

4. Conclusion

In this article, various convolutional neural networks architectures were investigated for the detection of structures in remote sensing images.

An experiments series was conducted, during which the optimal neural network architecture was identified in terms of training time and segmentation accuracy. Further research is planned on the use of conditional random fields to improve the segmentation quality.

Acknowledgments

This work was supported by the Russian Foundation for Basic Research (RFBR) № 18-01-00748-а.

[1] Myasnikov

V V

2012

Method for detection of vehicles in digital aerial and space remote sensed images

Computer Optics 36 ( 3 ) 429 - 438

[2] Kuznetsov

A V

and Myasnikov

V V

2014

A comparison of algorithms for supervised classification using hyperspectral data

Computer Optics 38 ( 3 ) 494 - 502

[3] Blokhinov

, Gorbachev

V A

, Rakutin

Y O

and Nikitin A D 2018

A real-time semantic segmentation algorithm for aerial imagery

Computer Optics 42 ( 1 ) 141 - 148 DOI: 10.18287/ 2412 -6179-2018-42-1- 141 -148

[4] Cortes

and Vapnik V 1995 Support-vector networks Machine Learning 20 273 - 297

[5] Long

, Shelhamer

and Darrell

T 2016

Fully convolutional networks for semantic segmentation The Pattern Analysis and Machine Intelligence 324 100 - 108

[6] Chaurasia

and Culurciello

2017 Linknet: Exploiting encoder representations for efficient semantic segmentation IEEE Conference on Computer Visionand Pattern Recognition 362 234 - 247

[7] Badrinarayanan

, Kendall

and Cipolla R 2017 Segnet: A deep convolutional encoderdecoder architecture for image segmentation IEEE Conferenceon Computer Vision and Pattern Recognition 353 125 - 145

[8] He

, Zhang

, Ren

and Sun

J 2016

Deep residual learning for image recognition

IEEE Conference on Computer Vision and Pattern Recognition 123 235 - 247

[9] Ronneberger

, Fischer

and Brox

T 2015

U-net: Convolutional networks for biomedical image segmentation Medical Image Computing and Computer-Assisted Intervention - MICCAI 345 234 - 241

[10] Russakovsky

, Deng

, Su

, Krause

, Satheesh

, Ma

, Huang

, Karpathy

, Khosla

, Bernstein

, Berg

A C

and Fei-Fei L 2015

ImageNet large scale visual recognition

IEEE Conference on Computer Vision and Pattern Recognition 243 121 - 136

[11] Golik

, Doetsch

and Ney

H 2013

Cross-entropy vs. squared error training: a theoretical and experimental comparison

Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 1756-1760

[12] Kingma

and Ba J 2014 Adam: A Method for Stochastic Optimization International Conference on Learning Representations

[13] Regional geographic information system of the Moscow region URL: https://rgis.mosreg.ru