<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Automatic detection of constructions using binary image segmentation algorithms</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>E A Dmitriev</string-name>
          <email>dmitrievEgor94@yandex.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>A A Borodinov</string-name>
          <email>aaborodinov@yandex.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>A I Maksimov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>S A Rychazhkov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Samara National Research University</institution>
          ,
          <addr-line>Moskovskoye shosse, 34, Samara, Russia, 443086</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <fpage>264</fpage>
      <lpage>268</lpage>
      <abstract>
        <p>This article presents binary segmentation algorithms for buildings automatic detection on aerial images. There were conducted experiments among deep neural networks to find the most effective model in sense of segmentation accuracy and training time. All experiments were conducted on Moscow region images that were got from open database. As the result the optimal model was found for buildings automatic detection.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The automatically detecting objects in Earth remote sensing (RS) images task is one of the most
difficult tasks. An example of a solution to the problem under consideration is [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Currently, one of
the most effective approaches is semantic segmentation algorithms usage. In other words, for each
image pixel, the object class to which it belongs is determined.
      </p>
      <p>The segmentation of remote sensing images is used in many industries: geoinformatics, the creation
of maps, analysis of land use, etc. At the moment, many segmentation process stages are solved
manually with the help of operators, which leads to high economic costs in temporary resources, as
well as some inaccuracies in the markup due to the human factor.</p>
      <p>
        Currently, there are many algorithms for image segmentation [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2, 3, 4</xref>
        ], but the most effective are
approaches using convolutional neural networks (CNN) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. For almost all computer vision tasks,
convolutional networks provide more efficient results than other algorithms.
      </p>
      <p>
        In recent years, various approaches have been proposed for the CNN models formation, which at
the output give an original image segmentation map. One of the most effective methods is based on
the use of fully connected neural networks [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Unlike the convolutional networks that are used for
classification, there is no subnet of the multilayer perceptron for classification in fully connected
networks.
      </p>
      <p>
        The CNN architecture for semantic segmentation can be divided into two parts: the encoder and the
decoder. The output coder produces feature maps with a smaller size than the input image. A decoder
is used to restore the size of the feature maps. In the original versions of models of fully convolutional
networks, the decoder was a geometric transformation to increase the size of images with various
interpolation methods [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Currently, an approach is used where the decoder subnetwork is constructed
symmetrically to the encoder’s subnetwork with the exception of pooling layers. Instead of pooling
layers, transposed layers [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] or unpooling layers [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] can be used.
      </p>
      <p>The paper discusses 4 convolutional networks for detecting buildings with different encoder and
decoder architectures. As the criteria for the algorithms effectiveness, network learning time and
segmentation accuracy are used.</p>
      <p>The work is organized in the following order. The second section describes the considered neural
network architectures. The third section presents the experimental studies results on real images of the
Moscow region. The final section summarizes the results and tells about the future research direction
in the field of semantic segmentation algorithms.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methods</title>
      <p>
        As algorithms for binary semantic segmentation, we used SegNet neural networks [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], a model with an
encoder from the ResNet-50 network [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and a decoder in the form of a geometric transformation with
bilinear interpolation, U-Net [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], LinkNet [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>The SegNet network model is a classic encoder-decoder architecture. The SegNet encoder network
consists of 13 convolutional layers which correspond to the first 13 convolutional layers in the
VGG16 network. The decoder architecture is almost symmetrical to the encoder's subnetwork, with the
exception of pooling layers. In this paper, unpooling layers are used. The SegNet network model is
shown in Figure 1.</p>
      <p>The paper also considered a convolutional neural network for segmentation with an encoder based
on ResNet-50. A feature of the ResNet-50 network is the use of residual connections, which make it
possible to effectively solve the problem of a damped gradient arising with an increase in the number
of neural network layers. The network model is shown in Figure 2.</p>
      <p>The next neural network architecture under consideration is U-Net. The U-Net model feature is the
feature maps concatenation on the lower and upper neural network levels. This approach is very
similar to the residual connections in the ResNet-50 network, but in the case of U-Net, deeper
connections are used. The network model is shown in Figure 3.</p>
      <p>LinkNet is an evolution of the U-Net model. The encoder and decoder are divided into several
subblocks. LinkNet requires less computational resources in comparison with the considered models due
to the rapid decrease in the size of attribute maps. At the network input, a decrease in feature maps
occurs at the expense of pooling and convolution with a step equal to 2, and in the encoder block, at
the expense of convolution instead of pooling. In the decoder, transposed convolutional layers are used
to restore the size of the images. The network model is shown in Figure 4.
number of channels in the input image. Let Y (n1, n2 , n3 ) – mask the true segmentation, the dimensions
of which coincide with the input image, and the number channels equal to the classes number. Each
channel corresponded to a specific class. The classes were the buildings and the background. The
values Y (n1, n2 , n3 ) in the channels were 0 or 1, depending on the pixel class in the input image. Let
O(n1, n2 , n3 ) – the image obtained at the neural network output whose size and the channels number
coincide with the image markup. Let y n3 , on3  – pixels with the same positions on the spaced and
output images. Then the loss function as follows:</p>
      <p>N3 1
H  y, o    y i  log o(i) . (1)</p>
      <p>i0</p>
      <p>The target function performed functional mean error of the neural network training set. Let X G –
set with training images, where G – amount of elements, and w – neural network weights. Then the
mean error is as follows:</p>
      <p>1 G1 N11 N2 1
Q(w, X G )     H O i, j ,Y i, j  (2)</p>
      <p>G i0 j0 k0</p>
      <p>
        All models were trained using an adaptive stochastic gradient algorithm [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. During the network
training, the reducing technique the training coefficient was used in the event that the network quality
value on the validation sample did not increase.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments</title>
      <p>
        The work considered photographs of settlements of the Moscow region [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. RGB images of 512 ×
512 size were fed to the network input. The number of shots was 3323. The ratio of the number of
elements in the training sample to the number of elements of the test sample was 80:20. In the role of
classes were the buildings and the background. An example of the image and mask is shown in Figure
5.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>In this article, various convolutional neural networks architectures were investigated for the detection
of structures in remote sensing images.</p>
      <p>An experiments series was conducted, during which the optimal neural network architecture was
identified in terms of training time and segmentation accuracy. Further research is planned on the use
of conditional random fields to improve the segmentation quality.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work was supported by the Russian Foundation for Basic Research (RFBR) № 18-01-00748-а.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Myasnikov</surname>
            <given-names>V V</given-names>
          </string-name>
          <year>2012</year>
          <article-title>Method for detection of vehicles in digital aerial and space remote sensed images</article-title>
          <source>Computer Optics</source>
          <volume>36</volume>
          (
          <issue>3</issue>
          )
          <fpage>429</fpage>
          -
          <lpage>438</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Kuznetsov</surname>
            <given-names>A V</given-names>
          </string-name>
          and
          <string-name>
            <surname>Myasnikov</surname>
            <given-names>V V</given-names>
          </string-name>
          <string-name>
            <surname>2014</surname>
          </string-name>
          <article-title>A comparison of algorithms for supervised classification using hyperspectral data</article-title>
          <source>Computer Optics</source>
          <volume>38</volume>
          (
          <issue>3</issue>
          )
          <fpage>494</fpage>
          -
          <lpage>502</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Blokhinov</surname>
            <given-names>Y</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gorbachev</surname>
            <given-names>V A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rakutin</surname>
            <given-names>Y O</given-names>
          </string-name>
          <string-name>
            <surname>and Nikitin A D 2018</surname>
          </string-name>
          <article-title>A real-time semantic segmentation algorithm for aerial imagery</article-title>
          <source>Computer Optics</source>
          <volume>42</volume>
          (
          <issue>1</issue>
          )
          <fpage>141</fpage>
          -
          <lpage>148</lpage>
          DOI: 10.18287/
          <fpage>2412</fpage>
          -6179-2018-42-1-
          <fpage>141</fpage>
          -148
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Cortes</surname>
            <given-names>C</given-names>
          </string-name>
          and
          <article-title>Vapnik V 1995 Support-vector networks</article-title>
          <source>Machine Learning</source>
          <volume>20</volume>
          <fpage>273</fpage>
          -
          <lpage>297</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Long</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shelhamer</surname>
            <given-names>E</given-names>
          </string-name>
          and
          <string-name>
            <surname>Darrell</surname>
            <given-names>T 2016</given-names>
          </string-name>
          <article-title>Fully convolutional networks for semantic segmentation The Pattern Analysis</article-title>
          and
          <source>Machine Intelligence</source>
          <volume>324</volume>
          <fpage>100</fpage>
          -
          <lpage>108</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Chaurasia</surname>
            <given-names>A</given-names>
          </string-name>
          and
          <string-name>
            <surname>Culurciello</surname>
            <given-names>E</given-names>
          </string-name>
          2017 Linknet:
          <article-title>Exploiting encoder representations for efficient semantic segmentation</article-title>
          <source>IEEE Conference on Computer Visionand Pattern Recognition</source>
          <volume>362</volume>
          <fpage>234</fpage>
          -
          <lpage>247</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Badrinarayanan</surname>
            <given-names>V</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kendall</surname>
            <given-names>A</given-names>
          </string-name>
          and
          <string-name>
            <surname>Cipolla R 2017 Segnet:</surname>
          </string-name>
          <article-title>A deep convolutional encoderdecoder architecture for image segmentation IEEE Conferenceon Computer Vision</article-title>
          and Pattern Recognition 353
          <fpage>125</fpage>
          -
          <lpage>145</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>He</surname>
            <given-names>K</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            <given-names>X</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ren</surname>
            <given-names>S</given-names>
          </string-name>
          and
          <string-name>
            <surname>Sun</surname>
            <given-names>J 2016</given-names>
          </string-name>
          <article-title>Deep residual learning for image recognition</article-title>
          <source>IEEE Conference on Computer Vision and Pattern Recognition</source>
          <volume>123</volume>
          <fpage>235</fpage>
          -
          <lpage>247</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Ronneberger</surname>
            <given-names>O</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fischer</surname>
            <given-names>P</given-names>
          </string-name>
          and
          <string-name>
            <surname>Brox</surname>
            <given-names>T 2015</given-names>
          </string-name>
          <article-title>U-net: Convolutional networks for biomedical image segmentation Medical Image Computing</article-title>
          and
          <string-name>
            <surname>Computer-Assisted</surname>
          </string-name>
          Intervention - MICCAI 345
          <fpage>234</fpage>
          -
          <lpage>241</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Russakovsky</surname>
            <given-names>O</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deng</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Su</surname>
            <given-names>H</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krause</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Satheesh</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ma</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            <given-names>Z</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karpathy</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khosla</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bernstein</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berg</surname>
            <given-names>A C</given-names>
          </string-name>
          and
          <string-name>
            <surname>Fei-Fei L 2015</surname>
          </string-name>
          <article-title>ImageNet large scale visual recognition</article-title>
          <source>IEEE Conference on Computer Vision and Pattern Recognition</source>
          <volume>243</volume>
          <fpage>121</fpage>
          -
          <lpage>136</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Golik</surname>
            <given-names>P</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Doetsch</surname>
            <given-names>P</given-names>
          </string-name>
          and
          <string-name>
            <surname>Ney</surname>
            <given-names>H 2013</given-names>
          </string-name>
          <article-title>Cross-entropy vs. squared error training: a theoretical and experimental comparison</article-title>
          <source>Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 1756-1760</source>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Kingma</surname>
            <given-names>D</given-names>
          </string-name>
          and
          <string-name>
            <surname>Ba J 2014 Adam:</surname>
          </string-name>
          <article-title>A Method for</article-title>
          Stochastic Optimization International Conference on Learning Representations
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <article-title>Regional geographic information system of the Moscow region</article-title>
          URL: https://rgis.mosreg.ru
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>