Intelligent System for Building Separation on a Semantically
Segmented Map
Volodymyr Hnatushenkoa, Vadym Zhernovyib , Iryna Udovyka and Olga Shevtsovaa
a
    Dnipro University of Technology, Dmytra Yavornytskoho av., 19, Dnipro, 49005, Ukraine
b
    Oles Honchar Dnipro National University, Gagarina av.,72, Dnipro, 49010, Ukraine


                 Abstract
                 Terabytes of very high-resolution satellite imagery data are sent to land stations every day
                 and only 5% of this information is used which raises a demand in automation of image
                 processing routines. Semantic maps become especially popular for a wide range of analysis
                 challenges like surveillance, vegetation monitoring, change detection, etc. Nowadays, deep
                 learning approach to image processing suggests a very flexible and configurable tool for
                 different needs – semantic segmentation included. With the use of deep learning, it is
                 possible to extract unique features from data and adapt model and algorithms for specific data
                 to achieve the best results possible. In current work the algorithm to instance-like
                 segmentation is suggested. This algorithm is applied to a modified semantic segmentation
                 neural network in order to work with separate instances of different land objects. There are
                 other networks which already perform instance segmentation like Mask RCNN. However,
                 often semantic segmentation networks provide better detection results regarding accuracy and
                 a possibility to work with detected objects is crucial. Separated instances can be used in
                 various calculations and measurements such as a size of these objects, distances, etc. In
                 addition to the semantic segmentation neural network, an approach is suggested to
                 approximate measurements of such essential physical parameters of land objects as
                 perimeters, square areas and building density using knowledge of spatial resolution
                 characteristics of the ultra-high-resolution remote sensing imagery used in current work as a
                 source of data for the training dataset. The results of suggested methods can be applied to
                 countless areas such as urban planning, built-up analysis, traffic control, etc. The solution is
                 flexible and can be additionally adjusted for different needs which is discussed in our future
                 research.

                 Keywords 1
                 Remote sensing, image, deep learning, semantic segmentation, masks, measurements.

1. Introduction
   The satellite imagery is based on the complex process of converting solar energy reflected from
the earth's surface and electromagnetic pulses, which are recorded digitally. Until a decade ago,
access to satellite data was limited, and only military, large corporations, government agencies, and
some scientific institutions could obtain such information. Now terabytes of satellite data are available
to everyone. Every day we can see how our planet looks like with the help of satellite imagery.
Remotely sensed images permit accurate mapping of land cover and can assist the planning and
coordination of global change.


IntelITSIS’2021: 2nd International Workshop on Intelligent Information Technologies and Systems of Information Security, March 24–26,
2021, Khmelnytskyi, Ukraine
EMAIL: vvgnat@ukr.net (V. Hnatushenko); vadim.zhernovoy@gmail.com (V. Zhernovyi); udovik.im@gmail.com (I. Udovyk);
shevtsova.o.s@nmu.one (O. Shevtsova)
ORCID: 0000-0003-3140-3788 (V. Hnatushenko); 0000-0002-0599-7992 (V. Zhernovyi); 0000-0002-5190-841X (I. Udovyk); 0000-0002-
6421-8127 (O. Shevtsova)
            © 2021 Copyright for this paper by its authors.
            Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
            CEUR Workshop Proceedings (CEUR-WS.org)
    Recently, semantic segmentation of land objects becomes extremely popular in remote sensing
applications and systems. Such segmented maps have a lot of applications in different areas such as
urban planning [1], agricultural applications [2], traffic estimation and monitoring as on land as well
as in water [3], etc. Two categories of approaches are usually considered when solving semantic
segmentation problems – definitive feature-based algorithms such as described in [2, 4] or stochastic
deep learning approaches which are heavily relied on deep convolutional neural networks [5-7].
    Considering the fact that feature-based hand-crafted algorithms and other machine learning
approaches without neural networks may be successfully applied to solve certain satellite imagery
processing problems, deep learning still remains more promising in the long run [8-10]. The advances
in deep learning neural networks of direct propagation are the alternation of convolutional and max-
pooling layers [11], topped with several fully connected or sparsely connected layers, according to
followed by the final layer of classification. Training is usually done without any spontaneous pre-
training. GPU-based approaches have won many image recognition competitions, including the
IJCNN 2011 Traffic Sign Recognition Competition, [12] the Neural Structure Segmentation
Competition in the electron microscopy stack. (Segmentation of neuronal structures in EM stacks
challenge) ISBI 2012 [13], ImageNet [14] and others. Such guided methods of deep learning also
became the first artificial image recognizers to achieve in some tasks efficiency comparable to human [15].
    The deep learning applications in remotely sensed images are different from those in natural
images. The remotely sensed images usually have more complicated and diverse patterns. Thanks to
the strong ability of deep learning in feature representation, deep learning has been introduced into
environmental remote sensing and applied in many aspects, including land cover mapping,
environmental parameter retrieval, data fusion and downscaling, and information construction and
prediction. More detailed applications of deep learning in environmental remote sensing are as
follows [16].
    Most deep learning solutions make use of neural network structures based on convolutional neural
networks. Certain neural networks type suit better for different challenges – remote sensing is not an
exclusion. One of the problems that exists even today for remote sensing imagery processing is that it
is often required to be process in patches or tiles since some neural networks accept 3-channel images
of certain resolution when most of remote sensing imagery may reach a resolution of more than ten
thousand pixels per a dimension. Fully Convolutional Network (FCN) was invented to address the
mentioned problem [17-19]. Using this neural network type, it is possible to generated segmented
map of any size. Another popular approach for semantic segmentation is encoder-decoder architecture
which results in generating a semantic map of the same resolution and dimensions as the original
image. Later, more complex solutions were developed to advance generation of semantic maps
namely SegNet [20], DCNN+CRF [21], SS-CNN [22] and others. Good review of neural network
applications for remote sensing data is provided in [23]. One of the main reasons to choose one neural
network architecture before others is to make use of both spectral and spatial information which is
often provided for the most used satellite imagery. In most cases the results of deep learning solutions
for remote sensing are applied successfully only to certain imagery type which it was implemented
and tested with. However, there are exclusions where a proper combination of neural network
architectures and parameters solved the problem of semantic segmentation for similar imagery from
multiple different satellite vehicles. These problems and solutions are described in detail in [24]
    In current research paper the modified Unet-like architecture is suggested for processing a very
high-resolution hyperspectral WorldView-3 Imagery data. WorldView-3 Imagery is used to design a
dataset for the neural network training. Additional layers are designed on a top of the neural network
architecture to separate instances of detected land objects and perform land object density
measurements algorithm.

2. Pre-processing
   In current work WorldView-3 imagery is used as a source for a training and testing datasets.
WorldView-3 is the high resolution satellite sensor operating at an altitude of 617 km. WorldView-3
satellite provides 31 cm panchromatic resolution, 1.24 m multispectral resolution, 3.7 m short wave
infrared resolution (SWIR) and 30 m CAVIS resolution. The satellite has an average revisit time of
<1 day and is capable of collecting up to 680,000 km2 per day [25].
    In order to achieve the best results possible applying deep learning to satellite imagery, it is
essential to consider the main characteristics of sensory systems that determine the suitability of the
data to solve a problem, there are four types of distinction:
    •     spectral;
    •     spatial;
    •     radiometric;
    •     temporal.
    Spectral resolution is the ability of a sensor system to register electromagnetic radiation of a
specific frequency range, which is determined by the number of satellite channels, for example, the
intervals of wavelengths of the electromagnetic spectrum to which the sensor is sensitive.
WorldView-3 provides a wide range of options regarding spectral resolution including panchromatic
images, multispectral visible and non-visible bands. Non-visible bands are short-wave infrared bands
and are not used in current research.
    Spatial resolution is the size of the smallest object on the earth's surface that differs in the image,
that is, it is actually the physical size of a pixel. Currently, the best commercially available imagery
has a spatial resolution of 30 cm – WorldView-3 satellite (excerpt from the Tvis website). This means
that a 30 × 30 cm object will appear in the image as a single pixel. So, the objects, such as cars, will
be noticeable in the picture and their color can be determined (if the picture is color), but smaller
details (registration number, design features that help determine the make and model) will not be read
in the picture [26]. These characteristics are the main ones which were taken to consideration when
designing a dataset for the deep neural network training. Details on the dataset design are described in
[27]. In order to improve a quality of a dataset further, additional image enhancement techniques can
be applied, in current research [28] are used. There was an additional attempt of shadow detection
algorithm [29] application to the dataset but it did not show any significant improvement for post-
processing algorithm performance. Resulting spectral and spatial characteristics as well as amount of
information in pixels are provided in a Table 1.

Table 1
WorldView-3 Imagery Spatial Characteristics
      Type            Wavebands             Pixel resolution         Num. channels           Size
    Grayscale        Panchromatic               0.31 m                     1             16924 x 17020
     8-band          Multispectral              0.31 m                     8             16924 x 17020
                     pansharpened
    16-band          Multispectral              1.24 m                      8             4255 x 4231
                  Shor-wave infrared             7.5 m                      8              670 x 688

    Radiometric resolution is the number of possible encoded spectral luminance values in the data file
for each spectral band indicated by the number of bits. It is determined by the number of gradations of
color values, the corresponding transitions from the brightness of absolutely "black" to absolutely
"white" and is expressed in the number of bits per pixel. For WorldView-3 this value is 16-bit which
means that spectral luminance values for this imagery varies from 0 to 65535. For the use with the
neural network these values are normalized between 0 and 1, 16-bit values (FP16). Among the best
practices for training a Neural Network is to normalize your data to obtain a mean close to 0.
Normalizing the data generally speeds up learning and leads to faster convergence [30].
    Temporal characteristics are not considered since the change detection for the same territories in
different time is out of scope in current research.

3. Neural network
   For years, Unet architecture remain popular choice in many areas of research where semantic
segmentation is required. Originally Unet was designed for biomedical image segmentation [31].
However, today Unet is applied successfully in other areas of knowledge including remote sensing
[10, 32-34].
   The main focus of this research is post-processing of Unet segmentation results so additional
tuning was done to the neural network used in current work in order to aid separation of the whole
mask to separated instances and measurements.
   Our neural network consists of:
   •    Input layer of size 512 x 512
   •    5 encoder blocks (Fig. 1)
   •    1 extra convolution block (Fig. 1)
   •    5 decoder blocks (Fig. 1)
   •    Output sigmoid activation layer.

   Unlike most Unet-like architecture applications, dropout is not used in current work – in conducted
experiments different dropout layers combination didn’t show any improvements for post-processing
algorithm.
   Training is run on 4255 training and 759 validation samples. Random hue, horizontal flipping and
height-width shifting are applied as augmentation for the training dataset. Hue delta value of 0.1 is
chosen. Such augmentation is not applied to a validation dataset.


Figure 1: Unet backbone blocks

   There are metrics specifically developed to adequately measure deep learning solutions
performance. Custom metrics and loss functions were developed in current work for better
representation of both - the neural network performance and the post-processing algorithm
performance. Since the goal of the article to achieve good instance segmentation, it was decided to
apply a modifier dice coefficient (F1 score)
                                           𝑛𝑛                                               (1)
                                        1     2 ∗ 𝑝𝑝(𝑦𝑦𝑖𝑖 ) ∗ 𝑦𝑦𝑖𝑖 + 1
                              𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 = �                             ,
                                        𝑁𝑁     𝑝𝑝(𝑦𝑦𝑖𝑖 ) + 𝑦𝑦𝑖𝑖 + 1
                                         𝑖𝑖=0

where 𝑝𝑝(𝑦𝑦𝑖𝑖 ) is a predicted mask and 𝑦𝑦𝑖𝑖 is a ground truth annotated mask available from the training
data.
   It was successfully used in [35] to represent instance segmentation results. Original dice
coefficient is modified by adding 1 for intersection and union parts of the equation to prevent division
by zero. Additionally, a dice loss function is used for training
                                                                                                   (2)
                                     𝑙𝑙𝑙𝑙𝑙𝑙𝑠𝑠𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 = 1 − 𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷,
where Dice is defined in (1). Dice loss function is a metric for measuring overlap for ground truth and
segmented masks. Dice loss metric is very flexible and could be additionally optimized which may
improve results more [36], but in current work such optimization is not investigated.
   Unfortunately, for the neural network architecture using 𝑙𝑙𝑙𝑙𝑙𝑙𝑠𝑠𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 led to overfitting of the model
despite a popular neural network architecture and a relatively big dataset. In order to overcome this
problem an improvement was implemented for the loss function which helped in another problem
using similar neural network model [37]. The solution is to define a more complex loss
                                                                                                       (3)
                                 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 = lossdice + 𝑙𝑙𝑙𝑙𝑙𝑙𝑠𝑠𝑏𝑏𝑏𝑏𝑏𝑏 ,

where 𝑙𝑙𝑙𝑙𝑙𝑙𝑠𝑠𝑏𝑏𝑏𝑏𝑏𝑏 is a binary cross entropy loss or log loss function which is defined as
                                 𝑛𝑛                                                                  (4)
                                 1
               𝑙𝑙𝑙𝑙𝑙𝑙𝑠𝑠𝑏𝑏𝑏𝑏𝑏𝑏 = − � yi ∗ log�p(yi )� + (1 − yi ) ∗ log�1 − p(yi )� ,
                                 𝑁𝑁
                                𝑖𝑖=0

where 𝑝𝑝(𝑦𝑦𝑖𝑖 ) is a predicted mask and 𝑦𝑦𝑖𝑖 is a ground truth annotated mask available from training data.
   Adam function was chosen as an optimizer. Additional standard metrics such as accuracy,
precision and recall were also calculated for secondary analysis of the results.
   Mixed precision technique was applied in order to improve training speed of relatively big model
(10 million parameters) in the limited training environment. Mixed precision technique is described in
detail in [38].
   Final results of training are mentioned in Table 2.

Table 2
Unet training results
                              Metric                                                   Value
                             Accuracy                                                  0.9834
                             Precision                                                 0.9217
                              Recall                                                   0.7641
                             F1 Score                                                  0.8355

   Such results are in line with ones in similar works [10] which have a better recall rate and around
the same precision.
   Preliminary research showed that more training and dataset cleanup are the main contributors in a
better recall.

4. Post-processing
  The main goal of current article is post-processing which is implemented as top layers for the
Unet-like backbone. This post-processing is capable of multiple sequential steps:
  •     Instance separation
  •     Semantic labeling
  •     Measurements of land objects
  •     Building density

4.1.    Instance separation
   Output of Unet is pixel-wise grayscale values in a range of 0..1 which represent a degree of how
much a pixel belong to certain class. In order to distinguish detected masks properly from the
background, a threshold must be implemented so every pixel can be separated by this value.
   Most research use a middle value of 0.5. However, for the suggested approach another value
performs visual better. This value is obtained by application of Otsu’s method [39] to the
segmentation results which is an automatic image thresholding technique used for classification of all
pixels into two classes – foreground and background which results in a binarized image (Fig. 2).
According to the Otsu method, the optimal threshold for binarization reaches the minimization of the
weighted sum of variances within each cluster or, on the other hand, the maximum sum of the
interclass variances.


Figure 2: Results of applying Otsu’s thresholding algorithm, original image (a), segmented image (b),
post-processed image (c)

    Another algorithm is implemented to separate instances from the whole mask. This technique
involves feature-based analysis which distinguish arrays of pixels using a centrosymmetric filter
structure. Such approach helps to keep together pixels that belong to one territory but consists of
multiple objects of search (Fig. 3).


Figure 3: Post-processing algorithm: original image (a) is segmented by 3 split areas (b) but classified
as a single instance (c)

   Additional semantic search is performed after separation to obtain coordinates and dimensions of
all objects found. These coordinates and dimensions are used for measurements and density
calculations.

4.2.     Land objects measurements
   This part of post-processing is considered as the simplest in terms of computation complexity. It is
based on knowledge of structure of the satellite imagery. It is known that a length of one pixel is 31
cm which gives an opportunity to calculate physical parameters of detected objects.
   First, the minimal fitting rectangle is calculated for one instance. This rectangle is built with the
largest vertical and horizontal diameters (Fig. 4).
Figure 4: Maximum vertical and horizontal diameters based on minimal fitting a rectangle

   Using calculated diameter land objects physical parameters are assumed – perimeter (5), square
area (6).
                                                                                           (5)
                          𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 = 2 ∗ (𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣 + ℎ𝑚𝑚𝑚𝑚𝑚𝑚),
where hmax is a maximum horizontal diameter assumed from the minimal fitting rectangle, vmax –
vertical.
                                                                                         (6)
                            𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 = 𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣 ∗ ℎ𝑚𝑚𝑚𝑚𝑚𝑚,
where hmax and vmax are maximum horizontal and vertical diameters correspondingly.
    Building density is calculated using (7) as a percentage of all pixels identified as land objects to a
total number of image pixels. Though these calculations only assume the physical sizes of land object
because AI system cannot be 100% accurate, the formula also considers a correction by using
precision and recall values as a factor.
                                                                                                   (7)
                                       𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 ∑𝑚𝑚
                                                           𝑖𝑖=0 𝑝𝑝(𝑦𝑦𝑖𝑖 )
                                𝐵𝐵𝐵𝐵 =                   ∗               ,
                                          𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟     ∑𝑛𝑛𝑗𝑗=0 𝑥𝑥𝑗𝑗
where, m is a number of segmented pixels, n is a total number of image pixels, 𝑝𝑝(𝑦𝑦𝑖𝑖 ) is a predicted
pixel, 𝑥𝑥𝑗𝑗 is an image pixel. Precision and recall are corresponding neural network metrics obtained
during validation stage of the training.

5. Experiment
   Experiments were conducted in multiple ways – for a neural network part, the post-processing and
for the whole system. The necessity of such conditions is justified by modularity and comparability of
each part to similar approaches for pre-processing, neural network processing and postprocessing.
   For the neural network accuracy, precision, recall and F1-score are calculated and compared to
similar Unet-like neural network solutions for remote sensing semantic segmentation. Unet-like
architectures are chosen for comparison since we are not suggesting a brand-new neural network
architecture to compare it with wider range of neural network architectures – the main focus of the
research is the post-processing part. The point of such comparison is to demonstrate that the suggested
Unet-backbone is not worse than existing similar models but still optimized for the needs of the post-
processing module. The suggested architecture (SA) is compared to the original Unet, HSFA-Unet
[9], Refined Unet [10], Stacked Unets [40, 41].
   All mentioned neural networks are ran on a custom dataset used for development and testing in the
current work [27]. The result of neural networks testing is provided in Table 3.
Table 3
Building calculation results
        Neural network               Accuracy           Precision         Recall          F1-score
               SA                     0.9834             0.9217           0.7641           0.8355
              Unet                    0.8911             0.9316           0.7923           0.8563
           HSFA-Unet                  0.9831             0.8832           0.7373           0.8036
          Refined Unet                0.7712             0.6909           0.7601           0.7238
         Stacked Unets                0.8989             0.8877           0.7878           0.8347

    Traditional Unet is slightly better in terms of general performance but the suggested Unet-like
architecture is superior considering application of post-processing routine because of much higher
accuracy than for the original Unet.
    When an optimal neural network architecture was determined, another experiment was conducted
to test the main part of the research – the port-processing method for measurements of land objects.
Thus, calculations for the building on figures 2-4 mentioned in Table 4.

Table 4
Building calculation results
                         Metric                                                 Value
               Maximum horizontal diameter, m                                   105.4
                Maximum vertical diameter, m                                    121.8
                      Perimeter, m                                              454.4
                        Area, m2                                                12837
                        Density, %                                               23

    Further experiments showed that the density calculation for the whole scene instead of separate
tiles decreases approximately by 10%.

6. Results and discussion
   In current research, the approach for end-to-end AI pipeline is suggested including pre-processing,
neural network modeling and post-processing.
   Pre-processing stage is heavily relied on the results of previous work [27]. These results suggest a
complete approach for dataset development for solving remote sensing problems and used in current
research with minimal changes which include additional augmentation and image enhancement
techniques in order to improve performance of the neural network processing and post-processing.
   The second part of the solution keeps the novelty of Unet for Remote sensing by suggesting
another approach to configure this architecture for solving many different problems and challenge
including instance segmentation and land object measurements which it is not originally purposed for.
Custom metrics and loss functions are developed which complement the post-processing and are
highly suggested for using when solving land measurements tasks such as urban planning, etc. This
section suggests all the required information to successfully conduct described experiment.
   The post-processing is the main part of the research. All the previous work regarding the neural
network configuration and custom metrics development are done to compliment post-processing. The
post-processing workflow solves a very applied task fully automated – no interaction with other
systems or operator is needed.
   All suggested mechanisms are flexible and interchangeable. These results can be used to conduct
experiments in other areas of research (i.e. Healthcare) and with other neural network architecture.
   The developed approach can be improved further in multiple way:
   •    Increasing a number of data and its clean up
   •    Optimizing dice loss function
   •    Fine-tuning of neural network or replacing with another one
   •    Investigating and adding factors to density calculation mechanism such as counting
   vegetation and other objects
   All results are currently applied to the only class of objects – building. The other classes of objects
are planned to add to dataset in future. Since the source if data remains the same – WorldView-3
imagery, the developed approach will demonstrate the same performance for new classes of objects
which may be trees, vehicles, etc.

7. Conclusions
    The suggested end-to-end approach has been proven to provide promising results in processing
multiple types of very high resolution satellite imagery data. Current paper demonstrates good quality
of processing for WorldView-3 imagery. The obtained results lead to conclusion that the methods
which are suggested in this research paper are suitable for non-RGB images of very high resolution
such as satellite imagery data. Even though the approach and methods are applied to WorldView-3
imagery data in current work, it is not limited and can be used in similar satellite imagery for another
satellite vehicle such Landsat or Sentinel.
    Application of the suggested methods to different remote sensing imagery is possible due to
flexibility deep learning tools provide and all aspects of adaptation and optimization for the use with
different imagery is covered in previous sections.
    Another important aspect of work is that it demonstrates an application of deep learning tools on
not popular open-source remote sensing images (such as Landsat 8, GeoEye-1 or Sentinel-2) but a
commercial one -WorldView-3, which is of better resolution and quality and the least covered with
research in compression to government satellite vehicles. Furthermore, the better resolution and
quality and informational content of WorldView-3 imagery may impact solutions using deep learning
in multiple ways – better and worse due to neural networks specifics. The latter increases importance
of covering with research the ‘unpopular’ commercial satellite imagery. Commercial remote sensing
imagery is very important in terms of application in different fields of knowledge due to its usually
better technical quality than public satellites, as well as commercial imagery provides coverage of the
land more frequently which may be crucial for change detection and treating any sorts of
humanitarian crisis.

8. References
 [1] D.M. Hordiiuk and V.V. Hnatushenko, Neural network and local laplace filter methods applied
     to very high resolution remote sensing imagery in urban damage detection, 2017 IEEE
     International Young Scientists Forum on Applied Physics and Engineering (YSF), Lviv, 2017,
     pp. 363-366, doi: 10.1109/YSF.2017.8126648.
 [2] V. Hnatushenko, P. Kogut, M. Uvarov, On Satellite Image Segmentation via Piecewise Constant
     Approximation of Selective Smoothed Target Mapping, Applied Mathematics and Computation,
     Vol.389, 2020, Id 125615, 26p, doi.org/10.1016/j.amc.2020.125615.
 [3] D. Hordiiuk, I. Oliinyk, V. Hnatushenko, K. Maksymov, Semantic Segmentation for Ships
     Detection from Satellite Imagery. 2019 IEEE 39th International Conference on Electronics and
     Nanotechnology (ELNANO). doi:10.1109/elnano.2019.8783822.
[4] Zhu, Xiao Xiang, et al, Deep learning in remote sensing: A comprehensive review and list of
     resources. IEEE Geoscience and Remote Sensing Magazine, vol. 5, no. 4, pp. 8-36, Dec. 2017,
     doi: 10.1109/MGRS.2017.2762307.
[5] D. Mozgovoy, V. Hnatushenko, and V. Vasyliev, Accuracy evaluation of automated object
     recognition using multispectral aerial images and neural network, Proc. SPIE 10806, Tenth
     International Conference on Digital Image Processing (ICDIP 2018), 108060H (9 August 2018);
     https://doi.org/10.1117/12.2502905.
[6] Zhang Liangpei, Lefei Zhang, and Bo Du, Deep learning for remote sensing data: A technical
     tutorial on the state of the art. IEEE Geoscience and Remote Sensing Magazine 4.2 (2016) 22-40.
[7] Ma Lei, et al, Deep learning in remote sensing applications: A meta-analysis and review. ISPRS
     journal of photogrammetry and remote sensing 152 (2019) 166-177.
[8] Yuan, Qiangqiang, et al, Deep learning in environmental remote sensing: Achievements and
     challenges. Remote Sensing of Environment 241 (2020) 111716.
[9] He Nanjun, Leyuan Fang, and Antonio Plaza, Hybrid first and second order attention Unet for
     building segmentation in remote sensing images. Science China Information Sciences 63.4
     (2020) 1-12.
[10] L. Jiao, L. Huo, C. Hu and P. Tang, Refined UNet: UNet-Based Refinement Network for Cloud
     and      Shadow       Precise     Segmentation.    Remote      Sens.    2020,     12,    2001.
     https://doi.org/10.3390/rs12122001.
[11] D. C. Ciresan, U. Meier, J. Masci, L. M. Gambardella, J. Schmidhuber, Flexible, High
     Performance Convolutional Neural Networks for Image Classification. International Joint
     Conference on Artificial Intelligence (IJCAI-2011, Barcelona) (2011).
[12] Zhang, Jianming, et al, Lightweight deep network for traffic sign classification. Annals of
     Telecommunications 75.7 (2020) 369-379.
[13] D. Ciresan, A. Giusti, L. Gambardella, J. Schmidhuber, Deep Neural Networks Segment
     Neuronal Membranes in Electron Microscopy Images. In Advances in Neural Information
     Processing Systems (NIPS 2012), Lake Tahoe (2012).
[14] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet Classification with Deep Convolutional
     Neural Networks. NIPS 2012: Neural Information Processing Systems, Lake Tahoe, Nevada,
     (2012).
[15] Agostinelli, Forest, Michael R. Anderson, and Honglak Lee, Adaptive multi-column deep neural
     networks with application to robust image denoising. Advances in Neural Information Processing
     Systems (2013).
[16] Yuan, Qiangqiang, H. Shen, T. Li, Zhi-wei Li, Shuwen Li, Yun Jiang, Hongzhang Xu, W. Tan,
     Q. Yang, Jiwen Wang, Jianhao Gao and Liangpei Zhang, Deep learning in environmental remote
     sensing: Achievements and challenges. Remote Sensing of Environment 241 (2020) 111716.
[17] Fu, Gang, et al, Classification for high resolution remote sensing imagery using a fully
     convolutional network. Remote Sensing 9.5 (2017) 498.
[18] Maggiori, Emmanuel, et al, Fully convolutional neural networks for remote sensing image
     classification. 2016 IEEE international geoscience and remote sensing symposium (IGARSS).
     IEEE, 2016.
[19] Sun, Weiwei, and Ruisheng Wang, Fully convolutional networks for semantic segmentation of
     very high resolution remotely sensed images combined with DSM. IEEE Geoscience and
     Remote Sensing Letters 15.3 (2018) 474-478.
[20] Badrinarayanan, Vijay, Alex Kendall, and Roberto Cipolla, Segnet: A deep convolutional
     encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and
     machine intelligence 39.12 (2017) 2481-2495.
[21] Papandreou, George, et al, Weakly-and semi-supervised learning of a deep convolutional
     network for semantic image segmentation. Proceedings of the IEEE international conference on
     computer vision (2015).
[22] Zhang, Mengmeng, Wei Li, and Qian Du, Diverse region-based CNN for hyperspectral image
     classification. IEEE Transactions on Image Processing 27.6 (2018) 2623-2634.
[23] M. Y. Saifi, J. Singla, Nikita, Deep Learning based Framework for Semantic Segmentation of
     Satellite Images. 2020 Fourth International Conference on Computing Methodologies and
     Communication (ICCMC). doi:10.1109/iccmc48092.2020.iccmc-00069
[24] E. Saralioglu, O. Gungor, Semantic segmentation of land cover from high resolution
     multispectral satellite images by spectral-spatial convolutional neural network. Geocarto
     International, 1–21, (2020). doi:10.1080/10106049.2020.1734871
[25] Satimagingcorp. WorldView-3 Satellite Sensor | Satellite Imaging Corp. (2016). URL:
     https://www.satimagingcorp.com/satellite-sensors/worldview-3/
[26] Schowengerdt R., Remote sensing: models and methods for image processing, New York:
     Academic Press. 2007. p.560.
[27] V. Hnatushenko, and V. Zhernovyi, Complex Approach of High-Resolution Multispectral Data
     Engineering for Deep Neural Network Processing. In: Lytvynenko V., Babichev S., Wójcik W.,
     Vynokurova O., Vyshemyrskaya S., Radetskaya S. (eds) Lecture Notes in Computational
     Intelligence and Decision Making. ISDMCI 2019. Advances in Intelligent Systems and
     Computing, (2020) vol 1020. Springer, Cham. https://doi.org/10.1007/978-3-030-26474-1_46.
[28] V.J. Kashtan, V.V. Hnatushenko and Y.I. Shedlovska, Processing technology of multispectral
     remote sensing images, 2017 IEEE International Young Scientists Forum on Applied Physics
     and Engineering (YSF), Lviv, 2017, pp. 355-358, doi: 10.1109/YSF.2017.8126647.
[29] Y.I. Shedlovska and V.V. Hnatushenko, Shadow detection and removal using a shadow
     formation model, 2016 IEEE First International Conference on Data Stream Mining &
     Processing (DSMP), Lviv, Ukraine, 2016, pp. 187-190, doi: 10.1109/DSMP.2016.7583537.
[30] T. Stöttner (2019, May 16), Why Data should be Normalized before Training a Neural Network.
     Medium. URL: https://towardsdatascience.com/why-data-should-be-normalized-before-training-
     a-neural-network-c626b7f66c7d.
[31] O. Ronneberger, P. Fischer, and T. Brox, U-net: Convolutional networks for biomedical image
     segmentation. International Conference on Medical image computing and computer-assisted
     intervention. Springer, Cham (2015).
[32] He Nanjun, Leyuan Fang, and Antonio Plaza, Hybrid first and second order attention Unet for
     building segmentation in remote sensing images. Science China Information Sciences 63.4
     (2020) 1-12.
[33] Sun Shuting, et al. "L-UNet: An LSTM Network for Remote Sensing Image Change Detection"
     IEEE Geoscience and Remote Sensing Letters (2020). doi: 10.1109/LGRS.2020.3041530.
[34] Cao Kaili and Xiaoli Zhang, An improved res-unet model for tree species classification using
     airborne      high-resolution    images.     Remote     Sensing      2020;    12(7):   1128.
     https://doi.org/10.3390/rs12071128.
[35] V. Hnatushenko and V. Zhernovyi, Method of Improving Instance Segmentation for Very High
     Resolution Remote Sensing Imagery Using Deep Learning. In: Babichev S., Peleshko D.,
     Vynokurova O. (eds). Data Stream Mining & Processing. DSMP 2020. Communications in
     Computer and Information Science, vol. 1158. Springer, Cham. https://doi.org/10.1007/978-3-
     030-61656-4_21.
[36] Milletari Fausto, Nassir Navab and Seyed-Ahmad Ahmadi, V-net: Fully convolutional neural
     networks for volumetric medical image segmentation. 2016 fourth international conference on
     3D vision (3DV). IEEE, 2016.
[37] Carvana       Image     Masking     Challenge    |    Kaggle.     (2015).   URL:     Kaggle.
     https://www.kaggle.com/c/carvana-image-masking-challenge/discussion/40199
[38] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia & H. Wu, Mixed precision
     training. arXiv preprint arXiv:1710.03740, 2017.
[39] Nobuyuki Otsu, A threshold selection method from gray-level histograms. IEEE Transactions on
     Systems, Man, and Cybernetics, vol. 9, no. 1, pp. 62-66, Jan. 1979, doi:
     10.1109/TSMC.1979.4310076.
[40] A. Ghosh, M. Ehrlich, S. Shah, L. Davis, & R. Chellappa, Stacked U-Nets for Ground Material
     Segmentation in Remote Sensing Imagery. 2018 IEEE/CVF Conference on Computer Vision and
     Pattern Recognition Workshops (CVPRW), pp.252-256. doi:10.1109/cvprw.2018.00047.
[41] X. Yuan, J. Shi, & L. Gu, A Review of Deep Learning Methods for Semantic Segmentation of
     Remote Sensing Imagery. Expert Systems with Applications, 2020, 114417.
     doi:10.1016/j.eswa.2020.114417.