=Paper=
{{Paper
|id=Vol-2485/paper39
|storemode=property
|title=Building Recognition in Air and Satellite Photos
|pdfUrl=https://ceur-ws.org/Vol-2485/paper39.pdf
|volume=Vol-2485
|authors=Dmitriy Bulatitskiy,Aleksandr Buyval,Mikhail Gavrilenkov
}}
==Building Recognition in Air and Satellite Photos==
<pdf width="1500px">https://ceur-ws.org/Vol-2485/paper39.pdf</pdf>
<pre>
                         Building Recognition in Air and Satellite Photos
                                        D.I. Bulatitskiy1, A.K. Buyval2, M.A. Gavrilenkov1
                              bulatizkydi@mail.ru, alexbuyval@gmail.com, gavrilenkov@umlab.ru
                                      1
                                        Bryansk State Technical University, Bryansk, Russia
                                              2
                                               Innopolis University, Innopolis, Russia

    The paper deals with the algorithms of building recognition in air and satellite photos. The use of convolutional artificial neural
networks to solve the problem of image segmentation is substantiated. The choice between two architectures of artificial neural networks
is considered. The development of software implementing building recognition based on convolutional neural networks is described.
The architecture of the software complex, some features of its construction and interaction with the cloud geo-information platform in
which it functions are described. The application of the developed software for the recognition of buildings in images is described. The
results of experiments on building recognition in pictures of various resolutions and types of buildings using the developed software are
analysed.
    Keywords: Earth remote sensing, building recognition in photos, convolutional neural networks, semantic picture segmentation.

                                                                             To solve the segmentation problem, two types of methods
1. Introduction                                                         can be distinguished: 1) classical and 2) based on artificial neural
                                                                        networks (ANNs).
     At present, the recognition of construction objects in satellite
                                                                             Classical methods include such methods as K-means
and air photographs, which is a part of operation of many
                                                                        clustering, edge detection, watershed transformation and others.
government departments and commercial structures, is often              However, classical approaches, as a rule, show good results only
carried out manually. Such processes as cadastral surveys,
                                                                        on simple images and after careful adjustment of parameters. At
control over observing the borders of separate and protective
                                                                        the same time, they are extremely unstable to various changes in
zones, use of land as intended, control over the setting of
                                                                        the image (brightness, contrast, and others). And, probably, the
buildings on the state registration and other require considerable
                                                                        most important drawback is that these methods do not allow to
cost and labor. Therefore, it is necessary to automate recognition
                                                                        determine the class of the found object.
and classification of objects in satellite and air photographs
                                                                             In their turn, image segmentation methods based on artificial
through the use of information technologies, in particular,
                                                                        neural networks significantly surpass classical methods in
computer vision and machine learning, which show good results
                                                                        accuracy and stability.
in related fields.
                                                                             Analysis of some papers [1-6] and competition results of
     The source data for the problem to be solved are usually
                                                                        image processing and Earth remote sensing (ERS) [7-10]
GeoTiff files, which contain both the terrain image and
                                                                        allowed to conclude that the use of convolutional neural
information on the spatial resolution of pixels and the image
                                                                        networks for solving the problem of building recognition is
binding to geographical coordinates. As the output, it is
                                                                        reliable, and U-Net and DeepLabV3 architectures are the most
necessary to obtain the contours of the detected buildings in
                                                                        attractive for further research and experimental testing.
vector form in geocoordinates.
                                                                             On the basis of source code libraries provided by the authors
     Several phases can be distinguished in the solution of the
                                                                        of the selected architectures, we tested software for training the
initial problem:
                                                                        corresponding neural network models and quality control of their
     1.    Getting a bitmap of the terrain from the original
                                                                        work on the basis of Jaccard index.
GeoTiff file and direct building recognizing, that is, selecting
                                                                             At the first stage of the work there were only the results of
areas in the picture and classifying them as buildings of a
                                                                        satellite photos, air photos using manned and unmanned aircrafts
particular type.
                                                                        to obtain high-resolution images were only performed by
     2.    Polygon boundary detection in vector form and
                                                                        outsourcers, so there were tasks of testing the selected ANN
converting bit-mapped coordinates into geographic ones based
                                                                        architectures set in the following areas. Firstly, it was necessary
on geodata from the original GeoTiff file.
                                                                        to assess the impact of hyper-parameters of networks on their
     3.    Post-processing of selected polygons, including the          work. Secondly, we were to check the assumption that the shade
application of rules and heuristics for filtering and classification
                                                                        marking of buildings can have a beneficial effect on the building
refinements.
                                                                        recognition. Finally, it was necessary to choose one of the
     Each processing phase uses its own set of approaches and
                                                                        architectures for further use in the project.
technologies, but for the convenience of the end user it is                  For the experiments, a set of data on satellite images was
advisable to implement the solution of this problem as a single
                                                                        prepared, including more than 600 images with different types of
act.
                                                                        buildings. In total, more than 5,000 buildings of various classes
                                                                        were represented in these images. The images were carefully
2. Selection of Methods and Algorithms for                              labeled and divided into a training sample (448 images), a
Solving the Problem of Building Recognition                             validation sample (110 images), and a test sample (59 images).
    The building recognition problem can be referred to a class         In each sample, images with large buildings, private sector, and
of machine vision problems called "semantic segmentation", in           no buildings are presented in the same proportions.
which each pixel of the original image must be assigned to one               Training of each model took up to several days. Therefore, it
of the predefined semantic classes. If the way of referring pixels      was too difficult to perform experiments for testing all possible
to semantic classes corresponds to the human perception of the          combinations of hyper-parameters and marking variations.
image, the pixels will be grouped into areas that contain dots of       Instead, the effect of hyper-parameters on the shade marking was
a certain class only or mainly this class. Thus, the whole image        first studied and their best combination was chosen. Then, using
will be divided into a finite number of a segment, each of which        the best and worst combination of hyper-parameters ANN work
are an object of one of the required classes or is its background.      was tested on the marking without shades influence. The result
                                                                        of this check is presented in tables 1-2.


Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    The following conclusions were made according to the               isolated from CGIP as possible; BRiIS should be easy to scale;
analysis of Tables 1-2.                                                there should be means of monitoring BRiIS operation. Taking
                                                                       into account these requirements, the architecture of BRiIS
1. Both architectures showed that taking shades into account           service has been developed, shown in Fig. 1.
   only makes the result worse. Therefore, in the future, it was            The following data flows can be identified: task information
   decided to perform the marking and all other works without          and source files come from the platform to BRiIS, and resulting
   taking shades into account.                                         files and diagnostic messages - from the service to the platform.
2. The best values of Jaccard index were achieved with the help             It was decided to organize the first flow through which tasks
   of DeepLabV3 network: 89.5% vs 77% from U-Net.                      are transferred on the basis of RabbitMQ queue. The web user
   Therefore, it was decided to conduct further research on the        interface is on the platform side. The user chooses files for
   basis of DeepLabV network.                                          processing and additional recognition options: building classes
3. When we decrease the value of output stride in the studied          and images he is interested in, and others. The platform generates
   sample, a slight improvement of the result was observed             processing tasks and sends them to the queue.
   (especially noticeable on smaller objects), however, it should
   be noted that source intensity increases multiply (the
   consumption of time and memory clock for GPU computing,
   training time). The optimal value of this parameter is 16.
4. Increasing batch_size results in memory consumption during
   training, but allows to get a much better result and reduces
   training time. When training, it is recommended to increase
   this parameter as much as GPU memory allows.
Table 1. Results of U-Net Testing
                                      Number of
   №     Number of layers                                  IoU, (%)
                                     features_root
                          Marking with shades
   1             3                         32                 65
   2             3                         48                 69
   3             3                         64                 52
                                                                                       Fig. 1. Service BRiIS architecture
   4             4                         32                 70            Service BRiIS guards the task queue, and as messages
   5             4                         48                 71       appear, the Service de-queues and processes them. This way of
   6             4                         64                 57       transferring tasks provides not only their guaranteed delivery, but
   7             5                         32                 66       also provides scalability of the system. If necessary, several
   8             5                         48                 69       BRiIS will be launched guarding the same task queue and
   9             5                         64                 68       performing the tasks in parallel.
                         Marking without shades                             The message contains JSON data structure. It contains the
  10             3                         64                 67       identifier and the task type, the path to the source directory, and
  11             4                         48                 77       the path to the target directory where service BRiIS should write
                                                                       the resulting files.
Table 2. Results of DeepLabV3 Testing                                       For debugging purposes, a command-line utility has also
           Reduction factor of
  №                                    Batch_size          IoU, (%)    been developed that allows you to send tasks to the queue for
               output_stride                                           processing one at a time or in batches.
                       Marking with shades                                  The second data flow, which provides the feedback from
  1                 16                     2                  68       service BRiIS, is organized by sending diagnostic messages by
  2                 16                     4                  71       POST over HTTP Protocol in JSON format. Messages are sent
  3                 8                      2                  70       when significant task processing events occur: when the task is
   4                 8                           4            72       de-queued, when processing begins, when polygons are formed,
                         Marking without shades                        and when it is completed.
   5                 16                          2            86            The third data flow is provided by file exchange. For its
   6                 16                          4            89       successful functioning service BRiIS should have access to the
   7                  8                          4           89.5      file system of the geographic information platform. During task
                                                                       processing service BRiIS refers to itfor the source files, and
                                                                       writes intermediate and final results. The path to the source
3. Software          Development                for     Building       directory and the resulting directory is specified in the message
Recognition                                                            de-queued from the task queue.
    Simultaneously with the experiments on selecting neural                 The central part of the service is a task processing module,
network architecture, software was developed which is intended         which is based on convolutional neural network DeepLabV3.
to function as a part of a cloud geoinformation platform (CGIP).       The neural network is surrounded by a processing pipeline of
This platform is being created at Innopolis University and should      geographical images. First, the image is cut into fragments of the
become a comprehensive system for promoting products and               desired size, they are transferred in batches to the neural network,
services in the field of remote sensing of the Earth. Building         the resulting segmented fragments are combined into the image
recognition in images service (BRiIS) is one of the internal           of the original size. Then, vectoring procedure finds the outlines
services of the cloud geoinformation platform and does not             of buildings, approximate them in polygons with pixel
communicate directly with the users, but the results that the user     coordinates, and performs the initial filtering of noise and
gets directly depend on the quality of its operation. BRiIS is quite   transfers the polygons into geographical coordinates based on the
a resource-intensive part of CGIP, and its implementation has a        position and scale data of the source geoimages. Finally, the
search character: the scenarios are very likely when models and        selected polygons are post processed based on heuristics.
algorithms of BRiIS core may undergo significant changes.                    BRiIS uses a large number of libraries, many of which are
These factors have determined the requirements for the                 large, require related libraries of certain versions, or involve a
organization of BRiIS and CGIP interaction: BRiIS should be as         non-standard installation process. All this greatly complicates the
environment adjustment for BRiIS. In some cases, with certain              There is certainly a direct connection between Jaccard index
combinations of operating system versions and installed                and F-score: the better the image is segmented, the easier it is to
software, adjustment may not be possible. Therefore, we decided        find the correct building boundaries. However, there is a
to run BRiIS in an isolated docker container environment.              significant difference. Bit-mapped metrics is much more loyal to
    In addition to insulation of applications, using docker            the gaps or, on the contrary, false recognition of small buildings,
containers makes it easy to deploy and replicate. The operating        as well as to situations where two close buildings merge into one
environment, all necessary libraries with all dependencies, as         or, conversely, one building of a complex configuration breaks
well as application modules and scripts are packaged into the          apart.
image. This image is transferred to other machines, unpacked,              To quantify the quality of building recognition based on F-
and the service container is started.                                  score and results visualizing, a separate application is developed
    Machine learning utilities shown in Fig. 1 are not a direct part   in Python using Tkinter library.
of CGIP, they are designed to prepare neural network models,
which are then used by BRiIS. The utilities allow to generate          5. Study of Network DeepLabV3 Operation in
ground truth labels based on the original GeoTiff files, and the       Images of Various Types
ground truth label, made in the vector form in GIS-systems, then
to assemble these training sets in a special format of tf-records          At the second stage of the work, not only space images were
and finally to execute the learning procedure itself.                  ready, but also air photographs: in the urban area with the
    Python was chosen as the language for creating BRiIS and           resolution of 0.05 and 0.1 m/pixel, in rural areas – with the
related programs. Neural networks are made with the use of             resolution of 0.1 m/pixel. In addition to red-green-blue images
open-source libraries for machine learning TensorFlow from             (RGB), there were also colored infra-red (CIR) satellite images
Google. Framework Nvidia from CUDA is used to speed up                 with the resolution of 0.5 m/pixel. In total tagged images
calculations.                                                          consisted of approximately 50,000 buildings. For evaluation
                                                                       images of different types and with various types of buildings,
4. Evaluation Criteria                                                 containing over 7,000 buildings, were left (i.e. not used in
                                                                       training).
    Taking into account the objectives of service BRiIS                    Now the task was to test in practice how the combination of
developing, there are two main typical scenarios of its                training sets affects the final result.
application:                                                               Intuitively, it has been assumed that separate models for
      reconciliation of the building boundaries, recorded within      datasets of air photos (made by unmanned aerial vehicle, UAV
       the new session of ERS, with the registered;                    and by manned aircraft, MA) and space photos (made by satellite,
      detection of new buildings within the new session of ERS,       SAT) should be trained, as their scales are too different.
       not previously recorded.                                        Similarly, RGB data sets differ from CIR sets, and therefore
    In the first case, it is of paramount importance to determine      separate models should also be trained for them. So, the
the boundaries of buildings as accurately as possible. For these       following models were trained: 1) UAV+MA, 2) SAT(RGB),
purposes, the best are the criteria for calculating the accuracy of    3) SAT(CIR) and the results were evaluated based on F-score.
the recognition algorithm, based on the quantitative similarity        The results of the evaluation are shown in table. 3 (the second
between the ground truth label and the predicted one in the pixel-     column). Fig. 2 shows an example of the results of building
by-pixel comparison. In this paper one of the strictest criteria was   recognition in UAV image.
used, that is Jaccard index, which in its finitely multiple version        Then a general model for all RGB images (together UAV,
(at a given resolution, the image is a finite set of pixels) can be    MA, SAT) was trained and a separate one for SAT(CIR) images.
written as follows:                                                    The results of evaluation of these models are shown in table. 3 in
                          𝑛(𝐴 ∩ 𝐵)             𝑛(𝐴 ∩ 𝐵)                the third column. As it can be seen from table 3, the result of F-
           𝐾=                               =            .             score has not changed for UAV+MA images, but improved for
                 𝑛(𝐴) + 𝑛(𝐵) − 𝑛(𝐴 ∩ 𝐵) 𝑛(𝐴 ∪ 𝐵)
                                                                       SP images. Probably, this improvement of recognition results of
    This measure is also called Intersection over Union (IoU),         SAT images is due to the training set is too small, and adding
which reflects the essence of the fraction above.                      UAV and MA images, even differing in scale, beneficially
    For the second case, more suitable are measures based on           effects learning.
counting the number of buildings, which polygons in the                    Since the training set for SAT(CIR) is even smaller than
predicted label sufficiently intersected with the polygons of the      SAT(RGB), then a unified model for all types of available
ground truth label. In other words, the comparison is not made         images was trained. The results of evaluation of the unified
by pixels, but by pieces (or buildings). The score is calculated as    model are shown in table. 3 in the fourth column. As it can be
follows:                                                               seen from table 3, the result of F-score has not changed for MA
                            2 ∗ Precision ∗ Recall                     images, but improved for UAV images and slightly deteriorated
                𝐹𝑠𝑐𝑜𝑟𝑒 =                           ,                   for SAT. However, the overall F-score has slightly improved.
                             (Precision + Recall)
                                       𝑇𝑃                                  Another advantage of the unified model is that there is no
                     Precision =               ,                       need to prepare separate datasets for different types of recording.
                                    (TP + FP)
                                      𝑇𝑃                               Also, the unified model will speed up the work of the service,
                       Recall =              ,                         since no time will be spent on downloading different ANN
                                  (TP + FN)
                                                                       models in case of recognizing images of different types.
where Precision is called algorithm accuracy, Recall is the
                                                                           For evaluating the results and forming columns 2-4 of table
completeness, TP is the number of true positives, FN is the
                                                                       3, all objects larger than 2x1m for air photographs and 4x4 for
number of false positives, and FN is the number of false
                                                                       satellite images were taken into account. If we ignore all
negatives. In this paper, a building is considered to be correctly
                                                                       buildings less than 3x3 and 7x7 respectively, the results are
detected if Jaccard index for it and its label exceeds 50%.
                                                                       significantly improved (see the fifth column of table 3). This
    After finishing software development of service BRiIS,
                                                                       proves the assumption that small objects are the most difficult to
including results visualizer modules, it became possible to use
                                                                       recognize.
not only a bit-mapped metrics based on Jaccard index (IoU), but
also F-score objective measure, which allows to assess the results
better in the context of the ultimate goal – building recognition.
Table 3. Results of Network DeepLabV3 Testing                               F-score results are much lower than 80% for all these types
                                          F-score                       of complex constructions. This is quite natural, because even a
                                                              Unified   man using semantic context find it difficult to determine where
                       Three         Two
       Image                                        Unified   model     the boundary between such objects are. However, finding such
                      separate      models
                                                    model      (3х3,    areas in the processed images and the application of separate
                      models      (RGB/CIR)
                                                               7х7)
                                                                        models and algorithms to them can give a significant increase in
 UAV
                                                                        the quality of recognition. It is the development of such
 16-1-239-157-В-
                     0,879        0,878             0,850     0,923     combined architectures that is a priority for further research
 1
 16-1-239-157-В-                                                        within the framework of the project development.
                     0,846        0,871             0,885     0,940
 2
 16-1-239-157-В-
                     0,904        0,896             0,913     0,946
 3
 16-1-239-157-В-
                     0,813        0,857             0,861     0,922
 4
 16-1-239-157-A-
                     0,846        0,826             0,884     0,927
 7
 16-1-239-157-A-
                     0,852        0,872             0,887     0,946
 9
 16-1-239-157-A-
                     0,867        0,862             0,885     0,899
 10
 16-1-239-157-A-
                     0,868        0,915             0,906     0,965
 11
 16-1-239-157-A-
                     0,883        0,854             0,888     0,937
 13
 16-1-239-157-A-
                     0,858        0,878             0,888     0,928
 14
 16-1-239-157-A-
                     0,930        0,853             0,888     0,930
 15
 Konstantinovka      0,867        0,846             0,872     0,893
 Averagefor UAV      0,868        0,867             0,884     0,930
 MA
 16-33-23-(131-д)    0,760        0,771             0,763     0,829
 mesha_2_12          0,790        0,783             0,808     0,865
 16-33-23-(018-е)    0,777        0,772             0,770     0,798
 16-33-23-(018-b)    0,761        0,781             0,770     0,721
 Average for MA      0,772        0,777             0,778     0,803
 SAT
 Fr4a_RGB            0,705        0,710             0,685     0,764
 Fr7a_RGB            0,612        0,694             0,658     0,715
 Fr7a_CIR            0,605        0,667             0,666     0,728
 Average for SAT     0,641        0,690             0,672     0,740
 Total average         0,812       0,823          0,828       0,872
    The main task of the work was to create a service for building
recognition without additional classification by their functional
profile or other criteria. However, most part of marking at the
second stage was performed according to ten classes:
Background, Residential building, House, Industrial or
commercial building, Administration or educational building,
Other non-residential building, Building under construction,
Greenhouse, Garages, Foundation of building. A separate model
was trained on these data, but its results were much worse due to
frequent errors in the classification of found objects. Tables with     Fig. 2. Example of recognizing on UAV model, trained only
the results for ten classes are very bulky, that is why they are not       on UAV+MA sets (first is an original image, second is
given in this paper.                                                    reference marking, third is the result of recognition result
                                                                                     on top of the reference marking)
6. Conclusion
                                                                        7. References
     Analysis of papers and experimental data obtained when
testing software developed by the authors prove the efficient use       [1] Vijay Badrinarayanan, Alex Kendall, Roberto Cipolla.
of convolutional neural networks and, in particular, DeepLabV3              SegNet: A Deep Convolutional Encoder-Decoder
architecture for building recognition in satellite and air photos.          Architecture for Image Segmentation. arXiv:1511.00561v3
The average F-score on the sample of images under study                     [cs.CV] 10 Oct 2016
exceeded 80%, which is a very good result, taking into account          [2] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos,
the fact that the test sample of images had difficult for                   Kevin Murphy, Alan L. Yuille. DeepLab: Semantic Image
recognition objects.                                                        Segmentation with Deep Convolutional Nets, Atrous
     These objects hard for recognition include poorly structured           Convolution,      and     Fully    Connected      CRFs.
clusters of containers and tents in markets, neighborhoods with             arXiv:1606.00915v2 [cs.CV] 12 May 2017
old low-rise buildings and an abundance of small very close to          [3] Liang-Chieh Chen, George Papandreou, Florian Schroff,
each other household buildings, as well as industrial facilities of         Hartwig Adam. Rethinking Atrous Convolution for
complex shapes with many link buildings and transporters                    Semantic Image Segmentation. arXiv:1706.05587v3
between buildings.                                                          [cs.CV] 5 Dec 2017
[4] Jonathan Long, Evan Shelhamer, Trevor Darrell. Fully.
     Convolutional Networks for Semantic Segmentation.
     arXiv:1411.4038v2 [cs.CV] 8 Mar 2015.
[5] Olaf Ronneberger, Philipp Fischer, Thomas Brox. U-Net:
     Convolutional Networks for Biomedical Image
     Segmentation. arXiv:1505.04597v1 [cs.CV] 18 May 2015
[6] Pierre Sermanet, David Eigen, Xiang Zhang, Michael
     Mathieu, Rob Fergus, Yann LeCun. OverFeat: Integrated
     Recognition,     Localization  and     Detection   using
     Convolutional Networks. arXiv:1312.6229v4 [cs.CV] 24
     Feb 2014
[7] 2015 IEEE GRSS Data Fusion Contest Results
     http://www.grss-ieee.org/community/technical-
     committees/data-fusion/2015-ieee-grss-data-fusion-
     contest-results/
[8] 2016 IEEE GRSS Data Fusion Contest Results
     http://www.grss-ieee.org/community/technical-
     committees/data-fusion/2016-ieee-grss-data-fusion-
     contest-results/
[9] 2017 IEEE GRSS Data Fusion Contest Results
     http://www.grss-ieee.org/community/technical-
     committees/data-fusion/2017-ieee-grss-data-fusion-
     contest-results/
[10] 2018 IEEE GRSS Data Fusion Contest Results
     http://www.grss-ieee.org/community/technical-
     committees/data-fusion/2018-ieee-grss-data-fusion-
     contest-results/

</pre>