=Paper=
{{Paper
|id=Vol-2485/paper39
|storemode=property
|title=Building Recognition in Air and Satellite Photos
|pdfUrl=https://ceur-ws.org/Vol-2485/paper39.pdf
|volume=Vol-2485
|authors=Dmitriy Bulatitskiy,Aleksandr Buyval,Mikhail Gavrilenkov
}}
==Building Recognition in Air and Satellite Photos==
Building Recognition in Air and Satellite Photos D.I. Bulatitskiy1, A.K. Buyval2, M.A. Gavrilenkov1 bulatizkydi@mail.ru, alexbuyval@gmail.com, gavrilenkov@umlab.ru 1 Bryansk State Technical University, Bryansk, Russia 2 Innopolis University, Innopolis, Russia The paper deals with the algorithms of building recognition in air and satellite photos. The use of convolutional artificial neural networks to solve the problem of image segmentation is substantiated. The choice between two architectures of artificial neural networks is considered. The development of software implementing building recognition based on convolutional neural networks is described. The architecture of the software complex, some features of its construction and interaction with the cloud geo-information platform in which it functions are described. The application of the developed software for the recognition of buildings in images is described. The results of experiments on building recognition in pictures of various resolutions and types of buildings using the developed software are analysed. Keywords: Earth remote sensing, building recognition in photos, convolutional neural networks, semantic picture segmentation. To solve the segmentation problem, two types of methods 1. Introduction can be distinguished: 1) classical and 2) based on artificial neural networks (ANNs). At present, the recognition of construction objects in satellite Classical methods include such methods as K-means and air photographs, which is a part of operation of many clustering, edge detection, watershed transformation and others. government departments and commercial structures, is often However, classical approaches, as a rule, show good results only carried out manually. Such processes as cadastral surveys, on simple images and after careful adjustment of parameters. At control over observing the borders of separate and protective the same time, they are extremely unstable to various changes in zones, use of land as intended, control over the setting of the image (brightness, contrast, and others). And, probably, the buildings on the state registration and other require considerable most important drawback is that these methods do not allow to cost and labor. Therefore, it is necessary to automate recognition determine the class of the found object. and classification of objects in satellite and air photographs In their turn, image segmentation methods based on artificial through the use of information technologies, in particular, neural networks significantly surpass classical methods in computer vision and machine learning, which show good results accuracy and stability. in related fields. Analysis of some papers [1-6] and competition results of The source data for the problem to be solved are usually image processing and Earth remote sensing (ERS) [7-10] GeoTiff files, which contain both the terrain image and allowed to conclude that the use of convolutional neural information on the spatial resolution of pixels and the image networks for solving the problem of building recognition is binding to geographical coordinates. As the output, it is reliable, and U-Net and DeepLabV3 architectures are the most necessary to obtain the contours of the detected buildings in attractive for further research and experimental testing. vector form in geocoordinates. On the basis of source code libraries provided by the authors Several phases can be distinguished in the solution of the of the selected architectures, we tested software for training the initial problem: corresponding neural network models and quality control of their 1. Getting a bitmap of the terrain from the original work on the basis of Jaccard index. GeoTiff file and direct building recognizing, that is, selecting At the first stage of the work there were only the results of areas in the picture and classifying them as buildings of a satellite photos, air photos using manned and unmanned aircrafts particular type. to obtain high-resolution images were only performed by 2. Polygon boundary detection in vector form and outsourcers, so there were tasks of testing the selected ANN converting bit-mapped coordinates into geographic ones based architectures set in the following areas. Firstly, it was necessary on geodata from the original GeoTiff file. to assess the impact of hyper-parameters of networks on their 3. Post-processing of selected polygons, including the work. Secondly, we were to check the assumption that the shade application of rules and heuristics for filtering and classification marking of buildings can have a beneficial effect on the building refinements. recognition. Finally, it was necessary to choose one of the Each processing phase uses its own set of approaches and architectures for further use in the project. technologies, but for the convenience of the end user it is For the experiments, a set of data on satellite images was advisable to implement the solution of this problem as a single prepared, including more than 600 images with different types of act. buildings. In total, more than 5,000 buildings of various classes were represented in these images. The images were carefully 2. Selection of Methods and Algorithms for labeled and divided into a training sample (448 images), a Solving the Problem of Building Recognition validation sample (110 images), and a test sample (59 images). The building recognition problem can be referred to a class In each sample, images with large buildings, private sector, and of machine vision problems called "semantic segmentation", in no buildings are presented in the same proportions. which each pixel of the original image must be assigned to one Training of each model took up to several days. Therefore, it of the predefined semantic classes. If the way of referring pixels was too difficult to perform experiments for testing all possible to semantic classes corresponds to the human perception of the combinations of hyper-parameters and marking variations. image, the pixels will be grouped into areas that contain dots of Instead, the effect of hyper-parameters on the shade marking was a certain class only or mainly this class. Thus, the whole image first studied and their best combination was chosen. Then, using will be divided into a finite number of a segment, each of which the best and worst combination of hyper-parameters ANN work are an object of one of the required classes or is its background. was tested on the marking without shades influence. The result of this check is presented in tables 1-2. Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). The following conclusions were made according to the isolated from CGIP as possible; BRiIS should be easy to scale; analysis of Tables 1-2. there should be means of monitoring BRiIS operation. Taking into account these requirements, the architecture of BRiIS 1. Both architectures showed that taking shades into account service has been developed, shown in Fig. 1. only makes the result worse. Therefore, in the future, it was The following data flows can be identified: task information decided to perform the marking and all other works without and source files come from the platform to BRiIS, and resulting taking shades into account. files and diagnostic messages - from the service to the platform. 2. The best values of Jaccard index were achieved with the help It was decided to organize the first flow through which tasks of DeepLabV3 network: 89.5% vs 77% from U-Net. are transferred on the basis of RabbitMQ queue. The web user Therefore, it was decided to conduct further research on the interface is on the platform side. The user chooses files for basis of DeepLabV network. processing and additional recognition options: building classes 3. When we decrease the value of output stride in the studied and images he is interested in, and others. The platform generates sample, a slight improvement of the result was observed processing tasks and sends them to the queue. (especially noticeable on smaller objects), however, it should be noted that source intensity increases multiply (the consumption of time and memory clock for GPU computing, training time). The optimal value of this parameter is 16. 4. Increasing batch_size results in memory consumption during training, but allows to get a much better result and reduces training time. When training, it is recommended to increase this parameter as much as GPU memory allows. Table 1. Results of U-Net Testing Number of № Number of layers IoU, (%) features_root Marking with shades 1 3 32 65 2 3 48 69 3 3 64 52 Fig. 1. Service BRiIS architecture 4 4 32 70 Service BRiIS guards the task queue, and as messages 5 4 48 71 appear, the Service de-queues and processes them. This way of 6 4 64 57 transferring tasks provides not only their guaranteed delivery, but 7 5 32 66 also provides scalability of the system. If necessary, several 8 5 48 69 BRiIS will be launched guarding the same task queue and 9 5 64 68 performing the tasks in parallel. Marking without shades The message contains JSON data structure. It contains the 10 3 64 67 identifier and the task type, the path to the source directory, and 11 4 48 77 the path to the target directory where service BRiIS should write the resulting files. Table 2. Results of DeepLabV3 Testing For debugging purposes, a command-line utility has also Reduction factor of № Batch_size IoU, (%) been developed that allows you to send tasks to the queue for output_stride processing one at a time or in batches. Marking with shades The second data flow, which provides the feedback from 1 16 2 68 service BRiIS, is organized by sending diagnostic messages by 2 16 4 71 POST over HTTP Protocol in JSON format. Messages are sent 3 8 2 70 when significant task processing events occur: when the task is 4 8 4 72 de-queued, when processing begins, when polygons are formed, Marking without shades and when it is completed. 5 16 2 86 The third data flow is provided by file exchange. For its 6 16 4 89 successful functioning service BRiIS should have access to the 7 8 4 89.5 file system of the geographic information platform. During task processing service BRiIS refers to itfor the source files, and writes intermediate and final results. The path to the source 3. Software Development for Building directory and the resulting directory is specified in the message Recognition de-queued from the task queue. Simultaneously with the experiments on selecting neural The central part of the service is a task processing module, network architecture, software was developed which is intended which is based on convolutional neural network DeepLabV3. to function as a part of a cloud geoinformation platform (CGIP). The neural network is surrounded by a processing pipeline of This platform is being created at Innopolis University and should geographical images. First, the image is cut into fragments of the become a comprehensive system for promoting products and desired size, they are transferred in batches to the neural network, services in the field of remote sensing of the Earth. Building the resulting segmented fragments are combined into the image recognition in images service (BRiIS) is one of the internal of the original size. Then, vectoring procedure finds the outlines services of the cloud geoinformation platform and does not of buildings, approximate them in polygons with pixel communicate directly with the users, but the results that the user coordinates, and performs the initial filtering of noise and gets directly depend on the quality of its operation. BRiIS is quite transfers the polygons into geographical coordinates based on the a resource-intensive part of CGIP, and its implementation has a position and scale data of the source geoimages. Finally, the search character: the scenarios are very likely when models and selected polygons are post processed based on heuristics. algorithms of BRiIS core may undergo significant changes. BRiIS uses a large number of libraries, many of which are These factors have determined the requirements for the large, require related libraries of certain versions, or involve a organization of BRiIS and CGIP interaction: BRiIS should be as non-standard installation process. All this greatly complicates the environment adjustment for BRiIS. In some cases, with certain There is certainly a direct connection between Jaccard index combinations of operating system versions and installed and F-score: the better the image is segmented, the easier it is to software, adjustment may not be possible. Therefore, we decided find the correct building boundaries. However, there is a to run BRiIS in an isolated docker container environment. significant difference. Bit-mapped metrics is much more loyal to In addition to insulation of applications, using docker the gaps or, on the contrary, false recognition of small buildings, containers makes it easy to deploy and replicate. The operating as well as to situations where two close buildings merge into one environment, all necessary libraries with all dependencies, as or, conversely, one building of a complex configuration breaks well as application modules and scripts are packaged into the apart. image. This image is transferred to other machines, unpacked, To quantify the quality of building recognition based on F- and the service container is started. score and results visualizing, a separate application is developed Machine learning utilities shown in Fig. 1 are not a direct part in Python using Tkinter library. of CGIP, they are designed to prepare neural network models, which are then used by BRiIS. The utilities allow to generate 5. Study of Network DeepLabV3 Operation in ground truth labels based on the original GeoTiff files, and the Images of Various Types ground truth label, made in the vector form in GIS-systems, then to assemble these training sets in a special format of tf-records At the second stage of the work, not only space images were and finally to execute the learning procedure itself. ready, but also air photographs: in the urban area with the Python was chosen as the language for creating BRiIS and resolution of 0.05 and 0.1 m/pixel, in rural areas – with the related programs. Neural networks are made with the use of resolution of 0.1 m/pixel. In addition to red-green-blue images open-source libraries for machine learning TensorFlow from (RGB), there were also colored infra-red (CIR) satellite images Google. Framework Nvidia from CUDA is used to speed up with the resolution of 0.5 m/pixel. In total tagged images calculations. consisted of approximately 50,000 buildings. For evaluation images of different types and with various types of buildings, 4. Evaluation Criteria containing over 7,000 buildings, were left (i.e. not used in training). Taking into account the objectives of service BRiIS Now the task was to test in practice how the combination of developing, there are two main typical scenarios of its training sets affects the final result. application: Intuitively, it has been assumed that separate models for reconciliation of the building boundaries, recorded within datasets of air photos (made by unmanned aerial vehicle, UAV the new session of ERS, with the registered; and by manned aircraft, MA) and space photos (made by satellite, detection of new buildings within the new session of ERS, SAT) should be trained, as their scales are too different. not previously recorded. Similarly, RGB data sets differ from CIR sets, and therefore In the first case, it is of paramount importance to determine separate models should also be trained for them. So, the the boundaries of buildings as accurately as possible. For these following models were trained: 1) UAV+MA, 2) SAT(RGB), purposes, the best are the criteria for calculating the accuracy of 3) SAT(CIR) and the results were evaluated based on F-score. the recognition algorithm, based on the quantitative similarity The results of the evaluation are shown in table. 3 (the second between the ground truth label and the predicted one in the pixel- column). Fig. 2 shows an example of the results of building by-pixel comparison. In this paper one of the strictest criteria was recognition in UAV image. used, that is Jaccard index, which in its finitely multiple version Then a general model for all RGB images (together UAV, (at a given resolution, the image is a finite set of pixels) can be MA, SAT) was trained and a separate one for SAT(CIR) images. written as follows: The results of evaluation of these models are shown in table. 3 in 𝑛(𝐴 ∩ 𝐵) 𝑛(𝐴 ∩ 𝐵) the third column. As it can be seen from table 3, the result of F- 𝐾= = . score has not changed for UAV+MA images, but improved for 𝑛(𝐴) + 𝑛(𝐵) − 𝑛(𝐴 ∩ 𝐵) 𝑛(𝐴 ∪ 𝐵) SP images. Probably, this improvement of recognition results of This measure is also called Intersection over Union (IoU), SAT images is due to the training set is too small, and adding which reflects the essence of the fraction above. UAV and MA images, even differing in scale, beneficially For the second case, more suitable are measures based on effects learning. counting the number of buildings, which polygons in the Since the training set for SAT(CIR) is even smaller than predicted label sufficiently intersected with the polygons of the SAT(RGB), then a unified model for all types of available ground truth label. In other words, the comparison is not made images was trained. The results of evaluation of the unified by pixels, but by pieces (or buildings). The score is calculated as model are shown in table. 3 in the fourth column. As it can be follows: seen from table 3, the result of F-score has not changed for MA 2 ∗ Precision ∗ Recall images, but improved for UAV images and slightly deteriorated 𝐹𝑠𝑐𝑜𝑟𝑒 = , for SAT. However, the overall F-score has slightly improved. (Precision + Recall) 𝑇𝑃 Another advantage of the unified model is that there is no Precision = , need to prepare separate datasets for different types of recording. (TP + FP) 𝑇𝑃 Also, the unified model will speed up the work of the service, Recall = , since no time will be spent on downloading different ANN (TP + FN) models in case of recognizing images of different types. where Precision is called algorithm accuracy, Recall is the For evaluating the results and forming columns 2-4 of table completeness, TP is the number of true positives, FN is the 3, all objects larger than 2x1m for air photographs and 4x4 for number of false positives, and FN is the number of false satellite images were taken into account. If we ignore all negatives. In this paper, a building is considered to be correctly buildings less than 3x3 and 7x7 respectively, the results are detected if Jaccard index for it and its label exceeds 50%. significantly improved (see the fifth column of table 3). This After finishing software development of service BRiIS, proves the assumption that small objects are the most difficult to including results visualizer modules, it became possible to use recognize. not only a bit-mapped metrics based on Jaccard index (IoU), but also F-score objective measure, which allows to assess the results better in the context of the ultimate goal – building recognition. Table 3. Results of Network DeepLabV3 Testing F-score results are much lower than 80% for all these types F-score of complex constructions. This is quite natural, because even a Unified man using semantic context find it difficult to determine where Three Two Image Unified model the boundary between such objects are. However, finding such separate models model (3х3, areas in the processed images and the application of separate models (RGB/CIR) 7х7) models and algorithms to them can give a significant increase in UAV the quality of recognition. It is the development of such 16-1-239-157-В- 0,879 0,878 0,850 0,923 combined architectures that is a priority for further research 1 16-1-239-157-В- within the framework of the project development. 0,846 0,871 0,885 0,940 2 16-1-239-157-В- 0,904 0,896 0,913 0,946 3 16-1-239-157-В- 0,813 0,857 0,861 0,922 4 16-1-239-157-A- 0,846 0,826 0,884 0,927 7 16-1-239-157-A- 0,852 0,872 0,887 0,946 9 16-1-239-157-A- 0,867 0,862 0,885 0,899 10 16-1-239-157-A- 0,868 0,915 0,906 0,965 11 16-1-239-157-A- 0,883 0,854 0,888 0,937 13 16-1-239-157-A- 0,858 0,878 0,888 0,928 14 16-1-239-157-A- 0,930 0,853 0,888 0,930 15 Konstantinovka 0,867 0,846 0,872 0,893 Averagefor UAV 0,868 0,867 0,884 0,930 MA 16-33-23-(131-д) 0,760 0,771 0,763 0,829 mesha_2_12 0,790 0,783 0,808 0,865 16-33-23-(018-е) 0,777 0,772 0,770 0,798 16-33-23-(018-b) 0,761 0,781 0,770 0,721 Average for MA 0,772 0,777 0,778 0,803 SAT Fr4a_RGB 0,705 0,710 0,685 0,764 Fr7a_RGB 0,612 0,694 0,658 0,715 Fr7a_CIR 0,605 0,667 0,666 0,728 Average for SAT 0,641 0,690 0,672 0,740 Total average 0,812 0,823 0,828 0,872 The main task of the work was to create a service for building recognition without additional classification by their functional profile or other criteria. However, most part of marking at the second stage was performed according to ten classes: Background, Residential building, House, Industrial or commercial building, Administration or educational building, Other non-residential building, Building under construction, Greenhouse, Garages, Foundation of building. A separate model was trained on these data, but its results were much worse due to frequent errors in the classification of found objects. Tables with Fig. 2. Example of recognizing on UAV model, trained only the results for ten classes are very bulky, that is why they are not on UAV+MA sets (first is an original image, second is given in this paper. reference marking, third is the result of recognition result on top of the reference marking) 6. Conclusion 7. References Analysis of papers and experimental data obtained when testing software developed by the authors prove the efficient use [1] Vijay Badrinarayanan, Alex Kendall, Roberto Cipolla. of convolutional neural networks and, in particular, DeepLabV3 SegNet: A Deep Convolutional Encoder-Decoder architecture for building recognition in satellite and air photos. Architecture for Image Segmentation. arXiv:1511.00561v3 The average F-score on the sample of images under study [cs.CV] 10 Oct 2016 exceeded 80%, which is a very good result, taking into account [2] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, the fact that the test sample of images had difficult for Kevin Murphy, Alan L. Yuille. DeepLab: Semantic Image recognition objects. Segmentation with Deep Convolutional Nets, Atrous These objects hard for recognition include poorly structured Convolution, and Fully Connected CRFs. clusters of containers and tents in markets, neighborhoods with arXiv:1606.00915v2 [cs.CV] 12 May 2017 old low-rise buildings and an abundance of small very close to [3] Liang-Chieh Chen, George Papandreou, Florian Schroff, each other household buildings, as well as industrial facilities of Hartwig Adam. Rethinking Atrous Convolution for complex shapes with many link buildings and transporters Semantic Image Segmentation. arXiv:1706.05587v3 between buildings. [cs.CV] 5 Dec 2017 [4] Jonathan Long, Evan Shelhamer, Trevor Darrell. Fully. Convolutional Networks for Semantic Segmentation. arXiv:1411.4038v2 [cs.CV] 8 Mar 2015. [5] Olaf Ronneberger, Philipp Fischer, Thomas Brox. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv:1505.04597v1 [cs.CV] 18 May 2015 [6] Pierre Sermanet, David Eigen, Xiang Zhang, Michael Mathieu, Rob Fergus, Yann LeCun. OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks. arXiv:1312.6229v4 [cs.CV] 24 Feb 2014 [7] 2015 IEEE GRSS Data Fusion Contest Results http://www.grss-ieee.org/community/technical- committees/data-fusion/2015-ieee-grss-data-fusion- contest-results/ [8] 2016 IEEE GRSS Data Fusion Contest Results http://www.grss-ieee.org/community/technical- committees/data-fusion/2016-ieee-grss-data-fusion- contest-results/ [9] 2017 IEEE GRSS Data Fusion Contest Results http://www.grss-ieee.org/community/technical- committees/data-fusion/2017-ieee-grss-data-fusion- contest-results/ [10] 2018 IEEE GRSS Data Fusion Contest Results http://www.grss-ieee.org/community/technical- committees/data-fusion/2018-ieee-grss-data-fusion- contest-results/