Automated Area Assessment of Objects Using Deep Learning Approach and Satellite Imagery Data Kirill Tsyganov, Alexey Kozionov, Jaroslav Bologov, Alexandr Andreev, Oleg Mangutov and Ivan Gorokhov Deloitte Analytics Institute, ZAO Deloitte & Touche CIS, Moscow, Russia, {ktsyganov, akozionov, jbologov, aandreev, omangutov, igorokhov}@deloitte.ru Abstract. We describe an actual case of applying deep neural networks for area assessment of different types of objects in selected geographi- cal region through analysis of satellite images. The case was to detect, segment and asses area of buildings and agricultural lands on satellite images. We illustrate our framework of solving the problem and results validation methods. We compare performance of different convolutional neural networks in applying to our case and discuss the best quality segmentation model that was found – the U-net convolutional network. There was no training dataset of images and their corresponding masks available for our geographical region, but we constructed our own train- ing set. Paper reports in detail on the processes of satellite imagery data preparation, images pre-processing, construction of training dataset and learning neural networks. Keywords: Deep learning · Image segmentation · Object detection · Convolutional Neural Networks · U-net · Satellite imagery 1 Introduction The paper presents main technical details of real-life client’s case in experience of Deloitte Analytics Institute (Moscow). The paper does not pretend on scientific novelty of applied methods in the solution but rather describes our approach of using recent developments in machine learning in the actual industrial case. Due to the existing country legislation, the client faced a lack of systematic recordings on agricultural and residential areas assessments and other national statistics. The client wanted to perform a structured audit of agricultural lands and residential areas paired with further monitoring of their development in time. The client requested us to provide a solution for an automated area assessment, based on an analysis of satellites imagery. Since the problem required an accurate solution, we decided to use deep learning supervised approach. Basically we needed training dataset, neural net- work architecture for image segmentation and computational hardware resources to learn network on training data. We were going to experiment with publicly available dataset [3] and in case of bad performance on test images of our region create own dataset for our region of interest. For the neural network architec- tures we took straightforward CNN [2] and more complex architecture with layers passing through each other [1]. For the networks’ performance evaluation we took Jaccard index. 2 Satellite imagery data used in the solution 2.1 Data specific restrictions In order to apply deep learning approach for image segmentation we needed training set of images, i.e. pairs of satellite images and their corresponding masks where only objects for detection were marked. There was no training dataset with agricultural lands of out interest, so we had to construct our own dataset. The geographical region of our research had specific desert environment and there was no training dataset of images for buildings segmentation of this region. To overcome this issue with labeled data we tried to use publicly available aerial imagery training set1 of another geographical region(fig. 1). But test of models, trained on this open dataset, on images of our region of interest demonstrated insufficient quality of recognition. Possible causes of poor quality might be the following: – due to the distinct geographical regions on the train and test images, build- ings in the training dataset and buildings on the test images were very dif- ferent: colors of roofs were different, shapes of buildings were different; – projection angles on train and test images were different, it caused the size of shadows of objects; – image color schemes on train and test sets were significantly distinct. Fig. 1. Publicly available Massachusetts Buildings Dataset shared by V. Mnih [3]: con- sists of training dataset, i.e. satellite images and their corresponding masks with build- ings 1 Massachusetts Buildings Dataset publicly available at link http://www.cs.toronto. edu/~vmnih/data/. 2.2 Training dataset construction After several unsuccessful attempts to use open training datasets for our prob- lem, we came to the conclusion to use as training dataset satellite imagery data of our region of interest. Since there was no available labeled dataset we con- structed such dataset by ourselves. We used satellite images with resolution of 1 meter per pixel for training and test sets. Such resolution was able neural network to detect border structure of small buildings with area approximately 30 square meters. To construct training dataset we took several small subregions and manually draw a mask with buildings and agricultural lands for it (fig. 2 and fig. 3). In order to improve generality power of out models we put in the training dataset buildings and agricultural lands of all types from different geographical subre- gions. Forming the training dataset was an iterative cycled process: 1. We trained model on the training dataset. 2. Then we tested model on test dataset. 3. Next we visually examined model’s quality of recognition on test images and sought subregions where model performed low accuracy. 4. Finally we manually created masks for unsatisfactory recognized subregions and added such pairs of images-masks for the subregions into the training dataset. 5. Back to step 1. Fig. 2. Training dataset for buildings of our region marked by us: satellite images and their corresponding masks 2.3 Patches preparation Due to purpose of fast training dataset formation, the images in the initial train- ing dataset had the shapes of rectangles of different sizes. But the input for the neural network should have one predefined size. Therefore, in order to generalize our approach, for every image in the training dataset and its corresponding mask we took patches by sliding window of size 64 × 64 with step 16 (fig. 4). Fig. 3. Training dataset for agricultural lands of our region marked by us: satellite images and their corresponding masks Fig. 4. Collecting patches from an image by sliding window 2.4 Image data augmentation In order to enlarge training dataset without additional manual labelling of images we used standard techniques of image data augmentation, i.e. rotations and symmetries of original images (fig. 5). The data augmentation is applied to the patches of square shape, so that for every patch symmetry group of square is applied. Fig. 5. Training dataset augmentation: first image is the original satellite image, next images are generated by augmentation 3 Evaluation metrics Objects segmentation problems commonly estimated by the Jaccard index and visual analysis. We used the following metrics to assess the performance of mod- els: Jaccard index, area error, precision, recall. 3.1 Jaccard index The Jaccard index, also known as Intersection over Union (IU) is a measure of similarity and diversity of two sets. In order to compute Jaccard index between two finite sets A and B you need to divide the cardinality of intersection of A and B by the cardinality of union A and B: |A ∩ B| |A ∩ B| J(A, B) = = , 0 ≤ J(A, B) ≤ 1. (1) |A ∪ B| |A| + |B| − |A ∩ B| Jaccard index gives more penalty for error (both types of error) that precision and recall since it uses both false positives and false negatives statistics (fig. 6). Fig. 6. Metrics used in the research 4 Solution We had two classes of objects to detect and segment: buildings and agricultural lands with growing plants. Based on the conclusion that independent segmen- tation for multiple classes performs better than multinomial segmentation for multiple classes simultaneously [2], we decided to solve segmentation problems for each class separately. There was also an additional argument for such separa- tion of problems – since second class of objects was agricultural lands with only growing plants we were going use additional features, like vegetation indexes [7] in order to increase accuracy of distinction of growing and not growing plants. 4.1 Buildings segmentation network architecture We examined several architectures of convolutional neural networks: U-net [1] with different settings of hyperparameters and neural network with mixed con- volutional and fully connected layers [2]. The architecture of the best performed CNN is based on U-net. Among other differences our network has less number of merging layers – 2 merges instead of 3 – we found that learning process CNN with 3 merges is very time consuming but does not give significant benefit in performance. Our network (fig. 7) starts with contracting procedure with the repeated con- volution, maxpooling and dropout layers and proceeds with expansion procedure in which maxpooling is substituted with upsampling. The most important and benefit feature in the network is the append of the output from the contracting layers to the input in the expansive layers. This approach significantly improves network performance on buildings’ borders structure extraction. All convolu- tional layers except the last one use ReLU activation function and the output layer uses SoftMax. Fig. 7. CNN architecture for buildings segmentation based on U-net 4.2 Agricultural lands segmentation In general the problem of lands segmentation is analogous to buildings segmen- tation. However, the average farm size is much bigger than average building size, so one need to cut initial image into considerably larger patches to preserve the information about farm structure and it’s surroundings. The segmentation problem becomes computationally expensive when the neural network is used for processing heavy image patches. A new approach was applied for circle farms recognition in order to overcome computational difficulties. The main feature of the approach is to use the com- bination of two heatmaps produced by different processing techniques to make the final segmentation map. The first heatmap is produced by applying ellipsoid filters of various sizes to initial image. Exact sizes of the filters depend on image resolution. In this paper 5 x 5 and 50 x 50 filters were applied to 1 meter per pixel maps. Ellipsoid filter may be described as binary image of a circle inscribed in a square of a certain size or as matrix of zeros and ones with the ones filling the center circle-shaped region of the matrix. During applying of this filter erosion operation is performed. The filter slides through the image (like kernel in CNN convolution layer) and element-wise product of filter matrix and image segment is calculated. Minimum of these products is assigned to an anchor point that is set to be in the center of the filter. Thus, applying of a filter transforms initial image similarly to using convolution layer of CNN followed by minpooling layer. As a result, filtering, like CNN, is also produces a heatmap that is shown on Figure 9. The second heatmap is produced by running random forest classifier which was trained to predict pixel class (farm / non-farm) based on it’s color. The idea behind proposed approach is to use the advantages of two tech- niques, which compensate each other flaws. Color segmentation method produce a relatively noisy heatmap, as the color of hills and roads is somewhat similar to farms color (especially when crop is not yet grown). Shape detection method— filtering—produces much less noise, but detected farms areas are significantly smaller than actual ones due to information loss during erosion process. In the joint heatmap calculated as average previous two the intensity of noise is lower than in color segmentation map and boundaries of farms are closer to actual than in shape detection map. Remaining noise can be removed by applying thresholding technique and median filter [5]. 4.3 Polygons extraction Neural network output due to the final softmax activation function provided us with two probabalistic heatmaps – one with probabilities of buildings and inverted one. But for the presentation results of recognition in the geospatial system it is necessary to convert heatmaps into polygons form. For this task we used thresholding of heatmaps and Douglas-Peucker algorithm [6]. 5 Experiment results We obtained Jaccard index of approximately 0.61 for buildings recognition, and 0.65 – for circle agricultural farms. The recognition results for buildings and circle agricultural farms can be seen at figures (8 and 9) correspondingly. As well as Jaccard Index, we computed the total area accuracy and it had value of 94% for buildings segmentation problem on the validation dataset. Fig. 8. Process of buildings’ recognition on a test satellite image Fig. 9. a) initial satellite image, b) ellipsoid filters heatmap, c) color segmentation heatmap, d) joint heatmap, e) final heatmap after thresholding and applying median filter 5.1 Learning neural network for buildings segmentation Since we have a binary classification problem (buildings, background) we used binary cross entropy as a loss function: Hb (p) = H(p, 1 − p) = −p log(p) − (1 − p) log(1 − p). (2) Learning process of unet with input and output patch of size 64 × 64 was not overfitting till approximately 85 epoch: starting from 85 epoch validation loss deviated significantly with training loss decreasing smoothly and it hurt the quality on test data (left picture on fig. 10). Jaccard indices for different sizes of patches (as input and output shape for neural network) behaved the same starting from epoch 9 (right picture on fig. 10). 6 Discussion We highlight the following branches of improvements that could be done for our solution: – Color histogram equalization of satellite images Since satellite images in the initial photo bank could be done by differ- ent satellites the color histograms of images can differentiate significantly. Fig. 10. Left: dynamics loss (binary cross entropy) during learning process of unet with input shape 64x64; right: Jaccard indices comparison for different patches during learning process Such variety could harm the recognition quality. Therefore images’ color his- tograms should be equalized before the further analysis. We suggest that contrast limited adaptive histogram equalization (CLAHE) [8] is the most appropriate method for images’ color equalization. – Additional spectral bands Near-infrared range (NIR) and red edge channel could significantly enhance the quality of recognition algorithms, especially for agricultural lands. For ex- ample, combination of different bands with different resolution from different satellites in one regression model demonstrates high accuracy of agricultural land condition [9]. – Training dataset formation Creating a mask for satellite image is a tough problem. In order to do a significant improvement of recognition’s quality it is necessary to have masks for all types of objects a given class. We suggest to extend the training dataset not only by augmentation techniques of possessed images but by including bad-recognized regions. – Object detection phase Region proposal networks [10] [11] [12] [4] resolve the problem of object detection. The object detection phase can be used before image segmentation in order to reduce noise from other objects [4]. – Object boundaries adjustment by probabilistic graphical models In order to improve localization accuracy of object boundaries it was pro- posed to use combination of methods from DCNNs and probabilistic graphi- cal models [13]. Since CNNs can predict the rough position of the objects but it is difficult for them to highlight the boundaries, authors presented a new approach of refining objects’ boundaries by applying fully-connected condi- tional random fields (CRF) for accurate boundary recovery after the final layer of the CNNs. They proved increased performance of this approach at PASCAL VOC-2012 image segmentation task so we think that the solution can be applied to our problem with benefit. 7 Conclusion We present a report of applying deep learning approach for real life problem of objects’ area assessment. We describe the whole solution process: collection of satellite imagery with appropriate resolution, creation of training dataset by manual labelling and data augmentations techniques, training and testing CNNs and extraction buildings’ polygons from CNN’s output heatmaps. We obtained the sufficient recognition quality (Jaccard index is 0.61 for buildings) with CNN based on U-net architecture [1]. Finally we propose the next steps of the recog- nition model design and feature engineering. References 1. Olaf Ronneberger, Philipp Fischer, and Thomas Brox: U-Net: Convolutional Net- works for Biomedical Image Segmentation. In: Medical Image Computing and Computer-Assisted Intervention (MICCAI), Springer, LNCS, Vol.9351: 234–241, 2015. 2. Shunta Saito, Takayoshi Yamashita, and Yoshimitsu Aoki: Multiple Object Extrac- tion from Aerial Imagery with Convolutional Neural Networks In: Journal of Imaging Science and Technology, Volume 60, Number 1, January 2016, pp. 10402-1-10402- 9(9). 3. V. Mnih: Machine Learning for Aerial Image Labeling, Ph.D. thesis, University of Toronto, 2013. 4. Kaiming He, Georgia Gkioxari, Piotr Dollar, Ross Girshick: Mask R-CNN. PAMI, 2017. 5. J. W. Tukey: Non-linear (non-superposable) methods for smoothing data, Int. Conf. Rec. 1974 EASCON, pp. 673. 6. David Douglas, Thomas Peucker: Algorithms for the reduction of the number of points required to represent a digitized line or its caricature, The Canadian Cartog- rapher 10(2), 112–122, 1973. 7. Rouse, J.W, Haas, R.H., Scheel, J.A., and Deering, D.W.: Monitoring Vegetation Systems in the Great Plains with ERTS. Proceedings, 3rd Earth Resource Technol- ogy Satellite (ERTS) Symposium 1974, vol. 1, p. 309-313. 8. K. Zuiderveld: Contrast limited adaptive histogram equalization, Graphics gems IV, San Diego, CA:Academic Press Professional, Inc, 1994. 9. Rasmus Houborg, Matthew F. McCabe: High-Resolution NDVI from Planet’s Con- stellation of Earth Observing Nano-Satellites: A New Data Source for Precision Agriculture, Remote Sens. 2016, 8, 768. 10. Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik: Rich feature hierar- chies for accurate object detection and semantic segmentation, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014. 11. Ross Girshick: Fast R-CNN, IEEE International Conference on Computer Vision (ICCV), 2015. 12. Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun: Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, Neural Information Processing Systems (NIPS) 2015. 13. L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A. L. Yuille Semantic image segmentation with deep convolutional nets and fully connected CRFs, ICLR, 2015.