=Paper=
{{Paper
|id=Vol-3006/03_short_paper
|storemode=property
|title=Neural network texture segmentation of satellite images of woodlands using the U-net model
|pdfUrl=https://ceur-ws.org/Vol-3006/03_short_paper.pdf
|volume=Vol-3006
|authors=Anna E. Alyokhina,Dmitry S. Rusin,Egor V. Dmitriev,Anastasia N. Safonova
}}
==Neural network texture segmentation of satellite images of woodlands using the U-net model==
Neural network texture segmentation of satellite images of woodlands using the U-net model Anna E. Alyokhina1 , Dmitry S. Rusin1 , Egor V. Dmitriev2 and Anastasia N. Safonova1 1 Siberian Federal University, Krasnoyarsk, Russia 2 Marchuk Institute of Numerical Mathematics of the Russian Academy of Sciences, Moscow, Russia Abstract With the advent of space equipment that allows obtaining panchromatic images of ultra-high spatial resolution (< 1 m) there was a tendency to develop methods of thematic processing of aerospace images in the direction of joint use of textural and spectral features of the objects under study. In this paper, we consider the problem of classification of forest canopy structures based on textural analysis of multispectral and panchromatic images of Worldview-2. Traditionally, a statistical approach is used to solve this problem, based on the construction of distributions of the common occurrence of gray gradations and the calculation of statistical moments that have significant regression relationships with the structural parameters of stands. An alternative approach to solving the problem of extracting texture features is based on frequency analysis of images. To date, one of the most promising methods of this kind is based on wavelet scattering. In comparison with the traditionally applied approaches based on the Fourier transform, in addition to the characteristic signal frequencies, the wavelet analysis allows us to identify characteristic spatial scales, which is fundamentally important for the textural analysis of spatially inhomogeneous images. This paper uses a more general approach to solving the problem of texture segmentation using the convolutional neural network U-net. This architecture is a sequence of convolution-pooling layers. At the first stage, the sampling of the original image is lowered and the content is captured. At the second stage, the exact localization of the recognized classes is carried out, while the discretization is increased to the original one. The RMSProp optimizer was used to train the network. At the preprocessing stage, the contrast of fragments is increased using the global contrast normalization algorithm. Numerical experiments using expert information have shown that the proposed method allows segmenting the structural classes of the forest canopy with high accuracy. Keywords Neural network, segmentation, satellite images, U-net. 1. Introduction Monitoring of forest areas, namely textural segmentation and forest mapping is an urgent task. One of the promising ways of global tracking of areas is the use of remote sensing data of the Earth (remote sensing). Due to the growth and diversity of information, there is a need to develop and modernize new methods of its processing. So, in recent years, due to the development of production capacities, one of the fastest growing areas is artificial intelligence. Thus, the idea of this experiment is to use remote sensing data and neural network methods to solve the problem of texture segmentation. In particular, there are works in the literature SDM-2021: All-Russian conference, August 24–27, 2021, Novosibirsk, Russia " a.tolmacheva@solutionfactory.ru (A. E. Alyokhina); yegor@mail.ru (E. V. Dmitriev) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 15 Anna E. Alyokhina et al. CEUR Workshop Proceedings 15–23 that are close to our experiment. In [1], the authors performed texture segmentation using the AGMSSeg-Net neural network, interactively selected by the user. Models based on convolutional neural networks, such as the second [2] and third versions of Deeplab [3], were also successfully used to create color labels on maps that allow solving the problem of textural segmentation of forest zones. After the analysis of related works, we decided to use the Unet model to perform the task of textural segmentation of woodlands using Worldview-2 panchromatic images. The main contribution to the work is as follows: 1. Preliminary image processing was performed using the global contrast normalization method. 2. The U-net model was trained on the original and pre-processed images. 3. Texture segmentation was performed by the final version of the trained model with the best result of metrics. 4. The errors were compared by the cross-validation method. 2. Materials and methods 2.1. The research area The research area is located on the territory of the Moscow region, Bronnitsky forestry in the immediate vicinity of the territory of geographical landings of the forester P.I. Dementieva. The stands of the Bronnitsky forestry have an age of 40 years or more and, according to the variety of species, they cover all the main forest-forming breeds of Russia. The selected site contains natural and forest-cultural plantings with different species composition and visible textural differences of the forest canopy. The plot contains part of the territory of permanent larch forest-seed plantations, which have a pronounced regular structure. The plot also contains natural birch and pine (with an admixture of spruce) stands of various completeness. Multispectral and panchromatic images of WorldView-2 with a spatial resolution of 1.85 and 0.46 m, respectively, were used as satellite information. The photo was taken on June 28, 2011, before the construction of the Novoryazansky Highway and the Central Ring Road began. For texture processing, a panchromatic image was used, which, after correction, has a spatial resolution of 0.5 m (Fig. 1). 2.2. Pre-processing of images This subsection presents an algorithm for preprocessing a satellite image, which consists of: 1. Converting fragments from the format .tiff to .png format for further work with the neural network. In this study, fragments of images with a size of 27 × 27 pixels were prepared for training. 2. Increasing the contrast of fragments using the global contrast normalization algorithm [4]: ′ 𝑋𝑖,𝑗,𝑘 − 𝑋 𝑋𝑖,𝑗,𝑘 = 𝑠 {︃ √︂ }︃ , (1) 1 ∑︀𝑟 ∑︀𝑐 ∑︀3 max 𝜖, 𝜆+ (𝑋𝑖,𝑗,𝑘 − 𝑋)2 3𝑟𝑐 𝑖=1 𝑗=1 𝑘=1 16 Anna E. Alyokhina et al. CEUR Workshop Proceedings 15–23 Figure 1: A fragment of a satellite image of the test site of the Bronnitsky forestry. ′ where 𝑋𝑖,𝑗,𝑘 is the tensor of the original image, 𝑋𝑖,𝑗,𝑘 is the tensor of the normalized 1 ∑︀𝑟 ∑︀𝑐 ∑︀3 image, 𝑋 = 𝑋𝑖,𝑗,𝑘 is the average pixel value of the original image, 𝜖 and 3𝑟𝑐 𝑖=1 𝑗=1 𝑘=1 𝜆 are some constants, in our solution 𝜆 = 10, 𝜖 = 0.000000001, respectively. Figure 2 below shows the results of image processing by the global contrast normalization algorithm. A test sample of the studied textures was prepared from 3500 transformed fragments, of which 80% were allocated for training, 20% for validation, and one image was used for independent verification of the trained U-net model. For each class, 400 training segments, 100 test segments and 56 segments per test zone A were allocated. There was one color annotation label per fragment that belongs to a certain class. Most classes differ in the density of green spaces, as well as the variety of types of rocks. Two classes are an ordinary grass field. 2.3. The U-net model In this work, the Xception model [5] was used, due to the fact that with the help of this convolutional network architecture, it is possible to obtain a better result compared to Inception V3 [6], as presented in [7]. The Xception architecture represents a fully connected convolutional network that is able to work with a small number of training examples for segmentation tasks. The generalized U-net architecture [8] is shown in Figure 3. 17 Anna E. Alyokhina et al. CEUR Workshop Proceedings 15–23 Figure 2: The result of normalization of global contrast on the example of seven texture zones, where the first image is a coniferous tree, the second is field 1, the third is field 2, the fourth is a mixed forest, the fifth is a cluster mixed forest (average density), the sixth is ordinary larch. Figure 3: Generalized U-net architecture (figure from https://arxiv.org/abs/1505.04597). 18 Anna E. Alyokhina et al. CEUR Workshop Proceedings 15–23 2.4. Metrics To calculate the effectiveness of the trained model, we used the mAP and IoU metrics [9]. IoU is just a score indicator. Any algorithm that provides predicted bounding rectangles as output can be evaluated using IoU. 1. Reliably marked areas manually by an expert. 2. Certain results of the trained network (2): 𝐴𝑟𝑒𝑎𝑜𝑓 𝑜𝑣𝑒𝑟𝑙𝑎𝑝 𝐼𝑜𝑈 = (2) 𝐴𝑟𝑒𝑎𝑜𝑓 ⋓ where 𝐴𝑟𝑒𝑎𝑜𝑓 𝑜𝑣𝑒𝑟𝑙𝑎𝑝 is the label of ground truth, and 𝐴𝑟𝑒𝑎𝑜𝑓 ⋓ are the labels of pre- diction and truth. Additionally, 𝐹 1 − 𝑠𝑐𝑜𝑟𝑒 (3) and 𝑚𝐴𝑃 (4) are calculated to evaluate the performance of the model. The 𝐹 1 − 𝑠𝑐𝑜𝑟𝑒 is calculated based on Accuracy (Precision) (5) and memorization (Recall) (6). mAP is the average value for all classes or finding the area under the Precision-Recall curve above [10] mAP is calculated in the range from 0 to 1 using the following formula (4): 2 * 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 * 𝑅𝑒𝑐𝑎𝑙𝑙 𝐹 1 − 𝑠𝑐𝑜𝑟𝑒 = , (3) 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙 𝑁 ∑︁ 1 ∑︁ 𝑚𝐴𝑃 = 𝐴𝑃𝑖 = 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛(𝑅𝑒𝑐𝑎𝑙𝑙), (4) 𝑁 𝑖=1 𝑅𝑒𝑐𝑎𝑙𝑙 𝑇𝑃 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = , (5) 𝑇𝑃 + 𝐹𝑃 𝑇𝑃 𝑅𝑒𝑐𝑎𝑙𝑙 = (6) 𝑇𝑃 + 𝐹𝑁 where 𝑇 𝑃 is a true positive result, 𝐹 𝑃 is a false positive result, and 𝐹 𝑁 is a false negative result. The sparse categorical cross entropy (𝑆𝐶𝐶𝐸) (7) was used to calculate the model loss parameter 𝑛 ∑︁ 𝑆𝐶𝐶𝐸 = − (𝑥𝑖 * log(𝜎(𝑦𝑖 ))) (7) 𝑖=1 𝑒𝑦𝑖 where 𝜎(𝑦𝑖 ) = 𝑛 or is the normalized exponent. 𝑒𝑦𝑖 ∑︀ 𝑗=1 3. Results The RMSprop optimizer was used to train the network. The number of epochs was 250. The time spent for one epoch is about 5–6 minutes. The training was carried out on the Google 19 Anna E. Alyokhina et al. CEUR Workshop Proceedings 15–23 Colab platform [11]. The learning process is shown in Figure 4. The quality of using normalized images was compared. As you can see, increasing the contrast slightly improves the quality of the model, a more detailed comparison of the results is presented below. At the exit from the network, a mask is formed that corresponds to a certain forest structure. An example of the final processing of a test image is shown in Figure 5. The main metrics of this work are presented in Table 1. 𝑚𝐴𝑃 and 𝐹 − 𝑠𝑐𝑜𝑟𝑒 take an average value for small images of the test area. a b c Figure 4: The value of the error function (a) and accuracy value (b and c) at each epoch for the test and training processed data. Figure 5: An example of an output mask based on a photo of the study area, where zone 1 is a coniferous tree, zone 2 is field 1, zone 3 is field 2, zone 4 is a mixed forest, zone 5 is a cluster mixed forest (average density), and zone 6 is ordinary larch. 20 Anna E. Alyokhina et al. CEUR Workshop Proceedings 15–23 Figure 6: The confusion matrix of the trained model in the test image, where class 1 is a coniferous tree, class 2 is field 1, class 3 is field 2, class 4 is a mixed forest, class 5 is a cluster mixed forest (average density), and class 6 is an ordinary larch. Table 1 Results of network metrics Name The value of the processed data The value of the raw data loss 2.3654e−04 0.18 𝑚𝐴𝑃 the average for each segment is 0.71 the average for each segment is 0.63 𝐹 − 𝑠𝑐𝑜𝑟𝑒 average 0.73 average 0.59 Table 2 Results of network metrics No. Data without pre-processing Data from pre-processing 𝐹 − 𝑠𝑐𝑜𝑟𝑒 loss 𝐹 − 𝑠𝑐𝑜𝑟𝑒 loss 1 74.68 38.34 77.78 24.68 2 74.03 32.03 82.97 20.79 3 70.13 33.89 74.53 40.57 4 69.28 59.12 82.82 21.68 5 75.16 29.88 83.47 20.37 Average 72.66 38.65 80.31 25.61 After training, the model was tested on test images. The results of the predicted masks were compared with the test masks using several quality metrics. The results are presented in Table 1. Cross-validation was also carried out for evaluation on independent data. The parameter 𝑘 was equal to 5. The results of the sliding control are presented in Table 2. As can be seen from Figure 4, c, we have a predominance of the third class, and also sometimes the model considered the 3rd class as the 1st class, perhaps due to the close intersection of these classes, this error occurred, you can also see the erroneous definition of the ForestMixedNormal class as the LarchRegularNormal class. This was due to the similar data structure of the classes. 21 Anna E. Alyokhina et al. CEUR Workshop Proceedings 15–23 4. Conclusion Based on the results, we can conclude that the U-net model copes with the processing of satellite images of forest areas for segmentation tasks. The main structure of each type of forest is clearly highlighted in the image for a better result, a larger set of image data is still needed, on which several classes will intersect. In the future, it is planned to process high-resolution images (36 pixels) this will allow using several classes on one image, and it is also planned to use classical architectures of convolutional models of neural networks with a change in architecture to increase efficiency and compare with the U-net model in new areas. Acknowledgments The research was carried out with the financial support of the RFBR (projects No. 19-01-00215 and No. 20-07-00370). References [1] Li K., Hu X., Jiang H., Shu Z., Mi Z. Attention-guided multi-scale segmentation neural network for interactive extraction of region objects from high-resolution satellite imagery // Remote Sensing. 2020. Vol. 12. P. 789. DOI:10.3390/rs12050789. [2] Bengana N., Heikkilä J. Improving land cover segmentation across satellites using domain adaptation // Remote Sensing. 2020. DOI:1912.05000. [3] Barmpoutis P., Stathaki T., Dimitropoulos K., Nikos G. Early fire detection based on aerial 360-degree sensors, deep convolution neural networks and exploitation of fire dynamic textures // Remote Sensing. 2020. Vol. 12. P. 3177. DOI:10.3390/rs12193177. [4] Bengio Y. Deep learning. 2016. URL: https://www.deeplearningbook.org. [5] François C. Xception: Deep learning with depthwise separable convolutions // arXiv preprint. 2017. arXiv:1610.02357v3 [cs.CV]. [6] Szegedy C. et al. Rethinking the inception architecture for computer vision // arXiv preprint. 2016. [7] Canziani A., Paszke A., Culurciello E. An analysis of deep neural network models for practical applications. URL: https://arxiv.org/abs/1605.07678. [8] Hui J. mAP (mean Average Precision) for object detection. 2018. URL: https://jonathan-hui. medium.com/map-mean-average-precision-for-object-detection-45c121a31173. [9] Scikit-learn. Precision-Recall. URL: https://scikit-learn.org/stable/auto_examples/model_ selection/plot_precision_recall.html. [10] Canziani A., Paszke A., Culurciello E. An analysis of deep neural network models for practical applications. URL: https://arxiv.org/abs/1605.07678. [11] Hinton G. Neural networks for machine learning. Online course. URL: https://www. coursera.org/leture/neural-networks-deep-learning/geoffrey-hinton-interview-dcm5r. [12] Colab G. Research notebooks. URL: https://colab.research.google.com. 22