Semantic Segmentation using Deep Learning for Aerial Images Hector Eduardo Tovanche-Picón1 , Diego Mercado Ravell2 1 Department of Industrial Engineering and Manufacturing, The Autonomous University of Ciudad Juarez, Cd. Juarez, 32584, Chihuahua, Mexico 2 Center for Research in Mathematics, Quantum Knowledge City, Zacatecas, Mexico Abstract In this article, a convolutional neural network model is presented for semantic segmentation of aerial images in urban areas using the VGG16 architecture as the encoder and UNet as the decoder. The model was trained and evaluated using the publicly available dataset named Semantic Drone, which consists of aerial images acquired at altitudes ranging from 5 to 30 meters. Various data augmentation techniques, such as random elastic deformation and brightness adjustment, were applied to enhance the model’s generalization capability. The obtained results show an average accuracy of 81% in segmenting 23 different classes, including people, cars, and dogs. Additionally, an inference speed of up to 50 fps was achieved after optimizing the model on a GPU. Overall, the proposed model has the potential to be employed in drone security applications and decision-making processes in urban areas. Keywords semantic segmentation, deep learning, VGG16, UNet, data augmentation 1. Introduction In recent years, there has been significant interest in the application of deep learning techniques for semantic segmentation of aerial images [1, 2, 3, 4, 5]. One of the most popular techniques is the use of Convolutional Neural Networks (CNNs), which have been successfully applied in various computer vision applications. CNN-based models have demonstrated a remarkable ability to learn relevant features in images and segment different object classes with high accuracy and efficiency [3, 6, 7]. Additionally, other neural network architectures such as Fully Convolutional Networks (FCNs) [8], Encoder-Decoder Networks (ENC-DEC) [9, 10, 11, 12], and Attention Networks (SAN) [13] have also shown promising results in semantic image segmentation. However, the application of these techniques in semantic segmentation of aerial images faces challenges such as variability in object appearance and texture, the presence of shadows and reflections, and the lack of labeled data. Despite these challenges, the use of deep learning techniques in semantic segmentation of aerial images remains an active and evolving research area, with numerous opportunities for developing new models and approaches to further enhance accuracy and efficiency in this critical task. CISETC 2023: International Congress on Education and Technology in Sciences, December 04–06, 2023, Zacatecas, Mexico Envelope-Open hector.tovanche@uacj.mx (H. E. Tovanche-Picón); diego.mercado@cimat.mx (D. M. Ravell) GLOBE https://sites.google.com/view/ph-d-diego-mercado (D. M. Ravell) Orcid 0000-0001-5073-633X (H. E. Tovanche-Picón); 0000-0002-7416-3190 (D. M. Ravell) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings In the scientific literature, different strategies have been proposed to address challenges associated with deep learning-based semantic segmentation in aerial images. For example, new techniques have been developed to generate synthetic data and augment the training dataset [14], which can help improve model generalization. Image preprocessing techniques, such as atmospheric correction [15] and image normalization [16], have also been proposed to enhance input data consistency and quality. Additionally, hybrid approaches combining deep learning techniques with traditional image processing methods have been suggested [17], leveraging the advantages of both approaches to overcome their limitations. In addition to deep learning techniques, there are other strategies for semantic segmentation of aerial images, such as feature-based and graph-based approaches [18]. Feature-based approaches focus on extracting relevant features from images, such as texture and shape, to segment different object classes [19]. A state-of-the-art review in deep learning-based semantic segmentation for aerial images reveals a broad research field with numerous promising techniques and approaches that have the potential to significantly improve the accuracy and efficiency of semantic segmentation in aerial images. This work presents the evaluation of a convolutional neural network applied to the task of semantic segmentation of aerial images using the VGG16 and U-Net architectures with a publicly available dataset for training and validation. The remainder of the document is structured as follows: in Section 2, the selected architecture for training the semantic segmentation model, the dataset used, and data augmentation techniques are detailed. Section 3 describes in detail the selected parameters for the training stage of the model. Section 4 presents the experimental results of our work, focusing on evaluating the accuracy of the deep learning-based semantic segmentation model for aerial images on a public dataset for each of the classes and the optimization required to run the model in real-time. Finally, Section 5 discusses the conclusions of the work and presents possible future directions for research in this field. 2. Semantic Segmentation in Aerial Images In this section, we will describe in detail an approach to semantic segmentation based on a neural network architecture that combines an encoder based on the VGG16 architecture [20] and a decoder based on the U-Net architecture [9]. The VGG16 architecture [20] represents a prominent convolutional neural network with deep significance in image classification tasks. Its structure is characterized by a sequence of convolutional and pooling layers, followed by fully connected layers at the top of the network. In the task of semantic segmentation, the VGG16 network plays a fundamental role as an encoder capable of extracting highly relevant features from input images. This architecture has earned a prominent place in the fields of computer vision and deep learning due to its ability to understand and represent complex features in images, making it valuable in a variety of applications. On the other hand, the U-Net architecture [9] is presented as an encoder-decoder neural network designed specifically to address challenges in semantic segmentation of images, initially conceived for medical applications. However, its versatility has proven its suitability in various domains, including semantic segmentation of aerial images. This architecture is distinguished by its dual structure, comprising a downward section that uses convolutional and pooling layers to reduce the spatial resolution of the image, followed by an upward section that uses upsampling and concatenation layers to increase spatial resolution and generate the final segmentation mask. The U-Net architecture has become an essential tool in image processing, enabling precise and detailed segmentation in a wide range of applications, from medical diagnostics to mapping land surfaces from the air. By combining both architectures into a single architecture, see Table 1, the inherent strengths of VGG16 as an encoder and U-Net as a decoder are leveraged, resulting in a highly effective approach for semantic segmentation of images. VGG16, with its deep structure of convolutional and pooling layers, excels at extracting visually relevant features from input images, identifying patterns, textures, and key details. These features, acting as high-level knowledge, are essential for semantic segmentation. On the other hand, U-Net, with its specific encoder-decoder design, specializes in the precise reconstruction of segmentation masks. The downward section of U-Net simplifies the task by reducing spatial resolution, while the upward section recovers fine details and local context. The key to this combination lies in the seamless transition between both architectures, using features extracted by VGG16 as input for the upward section of U-Net. This approach provides accurate and consistent segmentation by combining rich detail information and local context with high-level features. Table 1 Simplified combined architecture of VGG16 and U-Net for semantic segmentation. Stage Layer Operation Parameters 2D Convolution 64 filters, kernel 3x3, ReLU Convolutional Layer activation 2D Convolution 64 filters, kernel 3x3, ReLU Encoder (VGG16) activation 2D Max Pooling kernel 2x2 2D Convolution 128 filters, kernel 3x3, ReLU Convolutional Layer activation 2D Convolution 128 filters, kernel 3x3, ReLU activation 2D Max Pooling kernel 2x2 Upsampling Layer 2D Deconvolution kernel 2x2 Decoder (U-Net) Concatenation with output from the cor- Concatenation Layer responding stage of the en- coder Convolutional Layer 2D Convolution 128 filters, kernel 3x3, ReLU activation Output Convolutional Layer 2D Convolution, 1 filter, ker- nel 1x1, Sigmoid activation In the proposed architecture, the VGG16-based encoder is employed to extract features from the input aerial images, which are then fed into the U-Net-based decoder to generate the final segmentation mask. Additionally, regularization techniques such as dropout and batch normalization are utilized to enhance generalization and prevent overfitting. 2.1. Image Division Based on Altitude for Semantic Segmentation The categorization of aerial images by altitude is a key approach in semantic segmentation, as the features and objects present in the images vary significantly depending on the altitude at which the image was captured. These images can be classified into three main categories based on their acquisition altitude: low, medium, and high. Low-altitude images are typically captured at heights of less than 30 meters and show fine details of objects such as buildings, vehicles, and pedestrians. Medium- altitude images are obtained at altitudes between 30 and 150 meters, providing a broader view of the photographed area, allowing for a better understanding of the context and distribution of objects in a scene. On the other hand, high-altitude images are taken at altitudes above 150 meters and offer an overview of a region, facilitating the understanding of the distribution of objects on a large scale. This categorization by altitude enables semantic segmentation models to focus on specific features of the images that are relevant to the corresponding acquisition altitude. This can significantly improve the accuracy and efficiency of semantic segmentation models, especially when deep learning techniques are employed. 2.2. Dataset The dataset used in this work is the Semantic Drone Dataset [21], which focuses on the semantic understanding of urban scenes to enhance the safety of autonomous drone flight and landing procedures. The dataset’s images depict more than 20 houses from a top-down (bird’s-eye) view acquired at an altitude of 5 to 30 meters above the ground. A high-resolution camera was used to capture images of size 6000 × 4000px (24Mpx). This dataset includes labels for 24 different classes; Table 2 displays the 24 classes and their assigned RGB values. Figure 1 shows four examples of RGB images and the various represented scenarios. Figure 2 displays the corresponding masks for the example images, where different colors represent the classes present in the dataset. The training set consists of 400 publicly available images, while the test set comprises 200 private images. This dataset is widely used in research on deep learning-based semantic segmentation for aerial images due to the diversity of objects and urban contexts presented in the images, posing an interesting challenge for machine learning models. 2.3. Data Augmentation Generating synthetic data or applying data augmentation techniques is a common approach to enhance the ability of semantic segmentation models to generalize and adapt to different scenarios and conditions. Data augmentation involves creating new images from the original ones by applying random transformations, such as rotation, scaling, brightness changes, contrast adjustments, among others. A common data augmentation technique used in semantic segmentation is called ”elastic deformation-based data augmentation,” which involves applying a random elastic deformation to the original image, creating a new synthetic image. This is achieved by adding a small fraction of a random elastic displacement field to the original position of each pixel. The random elastic displacement field is generated by a white noise function that is turned into a vector field Table 2 RGB Values for Dataset Classes [21] Name R G B Name R G B Unlabeled 0 0 0 Door 254 148 12 Paved Area 128 64 128 Fence 190 153 153 Soil 130 76 0 Fence Post 153 153 153 Grass 0 102 0 Dog 102 51 0 Gravel 112 103 87 Car 9 143 150 Water 28 42 168 Bicycle 119 11 32 Rocks 48 41 30 Tree 51 51 0 Pool 0 50 89 Bare Tree 190 250 190 Vegetation 107 142 35 AR Marker 112 150 146 Roof 70 70 70 Obstacle 2 135 115 Wall 102 102 156 Conflict 255 0 0 Window 254 228 12 Person 255 22 96 Figure 1: Examples of RGB images from the dataset [21]. through the application of a Gaussian filter. The mathematical formula for this transformation is represented in Equation 1. 𝜕𝑢𝑖𝑗 𝜕𝑢𝑖𝑗 𝑥𝑖𝑗′ = 𝑥𝑖𝑗 + 𝑢𝑖𝑗 + + (1) 𝜕𝑥𝑖 𝜕𝑥𝑗 Where 𝑥𝑖𝑗′ is the position of a pixel in a deformed image, 𝑥𝑖𝑗 is the original position of the pixel, and 𝑢𝑖𝑗 is the displacement of the pixel’s position. The formula indicates that the deformed position of the pixel is equal to its original position plus the displacement, along with the Figure 2: Examples of masks from the dataset [21]. contribution of the partial derivatives of the displacement function 𝑢 in the 𝑥𝑖 and 𝑥𝑗 directions. This technique is used to simulate deformations that may occur in an aerial image due to factors such as lens distortion, aerial vehicle movement, among others. Another popular data augmentation technique is random cropping, which involves cutting a random portion of the original image and using it as a new image. This technique is particularly useful for creating synthetic images containing partially visible objects, which can help the semantic segmentation model learn to recognize objects in challenging conditions. Equation 2 represents the random cropping function used, 𝑥𝑖 = random(0, 𝑤 − 𝑝); 𝑦𝑖 = random(0, ℎ − 𝑞) (2) Where 𝑤 and ℎ are the width and height of the original image, respectively, and 𝑝 and 𝑞 are the width and height of the desired crop, random() represents a function that returns a random number within the specified range. The data augmentation technique by brightness adjustment involves adjusting the brightness of images to enhance the model’s ability to generalize and handle different lighting conditions. This technique involves adding a constant value to all pixels in the image, which increases or decreases brightness. The formula used for brightness adjustment can be seen in Equation 3. 𝐼𝑏 = 𝐼 𝑜 + 𝑣 (3) Where 𝐼𝑜𝑟𝑖𝑔𝑖𝑛𝑎𝑙 is the original image, 𝐼𝑏𝑟𝑖𝑔ℎ𝑡 is the image with increased brightness, and 𝑣 is the constant value added to each pixel. The value of 𝑣 can be randomly generated within a specific range to create variations in the image’s brightness. All images used during the training stage undergo these data augmentation techniques to increase the size of the dataset. Figure 3 shows two results of data augmentation, and Figure 4 shows the corresponding masks after being processed with the same techniques. Figure 3: RGB images after data augmentation techniques, Random Crop, Elastic Deformation, Bright- ness Change, respectively. Combining different data augmentation techniques can significantly increase the diversity and quantity of data available for model training, improving its accuracy and ability to generalize to new scenarios and conditions. However, it is important to note that excessive data augmentation can also result in overfitting the model to the training data, limiting its ability to generalize to new data. 3. Model Training The training process of the machine learning-based segmentation model requires certain re- sources and tools. One of the most crucial elements is hardware, as computational resources are needed to effectively carry out the model training. In other words, a computer with sufficient processing power is required to perform the necessary mathematical operations for training the model. Additionally, it is necessary to define training hyperparameters, which are variables that control the model training process. These hyperparameters may include batch size, learning rate, number of epochs, among others. Proper selection of hyperparameters can have a significant impact on the model’s performance. Figure 4: Masks after applying data augmentation techniques 3.1. Hardware Specifications To conduct the training of the neural network used in this study, a computer equipped with a Windows 11 operating system, an Intel Core i7 processor, an NVIDIA GeForce RTX 2060 graphics card, and 32 GB of RAM was employed. This type of hardware configuration is commonly used in deep learning tasks due to its high processing capability and available memory. The RTX 2060 graphics card, chosen for this study, plays a crucial role in the training process of neural networks thanks to its Tensor core architecture. This feature enables substantial accel- eration in matrix operation calculations, an essential function in image processing, particularly in applications like semantic segmentation. The extensive RAM capacity available in the computer system is an essential resource that allows efficient storage and processing of large datasets, such as the one used in this study. This enhanced capacity significantly facilitates the neural network training process and ultimately reduces the processing time required to achieve a well-performing trained model. 3.2. Training Hyperparameters For the training of the neural network, various hyperparameters were utilized that influence the performance and accuracy of the semantic segmentation model for aerial images. Firstly, an input size of 256 × 256 pixels was used for each image. This size was chosen to balance model accuracy with the time and resources required for training. The batch size, referring to the number of images used in each iteration during training, was set to 8. This choice was based on the available memory capacity in the hardware used for neural network training. The number of epochs was fixed at 200, meaning the training dataset was iterated 200 times to adjust the neural network weights and optimize the model. The initial learning rate was set to 100𝑒 − 6, indicating the rate at which the neural net- work weights are updated during training. This value was adjusted to ensure an appropriate convergence rate of the model. Lastly, a Dice coefficient function was used for calculating the stochastic gradient descent in the optimization process. This function is commonly employed as a metric for evaluating semantic segmentation models, allowing the comparison of the overlap area between the predicted segmentation mask and the ground truth segmentation mask. 4. Results 4.1. Evaluation Metrics In order to assess the performance of the trained model, two widely used metrics in the literature are proposed: the Dice coefficient and the Jaccard coefficient or Intersection over Union. 4.1.1. Dice Coefficient The Dice coefficient is a similarity metric employed in semantic segmentation tasks, such as deep learning-based aerial image segmentation. It is commonly used as a loss function during model training and can also be used to evaluate the quality of segmentation on the test set. The Dice coefficient is calculated as the ratio between the area of intersection between the segmentation mask generated by the model and the true mask of the image, and the total area of the combined two masks. The Dice coefficient value ranges from 0 to 1, where a value of 1 indicates perfect segmentation—meaning the model-generated mask precisely matches the real mask of the image. A value of 0, on the other hand, indicates that the segmentation performed by the model is completely incorrect. The Dice coefficient can be calculated using the following mathematical formula: 2 × |𝑋 ∩ 𝑌 | Dice = (4) |𝑋 | + |𝑌 | Here, 𝑋 and 𝑌 are the masks of the segmentation generated by the model and the actual segmentation of the image, respectively. The symbol ∩ denotes the intersection operation between two sets, and |𝑋 | and |𝑌 | represent the size of sets 𝑋 and 𝑌, respectively. 4.1.2. Jaccard Coefficient The Jaccard Coefficient is a commonly used metric to assess the similarity between two datasets. In the context of semantic segmentation of aerial images, the Jaccard Coefficient can be employed to measure the accuracy of the segmentation obtained by the neural network. The formula for the Jaccard Coefficient is expressed as the ratio between the intersection of two sets and their union, and is defined as follows: |𝐴 ∩ 𝐵| 𝐽 (𝐴, 𝐵) = (5) |𝐴 ∪ 𝐵| Here, A and B are the sets being compared, 𝐴 ∩ 𝐵 is their intersection (i.e., the elements common to both sets), and 𝐴 ∪ 𝐵 is their union (i.e., all elements appearing in at least one of the sets). 4.2. Evaluation on Test Dataset To assess the model’s performance, a cross-validation was conducted using 100 images from the public dataset that were not part of the training set. Figure 5 presents examples of input images and predicted masks compared to ground truth. Figure 5: Input images for validation (left column), ground truth (center column), and predicted mask or inference (right column). The results shown in Table 3 indicate a relatively high average accuracy in semantic segmen- tation of aerial images using the proposed neural network. The average accuracy across all classes was 0.791, suggesting that the model can correctly identify most objects in the image. Looking at the results by class, segmentation of objects like pools, persons, dogs, and bicycles had quite high accuracy, surpassing 0.8 in each case. On the other hand, objects like fences, obstacles, and dirt areas had lower accuracy, possibly due to the difficulty of distinguishing these objects from their surroundings. Table 3 Semantic segmentation results Class Jaccard Coefficient Unlabeled 1.000000 Paved area 0.892092 Dirt 0.563520 Grass 0.900506 Gravel 0.725012 Water 0.913750 Rocks 0.699329 Pool 0.975321 Vegetation 0.670343 Roof 0.870463 Wall 0.582263 Window 0.695331 Door 0.931847 Fence 0.533866 Fence post 0.809327 Person 0.606276 Dog 0.978098 Car 0.940523 Bicycle 0.795602 Tree 0.831066 Bald tree 0.838161 AR marker 0.863405 Obstacle 0.545169 Conflict 1.000000 It is also observed that the VGG16 and UNet-based neural network achieved a high Dice coef- ficient for classes of persons, cars, and dogs, with values of 0.606, 0.940, and 0.978, respectively. These classes are of vital importance in security monitoring and traffic management in urban areas. Overall, the obtained results are promising and suggest that deep learning-based semantic segmentation can be a useful tool in applications requiring detailed understanding of aerial images, such as security surveillance, urban planning, and precision agriculture. However, it is essential to note that the model’s accuracy can be influenced by various factors, including image quality, scene complexity, and variability in detected objects. 4.3. Inference Time An evaluation of the model’s inference time was conducted using different hardware configu- rations, including CPU and GPU. The results show that when running on CPU, an inference rate of 2 fps was achieved, which is quite low for practical real-time applications. On the other hand, when evaluating the model using the GPU without optimization, an inference rate of 20 fps was obtained, representing a significant improvement compared to the CPU. However, by implementing the optimized model on the GPU, an average inference rate of 50 fps was achieved, demonstrating the importance of optimization to enhance model performance. In general, these results suggest that the implementation of deep learning-based models for real-time semantic segmentation of aerial images is feasible using suitable hardware and optimization techniques. Inference time is a critical factor in real-time applications, such as the monitoring and analysis of aerial images. In our study, the model’s performance in terms of processing speed was evaluated using different hardware configurations. When performing inference on a CPU, the model took an average of 0.5 seconds to process each image, resulting in a frames-per-second (FPS) rate of 2. Implementing the model without optimization on the GPU increased the FPS rate to 20. However, with the implementation of optimization techniques, such as operation fusion and precision conversion, an average FPS rate of 50 on the GPU was achieved. This means that the model could process 50 images per second, highlighting the importance of optimization in improving model performance. 5. Conclusions In summary, this article has proposed a robust solution for semantic segmentation in aerial im- agery, leveraging a neural network architecture amalgamating VGG16 and UNet. The Semantic Drone Dataset served as the cornerstone for training, and the integration of data augmentation techniques further amplified model accuracy. The achieved results are indeed promising, showcasing a Dice coefficient surpassing 50% for the majority of classes. Noteworthy enhancements in model accuracy were realized through the judicious application of data augmentation and hyperparameter optimization. In terms of inference time, a substantial boost was evident with GPU utilization and model optimization. This translates to a more streamlined application of the model in real-time scenarios. The demonstrated efficacy of deep learning-based semantic segmentation opens avenues for improved security and efficiency in diverse domains such as smart city planning and monitoring, precision agriculture, environmental surveillance, and other drone-centric applications. Acknowledgments The authors extend their gratitude to CONACYT (National Council of Science and Technology, Mexico) for their support in facilitating this research. References [1] H. Xiu, P. Vinayaraj, K.-S. Kim, R. Nakamura, W. Yan, 3d semantic segmentation for high-resolution aerial survey derived point clouds using deep learning (demonstration), ACM, 2018, pp. 588–591. URL: https://dl.acm.org/doi/10.1145/3274895.3274950. doi:10. 1145/3274895.3274950 . [2] M. S. Alam, J. Oluoch, A survey of safe landing zone detection techniques for autonomous unmanned aerial vehicles (uavs), Expert Systems with Applications 179 (2021) 115091. URL: https://linkinghub.elsevier.com/retrieve/pii/S0957417421005327. doi:10.1016/j.eswa.2021.115091 . [3] J. Kinahan, A. F. Smeaton, Image segmentation to identify safe landing zones for unmanned aerial vehicles (2021). URL: http://arxiv.org/abs/2111.14557. [4] J. Gonzalez-Trejo, D. Mercado-Ravell, I. Becerra, R. Murrieta-Cid, On the visual-based safe landing of uavs in populated areas: a crucial aspect for urban deployment, IEEE Robotics and Automation Letters 6 (2021) 7901–7908. doi:10.1109/lra.2021.3101861 . [5] J. A. González-Trejo, D. A. Mercado-Ravell, Monitoring social-distance in wide areas during pandemics: a density map and segmentation approach, CoRR (2021). URL: http: //arxiv.org/abs/2104.03361v1. arXiv:2104.03361 . [6] A. A. Cabrera-Ponce, L. O. Rojas-Perez, J. A. Carrasco-Ochoa, J. F. Martinez-Trinidad, J. Martinez-Carranza, Gate detection for micro aerial vehicles using a single shot detector, IEEE Latin America Transactions 17 (2019) 2045–2052. URL: https://ieeexplore.ieee.org/ document/9011550/. doi:10.1109/TLA.2019.9011550 . [7] R. Girshick, Fast r-cnn (2015) 1440–1448. URL: http://arxiv.org/abs/1504.08083. [8] M. Fayyaz, M. H. Saffar, M. Sabokrou, M. Fathy, R. Klette, F. Huang, Stfcn: Spatio-temporal fcn for semantic video segmentation (2016). URL: http://arxiv.org/abs/1608.05971. [9] O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks for biomedical image segmentation (2015). URL: http://arxiv.org/abs/1505.04597. [10] V. Badrinarayanan, A. Kendall, R. Cipolla, Segnet: A deep convolutional encoder-decoder architecture for image segmentation (2015). URL: http://arxiv.org/abs/1511.00561. [11] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A. L. Yuille, Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs (2016). URL: http://arxiv.org/abs/1606.00915. [12] K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask r-cnn (2017). URL: http://arxiv.org/abs/ 1703.06870. [13] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polo- sukhin, Attention is all you need (2017). URL: http://arxiv.org/abs/1706.03762. [14] E. Okafor, R. Smit, L. Schomaker, M. Wiering, Operational data augmentation in classifying single aerial images of animals, IEEE, 2017, pp. 354–360. URL: http://ieeexplore.ieee.org/ document/8001185/. doi:10.1109/INISTA.2017.8001185 . [15] X. Yu, Q. Liu, X. Liu, X. Liu, Y. Wang, A physical-based atmospheric correction algorithm of unmanned aerial vehicles images and its utility analysis, International Journal of Remote Sensing 38 (2017) 3101–3112. URL: https://www.tandfonline.com/doi/full/10.1080/ 01431161.2016.1230291. doi:10.1080/01431161.2016.1230291 . [16] L. T. Thanh, D. N. H. Thanh, An adaptive local thresholding roads segmentation method for satellite aerial images with normalized hsv and lab color models, 2020. URL: http://link. springer.com/10.1007/978-981-15-2780-7_92. doi:10.1007/978- 981- 15- 2780- 7\_92 . [17] Y. Zhang, L. Fu, Y. Li, Y. Zhang, Hdfnet: Hierarchical dynamic fusion network for change detection in optical aerial images, Remote Sensing 13 (2021) 1440. URL: https://www.mdpi. com/2072-4292/13/8/1440. doi:10.3390/rs13081440 . [18] Y. Li, R. Chen, Y. Zhang, H. Li, A cnn-gcn framework for multi-label aerial image scene classification, IEEE, 2020, pp. 1353–1356. URL: https://ieeexplore.ieee.org/document/ 9323487/. doi:10.1109/IGARSS39084.2020.9323487 . [19] R. Ratajczak, C. F. Crispim-Junior, E. Faure, B. Fervers, L. Tougne, Automatic land cover reconstruction from historical aerial images: An evaluation of features extraction and classification algorithms, IEEE Transactions on Image Processing 28 (2019) 3357–3371. URL: https://ieeexplore.ieee.org/document/8630683/. doi:10.1109/TIP.2019.2896492 . [20] S. Liu, W. Deng, Very deep convolutional neural network based image classification using small training sample size, IEEE, 2015, pp. 730–734. URL: http://ieeexplore.ieee.org/ document/7486599/. doi:10.1109/ACPR.2015.7486599 . [21] T. U. Graz, Semantic drone dataset, 2023. URL: https://www.tugraz.at/index.php?id=22387.