Identifying Urban Canopy Coverage from
    Satellite Imagery Using Convolutional Neural
                      Networks?

               Niamh Donnelly1 Conor Nugent2 and Brian Mac Namee1
           1
               School of Computer Science, University College Dublin, Ireland
                               2
                                 Breadboard Labs, Ireland
                            niamh.donnelly1@ucdconnect.ie


        Abstract. The availability of high resolution satellite imagery offers a
        compelling opportunity for the utilisation of state-of-the-art deep learn-
        ing techniques in the applications of remote sensing. This research inves-
        tigates the application of different Convolution Neural Network (CNN)
        architectures for pixel-level segmentation of canopy coverage in urban ar-
        eas. The performance of two established patch-based CNN architectures
        (LeNet and a pre-trained VGG16) and two encoder-decoder architectures
        (a simple 4-layer convolutional encoder-decoder and Unet) was compared
        using two datasets (a large set of images of the Geerman town of Vai-
        hingen and smaller set of the US city of Denver). Results show that the
        patch-based methods outperform the encoder-decoder methods. It is also
        shown that pre-training is only effective with the smaller dataset.

        Keywords: Convolutional Neural Network · Remote Sensing · Deep
        Learning · Canopy Coverage · Google Earth Engine.


1     Introduction
Accurate estimation of urban tree canopy coverage is vital to the task of moni-
toring environmental resources (i.e. soil and air quality, wildlife habitats, levels
of CO2 emissions) and for civic planning [17]. Typically, measuring canopy cov-
erage requires human surveyors to manually annotate sample areas of an urban
region. The level of coverage in the sample areas is then extrapolated to provide
an estimate of the total tree coverage for the full region. This approach, how-
ever, is slow, resource intensive, and especially prone to problems of consistency.
The recent emergence of affordable, broad coverage remote sensing through im-
provements in satellite technology and public availability of high resolution aerial
imagery has made automated solutions to estimating canopy coverage feasible.
In particular, machine learning approaches based on convolution neural networks
(CNNs) that have been used in remote sensing tasks offer significant potential.
?
    This work was partially supported under grant 12/RC/2289 from Science Foundation
    Ireland. The authors thank Paul Hickey from Breadboard Labs, Tanushree Biswas
    from The Nature Conservancy, and all data labellers.
    This paper describes an experimental study in which the effectiveness of dif-
ferent CNN-based architectures to tree canopy coverage identification are eval-
uated. The evaluation uses both patch-based CNN architectures and encoder-
decoder CNN architectures, and compares their performance on two datasets.
The remainder of this paper proceeds as follows. Section 2 surveys the state of
the art in using machine learning approaches on remote sensing data. Section 3
describes the evaluation experiment performed including the architectures used,
testing metrics, datasets, and experimental design. The results from the evalua-
tion are presented and discussed in Section 4. Finally, conclusions and directions
for future work are presented in Section 6.


2   Related Work

Canopy cover is the percentage cover of tree canopy in a given area, including
only trees and shrubs and ignoring any other forms of vegetation [11]. Prior to
the accessibility of remote sensing data, measuring canopy was a manual process
in which human surveyors identified and measured it from the ground. Manual
surveys were typically carried out in small sample areas of an urban region,
with the level of canopy coverage measured in the sample area extrapolated
to approximate the coverage in an entire region. More recently, surveyors have
manually marked areas of canopy coverage in satellite images [1]. Again this
manual labelling is usually performed on a sample area with the coverage of the
overall region approximated based on this.
     Machine learning offers a way to perform the identification and measurement
of urban canopy coverage automatically. Not only would this eliminate signif-
icant manual effort, but it would also remove the need to extrapolate overall
approximations from small sample areas. Stojanova et al. [24] benchmarked dif-
ferent classical machine learning techniques for identifying canopy coverage from
satellite imagery and found that random forests models performed well.
     CNNs have recently led to step change performance improvements in many
image processing tasks, including tasks based on satellite imagery. For example,
it is now common to apply CNN based approaches to land use classification. For
example, Basu et al [3] describe an approach in which small patches of satel-
lite imagery (sized 28x28 pixels) are classified into six land use classes. In many
cases pixel-level land use classification is required, rather than classification of
larger patches (see for example Fig. 4). Pixel-level land use segmentation can
be achieved using CNNs that classify each pixel in an image individually [16],
or using encoder-decoder network architectures that generate a segmentation
based on an input image [25]. In the Deep Globe land use segmentation bench-
mark competition [6] variations of the UNet [21] and SegNet [2] encoder-decoder
network architectures lead.
     There is not extensive research in the literature on canopy coverage identifica-
tion and measurement using deep learning techniques. Guirado et al. [9] describe
a comparative study between the performance of classical machine learning tech-
niques (e.g. SVM, KNN) and a CNN classification approach to determine the
presence of tree shrubs in extracted satellite images (sized 28x28 pixels). Sim-
ilar patch based approaches are described in [5,19]. Pixel-level segmentation of
satellite images into canopy and non-canopy regions, however, remains largely
unexplored.
    In patch-based approaches to pixel level segmentation a model is trained to
label the class of the centre pixel in a small image patch taking into account its
surrounding context. This procedure can be efficiently implemented using feed-
forward CNN architectures that take a small image patch as input and output
the class of the pixel at the centre of the patch. Popular CNN architectures for
this type of task include LeNet [15], VGG16 [23], and ResNet [10].
    Encoder-decoder techniques produce a complete segmentation of an input
image rather than a single class label. The pooling layers of the encoder gradually
down-samples the input, before the decoder uses upsample layers to gradually
increase the convolutional layer size back to the scale of the original image.
Training encoder-decoders requires that the ground truth of a dataset consists
of fully segmented image masks. This requires a more involved labelling approach
than patch-based techniques. Patch-based methods involve labeling a single pixel
and extracting the surrounding patch whereas the encoder-decoder technique
requires detailed segmentation of the entire patch by detailing the borders of all
objects in an image. Popular encoder-decoder architectures include UNet [21]
and FastRCNN [7].


3     Experimental Methods

To explore the use of CNN architectures to automatically identify and measure
urban canopy coverage we perform a series of benchmark evaluation experi-
ments. These compare the performance of patch-based (LeNet, VGG16) and
encoder-decoder (UNet, and a simple 4-layer network) architectures across two
datasets. The remainder of this section describes the datasets, and outlines the
experimental procedures for developing and testing the model architectures.3


3.1    Remote Sensing Datasets

This study makes use of a pre-existing dataset developed by the International So-
ciety for Photogrammetry and Remote Sensing (ISPRS) which contains images
of the city of Vaihingen in Germany. In addition to this, a bespoke dataset was
generated imagery of the city of Denver, Colorado in the US based on satellite
imagery obtained from Google Earth.


3.1.1 The Vaihingen Dataset This dataset contains imagery of Vaihingen,
a densely populated town in Germany. The dataset contains 33 tiles of varying
sizes, however, most contain approximately 2000 x 2000 pixels at a resolution of
3
    Access to code to run all experiments is provided at https://github.com/
    engineevecanopy_coverage_project
       (a)             (b)               (c)             (d)               (e)

 Fig. 1: Sample of 33x33 pixel patches extracted from the Vaihingen dataset.


9 cm per pixel. For each tile area, a infrared (CIR) image is provided. The CIR
images consist of three spectral bands, corresponding to near infrared, red and
green color channels (IR-R-G). The IR-R-G channels of the CIR images result
in a heightened red hue across images. For each image, a corresponding ground
truth image is provided showing segmentation into six land use classes (imper-
vious surfaces, building, low vegetation, tree, car, clutter/background). For the
specific task of tree canopy coverage, the Vaihingen dataset was converted to
binary format containing the original canopy label and converting all additional
classes to non-canopy. Fig. 1 shows an example from the Vaihingen dataset.
    The original Vaihingen image tiles were segmented into small 33x33 pixel
image patches using a sliding window approach with a stride of 18 pixels. A
selection of sample patches are shown in Fig. 1. 25 of the 33 Vaihingen tiles were
randomly selected for use as a training dataset, resulting in 372,884 training
patches. For the patch based approaches each patch had an associated canopy or
non-canopy label based on its central pixel. For the encoder-decoder approaches
each patch had a corresponding ground truth image in which pixels were seg-
mented as canopy or non-canopy. The ratio of canopy to non-canopy pixels in
the training dataset was approximately 7:2.
    Two adjoining tiles from the Vaihingen dataset (approximately a 2,500x4,000
pixel area) were used as a test set. In order to obtain a prediction for every pixel
in the test tiles, different extraction techniques were used for patch-based and
encoder-decoder techniques. For patch-based approaches, a stride of 1 was used
when extracting the 33x33 image patches, resulting in a patch for every pixel
in the image—9,665,460 test patches. For the encoder-decoder architectures, the
test tiles were segmented into 33x33 patches without any overlap, therefore a
stride of 32 is used, resulting in 9,438 test images. These images, however, still
included over nine million classification opportunities as each pixel was classified.

3.1.2 The Denver Dataset For this study a bespoke dataset was created
using the Google Earth Engine (GEE) satellite imagery of the city of Denver,
Colorado in the US. An tool was developed in GEE that allowed participants
to view imagery of points within the city and apply one of three labels—tree,
non-tree, or unsure—to them. A total of 11 people were employed to label ap-
prox 500 data points each (one participant labelled an additional 1000 points).
After all participants completed the labeling process, a second script extracted
           Fig. 2: Sample of images extracted from the Denver dataset.


      Fig. 3: Synthetic images for the Denver dataset generated using SMOTE


33x33 image patches surrounding each labeled point. A total of 6,895 labelled
image patches were collected (patches labelled unsure were excluded). Unlike the
Vaihingen dataset, the ground truth of the Denver consists of a single label for
each patch rather than semantically segmented label encoded images, therefore
the Denver dataset is unsuited for encoder-decoder techniques. A sample image
patches from the Denver dataset are shown in Fig. 2.
   The ratio of canopy to non-canopy pixels in the Denver dataset was approx-
imately 7:1. To counteract effects caused by this imbalance Synthetic Minority
Over-sampling Technique (SMOTE) [4] was applied. SMOTE generates syn-
thetic examples of the under-represented class (in this case canopy), resulting in
a balanced. Fig. 3 shows examples of synthetic patches generated using SMOTE.
the data was randomly split into a training and test set using a ratio of 75:25.


3.2     Experimental Design

Each experiment followed a hold-out test set design (with test sets as described
in Section 3.1). Cross-validation was not used due to the significant amount of
computation required to train and test models (for example, training a single
VGG-16 model on the Vaihingen dataset on a server contianing an Intel(R)
Xeon(R) CPU E5-2695 v4 @ 2.10GHz with 72 CPU cores and approximately
500G of RAM took 18 hours). Similarly, due to the excessive computation re-
quired, grid searches were not used to identify hyper-parameter values, but rather
recommendations from the literature were used.
    As all experiments involve binary classification, so precision, recall and macro
averaged F1-scores [12] are used to measure performance of all models. The
remainder of this section describes the three experiments performed.
            (a) Aerial image                         (b) Ground truth

Fig. 4: Aerial image and ground truth segmentation (green represents canopy)
for one of the Vaihingen test area tiles.


3.2.1 Experiment 1 This experiment used the Vaihingen dataset to compare
the performance of the LeNet and VGG16 patch-based CNN architectures and
two encoder-decoder architectures: a simple convolutional encoder-decoder and
the UNet architecture.
    The LeNet model replicated the architecture described in [15] with the ex-
ception that the size of the input layer was reduced to 33x33 to match the input
size of image patches. The LeNet model used binary cross entropy loss and the
adam optimiser [14] with a learning rate of 0.000001 and a batch size of 100.
    The VGG16 model used model weights that were pre-trained on the Imagenet
dataset [22]. The model architecture followed that described in [23]. Due to a size
restriction for the pre-built VGG16 Keras model, the training image patches were
upsampled to a resolution of 48x48 pixels. The VGG16 model used binary cross
entropy loss and the stochastic gradient descent optimiser [13] with a learning
rate of 0.0001 and a batch size of 128.
    For both of the encoder-decoder models image patches were resized to 32x32
as the up-sampling layers of the encoder-decoder architectures required even
numbered input dimensions. The simple encoder-decoder model was implemented
using the Keras package and had four convolutional layers. It used mean squared
error loss and the Adadelta optimiser [26] using a batch size of 120. The archi-
tecture of the simple 4-layer encoder-decoder included two convolution layers for
the encoder assembly and two for the decoder. All four convolution layers of the
network comprised of 32 filters with a kernel size of 3x3. Convolutions used a
stride of 1 with zero padding and a rectified linear unit activation function. The
encoder assembly incorporated two max pooling layers (one after each convolu-
tion layer) with a kernel size of 2x2, while the decoder assembly included two
upsampling layers (one after each convolution layer) similarly with a kernel size
of 2x2. The UNet architecture was implemented using the keras package and
followed th specification described in [21]. It used mean squared error loss and
the adam optimiser with a learning rate of 0.0002 and a batch size of 112.
3.2.2 Experiment 2 Experiments using the Denver dataset involved only
patch-based CNNS—the VGG16 and LeNet architectures—as the ground truth
required for encoder-decoder networks was not available. The model architec-
tures used for this experiment were the same as those used in Experiment 1.

4    Results
This section presents results from the two experiments, for each of which per-
formance metrics, confusion matrices and sample segmentations are shown.
    In Experiment 1 two tiles that contained substantial dispersion of canopy
and a variety of other features were used as the test area. Fig. 4 shows one tile,
and its associated ground truth. The performance of each architecture is shown
in Table 1, and associated confusion matrices are in Table 2. The areas of canopy
identified by each approach are shown in Fig. 5, which can be compared to the
ground truth image in Fig. 4.
    Only the patch-based CNN models were used in Experiment 2 which used
the Denver dataset. The performance of the each of these is shown in Table 3,
with associated confusion matrices shown in Table 4. The pre-trained VGG16
architecture produced the highest scores across all metrics.


          Table 1: Performance results on the Vaihingen test dataset .
             Model                         Precision   Recall    F1-Score
             LeNet                           0.9100    0.9017      0.9057
             Pre-trained VGG16               0.8844    0.8989      0.8913
             4-layer encoder-decoder         0.8716    0.6398      0.6631
             UNet                            0.8714    0.6007      0.6078


Table 2: Confusion Matrices for Vaihingen dataset (X-axis: Predicted, Y-axis:
True)
                 Non-canopy     Canopy                       Non-canopy      Canopy
    Non-canopy     6 689 190     295 981       Non-canopy      6 416 847      459 193
     Canopy          368 562   2 020 403        Canopy           319 203    2 039 213

                 (a) LeNet                             (b) Pre-trained VGG16

                 Non-canopy     Canopy                       Non-canopy      Canopy
    Non-canopy       696 639   1 741 136        Non-canopy       420 858    1 939 099
     Canopy           43 743   7 064 210         Canopy           22 082    7 085 871

        (c) Simple encoder-decoder                           (d) UNet
                     (a) LeNet                    (b) Pre-trained VGG16


           (c) 4-layer encoder-decoder                  (d) UNet

Fig. 5: Segmentation results of models applied to the test area of the Vaihingen
dataset.


       Table 3: Test results of patch-based models on the Denver dataset.
                 Model                Precision    Recall     F1-Score
                 LeNet                  0.5823     0.5031       0.4763
                 Pre-trained VGG16      0.7840     0.7940       0.7889


Table 4: Confusion Matrices for Denver dataset (X-axis: Predicted, Y-axis: True)
                 Non-canopy      Canopy                     Non-canopy    Canopy
    Non-canopy         1 535          6      Non-canopy           1 421       86
     Canopy              180          3       Canopy                 77      140

                  (a) LeNet                          (b) Pre-trained VGG16


5    Discussion

It is clear from Table 1 that for the Vaihingen dataset CNNs can acurately
detect urban canopy coverage. The maximum F1-score 0f 0.91 achieved by the
LeNet model on the Vaihingen dataset, exceeds that reported in the literature
for classical machine learning methods applied to this task, which are typically
in the range 0.85 to 0.9 [9]. Results demonstrate that CNNs can be used to
accurately perform pixel-level segmentation of urban canopy coverage.
       (a)             (b)               (c)             (d)               (e)

Fig. 6: Examples from Denver test dataset misclassified as non-canopy by LeNet
and VGG16 models


    Despite performing well on the Vaihingen dataset, the LeNet model did not
perform well on the Denver dataset. This is most likely due to the much smaller
size of the second dataset (just 9,088 training examples after SMOTE upsampling
compared to 372,884 for the Vaihingen dataset). Improved performance on the
Denver dataset was observed using the pre-trained VGG16 model, a common
finding for smaller datasets [18]. It is interesting, however, that the pre-trainined
VGG16 network did not outperform the much simpler LeNet architecture on the
Vaihingen dataset. Although the LeNet had a higher success rate in terms of
performance metrics examining the reconstructions of the sample patch in Fig.
7 (described below), the VGG16 is the only model to identify an area of tree
covered by shadow. This indicates it might be possible, with further network
tuning and the application of some post-processing techniques for the VGG16
model to obtain better overall performance.
    Development of the Denver dataset required human annotators to apply a
label to a point from a Google Earth Engine electronic map. Although an option
of ‘Unsure’ was provided to participants, it is possible some labels suffered from
response bias, whereby ambiguous labels were recorded as either canopy or non-
canopy. Fig. 6 shows a sample of some ambiguous points that were incorrectly
classified as non-canopy by the VGG16 model although participants labeled them
as canopy. This analysis revealed common sources of error to be points lying on
the border of canopy and non-canopy (see Figs. 6a and 6d); within an area of
shadow (see Figs. 6c and 6e); or over areas of other vegetation (see Fig. 6b)
which were all incorrectly labelled as canopy. It is likely that further cleaning of
this dataset to correct these types of errors could further improve performance.
    Both patch-based approaches out-performed both of the encoder-decoder ap-
proaches. This was largely due to the poor recall of the encoder-decoder models
(see Table 1). Further insights into the differences between these methods can
be gained by examining Fig. 7 which show a small 400x400 pixel area of the test
tile from Fig. 4 with associated segmentations from the four model types. While
the outputs of the patch-based approaches closely mirror the ground truth the
encoder-decoder models produce a more speckled, less defined representation.
Further investigation of this disparity indicates that the reduced performance is
a result of models miss-classifying areas of low vegetation for the canopy class,
a common problem in tree identifying classifiers [20].
     (a) Aerial image of sample area         (b) Ground truth of sample area


     (c) LeNet           (d) VGG16        (e) 4-layer encoder      (f) UNet

Fig. 7: Image and ground truth for a sample patch and the segmentations gen-
erated by different model architectures.


    It is worth noting, however, that there are large differences in the compu-
tational cost when models using the different architectures are used to segment
images. The encoder-decoder models predict the segmentation class of all pixels
in a 32x32 patch in one pass through the network. So, to segment the 400x400
sample patch in Fig. 7 requires 144 passes through the network (allowing for
padding). The patch based models, however, require one pass through the net-
work to predict the segmentation class of every pixel. So, a total of 147,456
passes through the network are required to segment the same 400x400 pixel
patch (again allowing for padding). This is an important difference would be
applied across large areas and so significant reduction in computational effort to
segment a region could be made by using an accurate encoder-decoder network.
    Training encoder-decoder networks, however, requires significantly more la-
belling effort than training patch base models. A ground truth image that pro-
vides the segmentation class for every pixel is required. In contrast, patch-based
methods simply require the label of the centre pixel to represent the class of
a patch. The latter type of labelling is much faster (for the Denver dataset,
participants produced approximately 500 labeled points per hour).


6   Conclusion
This research addresses the need for accurate classification of canopy coverage
in urban areas, which has important applications in large-scale urban planing
and environmental monitoring. The research pioneered the use of two distinct
CNN approaches: patch-based architectures and encoder-decoder architectures
for pixel level segmentation. Experiments compared the performance of two vari-
ants of each approach on the problem of canopy coverage for a dataset containing
aerial images of the German town of Vaihingen. It was found that the patch-
based approaches outperformed the encoder-decoder approaches by a significant
margin, with a LeNet model demonstrating the best performance. A follow on
study using a smaller dataset collected using the Google Earth Engine platform
exclusively for this study, focused on the town of Denver, Colorado. In this expe-
rioment (which used only patch based approaches) a pre-trained VGG16 model
produced substantially higher performance levels than a fully trained LeNet ar-
chitecture.
    These findings provide strong evidence of the high performance potential
of CNN architectures for canopy coverage identification and measurement, al-
though they also suggest the need for further research. Due to time and compu-
tational constraints, it was not feasible to perform grid searches on any of the
models. Model performance could potentially improve if optimal model param-
eters were identified through these searches. Another avenue for improvements
could be obtained through post processing techniques. Dilation and erosion tech-
niques [8] are commonly use to add definition to object boundaries by filling in
holes/gaps in images and separating objects that overlap. These techniques often
increase model accuracies of final predictions. Lastly, it is common known that
performance of CNNs is sensitive to the size of the dataset. Potentially increasing
the number of examples in the dataset could also add further improvement.


References
 1. Measuring the tree canopy cover in london: An analysis using aerial imagery. Tech.
    Rep. MSU-CSE-06-2, London SE1 2AA (September 2015)
 2. Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: A deep convolutional
    encoder-decoder architecture for image segmentation. IEEE Transactions on Pat-
    tern Analysis and Machine Intelligence 39(12), 2481–2495 (2017)
 3. Basu, S., Ganguly, S., Mukhopadhyay, S., DiBiano, R., Karki, M., Nemani, R.:
    Deepsat: A learning framework for satellite imagery. In: Proc of SIGSPATIAL ’15.
    pp. 37:1–37:10 (2015)
 4. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic mi-
    nority over-sampling technique. Journal of artificial intelligence research 16, 321–
    357 (2002)
 5. Chen, X., Xiang, S., Liu, C.L., Pan, C.H.: Vehicle detection in satellite images by
    hybrid deep convolutional neural networks. IEEE Geoscience and Remote Sensing
    Letters 11(10), 1797–1801 (2014)
 6. Demir, I., Koperski, K., Lindenbaum, D., Pang, G., Huang, J., Basu, S., Hughes,
    F., Tuia, D., Raskar, R.: Deepglobe 2018: A challenge to parse the earth through
    satellite images. In: The IEEE Conference on Computer Vision and Pattern Recog-
    nition (CVPR) Workshops (2018)
 7. Girshick, R.: Fast r-cnn. In: Proc of the IEEE international conference on computer
    vision. pp. 1440–1448 (2015)
 8. Gonzalez, W., Woods, R.E.: Digital Image Processing. Prentice Hall (2004)
 9. Guirado, E., Tabik, S., Alcaraz-Segura, D., Cabello, J., Herrera, F.: Deep-learning
    versus obia for scattered shrub detection with google earth imagery: Ziziphus lotus
    as case study. Remote Sensing 9(12), 1220 (2017)
10. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
    In: In Proc CVPR 2016. pp. 770–778 (2016)
11. Jennings, S., Brown, N., Sheil, D.: Assessing forest canopies and understorey illu-
    mination: canopy closure, canopy cover and other measures. Forestry: An Interna-
    tional Journal of Forest Research 72(1), 59–74 (1999)
12. Kelleher, J.D., Mac Namee, B., D’Arcy, A.: Fundamentals of Machine Learning for
    Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies. The
    MIT Press (2015)
13. Kiefer, J., Wolfowitz, J., et al.: Stochastic estimation of the maximum of a regres-
    sion function. The Annals of Mathematical Statistics 23(3), 462–466 (1952)
14. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint
    arXiv:1412.6980 (2014)
15. Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to
    document recognition. Proc of the IEEE 86(11), 2278–2324 (1998)
16. Mnih, V., Hinton, G.E.: Learning to label aerial images from noisy data. In: Proc
    of ICML-12. pp. 567–574 (2012)
17. Moskal, L.M., Styers, D.M., Halabisky, M.: Monitoring urban tree cover using
    object-based image analysis and public domain remotely sensed data. Remote Sens-
    ing 3(10), 2243–2262 (2011)
18. Nogueira, K., Penatti, O.A., dos Santos, J.A.: Towards better exploiting convolu-
    tional neural networks for remote sensing scene classification. Pattern Recognition
    61, 539 – 556 (2017)
19. Okafor, E., Pawara, P., Karaaba, F., Surinta, O., Codreanu, V., Schomaker, L.,
    Wiering, M.: Comparative study between deep learning and bag of visual words
    for wild-animal recognition. In: IEEE SSCI 2016. pp. 1–8 (2016)
20. Paisitkriangkrai, S., Sherrah, J., Janney, P., Van-Den Hengel, A.: Effective se-
    mantic pixel labelling with convolutional networks and conditional random fields.
    In: Computer Vision and Pattern Recognition Workshops (CVPRW), 2015 IEEE
    Conference on. pp. 36–43. IEEE (2015)
21. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed-
    ical image segmentation. CoRR abs/1505.04597 (2015)
22. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,
    Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large
    Scale Visual Recognition Challenge. International Journal of Computer Vision
    (IJCV) 115(3), 211–252 (2015)
23. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
    image recognition. CoRR abs/1409.1556 (2014)
24. Stojanova, D., Panov, P., Gjorgjioski, V., Kobler, A., Deroski, S.: Estimating veg-
    etation height and canopy cover from remotely sensed data with machine learning.
    Ecological Informatics 5(4), 256 – 266 (2010)
25. Volpi, M., Tuia, D.: Dense semantic labeling of subdecimeter resolution images
    with convolutional neural networks. IEEE ,, 55(2), 881–893 (2017)
26. Zeiler, M.D.: Adadelta: an adaptive learning rate method. arXiv preprint
    arXiv:1212.5701 (2012)