Latent Representations of Terrain in Aerial Image Classification
Pylyp Prystavka 1, Serge Dolgikh 2, Olga Cholyshkina 3 and Oleksandr Kozachuk 1
1
  National Aviation University, 1 Lubomyra Huzara Ave, Kyiv, 03058, Ukraine
2
  Solana Networks, 301 Moodie Drive, Ottawa, K2H 9C4, Canada
3
  Interregional Academy of Personnel Management, 2/16 Frometivska St., Kyiv, 03039, Ukraine


                Abstract
                Investigation of informative representations of complex data is a rapidly developing field of
                research in machine learning. In this work we present a process of production and analysis of
                informative low-dimensional latent representations of real-world image data with neural
                network models of unsupervised generative learning. A model of convolutional autoencoder
                based on VGG-16 architecture was used to produce low-dimensional latent representations of
                aerial image data and the characteristics of distributions of several higher-level classes of
                terrain types were studied. The analysis of distributions demonstrated a landscape of compact
                concept clusters for most studied types of terrain with good separation between concept
                regions. The results of this work can be used in developing methods of effective learning with
                minimal labeled data based on the emergent concept-sensitive structure in the latent
                representations.

                Keywords 1
                Artificial Intelligence, unsupervised machine learning, clustering, image recognition, image
                classification

1. Introduction
   Classification of complex data such as images represents a significant challenge in the areas with
severe deficit of labels. In these problems and applications, unsupervised machine learning methods
such as processing data with models of unsupervised generative self-learning has shown to be effective
in identification and selection of in-formative latent representations that can simplify subsequent
classification and significantly reduce the label requirement. The motivation of this work was to apply
these methods to the practical task of aerial image recognition where a specific set of classes combined
with a strong deficit of labels make application of standard methods of supervised classification
challenging.

1.1.      Related Research
   Informative representations obtained with models of unsupervised generative self-learning were
used in a number of applications to identify the concepts or classes of interest in the observable data.
Artificial neural networks have strong potential in such problems and applications due to their capability
of universal approximation [1,2], making them suitable for processing data of virtually any type and
complexity including live image data recorded in aerial surveillance.
   Recognition and classification of terrain images to identify objects and structures such as roads,
pathways, tracks of transport are challenging tasks due to variety of backgrounds and environments that
can be encountered in obtaining the data in the real-world environment under different conditions, that


ICTERI-2021, Vol I: Main Conference, PhD Symposium, Posters and Demonstrations, September 28 – October 2, 2021, Kherson, Ukraine
EMAIL: prystavka@nau.edu.ua (A. 1); sdolgikh@nau.edu.ua (A. 2); greenhelga@gmail.com (A. 3); kozachukk@gmail.com (A. 4)
ORCID: 0000-0002-0360-2459 (A. 1); 0000-0001-5929-8954 (A. 2); 0000-0002-06810413 (A. 3)
             ©️ 2021 Copyright for this paper by its authors.
             Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
             CEUR Workshop Proceedings (CEUR-WS.org)
is further complicated by poor or inconsistent between different studies formalization of classes and
severe deficit of labelled samples [3-5].
    Hierarchical representations of observable data were obtained in a completely unsupervised training
process with Restricted Boltzmann Machines (RBM) and Deep Belief Networks (DBN) [6,7] offering
a noticeable improvement in the quality of subsequent supervised learning. Different types,
architectures and flavors of generative models were investigated since including autoencoder neural
networks, Generative Adversarial Networks (GAN) [8,9] to name only a few in a rapidly expanding
field, resulting in improved accuracy and versatility of the models with virtually unlimited range of
applications. The relations between learning and statistical thermodynamics was studied in [10] leading
to understanding of a deep connection between learning processes in artificial neural models and
principles of information theory and statistical thermodynamics.
    Previous results in unsupervised representations with generative self-learning neural network
models include applications of deep autoencoder models of different architectures such as sparse,
variational, convolutional and others to create informative representations of image [11,12] and other
types of data [13,14]. These results have demonstrated that categorization of data by common higher-
level concepts in the latent representations under certain constraints imposed in training can be
considered a general effect of information processing in such models [15]. An unsupervised structure
of this type that does not require massive amounts of labeled data to identify can be harnessed for more
effective learning in the environments with strong deficit of labels and/or new and unknown
environments where labeled data is scarce. Given the constraints of the problem and the results
discussed earlier, it was hypothesized that applying the methods and models of unsupervised generative
learning to this problem may allowed to obtain informative representations of image data and reduce
the requirement for labeled data to achieve successful learning.

1.2.    Motivation
   The motivation of this work suggested by earlier results [5-8] was to investigate the structure that
emerges in unsupervised representations of generative models with real-world image data and introduce
methods of production, evaluation and analysis of informative latent representations that can be used in
developing of effective learning models with reduced requirements of labeled training data.
   To solve the problem of strong label deficit methods of creating informative representations with
deep neural network models of unsupervised generative learning were applied, that are capable of
learning essential patterns in the observed data in the unsupervised mode without any labeled data. The
novelty of the proposed approach is a successful application of generative models to real-world data
such as aerial images, allowing eventually to perform processing on board of an autonomous vehicle;
secondly, designing and demonstrating the methods of evaluation and measurement of latent
representations, including entirely unsupervised ones.
   The dataset of images recorded in real surveillance of terrain from an aerial vehicle was chosen for
two main reasons: first to demonstrate that the methods developed in this work can be applied to realistic
complex data types; and not in the least, because the tasks aerial image classification and interpretation
are becoming increasingly common in many practical applications.

2. Methods and Data
   This section contains the description of the model, data and methods used to produce and analyze
the distributions of the characteristic classes of data in the input image dataset.

2.1.    Methodology
    The models were of the type of deep convolutional autoencoder neural networks [16] based on VGG
architecture that achieved good results in supervised learning of image data [17]. The models were
implemented in Tensorflow and Keras [18] with a number of common machine learning packages and
libraries.
    Generative neural network models were used as described further in this section to produce
compressed latent representations of aerial terrain observation images represented in the dataset to
evaluate distributions of classes identified by a type of terrain the latent space. Methods of geometrical
analysis were applied to samples transformed to latent representation to identify characteristic
parameters of distributions, such as compactness and separation between distribution regions of
different classes.

2.2.    Generative Model Architecture
   The architecture diagram of the models used in this work is given in Figure 1. It can be described as
a deep convolutional autoencoder neural network that contained the encoder model with several
convolutional blocks, activation and normalization layers producing a flattened numerical
representation with dimensionality 8 x 8 x 256; and the generator model with up-sampling blocks with
the resulting output layer of the same dimensionality as the input layer.
   The models were trained in the unsupervised mode, with unlabeled raw image da-ta, to reduce the
generative error, i.e. the mean deviation of input images in the training dataset from their regeneration
by the model as:
                                 𝐸 = 𝑚𝑒𝑎𝑛(|𝐺(𝑆) − 𝑆|) → 𝑚𝑖𝑛                                         (1)
where S is the training sample, G(S), the output of the model on the training sample.
   The architecture of the model is shown in Figure 1.


Figure 1 Deep convolutional autoencoder with latent representation of image data

   A trained model can produce encoded representation of the input sample X in the observable space
by transforming it to the latent representation space defined by activations of the neurons in the latent
layer of the model as:

                                            𝑟 = 𝐸(𝑋)                                              (2)
   where E(X) is the generating stage of the model, from observable input to the latent representation
layer (Figure 1).
   It needs to be noted that in a trained generative model the encoding transformation (1) is defined in
completely unsupervised process and does not require for training any samples labeled with known
categories of the observable data.

2.3.    Data
   The dataset of images was obtained in live aerial surveillance of the terrain with preprocessing of
scaling to the standard size (64 x 64) and augmentation by rotation. Images in the dataset were classified
in semi-automatic process into classes representing characteristic types of terrain with significant
representation in the dataset. In the rest of the study, classes or categories of images were denoted with
a symbol, such as “T” for transport tracks, “W” for wooded areas and so on.
   The detailed composition of the dataset is described in Table 1.
Table 1
Terrain image dataset
          Class             Symbol       Size                        Description
    Transport tracks          T         9327                Tracks on soil, field or dirt road
   Narrow dirt road           N         11700                         Dirt roads
       Built areas            B         1600                     Roads in build areas
     Wooded areas             W         9766                    Roads in wooded areas
      Paved roads             H          955                    Paved roads, highways
       Footpaths              F          0.2                       Footpaths, trails
     Wide dirt road           D         1411
         Other                O         22466               Not roads; general background

2.4.    Training
    The models were trained in unsupervised process with minimization of the deviation between the
training set of images and their regenerations by the model. Cost functions used for unsupervised
training were Mean Squared Error (MSE) and binary cross entropy (BCE), both showing strong
improvement in the process of training (Figure 2).


Figure 2. Training error and trend, unsupervised generative self-learning.

   Strong reduction of generative error in the process of unsupervised training indicated that the latent
representations created by the models contained sufficient information to regenerate observable data,
because in a feedforward neural network of the type used in the study the output has to be generated
entirely from the information contained in the latent representation.

3. Results
    In this section we present the results of measurement and analysis of distributions of higher-level
concepts in the original (observable) dataset in the latent representations produced with generative
training of unsupervised autoencoder models as described in the previous sections.

3.1.    Overall Characteristics
   The characteristics of the general representative set of samples in the latent representation, without
breakdown by higher-level concepts were as follows:
   The analysis of principal components [19] produced the following results:
   First three components: 54.4% of overall variation
   First 10 components: 72.7 %
   First 100 components: 98.2% of overall variation.
   These results indicated the possibility of strong redundancy reduction in the latent representation
without significant loss of information. Based on these results in the analysis of class distribution three
principal dimensions with the highest variation were used, which allowed to produce direct
visualizations of concept distributions in the latent representation.

3.2.    Latent Concept Distributions
   In this section the results of the measurement and analysis of concept distributions in the latent
representation of generative models are presented. The parameters of distributions such as the
characteristic size, standard deviation and density of the concept distribution regions in the latent
representation coordinates are given relative to the maximum dimension of the overall latent dataset,
and the uniform density.

Table 2
Concept distributions in the latent space
            Class                Parameters (Size,                       Visualizations
                                   STD, Density)
      Narrow dirt roads             0.22, 0.41, 0.28
                                        0.53 – 0.91
                                                39.6


       Wide dirt roads              0.28, 0.26, 0.35
                                          0.69 – 0.9
                                                39.2


            Tracks                  0.39, 0.37, 0.16
                                        0.73 – 1.18
                                                43.3


    Paved roads highways            0.25, 0.26, 0.05
                                        0.15 – 0.73
                                              307.7


       Footpaths trails             0.36, 0.36, 0.23
                                          1.0 – 1.15
                                                33.5


   For most concepts with significant representation in the dataset, a compact and well-defined
character of latent concept distributions was observed with the density of the concept region (i.e. the
region of distribution of samples associated with the studied concept in the latent representation space)
significantly higher than uniform.
   These results confirm the earlier observed effect of correlation between unsupervised latent
distributions and higher-level concepts with strong representation in the observable dataset [14]. An in-
depth analysis of distributions will be attempted in a future study.

3.3.    Unsupervised Categorization
    In this section, measurements related to categorization capacity of the models are presented and
discussed. Shown in Table 3 are visualizations of intersections of concept regions, with the highest
relative volume. A cross-concept intersection matrix can be defined to indicate the degree of
disentanglement of concept regions in the representation, as the ratio of the latent volume of the
overlapping region between concepts A and B, Oa,b to the volume of the concept region A, Ha:
                                               𝑉(𝑂𝑎,𝑏 )                                        (2)
                                        𝑌𝑎,𝑏 =
                                                𝑉(𝐻𝑎 )
Table 3
Cross-concept overlap matrix Y, latent representation
   Class        Concept intersection Y, maximum                       Visualization
     N                        0.22


    B                         0.03


    W                         0.08


    H                         0.11


    F                         0.18
   As can be seen from the results in Table 3, a good separation of concept regions was observed for
most categories of images with significant representation in the dataset, indicating strong categorization
achieved by the models in the process of unsupervised generative learning.
   Unsupervised categorization or decoupling of higher-level concepts in the unsupervised latent
representations is the effect observed in a number of experiments with unsupervised self-learning
models [6-8] that is evident as compact and well-separated concept distributions in the latent space.
   Accordingly, categorized distributions tend to minimize the volume and consequently, maximize
the relative density of latent concept distributions while minimizing the overlapping between the
distributions of different concepts.

3.4.    Generative Ability
   Models that were successful in identifying characteristic patterns or concepts in the observable data
as a result of generative self-learning can be expected to be able to regenerate input samples with close
resemblance / low deviation from original input samples. To evaluate generative ability of the models,
experiments were performed with samples of the concepts present in the training dataset, as well as
those that were not classified into concepts (i.e. general background). Examples of generative results of
the models are shown in Table 4.
Table 4
Generative ability of self-learning models
       Sample                          Original                                Generated
       Concept
        (Field)


     Concept
    (Built area)


    A successful generative ability across multiple classes of data indicates that the distributions in the
latent representation were correlated with characteristic patterns in the original (i.e. observable) data. It
follows from the architecture of feed-forward artificial neural networks that all the information
necessary for generation of the output of the model must be contained in the latent representation layer
and there-fore the process of unsupervised generative training was able to produce informative
representation with substantially reduced redundancy in the observable data represented in the training
dataset.
    Generative ability of self-learning models such as those studied in this work can be used in
augmentation of training datasets for models of supervised learning, to improve the accuracy and extend
classification ability to classes of data under represented in the datasets. It will be investigated in more
detail in another study.
4. Discussion
    The results presented in this work are in agreement with the growing number of reports with
observations of the effects of concept-correlated latent representations emerging in unsupervised
generative learning and provide strong additional arguments in support of the general character of this
effect. Given the wide range of models and types of data, from lower complexity [14] to massive and
complex architectures [11,12] where unsupervised categorization in the latent representations in
generative self-learning has been observed, this conclusion appears to be well substantiated.
    It was demonstrated that under certain constraints such as generative accuracy and information
compression or redundancy reduction, low-dimensional latent representations of models learning to
generate input distributions in a completely unsupervised learning process can produce distinct structure
that is correlated with principal higher-level concepts in the observable data. The results in Sections
3.2, 3.3 on measurement of latent distributions of main concept regions appear to support this
conclusion.
    Identification and description of unsupervised latent structure that emerges in generative learning
can be a valuable instrument in the analysis of general data, in particular, of less known origin where
massive prior knowledge such as large labeled datasets used in supervised machine learning may not
be available.
    A number of recent results indicated that similar low-dimensional representations can play an
important role in processing of sensory data by humans [20,21]. Demonstration of success of low-
dimensional representations obtained with models of generative self-learning supports the conclusion
about the general nature of the observed effect in the learning systems of biological and artificial origin
and provides an intriguing possibility of a connection to bioinformatics [22] with learning systems able
to learn intuitively, incrementally and with minimal prior knowledge of the environment.

5. Conclusions
    Based on the results reported in this work, several essential observations can be made on the process
of unsupervised generative learning with real-work image data and the conceptual structure in the latent
representations of data produced by such models:
    1. The models of generative unsupervised learning used in the study were capable of producing
    well-defined categorized representations correlated with the principal (i.e. strongly represented)
    higher-level concepts in the training dataset.
    2. The observed latent representations showed good categorization and separation of principal
    concepts and appear to support the hypothesis [5] of a correlation between the unsupervised
    representation structure emergent in unsupervised generative learning and higher-level concepts
    with significant representation in the training data.
    3. Methods of measurement of categorization capacity of unsupervised generative models in the
    latent representations were defined and validated.
    4. The models showed good generative capacity for some principal concepts in the training data.
    Optimization of models for generative ability will be further investigated in a future study.
    5. Methods of evaluation of categorization ability of models are general and can be applied to
    different types of data and model architectures.
    Overall, the observed latent representations showed good categorization and separation of principal
concepts and appear to support the hypothesis [5] of a correlation between the categorization
performance and architecture of the model.
    The methods of evaluation of latent distributions of data classes, demonstrated and verified in this
work are of general nature not limited to a specific type of data and can be instrumental in evaluation
of the learning capacity and performance of generative models. The unsupervised latent structure
demonstrated in this and other works can be used to enhance learning ability of the models in the
environments with strong deficit of labels. These findings can therefore be instrumental in development
of learning models and methods that are capable of acquiring knowledge in a flexible and environment-
driven process that is closer to learning of biological systems.
References
[1] Coates, A., Lee, H., Ng, A.Y., “An analysis of single-layer networks in unsupervised feature
     learning”, in: Proceedings of 14th International Conference on Artiﬁcial Intelligence and Statistics,
     15, pp. 215– 223, 2011.
[2] Hornik, K., Stinchcombe M., White H., “Multilayer feedforward neural networks are universal
     approximators”, Neural Networks, vol. 2 (5), pp. 359–366, 1989.
[3] Marfil R., Molina-Tanco L., Bandera A., Rodriguez J.A. and Sandoval F., “Pyramid segmentation
     algorithms revisited”, Pattern Recognition, vol. 39 (8), pp. 1430 – 1451, 2006.
[4] Chyrkov A., Prystavka P., “Suspicious Object Search in Airborne Camera Video Stream”, in: Hu
     Z., Petoukhov S., Dychka I., He M. (eds) Advances in Computer Science for Engineering and
     Education, ICCSEEA 2018, Advances in Intelligent Systems and Computing, vol. 754, Springer,
     Cham, pp. 340 – 348, 2018.
[5] Prystavka P., Cholyshkina O., Dolgikh S. and Karpenko D., "Automated object recognition system
     based on convolutional autoencoder," 10th International Conference on Advanced Computer
     Information Technologies (ACIT-2020), Deggendorf, Germany, pp. 830–833, 2020.
[6] Fischer A., Igel C., “Training restricted Boltzmann machines: an introduction”, Pattern
     Recognition, vol. 47, pp. 25–39, 2014.
[7] Hinton G. E., Osindero S., Teh Y.W., “A fast learning algorithm for deep belief nets”, Neural
     Computation, vol. 18 (7), pp. 1527–1554, 2006.
[8] Welling M. and Kingma D.P., “An introduction to variational autoencoders”, Foundations and
     Trends in Machine Learning, vol. 12 (4), pp. 307–392, 2019.
[9] Creswell A., White T., Dumoulin V., Arulkumaran K., Sengupta B., Bharath A.A., “Generative
     adversarial networks: an overview”, IEEE Signal Processing Magazine, vol. 35, (1), pp. 53–65,
     2018.
[10] Ranzato, M. A., Boureau Y.-L., Chopra S., LeCun Y., “A uniﬁed energy-based framework for
     unsupervised learning”, in: 11th International Conference on Artiﬁcial Intelligence and Statistics
     (AISTATS), San Huan, Puerto Rico, 2007, vol. 2, pp. 371–379.
[11] Le, Q.V., Ransato, M. A., Monga R., et al., “Building high-level features using large scale
     unsupervised learning”, arXiv 1112.6209 [cs.LG] 2012.
[12] Higgins, I., Matthey, L., Glorot, X., Pal, A., et al., “Early visual concept learning with unsupervised
     deep learning”, arXiv1606.05579 [cs.LG], 2016.
[13] Shi, J., Xu, J., Yao, Y., and Xu, B., “Concept learning through deep reinforcement learning with
     memory-augmented neural networks”, Neural Networks, vol. 110, pp. 47–54, 2019.
[14] Dolgikh S., “Categorized representations and general learning”, in: Proceedings of 10th
     International Conference on Theory and Application of Soft Computing, Computing with Words
     and Perceptions, vol. 1095, pp. 93–100, 2019.
[15] Tishby N., Pereira F. C., Bialek W., “The Information Bottleneck method”,
     arXiv:physics/0004057, 2000.
[16] Kavukcuoglu K., Sermanet P., Boureau Y. L., Gregor K., Mathieu M., Cun Y., “Learning
     convolutional feature hierarchies for visual recognition”, Proceedings of the 23rd International
     Conference on Neural Information Processing Systems, vol. 1, pp. 1090-1098 Vancouver, Canada,
     2010.
[17] Simonyan K. and Zisserman A., “Very deep convolutional networks for large-scale image
     recognition”, arXiv,1409.1556 [cs.LG], 2014.
[18] Keras: The Python Deep Learning library, URL: https://keras.io.
[19] Jolliffe, I.T., “Principal Component Analysis”, Series: Springer Series in Statistics, 2nd ed.,
     Springer, NY, 2002, XXIX, 487 p. 28.
[20] Yoshida, T., Ohki, K., “Natural images are reliably represented by sparse and variable populations
     of neurons in visual cortex”, Nature Communications, vol. 11, pp. 872 2020.
[21] Bao X., Gjorgiea E., Shanahan L.K., Howard J. D., T. Kahnt T., Gottfried J. A., “Grid-like neural
     representations support olfactory navigation of a two-dimensional odor space”, Neuron, vol. 102
     (5), pp. 1066–1075, 2019.
[22] Hassabis D., Kumaran D., Summerfield C. and Botvinick M., “Neuroscience inspired Artificial
     Intelligence”, Neuron vol. 95, 245-258, 2017.