Towards Latent Space Exploration for Classifier
                             Improvement
                               Paulo Fernandes and João Correia and Penousal Machado1


Abstract.                                                                     One of the ways to train generative models is by using Genera-
   We propose a framework that combines Generative Adversarial             tive Adversarial Networks. These frameworks are more often used to
Networks and Evolutionary computation to perform Data Augmen-              produce very realistic images that follow the distribution of the train-
tation on small datasets in order to improve the performance of image      ing dataset [5]. In general, these work by putting a generator and a
classifiers trained via supervised learning. In this work, we attest the   discriminator against each other in a min-max game. The discrimi-
viability and potential of this framework for real-world problems.         nator is trained to distinguish images from the original dataset from
The framework is composed of a generator module that uses Gener-           images created by the generator, while the generator learns from the
ative Adversarial Networks to generate samples from a dataset. It          feedback given by the discriminator on the generated data. In this
employs an Evolutionary Computation approach to evolve sets of             work, we plan to use Generative Adversarial Networks to generate
images from the latent space. The fitness function is based on the         sets of synthetic images in order to understand if the addition of these
dissimilarity of the subsets generated by the Generative Adversarial       instances into the training set of a classification model is able to im-
Network. A Supervisor module handles the generated samples and             prove its performance.
chooses which set should be added to the training dataset. To test the        As a further matter, it is also essential to address the generation of
framework, we explore the Human Sperm Head Morphology dataset,             new samples. Even though a capable model is important in the gen-
a bio-medicine multi-class problem with a small number of samples          eration of better images, there is also another variable that impacts
that provide a challenge to the different supervised classification ap-    the instances generated, which is the latent space. It is unique to each
proaches. We deploy the framework to create an augmented dataset           generative model and hides underlying patterns in itself. Usually, to
to train a classifier, and after the training, we compute the perfor-      generate an image, a vector from the generative model’s latent space
mance on the test set. We compare with classifiers trained using the       is chosen at random for input, which means that there is no knowl-
base datasets without having the generated samples. Overall, with          edge about the output. Since the images generated depend on the
the preliminary tests, we can improve the performance of the classi-       input given to the generator, the exploration of the latent space may
fiers by up to 4% and on average by 1%, showing the viability and          reveal ways to control the output through the selection of input vec-
potential of our approach.                                                 tors by certain criteria. This way, we can also assure the quality of
                                                                           the images that will be added to the dataset and their relevance to
                                                                           help with the problem at hand. For instance, we might want to ensure
1      INTRODUCTION                                                        that we are not adding redundant samples to the training set. Like so,
                                                                           performing random Data Augmentation should have a higher chance
With the evolution of technology and computer capabilities, Machine        of undermining the performance of the algorithm, which means that
Learning has seen significant improvements in recent years. Like so,       ideally, we should prefer an approach of a supervised generation of
it became much easier to build and apply larger neural networks,           instances. Bearing this in mind, we chose to perform this supervision
such as Deep Neural Networks, to solve real-world problems. It also        by exploring the latent space using Evolutionary Computation. Using
became possible to build Deep Generative Models which produce              a Genetic Algorithm, we evolve sets of latent vectors which optimize
synthetic data by learning from already existing data.                     a specific criterion such as the diversity of images in the set. This
   While the availability of data has also been accompanying the evo-      framework for latent space exploration was explored in [4]. Here we
lution of technology, there are still many problems that lack enough       explore the usage of such a framework to generate new samples for
data to allow Machine Learning algorithms to be viable solutions for       the training dataset that may improve the performance of classifiers.
them. The performance of Machine Learning algorithms depends not           As proof of concept for real-world problems, we instantiate the ap-
only on the capability of the model used but also on the quality of the    proach in the Human Sperm Head Morphology dataset (HuSHeM)
dataset used in the training of the model. This means that training a      [8], a multiclass problem categorized as small data, that provides a
model with a bad dataset will, most likely, lead to poor performance       challenge by the lack of samples that exists for the problem.
results. As such, improving the quality of the datasets through Data          The remaining of the paper goes as follows: in the next Section,
Augmentation may be a way to improve the performance of the al-            we explain our approach to the problem and the framework used to
gorithm. With this in mind, is it possible to use generative models to     solve it (Section 3). In Section 4, we describe the experimental setup
enhance the quality of existing datasets, consequently improving the       and analyse and discuss the results. In Section 5 we draw overall
quality of machine learning algorithms?                                    conclusions.
1   CISUC, Department of Informatics Engineering, University of Coim-
    bra,Portugal, email: pcastillo,jncor,machado@dei.uc.pt


    Copyright ©2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)
2     RELATED WORK
Generative Adversarial Networks are generative models that are
trained through a face-off between a generator and a discriminator,
mostly used to train a generator that can produce realistic images.
The generator is given a noise vector to produce new images, usu-
ally, a high dimensional vector that is randomly sampled from a dis-
tribution, for example, a gaussian distribution, called the prior. The
high dimensional space is called latent space. Some work has already
been made to explore the latent space of generative models, and not
only with Generative Adversarial Networks, which is the framework
that we will use in this work. For instance, latent space exploration
was performed in Kernel Principal Component Analysis [11] mod-             Figure 1. Class example images from original dataset. From left to right,
els, showing navigation through image features and novelty detec-          top to bottom: Normal, Tapered, Pyriform and Amorphous.
tion; but also in Variational Auto-Encoders, in, for example, mapping
genes into a lower-dimensional space to uncover underlying gene ex-
                                                                           images to the training set. This way, we will be able to assess the
pression features in cases of tumour or cancer [10].
                                                                           quality of our approach and guide the progress of research.
   Evolutionary Computation has also been used in some works in
order to evolve images. For instance, evolving master print templates
[7] that, like a master key, are able to match multiple fingerprints.      3.2    Generator
In their work, Roy et al. compared four different Evolutionary Algo-
rithms, namely Hill-Climbing, Covariance Matrix Adaptation Evo-            The generator will be obtained through a Deep Convolutional Gen-
lution Strategy, Differential Evolution and Particle Swarm Optimiza-       erative Adversarial Network. These make use of Deep Convolutional
tion to evolve Synthetic MasterPrints according to the metric pro-         layers that better explore space correlation in images [6], which helps
posed by them, the Modified Marginal Success Rate. The samples             to generate better quality images. The training is unsupervised, which
were generated from two datasets, namely Authentec AES3400 and             means that no information is given to the model to guide the gen-
FVC 2002 DB1-A. Beyond these, and with a two-stage workflow                eration of images, the training progresses purely through the dif-
similar to what we implemented in this paper which includes first          ferentiation between real(original) and fake(generated) images [5].
the unsupervised training of Generative Adversarial Networks and           Therefore, a generator will be trained for each class of the problem
second the evolution of latent space, there are two works that can         to specifically control the generated images for each one. Each gen-
be mentioned. One that implements Interactive Evolutionary Com-            erator will learn to produce images following the distribution for a
putation [2] for image generation and another that uses Generative         single class. During training, the generated images are created from
Adversarial Networks and latent space evolution to learn and im-           random vectors, following a gaussian distribution.
prove Mario Levels [9] using Covariance Matrix Adaptation Evolu-
tion Strategy. Finally, a recent approach to generative models which
was inspired by Generative Adversarial Networks, the Generative            3.3    Supervisor
Latent Optimization [1]. This method replaces the adversarial dis-
                                                                           Lastly, the Supervisor, which is crucial to the optimization of the
criminator with simple reconstruction losses where the focus is to
                                                                           training dataset. By adding random images with no criteria, there is
evolve the latent space to match the one learnable noise vector to
                                                                           no way to ensure if these are relevant to the solution of the problem.
each one of the images in the training dataset.
                                                                           The introduction of flawed and redundant images might end up un-
                                                                           dermining the performance of the classification algorithm [3]. One
3     FRAMEWORK                                                            way to control the output of the generators is by controlling its latent
                                                                           space. Selecting generated images through certain criteria will allow
In this paper, we propose a framework that combines Generative Ad-         for the optimization of the classifier. More specifically, we will be
versarial Networks and Evolutionary computation to perform Data            looking into finding sets of images that are as diverse as possible so
Augmentation on small datasets in order to improve the performance         as to minimize redundancy. The exploration of the latent spaces will
of image classifiers trained via supervised learning.                      be performed through Evolutionary Computation, more specifically
   For the framework, there are 3 fundamental pieces: (i) a classi-        using a Genetic Algorithm. Each individual in the algorithm will rep-
fier responsible for the classification task, discriminating images into   resent a set of images where its genetic code corresponds to the latent
classes; (ii) a generator, responsible for generating new images, from     vectors of said set of images. The initial population is created through
an array of the latent space; (iii) a supervisor, responsible for manag-   random sampling from the same Gaussian distribution of the gener-
ing the generation images through the exploration of latent space (as      ator training. At each iteration, new populations are created by using
in [4]).                                                                   Tournament Selection, Uniform Crossover and Random Reset Muta-
                                                                           tion (which also applies the previous Gaussian distribution to obtain
3.1    Classifier                                                          the value of the new genes). The evaluation of individuals and fitness
                                                                           function correspond to the averaging of the similarities between each
The performance of classifier will measure the performance of this         image in the set and the centroid image of the set that includes the im-
framework. Therefore, we will be looking into comparing the per-           ages from that individual together with the images from the original
formance of the classifier after the baseline training - using only the    dataset. The similarities between images and centroid are calculated
original dataset - against the performance of the classifier after the     using Normalized Cross-Correlation. Since we are searching for di-
supervised augmented training - with selective addition of synthetic       verse datasets, the objective is to minimize the target function. In the
end, we should find a set of images that comes closest to the intended
objective and better tackles the issues at hand.

4     EXPERIMENTATION
In order to evaluate our approach, we performed several tests. The
conditions on which these tests were carried out are presented in this
section.

4.1    Dataset
I order to test our hypothesis we will be using the Human Sperm
Head Morphology dataset (HuSHeM) [8]. In the bio-medicine con-
text, Sperm morphology analysis is a key factor in the diagnosis           Figure 2. Class example of random images produced by the generator. On
process of male infertility. The dataset is divided into 4 classes of      each row from, top to bottom: Normal, Tapered, Pyriform, Amorphous
sperm heads images [Figure 1]: Normal (54 instances), Tapered (53
instances), Pyriform (57 instances) and Amorphous (52 instances)
                                                                             Table 2. Deep Convolutional Generative Neural Network parameters
for a total of 216 images. A small dataset like this one is an oppor-
                                                                                        Parameter          Setting
tunity to explore Data Augmentation approaches. The dataset has no
                                                                                        latent dimension   100
sub-division, as such it was decided that we would use 40 instances                     optimizer          Adam
of each class for training and cross-validation, leaving the remaining                  beta1              0.5
images for testing.                                                                     beta2              0.999
                                                                                        learn rate         0.0002
   Each image has a original dimensions 131x131x3, but in the ex-                       epochs             10000
periments we will be working with in dimensions 132x132x1.                              batch size         32
                                                                                        loss function      Binary Cross-Entropy
                                                                                        noise distribution N(0,1)
4.2    Classifier
The classifier module allows the assertion of the experiment results.      4.4    Supervisor
The model used was an off-the-shelf model that only required train-
ing since the optimization of the model was one of our objectives for
                                                                           The core step of this work is the supervision of the generation of sam-
this work. The parameters used for the training of the classifier are
                                                                           ples. This is what is going to allow the optimization of the process
present in Table 1. The number of epochs chosen ensures that the
                                                                           of performing Data Augmentation and ensure the best results possi-
classifier reaches a point of plateau for the original dataset, where
                                                                           ble. For this, we decided to use Evolutionary Computation, namely
there is no gain in performance. This helps to verify the quality of our
                                                                           a Genetic Algorithm, to explore the latent space of the generators
solution. On another note, the training also included cross-validation
                                                                           with the intent of finding sets of algorithms that optimize a certain
for every test, which means both tests with the original dataset and
                                                                           criterion. In this case, specifically, we are looking into maximizing
augmented datasets.
                                                                           the diversity of the dataset. The supervision is performed for a single
                                                                           generator, or single class, which means that in this problem, we are
                     Table 1. Classifier parameters                        going to use 4 supervisors to evolve 4 different sets of images that
                Parameter        Setting                                   will be added to the original set. The parameters used in the genetic
                optimizer        Adam
                beta1            0.5                                       algorithm were as defined in Table 3.
                beta2            0.999
                learn rate       0.0002
                epochs           250                                                        Table 3. Genetic Algorithm Parameters
                batch size       32                                              Parameter              Setting
                loss function    Binary Cross-Entropy                            Population size        20
                cross-validation Stratified                                      Number of generations  500
                folds            5                                               Genotype length        number of images × latent dimension
                                                                                 Elite size             1
                                                                                 Selection method       tournament
                                                                                 Tournament size        3
4.3    Generator                                                                 Crossover operator     uniform crossover
                                                                                 Crossover rate         0.7
The generators allow for the creation of new samples to perform data             Mutation operator      random reset mutation
augmentation. A generator is obtained through unsupervised train-                Mutation distribution  N(0,1)
ing of a Deep Convolutional Generative Adversarial Neural Network.               Mutation rate per gene 0.02
The model used for the discriminator, as the classifier, is an off-the-
shelf model, to which we only added 2 extra layers of convolution.            Each individual represents a set of images that are coded into its
As for the discriminator, the model used was the same as the classi-       genetic code in the form of latent vectors The fitness of each individ-
fier. The training is performed in each individual class, which means      ual is calculated by averaging the similarities between each image in
that in this case, in particular, we are going to need 4 different gen-    the set and the centroid image of the set that includes the images from
erators to produce samples for each class [Figure 2]. Each generator       that individual together with the images from the original dataset.
was trained using the 40 class corresponding samples in the training       The similarity is calculated using the Normalized Cross-Correlation
set. The training parameters were set, as shown in Table 2.                metric. The calculations are as follows.
                                    T = I _O                         (1)
                                     P
                                       t∈T
                                           t
                            C=                                       (2)
                                    length(T )
                          P
                             i∈I
                                   N CC(i, C)
                    F =                                              (3)
                              length(I)

   T is the set resulting of concatenation of the images from the in-
dividual (I) with the images from the original dataset(O). C is the
centroid of the set T . F is the fitness of the individual calculated
through averaging the similarities measured using the Normalized
Cross-Correlation(NCC). The Calculation of the similarity metric is
as follows:
                               P
                          ((A − B) (A − B))
           N CC(A, B) = pP         P                                 (4)
                           (A A) × (B B)
   The corresponds to the Hadamard product between two images.
   On a last note, since the fitness function measures similarity in-
stead of diversity, the objective of the algorithm is set to minimiza-
tion. In the end, we should end up with a set of images that are more
diverse, than if we just picked a randomly.


4.5    Experimental Results
In order to test our framework, we performed a comparison be-
tween the performances of the classifiers before and after performing
Data Augmentation. The evaluation of each classifier in the cross-
validation was performed in the test dataset, where several metrics
were measured, namely Accuracy, Precision, Recall, F1 score, Area
Under Receiving Operator Characteristic Curve (AUROC) and Aver-             Figure 3. Average of the learning curve of the trainings with the original
age Precision. Each test was performed 5 times with different seeds.        dataset across the all repetitions
In the following results we will be presenting the mean across these
5 repetitions. Note that the initialization of the weights is the same
between Model-X and Seed-X (e.g. Model-0 and Seed-0), Seed-0
differs on the seed used to generate the augmented dataset and, of
course, the existance of augmented instances.
   The first tests were performed with the original dataset. By looking
at Figure 3 we can observe that at 250 epoch the training has already
reached a plateau in terms of accuracy, which means that it will most
probably not get any benefits from further training since it will tend
to overfit. We were also able to find the performance that our solution
should be able to overcome [Table 4.5].
   The next step was building the sets of images of images to be
added to the original dataset and train with the augmented dataset.
For this experiment it was decided to test a dataset composed of 50%
original images and 50% synthetic images. As such, for each class
we generated and evolved sets of 40 images. The selection of these
sets was repeated for every repetition of the test with the classifier
which was performed 5 times with different seed, similarly to what
was done with the baseline test.
   First, by analysing the line of evolution in Figure 4 we can see that
the values of fitness of the best individuals do not have a great vari-
ation between the first and the last generations. All values, for every
class, sit on an interval of 0.01 between, 0.99 and 1. This means that      Figure 4. Average of the fitness of the best individual during the evolution
the similarity between the images in this dataset, for this metric, is      process across all repetitions
really high and that it promote a good evolution. However, if we take
look at Figure 5 which puts side-by-side the best individual of the
first generation (top) against the best individual of the last generation
(bottom) in the evolution of a set of the class ”Amorphous”, we can
                                                                             argue that the latter has in fact, from a subjective perspective, more
                                                                             visual diversity than the first.
                                                                                The last step is the training of the classifiers with the augmented
                                                                             datasets. By analysing the training curve in Figure 6 we can see that
                                                                             at epoch 250 the training also reaches a plateau accuracy-wise, mean-
                                                                             ing that further training would not help better the performance.


                                                                             Figure 6. Average of the learning curve of the training with the augmented
                                                                             dataset across the all repetitions


                                                                                Looking at the test results [Table 4.5], we can see that the perfor-
                                                                             mance of the classifiers trained with the augmented dataset was, on
                                                                             average, better for every metric. Although it would be necessary to
                                                                             perform more tests to verify the benefit of our solution, this shows
                                                                             that our approach might indeed be a way to improve datasets and
                                                                             consequently the performance of Classifiers. Each Seed-X represents
                                                                             a classifier that was trained by a subset from the Evolutionary Com-
                                                                             putation process using a different random generator seed. We can
                                                                             observe that in 4/5 seeds we are able to improve beyond the average
                                                                             performance of the original baseline classifier. We have one Seed
                                                                             that improves up to 4% over the baseline average. One of the seeds
                                                                             hindered the performance of the classifier but on average we have
Figure 5. Best individuals at the end of generations 0 and 500, top and      improvements over all the metrics when compared with the trained
bottom respectively, from the process of evolution of individuals from the   models with different initialization weights.
class ”Amorphous”

                                                                             5   CONCLUSION
                                                                             We explored an approach that uses Generative Adversarial Networks
                                                                             for Data Augmentation to improve the performance of a supervised
                                                                             classifier applied in a real-world problem. The underlying idea is to
                                                                             explore the latent space of the generative model using Evolutionary
                                                                             Computation to generate sets of instances to be used in the training
                                                                             dataset of a supervised classifier. Since arbitrarily adding instances
                                                                             to the dataset could hinder the performance of the classifier that is
                                                                               [3] João Correia, Evolutionary Computation for Classifier Assessment and
Table 4. The classifier test results for each metric. Each model trained is        Improvement, Ph.D. dissertation, University of Coimbra, 2018.
denoted as Model-X and each model trained with the augmented dataset from      [4] Paulo Fernades, João Correia, and Penousal Machado, ‘Evolution-
the Evolutionary process is denoted as Seed-X. Augmented is the average of         ary latent space exploration of generative adversarial networks’, in
the 5 seeds and Original is the average of 5 models trained.                       EvoApps 20’ Proceedings of the 23rd European Conference on the Ap-
                                                                                   plications of Evolutionary and bio-inspired Computation - Evolution-
                                                                                   ary Machine Learning, p. to appear, (2020).
                                           Metrics                             [5] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David
                                                                                   Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio,
  Data           Acc      Prec      Rec       F1     AUROC       Avg-Prec          ‘Generative adversarial nets’, in NIPS, pp. 2672–2680, (2014).
 Original       0.578     0.623    0.580     0.582    0.719        0.487       [6] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised repre-
                                                                                   sentation learning with deep convolutional generative adversarial net-
 Model-0        0.593     0.610    0.598     0.592    0.731        0.508           works, 2015.
 Model-1        0.575     0.648    0.580     0.588    0.719        0.497       [7] A. Roy, N. Memon, J. Togelius, and A. Ross, ‘Evolutionary methods
 Model-2        0.571     0.599    0.569     0.563    0.712        0.469           for generating synthetic masterprint templates: Dictionary attack in fin-
 Model-3        0.571     0.621    0.575     0.581    0.715        0.472           gerprint recognition’, in 2018 International Conference on Biometrics
 Model-4        0.579     0.641    0.580     0.587    0.718        0.486           (ICB), pp. 39–46, (Feb 2018).
                                                                               [8] Fariba Shaker. Human sperm head morphology dataset (hushem), 2018.
 Augmented      0.586     0.633    0.590     0.594    0.725        0.496       [9] Vanessa Volz, Jacob Schrum, Jialin Liu, Simon M. Lucas, Adam Smith,
                                                                                   and Sebastian Risi. Evolving mario levels in the latent space of a deep
 Seed-0         0.589     0.645    0.587     0.594    0.724        0.498           convolutional generative adversarial network, 2018.
 Seed-1         0.589     0.632    0.596     0.596    0.729        0.495      [10] Gregory P. Way and Casey S. Greene, Extracting a biologically relevant
 Seed-2         0.539     0.586    0.534     0.537    0.689        0.446           latent space from cancer transcriptomes with variational autoencoders,
 Seed-3         0.596     0.642    0.604     0.609    0.734        0.508           80–91, 2018.
 Seed-4         0.618     0.663    0.629     0.633    0.750        0.536      [11] D Winant, Joachim Schreurs, and J Suykens, ‘Latent space exploration
                                                                                   using generative kernel pca’, in Proc. of the 28th Belgian Dutch Con-
                                                                                   ference on Machine Learning (Benelearn2019). BNAIC/Benelearn,
                                                                                   (2019).

being trained, we rely on a Supervisor module that selects the best
set based on different criteria.
   We instantiate this framework in a real-world application problem
the Human Sperm Head Morphology dataset has a proof of concept.
Due to the small number of instances we can categorize it as small
data dataset, which presents an opportunity to explore Data Augmen-
tation approaches. We created a baseline classifier with the provided
data for comparison with the classifiers created by our framework.
We used a Genetic Algorithm to evolve sets of latent space vec-
tors that generated sets of images. We used the normalized cross-
correlation similarity metric to calculate the dissimilarity among the
sets and used the average value to assign fitness to each one. Overall
we were able to guide evolution and generated dissimilarity subsets.
The best subset from the last population of the evolutionary algo-
rithm was used to augment the training dataset of the classifier. The
classifier was then trained with the synthetic and with a base subset
of instances. We used cross-validation to compute performance met-
rics. Overall the results show that we can increase the performance of
the classifier. For example, we were able to raise accuracy by 0.8%
and the f1-score by 1.2%. Although more tests are needed to verify
this conclusion and even to improve the quality of the solution, it is a
first step and a proof of concept of the potential of such an approach.
   Future work may include testing with different proportions be-
tween original images and generated images in augmented datasets,
and even testing on training sets composed of generated images only.
Also, we may even test different datasets, use different similarity
metrics or even improve the supervision algorithm with the inclu-
sion of other techniques. Finally, we may also look into comparing
this approach to other data augmentation approaches.


REFERENCES
[1] Piotr Bojanowski, Armand Joulin, David Lopez-Paz, and Arthur Szlam.
    Optimizing the latent space of generative networks, 2017.
[2] Philip Bontrager, Wending Lin, Julian Togelius, and Sebastian Risi,
    ‘Deep interactive evolution’, in Computational Intelligence in Mu-
    sic, Sound, Art and Design, eds., Antonios Liapis, Juan Jesús
    Romero Cardalda, and Anikó Ekárt, pp. 267–282, Cham, (2018).
    Springer International Publishing.