Towards Latent Space Exploration for Classifier Improvement Paulo Fernandes and João Correia and Penousal Machado1 Abstract. One of the ways to train generative models is by using Genera- We propose a framework that combines Generative Adversarial tive Adversarial Networks. These frameworks are more often used to Networks and Evolutionary computation to perform Data Augmen- produce very realistic images that follow the distribution of the train- tation on small datasets in order to improve the performance of image ing dataset [5]. In general, these work by putting a generator and a classifiers trained via supervised learning. In this work, we attest the discriminator against each other in a min-max game. The discrimi- viability and potential of this framework for real-world problems. nator is trained to distinguish images from the original dataset from The framework is composed of a generator module that uses Gener- images created by the generator, while the generator learns from the ative Adversarial Networks to generate samples from a dataset. It feedback given by the discriminator on the generated data. In this employs an Evolutionary Computation approach to evolve sets of work, we plan to use Generative Adversarial Networks to generate images from the latent space. The fitness function is based on the sets of synthetic images in order to understand if the addition of these dissimilarity of the subsets generated by the Generative Adversarial instances into the training set of a classification model is able to im- Network. A Supervisor module handles the generated samples and prove its performance. chooses which set should be added to the training dataset. To test the As a further matter, it is also essential to address the generation of framework, we explore the Human Sperm Head Morphology dataset, new samples. Even though a capable model is important in the gen- a bio-medicine multi-class problem with a small number of samples eration of better images, there is also another variable that impacts that provide a challenge to the different supervised classification ap- the instances generated, which is the latent space. It is unique to each proaches. We deploy the framework to create an augmented dataset generative model and hides underlying patterns in itself. Usually, to to train a classifier, and after the training, we compute the perfor- generate an image, a vector from the generative model’s latent space mance on the test set. We compare with classifiers trained using the is chosen at random for input, which means that there is no knowl- base datasets without having the generated samples. Overall, with edge about the output. Since the images generated depend on the the preliminary tests, we can improve the performance of the classi- input given to the generator, the exploration of the latent space may fiers by up to 4% and on average by 1%, showing the viability and reveal ways to control the output through the selection of input vec- potential of our approach. tors by certain criteria. This way, we can also assure the quality of the images that will be added to the dataset and their relevance to help with the problem at hand. For instance, we might want to ensure 1 INTRODUCTION that we are not adding redundant samples to the training set. Like so, performing random Data Augmentation should have a higher chance With the evolution of technology and computer capabilities, Machine of undermining the performance of the algorithm, which means that Learning has seen significant improvements in recent years. Like so, ideally, we should prefer an approach of a supervised generation of it became much easier to build and apply larger neural networks, instances. Bearing this in mind, we chose to perform this supervision such as Deep Neural Networks, to solve real-world problems. It also by exploring the latent space using Evolutionary Computation. Using became possible to build Deep Generative Models which produce a Genetic Algorithm, we evolve sets of latent vectors which optimize synthetic data by learning from already existing data. a specific criterion such as the diversity of images in the set. This While the availability of data has also been accompanying the evo- framework for latent space exploration was explored in [4]. Here we lution of technology, there are still many problems that lack enough explore the usage of such a framework to generate new samples for data to allow Machine Learning algorithms to be viable solutions for the training dataset that may improve the performance of classifiers. them. The performance of Machine Learning algorithms depends not As proof of concept for real-world problems, we instantiate the ap- only on the capability of the model used but also on the quality of the proach in the Human Sperm Head Morphology dataset (HuSHeM) dataset used in the training of the model. This means that training a [8], a multiclass problem categorized as small data, that provides a model with a bad dataset will, most likely, lead to poor performance challenge by the lack of samples that exists for the problem. results. As such, improving the quality of the datasets through Data The remaining of the paper goes as follows: in the next Section, Augmentation may be a way to improve the performance of the al- we explain our approach to the problem and the framework used to gorithm. With this in mind, is it possible to use generative models to solve it (Section 3). In Section 4, we describe the experimental setup enhance the quality of existing datasets, consequently improving the and analyse and discuss the results. In Section 5 we draw overall quality of machine learning algorithms? conclusions. 1 CISUC, Department of Informatics Engineering, University of Coim- bra,Portugal, email: pcastillo,jncor,machado@dei.uc.pt Copyright ©2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) 2 RELATED WORK Generative Adversarial Networks are generative models that are trained through a face-off between a generator and a discriminator, mostly used to train a generator that can produce realistic images. The generator is given a noise vector to produce new images, usu- ally, a high dimensional vector that is randomly sampled from a dis- tribution, for example, a gaussian distribution, called the prior. The high dimensional space is called latent space. Some work has already been made to explore the latent space of generative models, and not only with Generative Adversarial Networks, which is the framework that we will use in this work. For instance, latent space exploration was performed in Kernel Principal Component Analysis [11] mod- Figure 1. Class example images from original dataset. From left to right, els, showing navigation through image features and novelty detec- top to bottom: Normal, Tapered, Pyriform and Amorphous. tion; but also in Variational Auto-Encoders, in, for example, mapping genes into a lower-dimensional space to uncover underlying gene ex- images to the training set. This way, we will be able to assess the pression features in cases of tumour or cancer [10]. quality of our approach and guide the progress of research. Evolutionary Computation has also been used in some works in order to evolve images. For instance, evolving master print templates [7] that, like a master key, are able to match multiple fingerprints. 3.2 Generator In their work, Roy et al. compared four different Evolutionary Algo- rithms, namely Hill-Climbing, Covariance Matrix Adaptation Evo- The generator will be obtained through a Deep Convolutional Gen- lution Strategy, Differential Evolution and Particle Swarm Optimiza- erative Adversarial Network. These make use of Deep Convolutional tion to evolve Synthetic MasterPrints according to the metric pro- layers that better explore space correlation in images [6], which helps posed by them, the Modified Marginal Success Rate. The samples to generate better quality images. The training is unsupervised, which were generated from two datasets, namely Authentec AES3400 and means that no information is given to the model to guide the gen- FVC 2002 DB1-A. Beyond these, and with a two-stage workflow eration of images, the training progresses purely through the dif- similar to what we implemented in this paper which includes first ferentiation between real(original) and fake(generated) images [5]. the unsupervised training of Generative Adversarial Networks and Therefore, a generator will be trained for each class of the problem second the evolution of latent space, there are two works that can to specifically control the generated images for each one. Each gen- be mentioned. One that implements Interactive Evolutionary Com- erator will learn to produce images following the distribution for a putation [2] for image generation and another that uses Generative single class. During training, the generated images are created from Adversarial Networks and latent space evolution to learn and im- random vectors, following a gaussian distribution. prove Mario Levels [9] using Covariance Matrix Adaptation Evolu- tion Strategy. Finally, a recent approach to generative models which was inspired by Generative Adversarial Networks, the Generative 3.3 Supervisor Latent Optimization [1]. This method replaces the adversarial dis- Lastly, the Supervisor, which is crucial to the optimization of the criminator with simple reconstruction losses where the focus is to training dataset. By adding random images with no criteria, there is evolve the latent space to match the one learnable noise vector to no way to ensure if these are relevant to the solution of the problem. each one of the images in the training dataset. The introduction of flawed and redundant images might end up un- dermining the performance of the classification algorithm [3]. One 3 FRAMEWORK way to control the output of the generators is by controlling its latent space. Selecting generated images through certain criteria will allow In this paper, we propose a framework that combines Generative Ad- for the optimization of the classifier. More specifically, we will be versarial Networks and Evolutionary computation to perform Data looking into finding sets of images that are as diverse as possible so Augmentation on small datasets in order to improve the performance as to minimize redundancy. The exploration of the latent spaces will of image classifiers trained via supervised learning. be performed through Evolutionary Computation, more specifically For the framework, there are 3 fundamental pieces: (i) a classi- using a Genetic Algorithm. Each individual in the algorithm will rep- fier responsible for the classification task, discriminating images into resent a set of images where its genetic code corresponds to the latent classes; (ii) a generator, responsible for generating new images, from vectors of said set of images. The initial population is created through an array of the latent space; (iii) a supervisor, responsible for manag- random sampling from the same Gaussian distribution of the gener- ing the generation images through the exploration of latent space (as ator training. At each iteration, new populations are created by using in [4]). Tournament Selection, Uniform Crossover and Random Reset Muta- tion (which also applies the previous Gaussian distribution to obtain 3.1 Classifier the value of the new genes). The evaluation of individuals and fitness function correspond to the averaging of the similarities between each The performance of classifier will measure the performance of this image in the set and the centroid image of the set that includes the im- framework. Therefore, we will be looking into comparing the per- ages from that individual together with the images from the original formance of the classifier after the baseline training - using only the dataset. The similarities between images and centroid are calculated original dataset - against the performance of the classifier after the using Normalized Cross-Correlation. Since we are searching for di- supervised augmented training - with selective addition of synthetic verse datasets, the objective is to minimize the target function. In the end, we should find a set of images that comes closest to the intended objective and better tackles the issues at hand. 4 EXPERIMENTATION In order to evaluate our approach, we performed several tests. The conditions on which these tests were carried out are presented in this section. 4.1 Dataset I order to test our hypothesis we will be using the Human Sperm Head Morphology dataset (HuSHeM) [8]. In the bio-medicine con- text, Sperm morphology analysis is a key factor in the diagnosis Figure 2. Class example of random images produced by the generator. On process of male infertility. The dataset is divided into 4 classes of each row from, top to bottom: Normal, Tapered, Pyriform, Amorphous sperm heads images [Figure 1]: Normal (54 instances), Tapered (53 instances), Pyriform (57 instances) and Amorphous (52 instances) Table 2. Deep Convolutional Generative Neural Network parameters for a total of 216 images. A small dataset like this one is an oppor- Parameter Setting tunity to explore Data Augmentation approaches. The dataset has no latent dimension 100 sub-division, as such it was decided that we would use 40 instances optimizer Adam of each class for training and cross-validation, leaving the remaining beta1 0.5 images for testing. beta2 0.999 learn rate 0.0002 Each image has a original dimensions 131x131x3, but in the ex- epochs 10000 periments we will be working with in dimensions 132x132x1. batch size 32 loss function Binary Cross-Entropy noise distribution N(0,1) 4.2 Classifier The classifier module allows the assertion of the experiment results. 4.4 Supervisor The model used was an off-the-shelf model that only required train- ing since the optimization of the model was one of our objectives for The core step of this work is the supervision of the generation of sam- this work. The parameters used for the training of the classifier are ples. This is what is going to allow the optimization of the process present in Table 1. The number of epochs chosen ensures that the of performing Data Augmentation and ensure the best results possi- classifier reaches a point of plateau for the original dataset, where ble. For this, we decided to use Evolutionary Computation, namely there is no gain in performance. This helps to verify the quality of our a Genetic Algorithm, to explore the latent space of the generators solution. On another note, the training also included cross-validation with the intent of finding sets of algorithms that optimize a certain for every test, which means both tests with the original dataset and criterion. In this case, specifically, we are looking into maximizing augmented datasets. the diversity of the dataset. The supervision is performed for a single generator, or single class, which means that in this problem, we are Table 1. Classifier parameters going to use 4 supervisors to evolve 4 different sets of images that Parameter Setting will be added to the original set. The parameters used in the genetic optimizer Adam beta1 0.5 algorithm were as defined in Table 3. beta2 0.999 learn rate 0.0002 epochs 250 Table 3. Genetic Algorithm Parameters batch size 32 Parameter Setting loss function Binary Cross-Entropy Population size 20 cross-validation Stratified Number of generations 500 folds 5 Genotype length number of images × latent dimension Elite size 1 Selection method tournament Tournament size 3 4.3 Generator Crossover operator uniform crossover Crossover rate 0.7 The generators allow for the creation of new samples to perform data Mutation operator random reset mutation augmentation. A generator is obtained through unsupervised train- Mutation distribution N(0,1) ing of a Deep Convolutional Generative Adversarial Neural Network. Mutation rate per gene 0.02 The model used for the discriminator, as the classifier, is an off-the- shelf model, to which we only added 2 extra layers of convolution. Each individual represents a set of images that are coded into its As for the discriminator, the model used was the same as the classi- genetic code in the form of latent vectors The fitness of each individ- fier. The training is performed in each individual class, which means ual is calculated by averaging the similarities between each image in that in this case, in particular, we are going to need 4 different gen- the set and the centroid image of the set that includes the images from erators to produce samples for each class [Figure 2]. Each generator that individual together with the images from the original dataset. was trained using the 40 class corresponding samples in the training The similarity is calculated using the Normalized Cross-Correlation set. The training parameters were set, as shown in Table 2. metric. The calculations are as follows. T = I _O (1) P t∈T t C= (2) length(T ) P i∈I N CC(i, C) F = (3) length(I) T is the set resulting of concatenation of the images from the in- dividual (I) with the images from the original dataset(O). C is the centroid of the set T . F is the fitness of the individual calculated through averaging the similarities measured using the Normalized Cross-Correlation(NCC). The Calculation of the similarity metric is as follows: P ((A − B) (A − B)) N CC(A, B) = pP P (4) (A A) × (B B) The corresponds to the Hadamard product between two images. On a last note, since the fitness function measures similarity in- stead of diversity, the objective of the algorithm is set to minimiza- tion. In the end, we should end up with a set of images that are more diverse, than if we just picked a randomly. 4.5 Experimental Results In order to test our framework, we performed a comparison be- tween the performances of the classifiers before and after performing Data Augmentation. The evaluation of each classifier in the cross- validation was performed in the test dataset, where several metrics were measured, namely Accuracy, Precision, Recall, F1 score, Area Under Receiving Operator Characteristic Curve (AUROC) and Aver- Figure 3. Average of the learning curve of the trainings with the original age Precision. Each test was performed 5 times with different seeds. dataset across the all repetitions In the following results we will be presenting the mean across these 5 repetitions. Note that the initialization of the weights is the same between Model-X and Seed-X (e.g. Model-0 and Seed-0), Seed-0 differs on the seed used to generate the augmented dataset and, of course, the existance of augmented instances. The first tests were performed with the original dataset. By looking at Figure 3 we can observe that at 250 epoch the training has already reached a plateau in terms of accuracy, which means that it will most probably not get any benefits from further training since it will tend to overfit. We were also able to find the performance that our solution should be able to overcome [Table 4.5]. The next step was building the sets of images of images to be added to the original dataset and train with the augmented dataset. For this experiment it was decided to test a dataset composed of 50% original images and 50% synthetic images. As such, for each class we generated and evolved sets of 40 images. The selection of these sets was repeated for every repetition of the test with the classifier which was performed 5 times with different seed, similarly to what was done with the baseline test. First, by analysing the line of evolution in Figure 4 we can see that the values of fitness of the best individuals do not have a great vari- ation between the first and the last generations. All values, for every class, sit on an interval of 0.01 between, 0.99 and 1. This means that Figure 4. Average of the fitness of the best individual during the evolution the similarity between the images in this dataset, for this metric, is process across all repetitions really high and that it promote a good evolution. However, if we take look at Figure 5 which puts side-by-side the best individual of the first generation (top) against the best individual of the last generation (bottom) in the evolution of a set of the class ”Amorphous”, we can argue that the latter has in fact, from a subjective perspective, more visual diversity than the first. The last step is the training of the classifiers with the augmented datasets. By analysing the training curve in Figure 6 we can see that at epoch 250 the training also reaches a plateau accuracy-wise, mean- ing that further training would not help better the performance. Figure 6. Average of the learning curve of the training with the augmented dataset across the all repetitions Looking at the test results [Table 4.5], we can see that the perfor- mance of the classifiers trained with the augmented dataset was, on average, better for every metric. Although it would be necessary to perform more tests to verify the benefit of our solution, this shows that our approach might indeed be a way to improve datasets and consequently the performance of Classifiers. Each Seed-X represents a classifier that was trained by a subset from the Evolutionary Com- putation process using a different random generator seed. We can observe that in 4/5 seeds we are able to improve beyond the average performance of the original baseline classifier. We have one Seed that improves up to 4% over the baseline average. One of the seeds hindered the performance of the classifier but on average we have Figure 5. Best individuals at the end of generations 0 and 500, top and improvements over all the metrics when compared with the trained bottom respectively, from the process of evolution of individuals from the models with different initialization weights. class ”Amorphous” 5 CONCLUSION We explored an approach that uses Generative Adversarial Networks for Data Augmentation to improve the performance of a supervised classifier applied in a real-world problem. The underlying idea is to explore the latent space of the generative model using Evolutionary Computation to generate sets of instances to be used in the training dataset of a supervised classifier. Since arbitrarily adding instances to the dataset could hinder the performance of the classifier that is [3] João Correia, Evolutionary Computation for Classifier Assessment and Table 4. The classifier test results for each metric. Each model trained is Improvement, Ph.D. dissertation, University of Coimbra, 2018. denoted as Model-X and each model trained with the augmented dataset from [4] Paulo Fernades, João Correia, and Penousal Machado, ‘Evolution- the Evolutionary process is denoted as Seed-X. Augmented is the average of ary latent space exploration of generative adversarial networks’, in the 5 seeds and Original is the average of 5 models trained. EvoApps 20’ Proceedings of the 23rd European Conference on the Ap- plications of Evolutionary and bio-inspired Computation - Evolution- ary Machine Learning, p. to appear, (2020). Metrics [5] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio, Data Acc Prec Rec F1 AUROC Avg-Prec ‘Generative adversarial nets’, in NIPS, pp. 2672–2680, (2014). Original 0.578 0.623 0.580 0.582 0.719 0.487 [6] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised repre- sentation learning with deep convolutional generative adversarial net- Model-0 0.593 0.610 0.598 0.592 0.731 0.508 works, 2015. Model-1 0.575 0.648 0.580 0.588 0.719 0.497 [7] A. Roy, N. Memon, J. Togelius, and A. Ross, ‘Evolutionary methods Model-2 0.571 0.599 0.569 0.563 0.712 0.469 for generating synthetic masterprint templates: Dictionary attack in fin- Model-3 0.571 0.621 0.575 0.581 0.715 0.472 gerprint recognition’, in 2018 International Conference on Biometrics Model-4 0.579 0.641 0.580 0.587 0.718 0.486 (ICB), pp. 39–46, (Feb 2018). [8] Fariba Shaker. Human sperm head morphology dataset (hushem), 2018. Augmented 0.586 0.633 0.590 0.594 0.725 0.496 [9] Vanessa Volz, Jacob Schrum, Jialin Liu, Simon M. Lucas, Adam Smith, and Sebastian Risi. Evolving mario levels in the latent space of a deep Seed-0 0.589 0.645 0.587 0.594 0.724 0.498 convolutional generative adversarial network, 2018. Seed-1 0.589 0.632 0.596 0.596 0.729 0.495 [10] Gregory P. Way and Casey S. Greene, Extracting a biologically relevant Seed-2 0.539 0.586 0.534 0.537 0.689 0.446 latent space from cancer transcriptomes with variational autoencoders, Seed-3 0.596 0.642 0.604 0.609 0.734 0.508 80–91, 2018. Seed-4 0.618 0.663 0.629 0.633 0.750 0.536 [11] D Winant, Joachim Schreurs, and J Suykens, ‘Latent space exploration using generative kernel pca’, in Proc. of the 28th Belgian Dutch Con- ference on Machine Learning (Benelearn2019). BNAIC/Benelearn, (2019). being trained, we rely on a Supervisor module that selects the best set based on different criteria. We instantiate this framework in a real-world application problem the Human Sperm Head Morphology dataset has a proof of concept. Due to the small number of instances we can categorize it as small data dataset, which presents an opportunity to explore Data Augmen- tation approaches. We created a baseline classifier with the provided data for comparison with the classifiers created by our framework. We used a Genetic Algorithm to evolve sets of latent space vec- tors that generated sets of images. We used the normalized cross- correlation similarity metric to calculate the dissimilarity among the sets and used the average value to assign fitness to each one. Overall we were able to guide evolution and generated dissimilarity subsets. The best subset from the last population of the evolutionary algo- rithm was used to augment the training dataset of the classifier. The classifier was then trained with the synthetic and with a base subset of instances. We used cross-validation to compute performance met- rics. Overall the results show that we can increase the performance of the classifier. For example, we were able to raise accuracy by 0.8% and the f1-score by 1.2%. Although more tests are needed to verify this conclusion and even to improve the quality of the solution, it is a first step and a proof of concept of the potential of such an approach. Future work may include testing with different proportions be- tween original images and generated images in augmented datasets, and even testing on training sets composed of generated images only. Also, we may even test different datasets, use different similarity metrics or even improve the supervision algorithm with the inclu- sion of other techniques. Finally, we may also look into comparing this approach to other data augmentation approaches. REFERENCES [1] Piotr Bojanowski, Armand Joulin, David Lopez-Paz, and Arthur Szlam. Optimizing the latent space of generative networks, 2017. [2] Philip Bontrager, Wending Lin, Julian Togelius, and Sebastian Risi, ‘Deep interactive evolution’, in Computational Intelligence in Mu- sic, Sound, Art and Design, eds., Antonios Liapis, Juan Jesús Romero Cardalda, and Anikó Ekárt, pp. 267–282, Cham, (2018). Springer International Publishing.