1. Introduction

GAN-based Image Generation Techniques Exploiting Latent Vector Distribution and Edge Loss Methods on Limited Datasets

Yun-Gyeong Song

songyg1020@gnu.ac.kr 0 1

Gun-Woo Kim

gunwoo.kim@gnu.ac.kr 1 2 0 Department of AI Convergence Engineering, Gyeongsang National University , Jinju , Republic of Korea 1 ISE 2023: 2nd International Workshop on Intelligent Software Engineering 2 School of Computer Science Gyeongsang National University , Jinju , Republic of Korea

Generative Adversarial Networks (GANs) have demonstrated remarkable capabilities in image generation, surpassing the performance of previous image generation models. However, GANs require large training datasets to facilitate proper learning. GANs have inherent problems such as the mode collapse problem, where identical images are generated, and instability problem, where the generator and the discriminator fail to form a successful adversarial relationship. These problems are particularly common when the availability of training data is limited. In this paper, we propose three techniques to address these challenges. Firstly, Common Feature Training (CFT) is introduced to enhance performance by training the Generator to recognize common features, thereby mitigating instability problems. Secondly, Mean Rescaling (MR) is employed to mitigate the mode collapse problem arising from sampling latent vectors with identical means and variances. Thirdly, an edge loss method is implemented, where the edge difference values between real and generated images are added to the GANs loss. This contributes to the classification of shapes, thereby mitigating the mode collapse problem and instability problem. Comparative experimental results illustrate improvements in the highlighted issues, and the performance enhancement is validated by metrics, namely Fréchet Inception Distance (FID) and Inception Score (IS).

1. Introduction

Generative models have demonstrated remarkable performance improvements with the advent of large datasets and deep neural networks. It is used for a variety of tasks, including inpainting and image translation tasks [ 9,10 ]. Generative Adversarial Networks (GANs) are a framework for estimating generative models through adversarial learning of generators and discriminators. The discriminator estimates the probability that the input data is real, while the generator generates fake data that mimics the distribution of real data to deceive the discriminator. Due to these characteristics, the discriminator is trained to distinguish real data from fake data, and the generator is trained to produce fake data that closely approximates real data. Training GANs inherently require a large dataset, as insufficient data can lead to instability problem such as a lack of adversarial relationship between the generator and discriminator, and a mode collapse problem, where the generator primarily produces an identical image [ 1,11 ].

Extensive research has been conducted to mitigate the mode collapse problem and instability problem in traditional GANs by improving the model’s architecture and loss function [ 2 ]. However, most of these researches heavily rely on the use of large datasets. The basic solution to mode collapse problem and instability problem is to augment the image or add more datasets [ 12,13 ]. Otherwise, without such large datasets, producing diverse and high-quality data becomes a challenge. In this paper, we propose following three techniques to address the problems of 0009-0008-6692-1925 (Y. G. Song); 0000-0001-5643-4797 (G. W. Kim) © 2023 Copyright for this paper by its authors.

Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

CEUR Workshop Proceedings (CEUR-WS.org) instability and mode collapse. Figure 1 shows the learning process and architecture of our proposed GAN model.

1. Common Feature Training (CFT): We propose the Common Features Training (CFT) to facilitate the Generator's training and establish an adversarial relationship between the Generator and Discriminator to address the instability problem. 2. Mean Rescaling (MR): We propose the Mean Rescaling (MR), which rescales the mean of the sampled vector z by replacing it with a random mean to mitigate the mode collapse problem. 3. edge loss: We propose edge loss to inhibit the fast learning of the discriminator and use GANs + edge loss to classify not only the authenticity of the image but also its shape, which mitigates the problems of mode collapse and instability.

2. Related work

GANs have a different approach to generating high-quality images compared to traditional generative models, and they have demonstrated excellent performance in image generation since the release of Deep Convolutional Generative Adversarial Networks (DCGAN), which was changed to Deep Convolution Nets [ 1,3 ]. However, when generating an image with limited data, the generator and discriminator may not form an adversarial relationship due to lack of data, which leads to problems of instability and mode collapse.

To address the mode collapse problem, Diverse and Limited Data Generative Adversarial Networks (DeLiGAN) were proposed [ 4 ]. DeLiGAN creates various components through a Gaussian mixture model and reparametrizes the latent vector and latent space, through random sampling. This research follows the approach of modifying the latent distribution to obtain samples in the high-probability region [ 5 ]. However, this approach involves the training of mean and variance components, which can be time-consuming and does not address the instability problem.

3. Method

Latent vectors are randomly sampled from a normal distribution but have entangled features because one feature is related to another [16]. We sample from a random normal distribution, the learning nature of the discriminator will cause the generator to synthesize an identical image that will best fool the discriminator [ 14 ]. The mode collapse problem can be mitigated to an extent by reparametrizing the distribution of the latent vector [ 4 ]. They can be trained with additional information to direct the generative model and slow down the learning of the discriminator [15]. This mitigates the instability problem.

distribution. w is the weight of the common feature, means each feature. Take an input of sampling size by a normal distribution and run it through a linear layout to get a trained of size +

. In this procedure, each feature's value is learned in such a way that it reduces the loss value, and a variable with a common feature is obtained.

FD is a quantification of the degree to which a feature is included, and choosing a high FD value can result in less diverse data because the distribution learns detailed features. Conversely, selecting a low FD value may result in a lack of common features being incorporated. Therefore, it is important to set the correct FD value. Additionally, if the data does not have common features between individual data points, the technique may not be very effective, so prior data analysis is

3.1. Common Feature training

Common Feature Training (CFT) is a technique to enhance the stability of learning. It consists of reconstructing a feature vector z into a distribution that incorporates the common features of the dataset, allowing for the learning of these common features. In this technique, we adopt a hyperparameter called Feature Differences (FD). This hyperparameter reflects the extent to which common features are considered and is initially set to 1. It is incremented by 1 until no instability problem occurs beyond epoch 30, at which point we adopt an appropriate FD value. This helps the generator learn to create an adversarial relationship. The equation of the reconstruction process is as follows: = + (1) (2) required. follows: follows:

3.3. Edge loss 3.2. Mean rescaling

Mean Rescaling (MR) is a technique that involves adding a random average value, ranging from 1 to 1, to the feature vector after it has undergone CFT. This technique ensures that each latent vector has a unique average, devised to mitigate the mode collapse problem occurring when the initial latent vectors have the same mean and variance. The equation of this technique is as where is the value of the potential vector generated by the CFT and μ is the mean value added to and is an arbitrary value between -1 and 1.

MR enables the latent vector to have the mean of a user-specified range. This technique mitigates the mode collapse problem and overcomes the problem of generating only specific classes of data, enabling the generation of a variety of data categories.

Most of the instability problem is caused by the faster training of the Discriminator compared to the Generator. To address this problem, we incorporate the similarity of the edges between the generated image and the real image into the loss function. The equation of loss function is as ( , ) = ~ ( ) [log ( ) + ∑( ( ) − )2] + ~ ( ) [log (1 − D(G(z)))] (3) difference between the image ( ) generated by the Generator and the real image , which also allows the Discriminator to learn the shape of the image. The Generator generates images that look like the real image through 1 − D(G(z)).

Since the discriminator is responsible for distinguishing between real and generated data, it does not recognize the shape of the generated data. However, with formula (3), the discriminator takes shape into consideration. This technique not only ensures the reliability of the data but also allows for a more accurate understanding of the image shape information. This mitigates the mode collapse problem and mitigates the instability problem by slowing down the learning speed of the Discriminator to compete with the Generator.

4. Experiments 4.1. Metrics

The experimental environment in this paper is as follows: We used PyTorch version 1.12.1, the optimization algorithm is Adam [ 8 ], the learning rate is 1e-2, and the batch size is 32. Figure 2 shows the training loss graph for each dataset and model, and we can see that our proposed GAN is more stable in all datasets compared to other models.

For the evaluation, we used Fréchet Inception Distance (FID) [ 6 ] and Inception Score (IS) [ 7 ]. FID is a metric that estimates the similarity of the distribution of real and generated data and calculates the distance, indicating how similar the two data are. The equation for FID is as follows: Emoji FID where | − | is the feature vector mean of the true image distribution and the generated image distribution. ∑+ ∑ is the sum of the true and generated image covariance matrices, and (∑∑ )2is the 1 square root matrix of the two covariances. Add | − | to the value of Tr to get the FID. The lower the FID, the more similar the generated data is to the real data.

IS is another metric used to evaluate the quality and diversity of generated data, predicting the class of generated data using the Inception Network. The equation for IS is as follows: where ( ( | )|| ( )) represents the Kullback-Leibler divergence between the predicted image distribution for image and the true image distribution. This quantifies the predictive value of the difference between these two probability distributions. The higher the IS, the better the performance in terms of quality and variety of data generated.

The above metrics were used to compare the performance of each model and used to demonstrate the performance of the proposed GAN.

4.2. Dataset

In this paper, we conducted experiments using the Emoji Dataset and CIFAR Dataset. The Emoji Dataset comprises 402 positive classes and 402 negative classes, totaling 804 images. The CIFAR10 dataset consists of 10 classes (airplane, car, bird, cat, deer, dog, frog, horse, ship, truck), with 6,000 images per class, resulting in a total of 60,000 images. However, for this research, 1,000 images were used.

4.3. Results and Performance

In Figure 3, as well as in Table 1, the IS and FID values for each dataset and model are displayed. Figure 3 reveals that our proposed GAN has the optimal performance, boasting the lowest FID values and the highest IS. In the CIFAR-10 dataset of Figure 3, the FID of DCGAN is overwhelmingly large in Figure 3-(a), so we exclude it and show the FID of DeLiGAN and our proposed GAN in Figure 3-(b).

When compared to DCGAN, our proposed GAN exhibits a 62.90% reduction in FID and a 30.77% increase in IS on the emoji dataset, as well as a 99.57% reduction in FID and a 31.82% increase in IS on the CIFAR-10 dataset. In comparison to DeLiGAN, our proposed GAN shows a 32.35% reduction in FID and a 13.33% increase in IS on the emoji dataset, and a 66.67% reduction in FID and an 11.54% increase in IS on the CIFAR-10 dataset.

Examining the generated images in Figure 4, it becomes visually evident that our proposed GAN outperforms others by generating images that are not only clearer but also more diverse. This underlines that the model, enhanced with the proposed technique and compared against DCGAN and DeLiGAN in experiments, produces exceptionally high-quality images that bear a striking resemblance to real images.

5. Conclusion

In this paper, we propose Common Feature training (CFT), Mean rescaling (MR), and edge loss to resolve the learning instability problem and mode collapse problem.

Common Feature training (CFT) is intended to train the latent vector z on the overall shape of the data. This causes the generator to learn a common feature to compete with the discriminator. whereby mitigates the instability problem.

Mean rescaling (MR) is a technique to mitigate the mode collapse problem caused by a latent vector z sampled from a distribution with the same mean. Mitigate the mode collapse problem by sampling latent vectors z with different means.

Edge loss is a loss function that adds the difference between the edge of real data and the edge of generative data to GANs loss and does not simply classify whether it is real or generative, but also learns image shape information to generate various images and slows down D's learning speed. whereby mitigate the mode collapse problem and instability problem. As a result, Figure 2 shows that the proposed GAN trains reliably compared to other models, and Figure 3 shows that our proposed GAN produces higher quality and more diverse images compared to other models.

Acknowledgements

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF), funded by the Ministry of Education, Science and Technology (NRF2021R1G1A1006381). [15] MehdiMirza, SimonOsindero, "ConditionalGenerativeAdversarialNets", arXiv preprint arXiv:1411.1784, 2014 [16] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, Pieter Abbeel, "InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets", 30th Conference on Neural Information Processing Systems, 2016, pp 2180-2188

[1]

Ian

Goodfellow , Jean Pouget Abadie, Mehdi Mirza, Bing Xu, David Warde Farley, Sherjil Ozair, Aaron Courville and

Yoshua

Bengio , "Generative adversarial nets" , Advances in neural information processing systems , volume 27 , 2014 , pp 139 - 144

[2]

Xudong

Mao ,

Qing

Li ,

Haoran

Xie , Raymond Y.K. Lau , Zhen Wang and Stephen Paul Smolley Stephen, "Least squares generative adversarial networks" , Proceedings of the IEEE international conference on computer vision (ICCV) , 2017 , pp 2794 - 2802

[3]

Alec

Radford , Luke Metz and

Soumith

Chintala , "Unsupervised representation learning with deep convolutional generative adversarial networks" , in Proc. Int. Conf. Learn. Representations , 2016

[4]

Swaminathan

Gurumurthy , Ravi Kiran Sarvadevabhatla and R. Ven katesh Babu, "Deligan: Generative adversarial networks for diverse and limited data" , In Proceedings of the IEEE conference on computer vision and pattern recognition , 2017 , pp 166 - 174

[5] Tommi

Jaakkola and Michael I. Jordan , "Improving the mean field approximation via the use of mixture distributions" , NATO ASI Series D Behaviroural Social Sci., volume 89 , 1998 , pp 528 - 534

[6]

Martin

Heusel , Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler and

Sepp

Hochreiter , "Gans trained by a two time-scale update rule converge to a local nash equilibrium" , Advances in neural information processing systems , volume 30 , 2017 , pp 6626 - 6637

[7]

Tim

Salimans , Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford and Xi Chen, “ Improved techniques for training gans” , Advances in neural information processing systems , volume 29 , 2016 , pp 2226 - 2234

[8] Diederik

Kingma and Jimmy Lei Ba, ” Adam: A method for stochastic optimization” , arXiv preprint arXiv:1412.6980 , 2014

[9]

Phillip

Isola , Jun-Yan

Zhu

, Tinghui Zhou, Alexei A. Efros , "image-to-image translation with conditional adversarial networks" , In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) , 2017 , pp 1125 - 1134

[10] Jun-Yan

Zhu

, Taesung Park, Phillip Isola, Alexei A. Efros, "Unpaired Image-To-Image Translation Using Cycle-Consistent Adversarial Networks" , In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017 , pp 2223 - 2232

[11] Martin

Arjovsky

, Léon Bottou, "Towards Principled Methods for Training Generative Adversarial Networks" , In Proc. ICLR , 2017

[12] Connor

Shorten

, Taghi M. Khoshgoftaar , "A survey on image data augmentation for deep learning" , Journal of Big Data , volume 6 , 2019

[13] Dan

Zhang

, Anna Khoreva, "PA-GAN: Improving GAN Training by Progressive Augmentation" , In Proc. ICLR , 2018

[14] Tim

Salimans

, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen, Xi Chen, "Improved techniques for training gans" , Advances in neural information processing systems 29 , 2016 , pp 2234 - 2242