Introduction

Multi-Modal Generative Adversarial Networks Make Realistic and Diverse but Untrustworthy Predictions When Applied to Ill-posed Problems

0 John S. Hyatt, Michael S. Lee Computational & Information Sciences Directorate, DEVCOM Army Research Laboratory

Ill-posed problems can have a distribution of possible solutions rather than a unique one, where each solution incorporates significant features not present in the initial input. We investigate whether cycle-consistent generative neural network models based on generative adversarial networks (GANs) and variational autoencoders (VAEs) can properly sample from this distribution, testing on super-resolution of highly downsampled images. We are able to produce diverse and plausible predictions, but, looking deeper, we find that the statistics of the generated distributions are substantially wrong. This is a critical flaw in applications that require any kind of uncertainty quantification. We trace this to the fact that these models cannot easily learn a bijective, invertible map between the latent space and the target distribution. Additionally, we describe a simple method for constraining the distribution of a deterministic encoder's outputs via the Kullback-Leibler divergence without the reparameterization trick used in VAEs.

Introduction

A problem is well-posed if it satisfies three criteria (Hadamard 1902): (1) the problem has a solution, (2) the solution is unique, and (3) the solution is a continuous function of the initial conditions. Many real-world problems of interest are inherently ill-posed, meaning they violate one or more of these criteria, and modeling them correctly remains one of the outstanding challenges in machine learning (ML). Out-of-distribution inputs (data the model was not trained to understand) violate the first criterion, while adversarial examples exist because ML models tend to be very unstable with regard to small, carefully chosen perturbations (Wiyatno et al. 2019) , effectively violating the third.

Violations of the second criterion can occur for relatively simple tasks such as classification of ambiguous inputs (Peterson et al. 2019), but they are ubiquitous in generative modeling tasks, where the desired output is complex and high-dimensional. The strongest possible model, given this type of ill-posed problem, is one that estimates the (conditional) posterior distribution of possible solutions.

Depending on the use case, it may not be necessary to model the full posterior; for example, if the objective is purely ae sthetic (Pathak et al. 2016 ; Yang et al. 2019 ). For safety-critical applications, however, proper risk management requires quantifying the model’s predictive uncertainty, as well as the error introduced when the model is used to approximate the true target distribution. The same considerations apply if the model feeds into some downstream analysis or decision-making process. Despite this, the literature on probabilistic generative neural network (NN) models rarely contains explicit verification of the learned distribution’s statistics. Often, what is actually verified is that the generative model produces realistic outputs or has low reconstruction error in the data domain. Optimizing realism incidentally pseudo-optimizes the error in reconstructing data in a high-dimensional space from a low-dimensional latent representation, even if the model has not learned to encode and reconstruct features well. Prediction diversity and multimodality are usually only discussed qualitatively.

We examine several cycle-consistent architectures incorporating elements of popular generative NN models, namely generative adversarial networks (GANs) and variational autoencoders (VAEs), as well as deterministic encoders. Our focus on cycle-consistent architectures is motivated by the fact that GANs and VAEs do not contain a mechanism for reversing the generative transformation. By examining the learned maps between latent space (which has a simple prior sampled during generative inference) and feature space, we verify that even for simple problems, these architectures cannot model the true data distribution. This failure appears to be due to the models being neither invertible nor bijective, a state of affairs that persists even when the models are converted to deterministic maps. Our main contributions are: • A null result, namely that even if generative models produce diverse and realistic predictions, they do not learn to bijectively map a latent distribution onto the true distribution represented by the training data. We use highly expressive models and follow best practices in network design and training; at minimum, our results argue that proper statistical behavior cannot be taken for granted. We imagine this will probably not surprise many ML practitioners, but also think it is worth making a point to test. We have been unable to find any examples in the literature on GANs and VAEs that explicitly test for these properties, although some related concepts are well known, like the fact that gaps exist in a VAE’s coverage in latent space. • A simple extension of VAEs to deterministic latent vector sampling. Instead of using the reparameterization trick to sample from a multivariate normal distribution with learned mean and variance, we preserve the flow of information through the encoder-decoder stack and optimize KL divergence over batches of training examples. • A proof that sampling from the space of solutions to a conditional inverse problem only requires pairs of conditioning information/ground truth examples, even when the conditioning information has high dimensionality (such as for super-resolution).

Our work does not come close to exploring all possible GAN- or VAE-based generative models, and it is possible that another architecture would learn a bijective map. We choose BicycleGAN as our starting point, as it is a state-of-the-art example of such models. Our chosen dataset, Fashion-MNIST (Xiao, Rasul, and Vollgraf 2017) , is also very simple compared to standards like CIFAR-10 (Krizhevsky 2009) or ImageNet (Russakovsky et al. 2014). This is precisely our argument: if we cannot learn the statistics of even an “easy” dataset, given a reasonable choice of high-capacity model, we should assume any of this family of multi-modal generative models is statistically unreliable unless proven otherwise.

Related Work

Generative NNs come in a a variety of flavors. GANs and VAEs are the most studied; together with invertible neural networks (INNs), they encompass most methods that generate data from a latent space representation. Other methods exist that are based on sequential (Parmar et al. 2018) or Bayesian neural networks (Saatci and Wilson 2017).

A note on notation: in this paper, we use calligraphic letters, X , for sets; upper-case letters, X, for random variables; (bold) lower-case letters, x, for their (vector) values; and p for their probability densities.

Fundamentally, GANs, VAEs, and INNs operate similarly during the forward (generative) process: a generator aGno:thRemrsp!aceR, nx m2apXs a vRecnt.orZzis2thZe spacRemoftopoainptosisnatminpled from some simple latent prior distribution, z pZ (z), and X is the complex space represented by the training data pdata(x); for example, a collection of images belonging to

X some category. The true underlying distribution of the data, pX (x), is unknown.

For GANs and VAEs, m n; the latent representation is compressed, with components of z corresponding to the presence or absence of major features in X . Under certain circumstances, simple arithmetic operations on a vector z can add or subtract semantic features in the corresponding x (Radford, Metz, and Chintala 2016). For standard INNs, m = n, so there is no compression. Typically, the latent prior distribution is taken to be as simple as possible, for example pZ (z) = N (0; 1); this does not preclude a homeomorphic map between Z and X , but may complicate it (Pe´rez Rey, Menkovski, and Portegies 2019).

The details, including the procedure for training G, vary between the different types of generative model. We briefly discuss the specific properties of GANs and VAEs that complicate inverse problem solving below, with comparison to INNs, which are explicitly invertible but less well-studied.

Generative Adversarial Networks (GANs)

A basic GAN training algorithm contains two models, a generator G that and a discriminator D (Goodfellow et al. 2014). D is a binary classifier trained to differentiate between real training data xreal and generated data xgen = G(z), while G is trained to generate outputs that appear real to D. The two models are trained alternately, with the goal that D should eventually learn to reject any xgen 62 X and push G(z) 2 X 8 z 2 Z.

However, there is no guarantee that the distribution modeled by G, pGX (x), is the same as or even close to the true distribution, pX (x). Mode collapse, where G(z) outputs the same realistic xgen 8 z, is only the most extreme example of this. A diverse, multi-modal pGX (x) is clearly better than the delta function distribution modeled by a mode-collapsed GAN, but diversity is not the same as representativeness, and without a theoretical guarantee or explicit testing, G cannot be a trustworthy model of the true distribution of solutions to an inverse problem. In fact, GANs do not explicitly attempt to model probability densities at all; moreover, on its own, a GAN generator cannot be inverted to map some x into Z.

Variational Autoencoders (VAEs)

A basic VAE also incorporates two models, an encoder E and a decoder G (Kingma and Welling 2013). E(x) encodes x into Z, and G maps the latent vector back into X . During training, E and G are updated simultaneously to minimize (i) some measure of the distance between x and G(E(x)), and (ii) the error introduced by approximating pE (z), the latent distribution modeled by E, as pZ (z), the

Z latent prior. The latter is expressed in terms of the KL divergence (Kullback and Leibler 1951) from pZ (z) to pE (z). Z During inference, E is discarded, latent vectors are sampled from the prior via z pZ (z), and samples are generated via xgen = G(z).

Enforcing pZE (z) pZ (z) is critical, as inputs to G are drawn from the former during training, but the latter during inference. This is typically done by assuming that pZE (z) = N ( ; 2), where and 2 are the mean and variance of E(x), respectively, and pZ (z) = N (0; 1). (Other latent priors are possible, but the standard normal is most common.) KL divergence is a simple function of and in this case (Kingma and Welling 2013).

However, and are statistical measures, meaning that they are defined in terms of a large number of observations, and a particular x represents only a single observation. Thus, rather than learning a deterministic encoding E(x) = zenc, E(x) predicts two vectors, and , that define a point cloud in latent space. Monte Carlo sampling of that point cloud is performed via the reparameterization trick, zenc = + , where N (0; 1). and are themselves deterministic, making it easier to take the gradient of E’s weights with respect to its outputs via backpropagation, but this variational approximation adds irreducible noise to the generative process and means E cannot invert G. Therefore, while a VAE can help G map every point in Z to a realistic x, the map is not bijective, which makes it difficult to verify its statistical properties or perform analysis in the latent space.

Invertible Neural Networks (INNs)

INNs (Ardizzone et al. 2018) are composed of a stack of operations that are invertible by construction, meaning that the entire network can be inverted cheaply. Thus, INNs learn a bijective map between Z and X and E is simply G 1. This means that solving the inverse problem is, in principle, as easy as inverting an INN that has been trained on the forward problem; bidirectional training is possible as well. Notably, these maps also have a tractable Jacobian, which means that the unknown data distribution can be written explicitly in terms of the latent prior. These properties address some of the shortcomings of other types of generative models.

The main disadvantage of INNs seems to be that they are a recent development, and consequently have not been refined to the same extent as GANs and VAEs. Also, as a consequence of guaranteeing bijectivity, Z has the same dimensionality as X , meaning further processing of the latent space is necessary to efficiently represent the data and perform certain types of analysis such as feature extraction, feature arithmetic, and anomaly detection.

Conditional Generative Models

All of these types have been extended to the conditional case, when the generator outputs are conditioned on some partial information, such as class label. cGANs (Mirza and Osindero 2014) and cVAEs (Sohn, Lee, and Yan 2015) have been heavily studied since they were introduced several years ago; cINNs (Liu et al. 2019; Ardizzone et al. 2019 ) are relatively new. Regardless, the basic idea is the same: G has a second, conditioning input y 2 Y Rl, where Y is the space of conditioning data, and implicitly represents the conditional distribution pGXjY =y(x), which may or may not be verifiably close to the true conditional distribution, pXjY =y(x). In the case at hand, our training data includes a set of conditional information, pdYata(y), where each y corresponds to an x in pdata(x). Together, these (x; y) pairs rep

X resent samples from the joint distribution pdXatYa (x; y).

Multimodal Models

Conditional GANs have been incredibly successful in mapping from one complex data space to another, but not in learning to predict distributions of solutions to ill-posed problems. The pix2pix framework (Isola et al. 2017) produced a model G : X 0 ! X that mapped, deterministically and in one direction, from a point x0 2 X 0 to a point x 2 X . CycleGAN (Zhu et al. 2017a) extended this to include another model F : X ! X 0, by including a cycle consistency loss to encourage F and G to invert one another. Neither was able to incorporate stochasticity via random sampling of z; even when z was included as a second input in pix2pix, the model simply learned to ignore it, although some stochastic elements could be introduced by including random dropout in the model. Further, CycleGAN owed its success in part to the fact that the cycle consistency loss function simultaneously optimized F and G, leading them to cheat by encoding hidden information in their predictions (Chu, Zhmoginov, and Sandler 2017) .

BicycleGAN (Zhu et al. 2017b) attempted to rectify these shortcomings by combining components from GANs and VAEs. As it was the inspiration for this work, BicycleGAN is discussed in more detail below.

Although they are not our focus, we also note that probabilistic models like Bayesian neural networks are inherently more suitable for modeling multi-modality, though less so for learning bijective maps or latent space representations.

Principled Two-cycle-consistent Generative Models for Inverse Problems

We chose the BicycleGAN framework (Zhu et al. 2017b) as a starting point. Because several of our design choices are intended to address specific concerns we have with this framework, we briefly repeat it here. We then describe our modified framework and justify it as a principled attempt to build a two-cycle-consistent generative model, introducing a new method for constraining the distribution output of a VAE-like encoder without losing cycle-critical information during a reparameterization step.

BicycleGAN

BicycleGAN simultaneously trains a conditional generator G : Y; Z ! X ; a VAE-based encoder E : X ! Z that outputs two vectors, the mean and log variance of a point cloud in Z; and a discriminator D to enforce realism in the outputs of G. BicycleGAN is built around two cycles: a conditional latent regressor (cLR) to enforce consistency on the path Z ! X ! Z, and a conditional variational autoencoder (cVAE) to do the same for the path X ! Z ! X . The models are trained to jointly optimize multiple loss functions, described below.

Standard cGAN loss This is the original conditional GAN loss function defined in (Mirza and Osindero 2014): LGAN(G; D) = Ex pdata [log(D(x))]

X +Ey pdata [log(1

Y z pZ

D(G(y; z)))]; (1) where Ep[ ] is the expected value under a distribution p. The two terms evaluate the realism of real and generated data, respectively. cVAE-GAN loss Identical to the GAN loss, except that z is sampled from E(x) via the reparameterization trick, rather than from pZ (z):

LVGAAEN(G; E; D) = Ex pdata [log(D(x))]

X +Ex;y pdXatYa z ( + [log(1

D(G(y; z)))]: (2) As ! 0 and ! 1, the cVAE-GAN loss term approaches the standard cGAN loss term.

Latent space reconstruction loss The L1 distance be

tween a randomly sampled latent vector and its reconstruction after passing through both G and E:

L1Z (G; E) = Ey pdata [kz

Y z pZ jE(G(y;z))k1]: (3) This is the cLR cyclic consistency term, intended to teach E to invert G.

Ground truth reconstruction loss The L1 distance be

tween a ground truth example and its reconstruction after passing through both E and G:

L1X (G; E) = Ex;y pdXatYa z ( + )jE(x) [kx

G(y; z)k1]: (4) This is the cVAE cyclic consistency term, intended to teach G to invert E.

KL Divergence from pZ (z) to pE (z) Attempts to en

Z sure that E maps into the simple prior distribution sampled from during inference:

LKL(E) = Ex pdXata [DKL( N ( ; 2) E(x) kN (0; 1))]: (5) This is necessary to ensure that G always receives latent vectors belonging to the same distribution. By design, E’s outputs are always interpreted as point clouds, since they only go back into the training process via reparameterization (which is always Gaussian), justifying the above form of the KL divergence.

The models are trained according to combinations of these loss terms, namely where “ ” represents an updated model after one training step and the s are weights.

BicycleGAN is very successful at generating diverse and realistic outputs G(y; z), but the statistics of the learned distributions have not been verified. This means ensuring that the conditional probability distribution implicitly modeled by G, pGXjY =y(x), well approximates the distribution described by the training data, pdata XjY =y(x); in the Appendix, we show that this is possible even if we can only sample pairs from the joint distribution, pdXatYa (x; y). It also means ensuring that the latent distribution modeled by E, pE (z), Z resembles the prior, pZ (z). Our tests of BicycleGAN do produce diverse and realistic reconstructions, but we do not find that the learned distribution match the ground truth statistics.

The original BicycleGAN has several features that potentially complicate learning a bijective map via enforcing twocycle consistency: 1. G has two inputs, y and z, but E only has one input, x. This asymmetry means that E cannot invert G, since E(G(y; z)) has no way to disentangle the separate contributions of y and z in G(y; z). 2. E is trained using VAE-based methods and has two outputs, and , rather than a point in Z. This is a second reason why E cannot actually invert G(y; z) to recover z, and therefore cannot be trained to enforce cycle consistency in the latent space. L1Z actually attempts to minimize the distance between z and jE(G(y;z)). 3. The authors found that simultaneously training G and E on both cyclic consistency loss terms incentivizes cheating, similar to what was observed in CycleGAN (Chu, Zhmoginov, and Sandler 2017) , so E is not trained to optimize L1Z . However, when we attempt to replicate their approach we find indications of the same behavior when simultaneously training G and E to optimize L1X . We address these concerns primarily by changing E, and by changing the loss functions used in training—specifically, which loss functions are used to train which models. Our modified framework simultaneously trains a conditional generator G : Y; Z ! X and a deterministic, conditional encoder E : Y; X ! Z. Each change is discussed in detail in the following sections.

Adversarial Losses

As with the original BicycleGAN, we use two cGAN-based loss terms to encourage realism in the outputs of G, one in which z is sampled from the latent prior and one in which z is encoded from an (x; y) pair. Rather than a discriminator, we use a Wasserstein critic C (Arjovsky, Chintala, and Bottou 2017) , with a gradient penalty loss (Gulrajani et al. 2017). A discriminator is a binary classifier that can only return values of 0 (generated) or 1 (real), but a Wasserstein critic scores realism on a continuous scale of more negative (more likely to be generated) to more positive (more likely to be real). This provides more useful gradients to G during training, but is not a fundamental change to the framework.

The two loss terms used to train G are

LccrLitRic(G; C)

Ey pdata [1 C(xgen)];

Y z pZ LccrAitEic(G; E; C) = Ex;y pdata [1 C(xcyc)]; (10) XY where xgen = G(y; z) and xcyc = G(y; E(y; x)). The explicit “1” indicates that G is being optimized to generate outputs considered “real” by C.

The complement to Eqs. 9 and 10 is

Lrceriatlic(C) = Ex pdata [(1 C(x))]:

X Gradient penalty terms ensure that C is 1-Lipschitz:

LcGLPR(G; C)

= Ezx;ypZpdYata [(krxC(xgen)k2 1)2]; (12) LcGAPE(G; E; C) = Ex;y pdata [(krxC(xcyc)k2 1)2]; (13) z pZ Y where gradients are taken with respect to randomly weighted averages xgen = ux + (1 u)xgen and xcyc = ux + (1 u)xcyc, where u is uniform random noise on the interval [0; 1]. k k2 is the L2 norm. (9) (11)

Cycle Consistency Losses

As with BicycleGAN, we use two cycle consistency loss terms, one for the cLR path and one for the cAE path: L1Z (G; E) = kz L1X (G; E) = kx zcyck1; xcyck1; (14) (15) where zcyc = E(y; G(y; z)) and xcyc = G(y; E(y; x)).

Calculating KL Divergence for a Deterministic Autoencoder

The two-output, probabilistic design of the encoder in a VAE enables calculation of the KL divergence from pZ (z) to pE (z), an inherently statistical measure, from a single data

Z point. However, this comes at the cost of cycle consistency in latent space, since E can no longer produce a single latent vector from G(y; z) to compare with the original z. This is even worse for a cVAE: in order to reconstruct data from multiple classes, a non-conditioned VAE is forced to partition latent space into regions corresponding to the classes, which is in direct tension with the KL divergence loss’s drive to map every x to N (0; 1). The extra information provided by the conditioning input negates the need to partition Z, but that in turn means that jE(x) ! 0 8 x, and therefore that the reconstruction loss term defined in Eq. 3 will not be able to learn anything meaningful. Our initial attempts to train a BicycleGAN had precisely this problem, regardless of the relative weights assigned to the different loss terms.

A deterministic autoencoder preserves the flow of information through the cLR path, but prevents us from calculating the KL divergence on a per-example basis in the cAE path. Fortunately, because KL divergence is a statistical term, it can be calculated over a batch of training data. We therefore switch to a batch-wise KL divergence loss, LcKALE(E) = Ex pdata [DKL(N ( enc; e2nc)kN (0; 1))]; (16)

X where E is now a deterministic autoencoder and enc and enc are the mean and standard deviation of zenc = E(y; x), respectively, calculated over a batch of training data.

To make the framework symmetrical, we also include a similar loss term on the cLR path,

LcKLLR(E) = Ey pdata [DKL(N ( cyc; c2yc)kN (0; 1))]; (17)

Y z pZ where cyc and cyc are calculated over a batch of zcyc = E(y; G(y; z)).

These loss terms are unusual in that they are calculated once over the batch, instead of once for each example, followed by averaging over the batch. Their effectiveness is strongly dependent on batch size: for a batch of z N (0; 1), the batch size must exceed 100 for the calculated KL divergence to drop below 0:01. For small batch sizes, these terms do not provide much useful information.

The desired effect of these loss terms is that E(y; x) will learn to map the information in x that is not contained in y onto Z, the space defined by the distribution pZ (z) = N (0; 1). On their own, LKL cAE are not sufficient to cLR and LKL guarantee this. For example, perhaps E could learn to assign a separate region in Z to each class that, averaged over {

z R cL y {

y E cA x generator G E

C C

LccrLitRic LccrAitEic

X L1 z y y x encoder

E G E

Z L1 LcKLLR LKcALE a large batch, yields = 0, = 1. However, cLR cycle consistency requires that E(y; G(y; z)) ! z 8 z pZ (z). Therefore, optimizing cLR cycle consistency together with KL divergence requires that E(y; x) maps (x; y) pairs into Z independent of y; or, in other words, that E learns common semantic features present in X but not Y, and maps those features into Z.

Full Model

The models are trained according to

G E C = arg min[LccrLitRic + LccrAitEic +

G = arg min[LcKLLR + LcKALE +

E 1X L1X ]; G attempts to generate realistic outputs, regardless of whether its z input comes from the prior distribution or E, while also attempting to invert E. E attempts to generate latent outputs consistent with the latent prior distribution, regardless of whether its x inputs come from the training data or G, while also attempting to invert G. C attempts to learn to differentiate between ground truth and generated x from both the cLR and cAE paths. The training scheme for G and E is shown in Fig. 1. The training scheme for C is standard for a Wasserstein critic except that there are two paths for generated samples; the factor of 1=2 in Eq. 20 ensures that real and generated x are weighted equally so C does not just label everything as fake.

This formulation is symmetric. G is only trained to optimize cAE cycle consistency and the adversarial loss, a measure of realism or the (modeled) likelihood that G(y; : : : ) 2 X . Similarly, E is only trained to optimize cLR cycle consistency and the KL divergence loss, a measure of the likelihood that E(y; : : : ) 2 Z. The lack of common loss functions between G and E keeps them from learning to cheat.

We note that there is some redundancy between Eqs. 10 and 15, and between Eqs. 14 and 17. If E and G do truly learn to invert one another, as incentivized by Eqs. 14 and 15, then Eqs. 10 and 17 will no longer provide useful gradients, but at worst this might lead to some wasted computations late in training.

Methods

Our code is publicly available at https://github.com/ USArmyResearchLab/ARL Representativeness of Cyclic GANs.

Dataset

We attempt to solve a simple super-resolution inverse problem. Fashion-MNIST (Xiao, Rasul, and Vollgraf 2017) is a collection of 28 28 grayscale images separated into 10 classes of clothing and accessories. Each class has 6,000 training examples and 1,000 test examples. Subjectively, most examples in each class fall into a small number of clusters—for example, most images in the “trousers” category look very similar, while there are several different apparent “sub-categories” among T-shirts. Each class has some examples, less than 10% or so, that vary strongly from other examples in the same class, while 90% are very similar. Some classes (e.g., T-shirt and shirt) have significant overlap.

We downsample the images once, using 2 2 average pooling, to get a set of 14 14 images, x0, and then again to get a corresponding set of 7 7 images, y0. We then upscale y0 by doubling the number of pixels to get very lowresolution 14 14 conditioning images y, and use those to obtain the residuals x = x0 y. We rescale (x; y) pairs to be on the interval [ 1; 1], with training and test data rescaled separately.

The original images in fashion-MNIST are already lowresolution, and this much downsampling destroys significant feature information. Many x can plausibly be obtained from a given y, according to some distribution pXjY =y(x). Although the dataset only includes one (x; y) pair per example, rather than a distribution pdata XjY =y(x), we can still justify supervised training using pdXatYa (x; y). This is discussed in the Appendix.

Conditioning C vs. X ! Y Consistency Loss Depending on the problem, it may be inappropriate to condition C. For example, in image inpainting, y includes the mask that defines the region to be filled in, which can be exploited to identify perceptual discontinuities between the original and generated portion s of the image (Pathak et al. 2016 ). For our problem, this is not the case.

A conditioning input allows C to ask whether G(y; : : : ) is consistent with y. We know the map X ! Y in our case; for a given x to be perfectly consistent with y, a 2 2-averagepool-downsampled x must be 0 everywhere. By applying this to G(y; : : : ), we can separate this consistency from C as a pair of supplemental loss terms,

cLR (G) LX Y

cAE(G; E) = LX Y = cXLYR kdownsample(xgen)k1; cXAYEkdownsample(xcyc)k1: (21) (22) We do not observe a significant difference in results between implementing a conditional C(y; x) vs. an unconditioned C(x) plus these supplemental losses, in terms of visual quality. The supplemental losses enforce the desired consistency explicitly, so we use them in our experiments. ground truth conditioner We use deep ResNeXt networks with efficient grouped convolutions and identity shortcuts (Xie et al. 2016 ) and O(4 106) parameters in each of G, E, and C. Activations are all LeakyReLU followed by layer normalization, except for the generator output, which uses tanh. We inject latent vectors z only at the top layer of the model. We test dim(z) 2 f10; 100; 1000g, finding that 100 performs best overall and therefore use that for most experiments.

We perform training in TensorFlow, using the Adam optimizer with default parameters. We choose a batch size of 200, since our batch-wise KL divergence is only useful for fairly large batch size. We train using in stance noise (Sønderby et al. 2016 ) over x, xgen, and xcyc, replacing x[ ] x[ ] + (1 )u, where anneals from 0 to 1 in increments of 0:01 over the first 100 epochs of training, and u is uniform random noise on the interval [ 1; 1].

For every update of G and E, we train C continuously until its validation loss fails to improve for 5 consecutive batches to ensure it is approximately converged. Training continues until all models’ validation losses have failed to improve for 20 consecutive epochs.

Results and Discussion

We test variational vs. deterministic encoders; different choices of hyperparameters and loss weights; and allowing vs. disallowing overlap between loss functions used to train E and G. The outcomes of these tests share two commonalities: G produces diverse and realistic predictions, exemplified in Fig. 2, but without cycle consistency or distribution matching in Z. Thus, their maps are not bijective and do not map the latent space onto the true data distribution.

Variational Frameworks

First, we test the BicycleGAN framework (the variational method) with a Wasserstein critic, trained with Eqs. 6 (modified to include Eqs. 9 and 10), 7, and 20. As noted previously, this generates realistic images on par with our other tests, but even strongly weighting L1Z does not produce any cLR cycle consistency, with ; going to 0; 1 very rapidly. This is true regardless of whether we train E but not G on L1Z and/or do not train E on L1X .

Deterministic Frameworks

As mentioned previously, we observe some steganographic collaboration between models that train G and E with overlapping loss functions. For example, some information is hidden in the black background of each xgen as an imperceptible, low-amplitude signal. Truncating the values of those pixels to 1 with no other changes produces a sizable change in kz zcyck1, increasing it from 0:07 to 0:4. This emphasizes that cycle consistency in the models does not mean they have learned a meaningful map. We therefore restrict our experiments to models trained without overlapping loss functions, using Eqs. 18–20.

In this scenario we observe no cLR cycle consistency, as E is unable to extract latent information from xgen. We see this regardless of the relative weights in Eq. 19, even if we weight LcKLLR and LcKALE independently. A typical result is shown in Fig. 3a: the model is penalized by L1Z for wrongly predicting zcyc, and is also unable to find a path to learn to predict zcyc correctly. Accordingly, given some z (red line) it simply predicts zcyc 0 8 y (black lines), which results in a smaller penalty than if it had predicted a nonzero, incorrect zcyc. This is a failure to optimize both L1Z , because the red and black lines do not overlap, and LcKLLR, because the standard deviation of the elements in zcyc is much less than 1. Only when we set 1Z 0 are the KL-divergence loss terms able to enforce good statistics.

This is not because E is simply unable to learn to invert G. Rather, it appears to be unable to do so quickly and robustly. Once the model converges, we train E only for 1000 more epochs, holding G and C constant. This does slightly improve cycle consistency in z, as shown in Fig. 3b. L1Z takes “only” several hundred epochs to plateau, so training time is not the only limiting factor. The small overall improvement in L1Z (about 4%) belies a noticeable qualitative change, due to a minority of z-coordinates being very wrong. This still falls far short of allowing E to truly invert G, and further, it is not stable, disappearing with another update to G.

L1X has limited success in optimizing cAE cycle consistency. This happens to some extent even if we do not optimize on L1Z , reflecting the fact that optimizing realism pseudo-optimizes the L1 distance between x and xcyc. L1 X then depends on only a handful of pixels for most images, which dominate the expected value in Eq. 15. In support of this, we find that xcyc does not reproduce rare features such as text, symbols, and some patterns and orientations.

Conclusion

Our experiments are consistent with the idea that cycleconsistent GANs are not good vehicles for obtaining representative maps, largely because they do not learn cycle consistency well in the first place. They still produce diverse and realistic outputs, but are not representative, in effect mapping onto an unknown subset of X .

Two-cycle consistency is a surrogate for bijectivity. The fact that it is so difficult to train G and E to invert one an1 0 1 2 b latent vector index latent vector index other even for a simple problem such as the one we test indicates that an explicit guarantee of bijectivity is probably the best path forward for achieving this, which in turn will allow inverse problems to be solved rigorously and probabilistically via simple Monte Carlo sampling. In the absence of such a guarantee, we believe representativeness in the generated distribution must be explicitly tested for in generative models, especially when risk or bias assessment, uncertainty quantification, and similar considerations are important.

INNs are inherently bijective and do not require explicit enforcement of two-cycle consistency. It seems likely that, even if there is a way to enforce bijectivity in a GAN-like construct, INNs provide a simpler path to this result.

Acknowledgments

The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the US Army Combat Capabilities Development Command (DEVCOM) Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein. The authors thank Dr. Ting Wang for valuable discussions and Mr. Matt Ziemann for careful reading of the manuscript. Computer time was provided by the DEVCOM ARL DSRC.

Appendix: Sampling Inverse Problem Solutions

pZE (z), respectively: G and E implicitly define distributions pGXjY =y(x) and pZ (z) dz f (G(y; z)) = pGXjY =y(x) dx f (x); Z Z

Rm Rn pXjY =y(x) dx f (E(y; x)) = Ex pXjY =y [f (E(y; x))] = Ez pE [f (z)]: (A.4) Z If our model performs as desired, the distribution of points in Rn obtained by sampling z pZ (z) and evaluating G(y; z) should resemble the true conditional distribution of possible x consistent with a given y, pXjY =y(x). Similarly, the distribution of points in Rm obtained by sampling x pXjY =y(x) and evaluating E(y; x) should resemble the latent prior, pZ (z). However, while we can sample z from the latent prior, we cannot sample x from the true conditional distribution of X, since we usually have only one pair each of ground truth (y; x) in our training data.

Fortunately, we can get around this if the learned encoder distribution, pE (z), is independent of Y , which is a reason

Z able assumption if the reconstruction process can identify common features applicable to many different conditioners, and is further supported by the KL divergence loss terms.

Taking the expectation under pY (y) in Eq. A.2 gives: pXjY =y(x) pY (y) dx dy f (E(y; x)) = The second integral on the RHS evaluates to one by definition, and Bayes’s theorem lets us rewrite the LHS to get pXY (x; y) dy dx f (E(y; x)) =

Rm pZE (z) dz f (z); (A.6) where pXY (x; y) is the true joint probability density function of X and Y . Unlike pXjY =y(x), we can sample from pXY (x; y); the training data, pdXatYa (x; y), does just that. We can then enforce a distribution constraint on pZE (z) as usual.

Ardizzone , L. ; Kruse , J. ; Wirkert, S. J. ; Rahner , D. ; Pellegrini , E. W. ; Klessen, R. S. ; Maier-Hein , L. ; Rother , C. ; and Ko¨the, U. 2018 . Analyzing inverse problems with invertible neural networks . CoRR abs/1808 .04730.

2019. Guided image generation with conditional invertible neural networks . CoRR abs/1907 .02392.

Arjovsky , M. ; Chintala , S. ; and Bottou , L. 2017 . Wasserstein GAN . CoRR abs/1701 .07875.

Chu , C. ; Zhmoginov , A. ; and Sandler, M. 2017 . Cyclegan, a master of steganography . CoRR abs/1712 .02950.

2014. Generative adversarial nets . In Ghahramani , Z. ; Welling , M. ; Cortes , C. ; Lawrence , N. D.; and Weinberger , K. Q., eds., Advances in Neural Information Processing Systems 27 . Curran Associates, Inc. 2672 - 2680 .

Sønderby , C. K. ; Caballero , J. ; Theis , L. ; Shi , W. ; and Husza´r, F. 2016 . Amortised MAP inference for image superresolution . CoRR abs/1610 .04490.

Wiyatno , R. R. ; Xu , A. ; Dia , O. ; and de Berker , A. 2019 . Adversarial examples in modern machine learning: A review.

CoRR abs/ 1911 .05268.

Xiao , H. ; Rasul , K. ; and Vollgraf, R. 2017 . Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms . CoRR abs/1708 .07747.

Xie , S. ; Girshick, R. B. ; Dolla´r, P.; Tu , Z. ; and He , K. 2016 .

Aggregated residual transformations for deep neural networks . CoRR abs/1611 .05431.

Yang , W. ; Zhang, X. ; Tian , Y. ; Wang , W. ; Xue, J.; and Liao , Q. 2019 . Deep learning for single image superresolution: A brief review . IEEE Transactions on Multimedia 21 ( 12 ): 3106 - 3121 .

Zhu , J.-Y.; Park , T. ; Isola , P. ; and Efros , A. A. 2017a . Unpaired image-to-image translation using cycle-consistent adversarial networks . In Proceedings of the IEEE international conference on computer vision , 2223- 2232 .

Zhu , J.-Y.; Zhang, R.; Pathak, D. ; Darrell, T. ; Efros , A. A. ; Wang , O. ; and Shechtman , E. 2017b . Toward multimodal image-to-image translation . In Advances in Neural Information Processing Systems , 465 - 476 .