<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multi-Modal Generative Adversarial Networks Make Realistic and Diverse but Untrustworthy Predictions When Applied to Ill-posed Problems</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>John S. Hyatt, Michael S. Lee Computational &amp; Information Sciences Directorate, DEVCOM Army Research Laboratory</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Ill-posed problems can have a distribution of possible solutions rather than a unique one, where each solution incorporates significant features not present in the initial input. We investigate whether cycle-consistent generative neural network models based on generative adversarial networks (GANs) and variational autoencoders (VAEs) can properly sample from this distribution, testing on super-resolution of highly downsampled images. We are able to produce diverse and plausible predictions, but, looking deeper, we find that the statistics of the generated distributions are substantially wrong. This is a critical flaw in applications that require any kind of uncertainty quantification. We trace this to the fact that these models cannot easily learn a bijective, invertible map between the latent space and the target distribution. Additionally, we describe a simple method for constraining the distribution of a deterministic encoder's outputs via the Kullback-Leibler divergence without the reparameterization trick used in VAEs.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        A problem is well-posed if it satisfies three criteria
(Hadamard 1902): (1) the problem has a solution, (2) the
solution is unique, and (3) the solution is a continuous
function of the initial conditions. Many real-world problems of
interest are inherently ill-posed, meaning they violate one or
more of these criteria, and modeling them correctly remains
one of the outstanding challenges in machine learning (ML).
Out-of-distribution inputs (data the model was not trained to
understand) violate the first criterion, while adversarial
examples exist because ML models tend to be very unstable
with regard to small, carefully chosen perturbations
        <xref ref-type="bibr" rid="ref7">(Wiyatno et al. 2019)</xref>
        , effectively violating the third.
      </p>
      <p>Violations of the second criterion can occur for relatively
simple tasks such as classification of ambiguous inputs
(Peterson et al. 2019), but they are ubiquitous in generative
modeling tasks, where the desired output is complex and
high-dimensional. The strongest possible model, given this
type of ill-posed problem, is one that estimates the
(conditional) posterior distribution of possible solutions.</p>
      <p>
        Depending on the use case, it may not be necessary to
model the full posterior; for example, if the objective is
purely ae
        <xref ref-type="bibr" rid="ref6">sthetic (Pathak et al. 2016</xref>
        ;
        <xref ref-type="bibr" rid="ref12">Yang et al. 2019</xref>
        ).
For safety-critical applications, however, proper risk
management requires quantifying the model’s predictive
uncertainty, as well as the error introduced when the model is used
to approximate the true target distribution. The same
considerations apply if the model feeds into some downstream
analysis or decision-making process. Despite this, the
literature on probabilistic generative neural network (NN) models
rarely contains explicit verification of the learned
distribution’s statistics. Often, what is actually verified is that the
generative model produces realistic outputs or has low
reconstruction error in the data domain. Optimizing realism
incidentally pseudo-optimizes the error in reconstructing data
in a high-dimensional space from a low-dimensional latent
representation, even if the model has not learned to encode
and reconstruct features well. Prediction diversity and
multimodality are usually only discussed qualitatively.
      </p>
      <p>We examine several cycle-consistent architectures
incorporating elements of popular generative NN models, namely
generative adversarial networks (GANs) and variational
autoencoders (VAEs), as well as deterministic encoders. Our
focus on cycle-consistent architectures is motivated by the
fact that GANs and VAEs do not contain a mechanism for
reversing the generative transformation. By examining the
learned maps between latent space (which has a simple prior
sampled during generative inference) and feature space, we
verify that even for simple problems, these architectures
cannot model the true data distribution. This failure appears to
be due to the models being neither invertible nor bijective, a
state of affairs that persists even when the models are
converted to deterministic maps. Our main contributions are:
• A null result, namely that even if generative models
produce diverse and realistic predictions, they do not learn
to bijectively map a latent distribution onto the true
distribution represented by the training data. We use highly
expressive models and follow best practices in network
design and training; at minimum, our results argue that
proper statistical behavior cannot be taken for granted. We
imagine this will probably not surprise many ML
practitioners, but also think it is worth making a point to test.
We have been unable to find any examples in the literature
on GANs and VAEs that explicitly test for these
properties, although some related concepts are well known, like
the fact that gaps exist in a VAE’s coverage in latent space.
• A simple extension of VAEs to deterministic latent
vector sampling. Instead of using the reparameterization trick
to sample from a multivariate normal distribution with
learned mean and variance, we preserve the flow of
information through the encoder-decoder stack and optimize
KL divergence over batches of training examples.
• A proof that sampling from the space of solutions to a
conditional inverse problem only requires pairs of
conditioning information/ground truth examples, even when
the conditioning information has high dimensionality
(such as for super-resolution).</p>
      <p>
        Our work does not come close to exploring all possible
GAN- or VAE-based generative models, and it is
possible that another architecture would learn a bijective map.
We choose BicycleGAN as our starting point, as it is
a state-of-the-art example of such models. Our chosen
dataset, Fashion-MNIST
        <xref ref-type="bibr" rid="ref9">(Xiao, Rasul, and Vollgraf 2017)</xref>
        ,
is also very simple compared to standards like CIFAR-10
(Krizhevsky 2009) or ImageNet (Russakovsky et al. 2014).
This is precisely our argument: if we cannot learn the
statistics of even an “easy” dataset, given a reasonable choice of
high-capacity model, we should assume any of this family
of multi-modal generative models is statistically unreliable
unless proven otherwise.
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>Generative NNs come in a a variety of flavors. GANs and
VAEs are the most studied; together with invertible neural
networks (INNs), they encompass most methods that
generate data from a latent space representation. Other methods
exist that are based on sequential (Parmar et al. 2018) or
Bayesian neural networks (Saatci and Wilson 2017).</p>
      <p>A note on notation: in this paper, we use calligraphic
letters, X , for sets; upper-case letters, X, for random variables;
(bold) lower-case letters, x, for their (vector) values; and p
for their probability densities.</p>
      <p>Fundamentally, GANs, VAEs, and INNs operate
similarly during the forward (generative) process: a generator
aGno:thRemrsp!aceR, nx m2apXs a vRecnt.orZzis2thZe
spacRemoftopoainptosisnatminpled from some simple latent prior distribution, z pZ (z),
and X is the complex space represented by the training data
pdata(x); for example, a collection of images belonging to</p>
      <p>X
some category. The true underlying distribution of the data,
pX (x), is unknown.</p>
      <p>For GANs and VAEs, m n; the latent representation
is compressed, with components of z corresponding to the
presence or absence of major features in X . Under certain
circumstances, simple arithmetic operations on a vector z
can add or subtract semantic features in the corresponding
x (Radford, Metz, and Chintala 2016). For standard INNs,
m = n, so there is no compression. Typically, the latent
prior distribution is taken to be as simple as possible, for
example pZ (z) = N (0; 1); this does not preclude a
homeomorphic map between Z and X , but may complicate it
(Pe´rez Rey, Menkovski, and Portegies 2019).</p>
      <p>The details, including the procedure for training G, vary
between the different types of generative model. We briefly
discuss the specific properties of GANs and VAEs that
complicate inverse problem solving below, with comparison to
INNs, which are explicitly invertible but less well-studied.</p>
      <sec id="sec-2-1">
        <title>Generative Adversarial Networks (GANs)</title>
        <p>A basic GAN training algorithm contains two models, a
generator G that and a discriminator D (Goodfellow et al.
2014). D is a binary classifier trained to differentiate
between real training data xreal and generated data xgen =
G(z), while G is trained to generate outputs that appear real
to D. The two models are trained alternately, with the goal
that D should eventually learn to reject any xgen 62 X and
push G(z) 2 X 8 z 2 Z.</p>
        <p>However, there is no guarantee that the distribution
modeled by G, pGX (x), is the same as or even close to the true
distribution, pX (x). Mode collapse, where G(z) outputs the
same realistic xgen 8 z, is only the most extreme example
of this. A diverse, multi-modal pGX (x) is clearly better than
the delta function distribution modeled by a mode-collapsed
GAN, but diversity is not the same as representativeness, and
without a theoretical guarantee or explicit testing, G cannot
be a trustworthy model of the true distribution of solutions to
an inverse problem. In fact, GANs do not explicitly attempt
to model probability densities at all; moreover, on its own, a
GAN generator cannot be inverted to map some x into Z.</p>
      </sec>
      <sec id="sec-2-2">
        <title>Variational Autoencoders (VAEs)</title>
        <p>A basic VAE also incorporates two models, an encoder E
and a decoder G (Kingma and Welling 2013). E(x)
encodes x into Z, and G maps the latent vector back into
X . During training, E and G are updated simultaneously to
minimize (i) some measure of the distance between x and
G(E(x)), and (ii) the error introduced by approximating
pE (z), the latent distribution modeled by E, as pZ (z), the</p>
        <p>Z
latent prior. The latter is expressed in terms of the KL
divergence (Kullback and Leibler 1951) from pZ (z) to pE (z).
Z
During inference, E is discarded, latent vectors are sampled
from the prior via z pZ (z), and samples are generated via
xgen = G(z).</p>
        <p>Enforcing pZE (z) pZ (z) is critical, as inputs to G
are drawn from the former during training, but the latter
during inference. This is typically done by assuming that
pZE (z) = N ( ; 2), where and 2 are the mean and
variance of E(x), respectively, and pZ (z) = N (0; 1). (Other
latent priors are possible, but the standard normal is most
common.) KL divergence is a simple function of and in
this case (Kingma and Welling 2013).</p>
        <p>However, and are statistical measures, meaning that
they are defined in terms of a large number of observations,
and a particular x represents only a single observation. Thus,
rather than learning a deterministic encoding E(x) = zenc,
E(x) predicts two vectors, and , that define a point cloud
in latent space. Monte Carlo sampling of that point cloud is
performed via the reparameterization trick, zenc = + ,
where N (0; 1). and are themselves deterministic,
making it easier to take the gradient of E’s weights with
respect to its outputs via backpropagation, but this variational
approximation adds irreducible noise to the generative
process and means E cannot invert G. Therefore, while a VAE
can help G map every point in Z to a realistic x, the map is
not bijective, which makes it difficult to verify its statistical
properties or perform analysis in the latent space.</p>
      </sec>
      <sec id="sec-2-3">
        <title>Invertible Neural Networks (INNs)</title>
        <p>
          INNs
          <xref ref-type="bibr" rid="ref1">(Ardizzone et al. 2018)</xref>
          are composed of a stack of
operations that are invertible by construction, meaning that
the entire network can be inverted cheaply. Thus, INNs learn
a bijective map between Z and X and E is simply G 1. This
means that solving the inverse problem is, in principle, as
easy as inverting an INN that has been trained on the forward
problem; bidirectional training is possible as well. Notably,
these maps also have a tractable Jacobian, which means that
the unknown data distribution can be written explicitly in
terms of the latent prior. These properties address some of
the shortcomings of other types of generative models.
        </p>
        <p>The main disadvantage of INNs seems to be that they
are a recent development, and consequently have not been
refined to the same extent as GANs and VAEs. Also, as a
consequence of guaranteeing bijectivity, Z has the same
dimensionality as X , meaning further processing of the latent
space is necessary to efficiently represent the data and
perform certain types of analysis such as feature extraction,
feature arithmetic, and anomaly detection.</p>
      </sec>
      <sec id="sec-2-4">
        <title>Conditional Generative Models</title>
        <p>
          All of these types have been extended to the conditional
case, when the generator outputs are conditioned on some
partial information, such as class label. cGANs (Mirza and
Osindero 2014) and cVAEs (Sohn, Lee, and Yan 2015) have
been heavily studied since they were introduced several
years ago; cINNs (Liu et al.
          <xref ref-type="bibr" rid="ref2">2019; Ardizzone et al. 2019</xref>
          )
are relatively new. Regardless, the basic idea is the same:
G has a second, conditioning input y 2 Y Rl, where Y
is the space of conditioning data, and implicitly represents
the conditional distribution pGXjY =y(x), which may or may
not be verifiably close to the true conditional distribution,
pXjY =y(x). In the case at hand, our training data includes a
set of conditional information, pdYata(y), where each y
corresponds to an x in pdata(x). Together, these (x; y) pairs
rep
        </p>
        <p>X
resent samples from the joint distribution pdXatYa (x; y).</p>
      </sec>
      <sec id="sec-2-5">
        <title>Multimodal Models</title>
        <p>
          Conditional GANs have been incredibly successful in
mapping from one complex data space to another, but not in
learning to predict distributions of solutions to ill-posed
problems. The pix2pix framework (Isola et al. 2017)
produced a model G : X 0 ! X that mapped, deterministically
and in one direction, from a point x0 2 X 0 to a point x 2 X .
CycleGAN
          <xref ref-type="bibr" rid="ref13 ref14 ref4">(Zhu et al. 2017a)</xref>
          extended this to include
another model F : X ! X 0, by including a cycle consistency
loss to encourage F and G to invert one another. Neither was
able to incorporate stochasticity via random sampling of z;
even when z was included as a second input in pix2pix, the
model simply learned to ignore it, although some stochastic
elements could be introduced by including random dropout
in the model. Further, CycleGAN owed its success in part to
the fact that the cycle consistency loss function
simultaneously optimized F and G, leading them to cheat by
encoding hidden information in their predictions
          <xref ref-type="bibr" rid="ref4">(Chu,
Zhmoginov, and Sandler 2017)</xref>
          .
        </p>
        <p>
          BicycleGAN
          <xref ref-type="bibr" rid="ref13 ref14">(Zhu et al. 2017b)</xref>
          attempted to rectify these
shortcomings by combining components from GANs and
VAEs. As it was the inspiration for this work, BicycleGAN
is discussed in more detail below.
        </p>
        <p>Although they are not our focus, we also note that
probabilistic models like Bayesian neural networks are inherently
more suitable for modeling multi-modality, though less so
for learning bijective maps or latent space representations.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Principled Two-cycle-consistent Generative</title>
    </sec>
    <sec id="sec-4">
      <title>Models for Inverse Problems</title>
      <p>
        We chose the BicycleGAN framework
        <xref ref-type="bibr" rid="ref13 ref14">(Zhu et al. 2017b)</xref>
        as a starting point. Because several of our design choices
are intended to address specific concerns we have with this
framework, we briefly repeat it here. We then describe our
modified framework and justify it as a principled attempt to
build a two-cycle-consistent generative model, introducing
a new method for constraining the distribution output of a
VAE-like encoder without losing cycle-critical information
during a reparameterization step.
      </p>
      <sec id="sec-4-1">
        <title>BicycleGAN</title>
        <p>BicycleGAN simultaneously trains a conditional generator
G : Y; Z ! X ; a VAE-based encoder E : X ! Z that
outputs two vectors, the mean and log variance of a point
cloud in Z; and a discriminator D to enforce realism in the
outputs of G. BicycleGAN is built around two cycles: a
conditional latent regressor (cLR) to enforce consistency on the
path Z ! X ! Z, and a conditional variational
autoencoder (cVAE) to do the same for the path X ! Z ! X . The
models are trained to jointly optimize multiple loss
functions, described below.</p>
        <p>Standard cGAN loss This is the original conditional
GAN loss function defined in (Mirza and Osindero 2014):
LGAN(G; D) = Ex pdata [log(D(x))]</p>
        <p>X
+Ey pdata [log(1</p>
        <p>Y
z pZ</p>
        <p>D(G(y; z)))];
(1)
where Ep[ ] is the expected value under a distribution p.
The two terms evaluate the realism of real and generated
data, respectively.
cVAE-GAN loss Identical to the GAN loss, except that
z is sampled from E(x) via the reparameterization trick,
rather than from pZ (z):</p>
        <p>LVGAAEN(G; E; D) = Ex pdata [log(D(x))]</p>
        <p>X
+Ex;y pdXatYa
z ( +
[log(1</p>
        <p>D(G(y; z)))]:
(2)
As ! 0 and ! 1, the cVAE-GAN loss term
approaches the standard cGAN loss term.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Latent space reconstruction loss The L1 distance be</title>
        <p>tween a randomly sampled latent vector and its
reconstruction after passing through both G and E:</p>
        <p>L1Z (G; E) = Ey pdata [kz</p>
        <p>Y
z pZ
jE(G(y;z))k1]:
(3)
This is the cLR cyclic consistency term, intended to teach
E to invert G.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Ground truth reconstruction loss The L1 distance be</title>
        <p>tween a ground truth example and its reconstruction after
passing through both E and G:</p>
        <p>L1X (G; E) = Ex;y pdXatYa
z ( +
)jE(x)
[kx</p>
        <p>G(y; z)k1]: (4)
This is the cVAE cyclic consistency term, intended to
teach G to invert E.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>KL Divergence from pZ (z) to pE (z) Attempts to en</title>
      <p>Z
sure that E maps into the simple prior distribution
sampled from during inference:</p>
      <p>LKL(E) = Ex pdXata [DKL( N ( ; 2) E(x) kN (0; 1))]:
(5)
This is necessary to ensure that G always receives latent
vectors belonging to the same distribution. By design, E’s
outputs are always interpreted as point clouds, since they
only go back into the training process via
reparameterization (which is always Gaussian), justifying the above
form of the KL divergence.</p>
      <p>The models are trained according to combinations of these
loss terms, namely
where “ ” represents an updated model after one training
step and the s are weights.</p>
      <p>BicycleGAN is very successful at generating diverse and
realistic outputs G(y; z), but the statistics of the learned
distributions have not been verified. This means ensuring that
the conditional probability distribution implicitly modeled
by G, pGXjY =y(x), well approximates the distribution
described by the training data, pdata
XjY =y(x); in the Appendix,
we show that this is possible even if we can only sample
pairs from the joint distribution, pdXatYa (x; y). It also means
ensuring that the latent distribution modeled by E, pE (z),
Z
resembles the prior, pZ (z). Our tests of BicycleGAN do
produce diverse and realistic reconstructions, but we do not find
that the learned distribution match the ground truth statistics.</p>
      <p>
        The original BicycleGAN has several features that
potentially complicate learning a bijective map via enforcing
twocycle consistency:
1. G has two inputs, y and z, but E only has one input,
x. This asymmetry means that E cannot invert G, since
E(G(y; z)) has no way to disentangle the separate
contributions of y and z in G(y; z).
2. E is trained using VAE-based methods and has two
outputs, and , rather than a point in Z. This is a second
reason why E cannot actually invert G(y; z) to recover
z, and therefore cannot be trained to enforce cycle
consistency in the latent space. L1Z actually attempts to
minimize the distance between z and
jE(G(y;z)).
3. The authors found that simultaneously training G and E
on both cyclic consistency loss terms incentivizes
cheating, similar to what was observed in CycleGAN
        <xref ref-type="bibr" rid="ref4">(Chu,
Zhmoginov, and Sandler 2017)</xref>
        , so E is not trained to
optimize L1Z . However, when we attempt to replicate their
approach we find indications of the same behavior when
simultaneously training G and E to optimize L1X .
We address these concerns primarily by changing E, and by
changing the loss functions used in training—specifically,
which loss functions are used to train which models. Our
modified framework simultaneously trains a conditional
generator G : Y; Z ! X and a deterministic, conditional
encoder E : Y; X ! Z. Each change is discussed in detail
in the following sections.
      </p>
      <sec id="sec-5-1">
        <title>Adversarial Losses</title>
        <p>
          As with the original BicycleGAN, we use two cGAN-based
loss terms to encourage realism in the outputs of G, one in
which z is sampled from the latent prior and one in which
z is encoded from an (x; y) pair. Rather than a
discriminator, we use a Wasserstein critic C
          <xref ref-type="bibr" rid="ref3">(Arjovsky, Chintala, and
Bottou 2017)</xref>
          , with a gradient penalty loss (Gulrajani et al.
2017). A discriminator is a binary classifier that can only
return values of 0 (generated) or 1 (real), but a Wasserstein
critic scores realism on a continuous scale of more negative
(more likely to be generated) to more positive (more likely
to be real). This provides more useful gradients to G during
training, but is not a fundamental change to the framework.
        </p>
        <p>The two loss terms used to train G are</p>
        <sec id="sec-5-1-1">
          <title>LccrLitRic(G; C)</title>
          <p>=</p>
          <p>Ey pdata [1 C(xgen)];</p>
          <p>Y
z pZ
LccrAitEic(G; E; C) = Ex;y pdata [1 C(xcyc)]; (10)
XY
where xgen = G(y; z) and xcyc = G(y; E(y; x)). The
explicit “1” indicates that G is being optimized to generate
outputs considered “real” by C.</p>
          <p>The complement to Eqs. 9 and 10 is</p>
          <p>Lrceriatlic(C) = Ex pdata [(1 C(x))]:</p>
          <p>X
Gradient penalty terms ensure that C is 1-Lipschitz:</p>
        </sec>
        <sec id="sec-5-1-2">
          <title>LcGLPR(G; C)</title>
          <p>= Ezx;ypZpdYata [(krxC(xgen)k2
1)2]; (12)
LcGAPE(G; E; C) = Ex;y pdata [(krxC(xcyc)k2 1)2]; (13)
z pZ Y
where gradients are taken with respect to randomly weighted
averages xgen = ux + (1 u)xgen and xcyc = ux + (1
u)xcyc, where u is uniform random noise on the interval
[0; 1]. k k2 is the L2 norm.
(9)
(11)</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>Cycle Consistency Losses</title>
        <p>As with BicycleGAN, we use two cycle consistency loss
terms, one for the cLR path and one for the cAE path:
L1Z (G; E) = kz
L1X (G; E) = kx
zcyck1;
xcyck1;
(14)
(15)
where zcyc = E(y; G(y; z)) and xcyc = G(y; E(y; x)).</p>
      </sec>
      <sec id="sec-5-3">
        <title>Calculating KL Divergence for a Deterministic</title>
      </sec>
      <sec id="sec-5-4">
        <title>Autoencoder</title>
        <p>The two-output, probabilistic design of the encoder in a VAE
enables calculation of the KL divergence from pZ (z) to
pE (z), an inherently statistical measure, from a single data</p>
        <p>Z
point. However, this comes at the cost of cycle consistency
in latent space, since E can no longer produce a single latent
vector from G(y; z) to compare with the original z. This is
even worse for a cVAE: in order to reconstruct data from
multiple classes, a non-conditioned VAE is forced to
partition latent space into regions corresponding to the classes,
which is in direct tension with the KL divergence loss’s drive
to map every x to N (0; 1). The extra information provided
by the conditioning input negates the need to partition Z, but
that in turn means that jE(x) ! 0 8 x, and therefore that
the reconstruction loss term defined in Eq. 3 will not be able
to learn anything meaningful. Our initial attempts to train a
BicycleGAN had precisely this problem, regardless of the
relative weights assigned to the different loss terms.</p>
        <p>A deterministic autoencoder preserves the flow of
information through the cLR path, but prevents us from
calculating the KL divergence on a per-example basis in the
cAE path. Fortunately, because KL divergence is a
statistical term, it can be calculated over a batch of training data.
We therefore switch to a batch-wise KL divergence loss,
LcKALE(E) = Ex pdata [DKL(N ( enc; e2nc)kN (0; 1))]; (16)</p>
        <p>X
where E is now a deterministic autoencoder and enc and
enc are the mean and standard deviation of zenc = E(y; x),
respectively, calculated over a batch of training data.</p>
        <p>To make the framework symmetrical, we also include a
similar loss term on the cLR path,</p>
        <p>LcKLLR(E) = Ey pdata [DKL(N ( cyc; c2yc)kN (0; 1))]; (17)</p>
        <p>Y
z pZ
where cyc and cyc are calculated over a batch of zcyc =
E(y; G(y; z)).</p>
        <p>These loss terms are unusual in that they are calculated
once over the batch, instead of once for each example,
followed by averaging over the batch. Their effectiveness
is strongly dependent on batch size: for a batch of z
N (0; 1), the batch size must exceed 100 for the calculated
KL divergence to drop below 0:01. For small batch sizes,
these terms do not provide much useful information.</p>
        <p>The desired effect of these loss terms is that E(y; x) will
learn to map the information in x that is not contained in
y onto Z, the space defined by the distribution pZ (z) =
N (0; 1). On their own, LKL cAE are not sufficient to
cLR and LKL
guarantee this. For example, perhaps E could learn to
assign a separate region in Z to each class that, averaged over
{</p>
        <p>z
R
cL y
{</p>
        <p>y
E
cA x
generator
G
E</p>
        <p>G</p>
        <p>C
C</p>
        <p>LccrLitRic
LccrAitEic</p>
        <p>X
L1
z
y
y
x
encoder</p>
        <p>E
G
E</p>
        <p>Z
L1
LcKLLR
LKcALE
a large batch, yields = 0, = 1. However, cLR cycle
consistency requires that E(y; G(y; z)) ! z 8 z pZ (z).
Therefore, optimizing cLR cycle consistency together with
KL divergence requires that E(y; x) maps (x; y) pairs into
Z independent of y; or, in other words, that E learns
common semantic features present in X but not Y, and maps
those features into Z.</p>
      </sec>
      <sec id="sec-5-5">
        <title>Full Model</title>
        <p>The models are trained according to</p>
        <p>G
E
C
= arg min[LccrLitRic + LccrAitEic +</p>
        <p>G
= arg min[LcKLLR + LcKALE +</p>
        <p>E
1X L1X ];
G attempts to generate realistic outputs, regardless of
whether its z input comes from the prior distribution or E,
while also attempting to invert E. E attempts to generate
latent outputs consistent with the latent prior distribution,
regardless of whether its x inputs come from the training data
or G, while also attempting to invert G. C attempts to learn
to differentiate between ground truth and generated x from
both the cLR and cAE paths. The training scheme for G and
E is shown in Fig. 1. The training scheme for C is standard
for a Wasserstein critic except that there are two paths for
generated samples; the factor of 1=2 in Eq. 20 ensures that
real and generated x are weighted equally so C does not just
label everything as fake.</p>
        <p>This formulation is symmetric. G is only trained to
optimize cAE cycle consistency and the adversarial loss, a
measure of realism or the (modeled) likelihood that G(y; : : : ) 2
X . Similarly, E is only trained to optimize cLR cycle
consistency and the KL divergence loss, a measure of the
likelihood that E(y; : : : ) 2 Z. The lack of common loss
functions between G and E keeps them from learning to cheat.</p>
        <p>We note that there is some redundancy between Eqs. 10
and 15, and between Eqs. 14 and 17. If E and G do truly
learn to invert one another, as incentivized by Eqs. 14 and 15,
then Eqs. 10 and 17 will no longer provide useful gradients,
but at worst this might lead to some wasted computations
late in training.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Methods</title>
      <p>Our code is publicly available at https://github.com/
USArmyResearchLab/ARL Representativeness of Cyclic
GANs.</p>
      <sec id="sec-6-1">
        <title>Dataset</title>
        <p>
          We attempt to solve a simple super-resolution inverse
problem. Fashion-MNIST
          <xref ref-type="bibr" rid="ref9">(Xiao, Rasul, and Vollgraf 2017)</xref>
          is a
collection of 28 28 grayscale images separated into 10
classes of clothing and accessories. Each class has 6,000
training examples and 1,000 test examples. Subjectively,
most examples in each class fall into a small number of
clusters—for example, most images in the “trousers” category
look very similar, while there are several different apparent
“sub-categories” among T-shirts. Each class has some
examples, less than 10% or so, that vary strongly from other
examples in the same class, while 90% are very similar. Some
classes (e.g., T-shirt and shirt) have significant overlap.
        </p>
        <p>We downsample the images once, using 2 2 average
pooling, to get a set of 14 14 images, x0, and then again
to get a corresponding set of 7 7 images, y0. We then
upscale y0 by doubling the number of pixels to get very
lowresolution 14 14 conditioning images y, and use those to
obtain the residuals x = x0 y. We rescale (x; y) pairs to
be on the interval [ 1; 1], with training and test data rescaled
separately.</p>
        <p>The original images in fashion-MNIST are already
lowresolution, and this much downsampling destroys significant
feature information. Many x can plausibly be obtained from
a given y, according to some distribution pXjY =y(x).
Although the dataset only includes one (x; y) pair per
example, rather than a distribution pdata
XjY =y(x), we can still
justify supervised training using pdXatYa (x; y). This is discussed
in the Appendix.</p>
        <p>
          Conditioning C vs. X ! Y Consistency Loss
Depending on the problem, it may be inappropriate to
condition C. For example, in image inpainting, y includes the
mask that defines the region to be filled in, which can be
exploited to identify perceptual discontinuities between the
original and generated portion
          <xref ref-type="bibr" rid="ref6">s of the image (Pathak et al.
2016</xref>
          ). For our problem, this is not the case.
        </p>
        <p>A conditioning input allows C to ask whether G(y; : : : ) is
consistent with y. We know the map X ! Y in our case; for
a given x to be perfectly consistent with y, a 2
2-averagepool-downsampled x must be 0 everywhere. By applying
this to G(y; : : : ), we can separate this consistency from C
as a pair of supplemental loss terms,</p>
        <p>cLR (G)
LX Y</p>
        <p>
          cAE(G; E) =
LX Y
= cXLYR kdownsample(xgen)k1;
cXAYEkdownsample(xcyc)k1:
(21)
(22)
We do not observe a significant difference in results between
implementing a conditional C(y; x) vs. an unconditioned
C(x) plus these supplemental losses, in terms of visual
quality. The supplemental losses enforce the desired consistency
explicitly, so we use them in our experiments.
ground truth
conditioner
We use deep ResNeXt networks with efficient grouped
convolutions and identity
          <xref ref-type="bibr" rid="ref6">shortcuts (Xie et al. 2016</xref>
          ) and
O(4 106) parameters in each of G, E, and C.
Activations are all LeakyReLU followed by layer normalization,
except for the generator output, which uses tanh. We inject
latent vectors z only at the top layer of the model. We test
dim(z) 2 f10; 100; 1000g, finding that 100 performs best
overall and therefore use that for most experiments.
        </p>
        <p>
          We perform training in TensorFlow, using the Adam
optimizer with default parameters. We choose a batch size
of 200, since our batch-wise KL divergence is only useful
for fairly large batch size. We train using in
          <xref ref-type="bibr" rid="ref6">stance noise
(Sønderby et al. 2016</xref>
          ) over x, xgen, and xcyc, replacing
x[ ] x[ ] + (1 )u, where anneals from 0 to
1 in increments of 0:01 over the first 100 epochs of training,
and u is uniform random noise on the interval [ 1; 1].
        </p>
        <p>For every update of G and E, we train C continuously
until its validation loss fails to improve for 5 consecutive
batches to ensure it is approximately converged. Training
continues until all models’ validation losses have failed to
improve for 20 consecutive epochs.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Results and Discussion</title>
      <p>We test variational vs. deterministic encoders; different
choices of hyperparameters and loss weights; and allowing
vs. disallowing overlap between loss functions used to train
E and G. The outcomes of these tests share two
commonalities: G produces diverse and realistic predictions,
exemplified in Fig. 2, but without cycle consistency or distribution
matching in Z. Thus, their maps are not bijective and do not
map the latent space onto the true data distribution.</p>
      <sec id="sec-7-1">
        <title>Variational Frameworks</title>
        <p>First, we test the BicycleGAN framework (the variational
method) with a Wasserstein critic, trained with Eqs. 6
(modified to include Eqs. 9 and 10), 7, and 20. As noted
previously, this generates realistic images on par with our other
tests, but even strongly weighting L1Z does not produce any
cLR cycle consistency, with ; going to 0; 1 very rapidly.
This is true regardless of whether we train E but not G on
L1Z and/or do not train E on L1X .</p>
      </sec>
      <sec id="sec-7-2">
        <title>Deterministic Frameworks</title>
        <p>As mentioned previously, we observe some steganographic
collaboration between models that train G and E with
overlapping loss functions. For example, some information is
hidden in the black background of each xgen as an
imperceptible, low-amplitude signal. Truncating the values of
those pixels to 1 with no other changes produces a sizable
change in kz zcyck1, increasing it from 0:07 to 0:4. This
emphasizes that cycle consistency in the models does not
mean they have learned a meaningful map. We therefore
restrict our experiments to models trained without overlapping
loss functions, using Eqs. 18–20.</p>
        <p>In this scenario we observe no cLR cycle consistency,
as E is unable to extract latent information from xgen. We
see this regardless of the relative weights in Eq. 19, even if
we weight LcKLLR and LcKALE independently. A typical result is
shown in Fig. 3a: the model is penalized by L1Z for wrongly
predicting zcyc, and is also unable to find a path to learn to
predict zcyc correctly. Accordingly, given some z (red line) it
simply predicts zcyc 0 8 y (black lines), which results in a
smaller penalty than if it had predicted a nonzero, incorrect
zcyc. This is a failure to optimize both L1Z , because the red
and black lines do not overlap, and LcKLLR, because the
standard deviation of the elements in zcyc is much less than 1.
Only when we set 1Z 0 are the KL-divergence loss terms
able to enforce good statistics.</p>
        <p>This is not because E is simply unable to learn to invert G.
Rather, it appears to be unable to do so quickly and robustly.
Once the model converges, we train E only for 1000 more
epochs, holding G and C constant. This does slightly
improve cycle consistency in z, as shown in Fig. 3b. L1Z takes
“only” several hundred epochs to plateau, so training time is
not the only limiting factor. The small overall improvement
in L1Z (about 4%) belies a noticeable qualitative change, due
to a minority of z-coordinates being very wrong. This still
falls far short of allowing E to truly invert G, and further, it
is not stable, disappearing with another update to G.</p>
        <p>L1X has limited success in optimizing cAE cycle
consistency. This happens to some extent even if we do not
optimize on L1Z , reflecting the fact that optimizing realism
pseudo-optimizes the L1 distance between x and xcyc. L1
X
then depends on only a handful of pixels for most images,
which dominate the expected value in Eq. 15. In support of
this, we find that xcyc does not reproduce rare features such
as text, symbols, and some patterns and orientations.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Conclusion</title>
      <p>Our experiments are consistent with the idea that
cycleconsistent GANs are not good vehicles for obtaining
representative maps, largely because they do not learn cycle
consistency well in the first place. They still produce diverse and
realistic outputs, but are not representative, in effect
mapping onto an unknown subset of X .</p>
      <p>Two-cycle consistency is a surrogate for bijectivity. The
fact that it is so difficult to train G and E to invert one
an1
0
1
2
b
latent vector index
latent vector index
other even for a simple problem such as the one we test
indicates that an explicit guarantee of bijectivity is probably the
best path forward for achieving this, which in turn will allow
inverse problems to be solved rigorously and
probabilistically via simple Monte Carlo sampling. In the absence of
such a guarantee, we believe representativeness in the
generated distribution must be explicitly tested for in generative
models, especially when risk or bias assessment, uncertainty
quantification, and similar considerations are important.</p>
      <p>INNs are inherently bijective and do not require explicit
enforcement of two-cycle consistency. It seems likely that,
even if there is a way to enforce bijectivity in a GAN-like
construct, INNs provide a simpler path to this result.</p>
    </sec>
    <sec id="sec-9">
      <title>Acknowledgments</title>
      <p>The views and conclusions contained in this document are
those of the authors and should not be interpreted as
representing the official policies, either expressed or implied, of
the US Army Combat Capabilities Development Command
(DEVCOM) Army Research Laboratory or the U.S.
Government. The U.S. Government is authorized to reproduce and
distribute reprints for Government purposes notwithstanding
any copyright notation herein. The authors thank Dr. Ting
Wang for valuable discussions and Mr. Matt Ziemann for
careful reading of the manuscript. Computer time was
provided by the DEVCOM ARL DSRC.</p>
    </sec>
    <sec id="sec-10">
      <title>Appendix: Sampling Inverse Problem</title>
    </sec>
    <sec id="sec-11">
      <title>Solutions</title>
      <p>pZE (z), respectively:
G and E implicitly define distributions pGXjY =y(x) and
pZ (z) dz f (G(y; z)) =
pGXjY =y(x) dx f (x);
Z
Z</p>
      <p>Rm
Rn
pXjY =y(x) dx f (E(y; x)) =
Ex pXjY =y [f (E(y; x))] = Ez pE [f (z)]: (A.4)
Z
If our model performs as desired, the distribution of points
in Rn obtained by sampling z pZ (z) and evaluating
G(y; z) should resemble the true conditional distribution
of possible x consistent with a given y, pXjY =y(x).
Similarly, the distribution of points in Rm obtained by sampling
x pXjY =y(x) and evaluating E(y; x) should resemble
the latent prior, pZ (z). However, while we can sample z
from the latent prior, we cannot sample x from the true
conditional distribution of X, since we usually have only one
pair each of ground truth (y; x) in our training data.</p>
      <p>Fortunately, we can get around this if the learned encoder
distribution, pE (z), is independent of Y , which is a
reason</p>
      <p>Z
able assumption if the reconstruction process can identify
common features applicable to many different conditioners,
and is further supported by the KL divergence loss terms.</p>
      <p>Taking the expectation under pY (y) in Eq. A.2 gives:
pXjY =y(x) pY (y) dx dy f (E(y; x))
=
The second integral on the RHS evaluates to one by
definition, and Bayes’s theorem lets us rewrite the LHS to get
pXY (x; y) dy dx f (E(y; x))
=</p>
      <p>Z</p>
      <p>Rm
pZE (z) dz f (z); (A.6)
where pXY (x; y) is the true joint probability density
function of X and Y . Unlike pXjY =y(x), we can sample from
pXY (x; y); the training data, pdXatYa (x; y), does just that. We
can then enforce a distribution constraint on pZE (z) as usual.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Ardizzone</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kruse</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Wirkert,
          <string-name>
            <given-names>S. J.</given-names>
            ;
            <surname>Rahner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ;
            <surname>Pellegrini</surname>
          </string-name>
          ,
          <string-name>
            <surname>E. W.</surname>
          </string-name>
          ; Klessen,
          <string-name>
            <given-names>R. S.</given-names>
            ;
            <surname>Maier-Hein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ;
            <surname>Rother</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          ; and Ko¨the,
          <string-name>
            <surname>U.</surname>
          </string-name>
          <year>2018</year>
          .
          <article-title>Analyzing inverse problems with invertible neural networks</article-title>
          .
          <source>CoRR abs/1808</source>
          .04730.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2019.
          <article-title>Guided image generation with conditional invertible neural networks</article-title>
          .
          <source>CoRR abs/1907</source>
          .02392.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Arjovsky</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Chintala</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Bottou</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Wasserstein GAN</article-title>
          .
          <source>CoRR abs/1701</source>
          .07875.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Chu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Zhmoginov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and Sandler,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>Cyclegan, a master of steganography</article-title>
          .
          <source>CoRR abs/1712</source>
          .02950.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          2014.
          <article-title>Generative adversarial nets</article-title>
          . In
          <string-name>
            <surname>Ghahramani</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Welling</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Cortes</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lawrence</surname>
          </string-name>
          , N. D.; and
          <string-name>
            <surname>Weinberger</surname>
          </string-name>
          , K. Q., eds.,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>27</volume>
          . Curran Associates, Inc.
          <fpage>2672</fpage>
          -
          <lpage>2680</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Sønderby</surname>
            ,
            <given-names>C. K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Caballero</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Theis</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Shi</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ; and Husza´r,
          <string-name>
            <surname>F.</surname>
          </string-name>
          <year>2016</year>
          .
          <article-title>Amortised MAP inference for image superresolution</article-title>
          .
          <source>CoRR abs/1610</source>
          .04490.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Wiyatno</surname>
            ,
            <given-names>R. R.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Dia</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>de Berker</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Adversarial examples in modern machine learning:</article-title>
          <source>A review.</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          CoRR abs/
          <year>1911</year>
          .05268.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Xiao</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Rasul</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ; and Vollgraf,
          <string-name>
            <surname>R.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms</article-title>
          .
          <source>CoRR abs/1708</source>
          .07747.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Xie</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; Girshick,
          <string-name>
            <given-names>R. B.</given-names>
            ; Dolla´r, P.;
            <surname>Tu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            ; and
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.</surname>
          </string-name>
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <article-title>Aggregated residual transformations for deep neural networks</article-title>
          .
          <source>CoRR abs/1611</source>
          .05431.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ; Zhang,
          <string-name>
            <given-names>X.</given-names>
            ;
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            ;
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            ; Xue, J.; and
            <surname>Liao</surname>
          </string-name>
          ,
          <string-name>
            <surname>Q.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>Deep learning for single image superresolution: A brief review</article-title>
          .
          <source>IEEE Transactions on Multimedia</source>
          <volume>21</volume>
          (
          <issue>12</issue>
          ):
          <fpage>3106</fpage>
          -
          <lpage>3121</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Zhu</surname>
          </string-name>
          , J.-Y.;
          <string-name>
            <surname>Park</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Isola</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Efros</surname>
            ,
            <given-names>A. A.</given-names>
          </string-name>
          <year>2017a</year>
          .
          <article-title>Unpaired image-to-image translation using cycle-consistent adversarial networks</article-title>
          .
          <source>In Proceedings of the IEEE international conference on computer vision</source>
          , 2223-
          <fpage>2232</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Zhu</surname>
            , J.-Y.; Zhang, R.; Pathak,
            <given-names>D.</given-names>
          </string-name>
          ; Darrell,
          <string-name>
            <given-names>T.</given-names>
            ;
            <surname>Efros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            ;
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            ; and
            <surname>Shechtman</surname>
          </string-name>
          ,
          <string-name>
            <surname>E.</surname>
          </string-name>
          <year>2017b</year>
          .
          <article-title>Toward multimodal image-to-image translation</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          ,
          <volume>465</volume>
          -
          <fpage>476</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>