<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>LatentVis: Investigating and Comparing Variational Auto-Encoders via Their Latent Space</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Xiao Liu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Junpeng Wang</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Engineering, the Ohio State University</institution>
          ,
          <addr-line>2015 Neil Avenue, Columbus, Ohio, 43210</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Visa Research</institution>
          ,
          <addr-line>385 Sherman Avenue, Palo Alto, California, 94306</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>As the result of compression and the source of reconstruction, the latent space of Variational Auto-Encoders (VAEs) captures the essences of the training data and hence plays a fundamental role in data understanding and analysis. Focused on revealing what data features/semantics are encoded and how they are related in the latent space, this paper proposes a visual analytics system, i.e., LatentVis, to interactively study the latent space for better understanding and diagnosing image-based VAEs. Specifically, we train a supervised linear model to relate the machine-learned latents with the human-understandable semantics. With this model, each important data feature is expressed along a unique direction in the latent space (i.e., semantic direction). Comparing the semantic directions of diferent features allows us to compare the feature similarity encoded in the latent space, and thus to better understand the encoding process of the corresponding VAE. Moreover, LatentVis empowers us to examine and compare latent spaces across various training stages, or diferent VAE models, which can provide useful insight into model diagnosis.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Deep generative model</kwd>
        <kwd>variational auto-encoder</kwd>
        <kwd>latent space</kwd>
        <kwd>semantics</kwd>
        <kwd>visual analytics</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>Investigating them could help to understand and di</title>
        <p>agnose the DGMs, and thus shed light on the mystery
With the powerful capability in feature extractions, power of DGMs. However, those latent spaces are
usuDeep Neural Networks (DNNs) have made a series of ally with high-dimensionality and the semantics of
inbreakthroughs across a wide range of applications, e.g., dividual latent dimension is not human-understandable.
image classification [1], object recognition [2], image Recently, we have witnessed many works on
intersegmentation [3], etc. More interestingly, DNNs also preting the latent space of DNNs. Some considered a
demonstrate excellent performance in feature genera- latent space as a high-dimensional manifold and
fotions, which has attracted more research attention [4]. cused on the geometric interpretation of the manifold.
For example, Generative Adversarial Nets (GANs) and For example, [7] showed that geodesic curves on the
Variational Auto-Encoders (VAEs) are able to generate latent space manifold are approximately straight in their
data (including images [5], sounds [6]) that are almost experiments. [8] revealed that a stochastic
Riemanindistinguishable from real data. nian metric in the latent space could produce smoother</p>
        <p>
          The outstanding performance of DNNs comes from interpolations than the conventional Euclidean distance.
their complicated internal model architectures and the With static visualizations of the geometric path in the
long-time model training processes, which, however, latent space, these studies have helped to understand
have gone far beyond humans’ interpretability. As a the abstractive manifold holistically.
result, it is very dificult to explain how Deep Genera- Others explored the semantics of diferent latent
spative Models (DGMs) understand the extracted features ces by focusing on specific tasks. For example, [9, 10]
and further use them to generate new features. The la- analyzed the word embedding and verified the linear
tent spaces of these models, located at the pivot point arithmetic of the semantics in the embedding/latent
between extraction and generation, compress all the space, e.g.,  − + ≈ . Similar linear
extracted features and control what to be generated. arithmetic has also been found in the latent space of
image-based DGMs [4]. These studies expose some
Proceedings of the CIKM 2020 Workshops, October 19-20, 2020, structures of the latent spaces, but are still insuficient
eGmaalwila:yl,iIur.e5l7a6n4d@osu.edu (X. Liu); junpeng.wang.nk@gmail.com (J. to comprehensively reveal their essential semantics.
Wang); The work was done while the author was at The Ohio State This paper targets to diagnose image-based VAEs
University. (J. Wang) by interactively investigating their latent space, and
orcid: 0000-0002-6303-0771 (X. Liu); 0000-0002-1130-9914 (J. hence answers three concrete research questions: (
          <xref ref-type="bibr" rid="ref2 ref3">1</xref>
          )
Wang) © 2020 Copyright for this paper by its authors. Use permitted under Creative what semantics are embedded in the latent space of
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmUmoRns WLiceonrsekAsthtriobuptioPnr4o.0cIneteerdnaitniognasl ((CCC EBYU4R.0)-.WS.org) VAEs; (2) how can we transfer the machine-learned
lalearned latent space into a human-understandable specific datasets. Some works in this group also tried
tent space to a human-understandable semantic space
for better interpretation; (3) how to use the latent spaces
a
of VAEs to track and compare VAE models. To the end,
we design and develop a three-module visual analytics
prototype, named LatentVis, for this matter. The Data
module presents an interface to interact with the
experimental dataset and select images with desired
features. The Semantics module identifies and compares
semantic directions of diferent image features,
bridging the machine-encoded latents with
human-understandable semantics. The Comparison module
compares the latent space of (
          <xref ref-type="bibr" rid="ref2 ref3">1</xref>
          ) the same model in two
diferent training stages, (2) the same model from two
separate trainings with randomly initialized network
parameters, and (3) two diferent VAE models. To sum
up, the contributions of this paper are three-fold:
• We present LatentVis, a visual analytics system
that helps to understand and diagnose VAEs by
interactively revealing the encoded semantics of
the latent space.
• Enlightened by the linear arithmetic of features,
we use a linear model to transfer a
machinesemantic space.
• Based on our analysis of the latent space, we
propose a model-agnostic approach to compare
VAEs, across training stages, separate trainings,
or diferent VAE models.
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Background and Related</title>
    </sec>
    <sec id="sec-3">
      <title>Works</title>
      <sec id="sec-3-1">
        <title>Interpreting Latent Spaces. DNNs can be consid</title>
        <p>ered as functions that transfer data instances from the
input data space to a latent space (
∶</p>
        <sec id="sec-3-1-1">
          <title>A well-trained DNN will preserve the essential infor</title>
          <p>mation of the input data during this transformation.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>However, due to the complexity of DNNs, it is a nontrivial problem to reveal or verify what information is preserved and how it is preserved in the latent space.</title>
          <p>→ 
 ).</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>Targeted on this problem, many research eforts have</title>
          <p>been devoted to interpret the latent spaces of DNNs.</p>
          <p>For example, [11] showed how the statistics of data
can be examined in the latent space representation.
[12] interpreted the association between visual
concepts and symbolic annotations captured by  VAE
through parallel coordinates plots. Latent embedding learn- decodes the image from the latent variable (Fig. 1a).
ing methods (GLO, LEO, GLANN [13, 14, 15]) were also
developed for the interpretation and understanding of latent variable  (i.e.,  = 
Specifically, the encoder maps an input image  to a
( ) ∼  ( | )</p>
          <p>), and
x : i n p u t
i m a g e</p>
          <p>Encoder</p>
          <p>Decoder
introduced in DFC-VAE [27], for feature reconstruction.
latent spaces. Moreover, in natural language
processing, the learned embedding of words/ paragraphs also
form a latent space. [9] and [16] interpreted this space
and found that the correlations between words/
paragraphs were well-captured in the space.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Visual Analytics for Deep Learning (VIS4DL).</title>
        <sec id="sec-3-2-1">
          <title>There are two groups of VIS4DL works in general. One</title>
          <p>focuses on a specific model to reveal the internal
working mechanism of the model, such as CNNVis [17],</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>GANViz [18], and ActiVis [19]. These works usually</title>
          <p>design a visualization system to expose the hidden
features and feature connections, for specific DNNs on
to generalize to diferent models on various datasets.
For example, [20] proposed Network Dissection to
quantify the interpretability of latent representations
captured by CNNs (AlexNet, VGG, GoogLeNet, ResNet)
via the alignment between hidden units and
semantic concepts. The other group focuses on using only
the model inputs and outputs to interpret/diagnose the
model, without touching the intermediate model
details (i.e., model-agnostic). For example, [21] proposed
a model-agnostic approach to reveal the dominant
regions of input images in controlling the prediction
results of a classifier. More examples in this group also
include [22, 23, 24]. Our work needs no examination
on the internal working mechanism of VAEs (as
semantics are encoded in the space formed by
activations, rather than individual neurons [25, 26]), and thus
belongs to the second group. Integrating a linear space
transformer into our visual analytics process, we try to
present a human-understandable latent space to
diagnose DGMs.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>Variational Auto-Encoder (VAE) [28] aims to re</title>
        <p>construct the input image from a latent representation
of the image encoded/learned by itself. It is comprised
of two neural networks: an encoder network encodes
the image into a latent variable, and a decoder network
I m a g e S p a c e a Df e aa tt au r e s ( i ) va its Iu m a la i gz ee Ss ep ma ca en t i c s imnatnhifsacweoirmk.agTehsi.sEdaacthasiemt
aisgecohnassti4tu0tbeidnaorfy20a2tt5r9ib9uhtuesc m u s t a c h e S e m a n t i c S p a c e (e.g., the image is a male face or not, a face with glasses
V A E b z m u s t a c h e Lg laa s ts ee sn t S p a c e Lm io n d e ea lr ad ilma tee nn st i(o zn) ( i i ) do ef t es ce tm tah ne t gadieclinasms sec esmone adss niiotn incg ( y )
itonhrgonsdoeoti)wmwnaitgthoesa64br×ey6sco4rlfouoptripooinnugrofVth1A7eE8ms×.2iIn1mt8o.aWg1e4es8×pw1rie4t-h8patrhnoecdesssacsmaeldefeature (i.e., have the same value on a binary attribute)
Figure 2: Three spaces: (a) the image space is where the belong to the same feature category.
CelebA images reside, each pixel is an independent dimen- We focused on this face image dataset for two
reasoifotnh; o(bse) tihmealgaetse,ntthsepaVcAeEisenthceodVeArEanledardneecdodreeprreenseanbtleattiohne sons. First, this dataset presents rich attributes for the
transformation between the image space and latent space; same object (the human face) in the same scale.
Com(c) the semantic space is derived from the latent space un- pared to other datasets with numerous objects in
difder the supervision of the 40 binary features of the images ferent scales (e.g., ImageNet), a VAE can more
accu(using our linear model). rately capture the underlying data distribution.
Second, the well-labeled attributes in this dataset can help
to interpret the semantics encoded in the latent space
the decoder maps a latent variable  to an output im- of VAEs, through which, we derived the semantic space
age  ′ (i.e.,  ′ =  ( ) ∼  ( | )). The encoder using our linear model (Fig. 2).
and decoder, defined by trainable parameters  and  Image Semantics. The semantics of face images is
respectively, are optimized via minimizing the follow- the existence and scale of the 40 features in the CelebA
ing loss function: dataset. A well-trained VAE can transfer the semantics
from the image space to the VAE’s latent space (i.e.,
 (,  ) = −   ( | )[  ( | )] +   (  ( | )‖ ( )). ftricosminFitgh.e2laatteon2tbs)p. aHceowisenvoetr,htuhmeatrna-nusnfedrerresdtasnedmabalne-.</p>
        <p>By  =  ( ), the latent variable of a specific Hence, our goal is to interpret them via a semantic
image from the VAE is readily accessible for further space (Fig. 2c), in which, we can explore if the image
semantic explorations. One common issue of VAEs is semantics have been accurately encoded (Fig. 2i) and
that the generated images tend to be blurry, due to the how they are encoded (Fig. 2ii).
aggregated pixel-wise image distance used in the loss Semantic Direction. In Fig. 2b, we use a point to
function, i.e., the  2 distance between  and  ′. denote the VAE encoded latent variable for the
corre</p>
        <p>Deep Feature Consistent VAE (DFC-VAE) [27] is sponding image in the image space. All latent
varia variant of the regular VAE. It improves the quality of ables for images of the same category (e.g. “glasses",
the reconstructed images by replacing the pixel-wise “mustache") form a cluster in the space, denoted as a
reconstruction loss with a feature perceptual loss [29]. blue bubble in Fig. 2b. We identify the direction from
In DFC-VAE, multiple levels of features are extracted one cluster without a particular feature to the
clusfrom both the input and reconstructed images by pass- ter with that feature as the semantic direction for the
ing them into a pre-trained CNN. Each layer of the feature. For example, in Fig. 2b, the directions on the
CNN extracts certain levels of image features. The red and green lines reflect the semantic directions for
features from the input and reconstructed images are “mustache" and “glasses". Moving the latent variable
then used to measure their perceptual distance (Fig. 1b, of one image along a semantic direction will change
the  2 loss between the corresponding feature maps). the corresponding feature of the reconstructed image
In this work, we adopted this perceptual loss to im- the most. Along this semantic direction, a vector with
prove the reconstruction quality. The VGG19 [30] pre- a certain length is referred to as a Semantic Vector. For
trained on the ImageNet data is used as our pre-trained CelebA, there are 40 features, and we have 40 unique
CNN. semantic directions.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Methodology</title>
      <p>3.2. Our Contributions</p>
      <sec id="sec-4-1">
        <title>The Linear Model. Enlightened by the linear arith</title>
        <p>3.1. Fundamental Concepts metic of features (e.g.,    ℎ   +
Image Dataset. We focus on a face image dataset, i.e., (   ℎ   −   ℎ   ) ≈
CelebA [31], to explore the latent space of VAE models    ℎ   ) [4], we trained a linear model
to quantify the semantic directions, as well as to
transI n f o r m a t i o n F l o w o f M o d u l e s</p>
        <p>D e t e c t
s e m a n t i c s</p>
        <p>C o m p a r e d i f f e r e n t
s e m a n t i c s i n a V A E</p>
        <p>C o m p a r e t h e s a m e
s e m a n t i c s i n t w o V A E s</p>
        <p>details in Sec 4.2).</p>
        <p>T 2
S e m a n t i c s M o d u l e
contributions of a latent dimension to all the seman- that the hyperparamters for these trainings could be
in a lens of their latent space: (T1) navigating the dataset ferent learning rates to study their convergence speed.
form the latent space to a human-understandable
semantic space (i.e., from Fig. 2b to 2c). The linear model,
 =   ⋅  +  , is trained using the latent variable of
all CelebA images encoded by VAEs (denoted as  ) and
the 40 binary attributes of the images (denoted as  ).
is a vector with many dimensions (the black arrow in</p>
        <p>Fig. 2b), and  is a vector with 40 dimensions (the green
arrow in Fig. 2c). Each row of the weight matrix  
is the derivatives of a certain feature to all the latent
dimensions, which represents a semantic direction in
the latent space. Each column of   represents the
tics.</p>
        <p>Analytic Tasks and LatentVis. We focus on three
analytic tasks to better understand and diagnose VAEs
and feature selection; (T2) visualizing and comparing
the image semantics/ semantic directions in a latent
space; and (T3) facilitating model comparisons and
diagnosis through comparing their latent spaces.
Foltem, LatentVis (Fig. 3, bottom), which contains three
analytical modules corresponding to an hierarchical
information flow (Fig. 3, top).</p>
      </sec>
      <sec id="sec-4-2">
        <title>The Data Module (Fig. 3a) gives an overview of the</title>
        <p>studied dataset, allowing us to flexibly explore data
instances. It is also an interface to select any interested
feature category and data instance for further
analyV i e w d a t a
f e a t u r e s</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Experiments and Results</title>
      <p>Neural Network Structure. We worked with one
regular VAE and one DFC-VAE, both with an encoder and
a decoder of four convolutional layers. The four-layer
encoder compresses the 64×64×3 CelebA images to 32×32
×32, 16×16×64, 8×8×128, and</p>
      <p>4×4×256. The compression
result is then flattened and mapped to a 100D
Gaussian distribution, represented by a 100D mean and a
100D standard deviation, through two fully-connected
layers. The decoder has a symmetric structure with
the encoder, but with a reversed order of the layers to
up-sample the 100D latent variables (sampled from the
100D Gaussian).</p>
      <sec id="sec-5-1">
        <title>The diference between the VAE and DFC-VAE is whether a pre-trained VGG19 model was used to compute the perceptual loss.</title>
      </sec>
      <sec id="sec-5-2">
        <title>We trained the VAE once</title>
        <p>and the DFC-VAE twice with the same batch size and
the Adam optimizer in all trainings. However, notice
the same or be diferent on-purpose to compare the
trained models and investigate the efect of the
hyperparameters, e.g., comparing models trained with
dif</p>
        <p>All three trainings used the 202599 CelebA images
and the batch size is 64. Every 800 batches were
considered as a training stage to collect model statistics,
like loss-values, and all the three trainings were run
lowing these tasks, we propose a visual analytics sys- for 197 stages (i.e., 157600 batches).
semantics has been captured by the VAE, and how the
data features are correlated in the latent space by
connecting the image space with the VAE latent space. Its
three views follow an hierarchical information flow to
detect, cluster, and compare semantics (see details in
Sec. 4.1).
mantic directions of latent space from diferent VAEs
to diagnose these models. The diagnosis for the
compared VAEs is performed by examining the learning
4.1. Detecting and Comparing Semantic</p>
        <p>Directions
We propose Algorithm 1 to detect and compare
semantic directions in the Semantics Module (Fig. 4). First,
we give all CelebA images to the well-trained VAE model
latent variables and their corresponding 40 binary
atdirections encoded in the latent space. To verify the
effectiveness of the semantic directions, we also
generate many random directions in the latent space. Given
any selected image, we visualize the modified version
of the image resulted from changing its latent variable
along the 40 semantic directions and a random
directhe feature “bangs" and “glasses" in Fig. 4a (i.e., change
the length of a semantic vector,  ∈[−10, 10]), we
observed how those two features were added (Fig. 4-a2)
pendix).
sis in other modules (please check details in the Ap- to obtain their latent variables. Then, we use these
The Semantics Module (Fig. 3b) demonstrates what tributes to train a linear model to capture the semantic
The Comparison Module (Fig. 3c) compares the se- tion. For example, by dragging the control point for
work division between the encoder and decoder (see to the selected image (Fig. 4-a1). However, no
obvious changes towards a particular feature were found
in the image when dragging the control point on the
axis representing random directions (Fig. 4-a3).</p>
      </sec>
      <sec id="sec-5-3">
        <title>Algorithm 1 : Detecting and Comparing Semantic Di</title>
        <p>rections
Require: images</p>
        <p>Req{u ir}e :=1selected image  ̂
= {  } =1
 , feature labels</p>
        <p>=
∈  , selected feature  ,
compared feature  ′, the length of a semantic
vecRequire: the VAE model with an 
4: train a linear model  =   ⋅  + 
5: semantic directions</p>
        <p>←  
7: randomly initialize   with ‖  ‖ = ‖  ‖</p>
        <p>// a row of  corresponding
tor 

2:
1: for  in  
3: end for</p>
        <p>← 
6:  
to feature</p>
        <p>←  [ , ∶]
8: ̂ ← 
← ̂ +   
← ̂ +   
← 
← 
← 
//Fig. 4a
9: ̂ 
10: ̂ 
11:  ̂ 
12:  ̂ 
13: 
14:</p>
        <p>to feature  ′
17: 
16:   ′ ← 
tics //Fig. 4c
( ) do</p>
        <p>(  )
( ̂ ).</p>
        <p>(̂  )
(̂  )
(   )
(   ′ )
1
2
3
bangs
glasses
2
3
4
5
③ changed along random direction
a ① original ② changed along semantic direction b ②① wmeaank--iswho③manw-ioshman-ish
④ old-ish ⑤ weak-man-ish
c RMoaslyecheeks
training stage: (a) detect, (b) cluster, and (c) compare
semantic directions. The cell color from red over white to blue
and a in the matrix (b) indicates the cosine similarity of two
semantic directions from -1 over 0 to 1; (b1-b5) represent five
clusters: “weak-woman-ish", “man-ish", “woman-ish",
“oldish", and “weak-man-ish".
which are from the “woman-ish" (Fig. 4-b3) and
“manish" (Fig. 4-b2) groups respectively. The negative
correlation is indicated by an obtuse angle between the
two colored semantic vectors (i.e.,
  
and    ′ in
Algorithm 1). To visualize these two high-dimensional
vectors</p>
        <p>and    ′ in a 2D plot intuitively,   
responding to the selected feature) is always along the
(corhorizontal direction, and    ′ (corresponding to the
compared feature) presents an angle with    ,
calculated as the angle between them in the original HD
latent space. The length of each colored segment
relfects the norm of the corresponding semantic vector.
Dragging the green/red point in Fig. 4c to the
opposite direction (i.e., change  to a negative value), we
can also verify that the opposite direction indeed
encodes the opposite feature, e.g., “pale skin" is the
opposite of “dark skin". Interestingly, we found several
such pairs, showing a similar way of how human
understand these semantics, such as “smile" v.s. “scary</p>
        <p>
          From the above explorations and visual evidence,
we feel confident to believe the following hypothesis
on the semantic structure of the latent space: (
          <xref ref-type="bibr" rid="ref2 ref3">1</xref>
          )
latent space tends to encode semantics along unique
directions (i.e., semantic directions); (2) smaller angles
between semantic directions denote similar semantics
and opposite semantic directions encode opposite
semantics.
        </p>
      </sec>
      <sec id="sec-5-4">
        <title>The Comparison Module compares two VAE models</title>
        <p>and facilitates model diagnosis using their latent spaces.</p>
        <p>The comparison is across diferent training stages,
trainings (with randomly initialized neural network
parameters), and VAE models. For each pair of compared
( ̂ ,  ̂  ,  ̂  ) to verify semantics in  
(   ,    ′ ), (  ,   ′ ) to compare
seman15:  ′ ←  [ ′, ∶] // another row of  corresponding
into five groups using the  -means algorithm based on
the cosine similarity between their corresponding
semantic directions. Interestingly, the feature categories
inside each group present similar semantics. For
example, the feature “make up", “no beard", and
“attractive" are in the same group, which are all
“womenish" features. With the similar logic, the other four
semantic groups are named “man-ish" (e.g., “mustache",
“five-o’clock shadow"), “weak-woman-ish" (e.g., “smile",
and “old-ish" (e.g., “chubby", “bald"). The five clusters
can be easily identified from the symmetric pair-wise
similarity matrix (Fig. 4b), and we can select any two
semantics (i.e., one row and one column) for
comparison. For example, Fig. 4c shows the negative
correlation between the “rosy cheeks" and “male" feature,</p>
        <p>We cluster the 40 feature categories (40 semantics) face", “bangs" v.s. “high hairlines".
“bangs"), “weak-man-ish"(e.g., “bushy eyebrows", “hat"), 4.2. Comparing Semantics across VAEs
b</p>
        <p>E 2</p>
        <p>D 2 lS e ea mr n a e n d t i fc r oD m i r em c ot ido en l (2S D ) S D 2
mS Do d fer lo m1 Sm Do d fer ol m2 a b
Figure 5: The mapping between images in the Comparison
Module. (a) Reconstructed images from diferent
combinations of encoders and decoders of two compared VAEs; (b)
reconstructed images when changing along the semantic
direction (SD) learned by diferent VAEs.</p>
        <p>Sm Do d fer ol m1 Sm Do d fer ol m2
S D 1 S D 2
i
i i
b</p>
        <p>c d
b</p>
        <p>e f</p>
        <p>When moving the latent variable of the image in Fig. 6a
along the semantic direction learned in those two stages,
all the six reconstructed images generated the “glasses"
feature (as shown in Fig. 6b), regardless of the
swapmodels, we run Algorithm 1 to obtain their individual ping of the encoders and decoders from the two
trainsemantic directions and generate the corresponding ing stages. This observation indicates that diferent
reconstructed images. Given any selected image, Al- stages of the training encode the semantic direction of
gorithm 1 also outputs its latent variable, from which, the same feature in a consistent way. It also implies
we can regenerate the image with a VAE’s decoder. We another insight on the semantic structure of the latent
can use the same VAE’s decoder and a diferent VAE’s space, i.e., the semantic directions may have a tolerable
decoder to reconstruct two images and compare them. range, within which, the learned semantics is evolving
From the comparison, we can track the work division over the training process.
between an encoder and a decoder, and also diagnose Comparing Semantic Directions Across-Training.
which of these two networks is more responsible for Focusing on a well-trained stage, we compared the
DFCcertain model functions. VAE from two separate trainings (where the model
pa</p>
        <p>Since we have two encoders and two decoders, there rameters are randomly initialized in each training). Our
are four possible combinations between them. Each goal is to explore whether the semantics is encoded
rectangle in Fig. 5a represents one combination. We in the same way over the two trainings. We used the
use purple and orange color to denote model 1 and same training hyperparameters and trained the same
model 2. The left and right borders’ color of a rectangle model twice with enough epochs. Fig. 6c investigates
reflects which model’s encoder is in use, whereas the the semantic directions of the “glasses" feature from
top and bottom borders’ color reflects which model’s the two trainings. We found the “glasses" feature can
decoder is in use. For example, the top right rectangle only be generated when using the matched
encoderin Fig. 5a means the reconstructed image uses encoder decoder pairs. For example, Fig. 6c reconstructs an
1 (the left and right borders of the rectangle are in pur- image using the decoder from the second training, but
ple) and decoder 2 (the top and bottom borders of the the latent variable is moved along the semantic
direcrectangle are in orange), i.e., the pair of E1-D2. tion learned from the first training. As a result, the</p>
        <p>When dragging the horizontal control point (i.e., cha- “glasses" feature was not generated. Conversely, the
nge  ) in the top of the Comparison Module, these four “glasses" feature could be generated when moving along
images will be modified along the semantic directions the semantic direction learned from the second
train(SD) for the focused semantics learned from the two ing, as shown in Fig. 6d. The results shown in Fig. 6e
models. Fig. 5b shows the mappings, i.e., which image (image with “glasses"), 6f (image without “glasses")
furis changing along which semantic direction. For exam- ther verify this. The observation indicates that
diferples, Fig. 5i and Fig. 5ii show the reconstructed images ent trainings of the same VAE may encode the
semanwhen changing the top right image of Fig. 5a along the tic direction of the same feature diferently.
semantic directions learned from model 1 and model 2,
respectively.</p>
        <p>Comparing Semantic Directions Across-Time. We 4.3. Diagnosing VAEs via Semantics
take the DFC-VAE in an early training stage and a well- Comparison
trained stage to perform the comparison. Fig. 6 in- Learning Process Comparison between Encoders and
vestigates the DFC-VAE model with parameters from Decoders. To interpret the learning work division
bethe 3 (orange) and the 197ℎ (purple) training stage.
tween the encoder and decoder, we explored the
DFCVAE model from a well-trained stage and an early train- b smile pale skin mustache hat glasses
ing stage. We swapped the pairing between the two
encoders and two decoders to investigate their
respective responsibilities, i.e., the well-trained encoder is Figure 9: Comparing the reconstructed images along
sepaired with the early-stage decoder, and the early-stage amnadnVtiAcEdi(roercatniognesbaotrde=rs4).i5nfr(ao)mthDeF3C -VaAnEd((pbu)r1p9l7eℎbosrtdaegres.)
encoder is paired with the well-trained decoder. By
cdoismcopvaerrinedg tthhaet raecwoenlslt-rtruacitneeddimenacgoedserfriosmretshpeomns,ibwlee DVFAC-EVAE a 3rd stage smile b 197th stage
for controlling semantics, while a well-trained decoder mpaulsetasckhine
is responsible for generating clear images. For exam- hat
ple, the image in Fig. 7-a2 reconstructs the image in glasses
Fig. 7-a0 using the early-stage encoder ( 2) and the Figure 10: Comparing the semantic correlations learned
well-trained decoder ( 1). Although the reconstruc- by DFC-VAE (top row) and VAE (bottom row) between the
tion did not catch the features of the original image five interested semantic directions and other semantic
di(e.g., gender, hair style and color), the generated im- rections in the the (a) 3 and (b) 197ℎ stage.
age is clear. On the contrary, the image in Fig. 7-a3 is
reconstructed using the well-trained encoder ( 1) and
the early-stage decoder ( 2). The image captures most ple, the left and right image in the five pairs of images
of the features in the original image, but it is blurry. in Fig. 9a compare five image features generated from
Similar observations can also be found in Fig. 7b, 7c, DFC-VAE and VAE respectively from the 3 training
and 7d. stage. Fig. 9b shows the same comparison but using</p>
        <p>We believe a well-trained encoder can better con- the parameters of DFC-VAE and VAE from the 197ℎ
trol semantics because it better captures the correla- stage. Comparing Fig. 9a and 9b vertically, i.e., across
tion between diferent semantics. In other words, bet- time, we can see that DFC-VAE enhanced the features
ter semantic correlations make the semantic directions in the reconstructed images more than VAE.
more accurate in the latent space generated from a Comparing the semantic correlations learned from
well-trained encoder. For example, Fig. 8 compares the DFC-VAE and VAE in those two training stages, we
correlations between the “glasses" feature and other found that both models captured the semantic
correfsetaagtuerse.sItinisthoebv3iou−s t6hℎa,t 8thℎe, n13egℎa,taivnedc1o9r7reℎlattriaoinnsinogf lbaottiotonms artowa soifmtihlaer 1p0acreo.w-FpoariresxianmFpilge., 1t0heshtoopw atnhde
diferent semantics (i.e., the region in the black dashed semantic correlations from the DFC-VAE and VAE
relines) were evolving gradually over the training. Com- spectively, in stage 3 and 197. As highlighted by the
paring the trend of negative to positive correlations rectangles, the evolutions of the correlations between
between semantics (i.e., red to blue cells), we can see the five semantics and other semantics are similar in
the negative correlations are acquired in later stages. both models, from the 3 stage (left) to the 197ℎ stage</p>
        <p>
          Comparing VAE and DFC-VAE. Although the VAE (right).
and DFC-VAE shared a similar network structure, the Combining our observations from Fig. 9 and Fig. 10,
feature perceptual loss used in DFC-VAE dramatically we get a better understanding on how the perceptual
improved the semantics learning. The image features loss (from the pre-trained VGG19) was afecting the
generated from DFC-VAE tend to be less blurry and model, i.e., compared to DFC-VAE, VAE captured the
more recognizable than those from VAE. For exam- correlations between diferent semantics but it still could
not generate clear features. We suspect that the
perceptual loss contributed more to improving the decoder and human interpreted similar/diferent semantics tend
in better reconstructing image features. to have smaller/larger angles between semantic
directions. Also, LatentVis can be used to examine and
compare VAEs from three diferent perspectives: (
          <xref ref-type="bibr" rid="ref2 ref3">1</xref>
          )
dif5. Limitations and Future Work ferent training stages, (2) separate trainings with
randomly initialized neural network parameters, and (3)
diferent VAEs. Several interesting points on VAEs are
discovered and summarized as follows:
LatentVis can be easily adapted to analyze other VAE
models, as it is a model-agnostic approach and does
not use any model-specific information (e.g., network
architectures). The required data are the input images, • Diferent stages of one training encode the
sethe reconstructed images, and the learned latent vari- mantic direction of the same feature in a
consisables at diferent training stages. The labels for dif- tent way.
ferent feature categories are also demanded to train
our supervised linear model. One interesting ques- • Diferent trainings of the same VAE model may
tion here is whether finer granularity labels can fur- result in the VAE encoding the semantic
directher improve the accuracy of the derived semantic di- tion for the same feature in a diferent way.
rections. For example, the current "glasses" feature • For a well-trained VAE, its encoder tends to be
includes both "sunglasses" and "normal glasses". Dif- responsible for controlling semantics, while its
ferentiating them as two features may help in more decoder tends to be responsible for generating
accurately extracting the semantic directions. Addi- clear images.
tionally, we can also verify the existence of the class
hierarchy of features in the latent space. These are in- • For the specific dataset we worked on, the
perteresting research directions for us to explore in the ceptual loss of DFC-VAE contributes more to the
future. However, similar to the current limitation of training of the decoder in better reconstructing
our work, these future works also heavily depend on image features. Without using the perceptual
the availability of the labeled datasets. loss, VAE is still able to accurately capture the
        </p>
        <p>Moreover, it is also possible to extend LatentVis to semantics correlations.</p>
        <p>VAEs trained on other data types, e.g., texts or audios.</p>
        <p>Compared to images, those types of data may not be These explorations and comparisons demonstrate how
able to be visually interpreted. However, through dif- the latent spaces can be used to interpret and compare
ferent visual encodings used in existing works, we be- the corresponding VAEs. With the promising results
lieve they can be intuitively presented as well. We plan demonstrated in the paper, we are confident in
extendto investigate more from the literature and spend more ing LatentVis to other latent variable models or other
eforts this direction in the future. data types in the future.</p>
        <p>It is worth mentioning that our current explorations
in this work are heuristic and based only on one dataset, References
through which, we hope to shed some light on how
the latent space of VAEs captured the semantics of
images. More thorough experimental studies on more
datasets would be needed to further validate our
findings, which is another planned future work for us.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>
        In this paper, we propose LatentVis, a visual
analytics system to interpret and compare the semantics
encoded in the latent space of image-based VAEs. The
system trains a supervised linear model to bridge the
machine learned latent space with the human
understandable semantic space. From this bridging, we found
that data semantics is usually expressed along a fixed
direction in the latent space (i.e., semantic direction),
generative adversarial networks, arXiv preprint visual analytics approach to understand the
adarXiv:1511.06434 (
        <xref ref-type="bibr" rid="ref9">2015</xref>
        ). versarial game, IEEE transactions on
visualiza[5] T. Karras, T. Aila, S. Laine, J. Lehtinen, Pro- tion and computer graphics 24 (2018) 1905–1917.
gressive growing of gans for improved qual- [19] M. Kahng, P. Y. Andrews, A. Kalro, D. H. Chau,
ity, stability, and variation, arXiv preprint ActiVis: Visual Exploration of Industry-Scale
arXiv:1710.10196 (2017). Deep Neural Network Models, arXiv e-prints
[6] H. Sak, A. Senior, K. Rao, F. Beaufays, Fast (2017) arXiv:1704.01942. arXiv:1704.01942.
and accurate recurrent neural network acoustic [20] D. Bau, B. Zhou, A. Khosla, A. Oliva, A. Torralba,
models for speech recognition, arXiv preprint Network dissection: Quantifying interpretability
arXiv:1507.06947 (
        <xref ref-type="bibr" rid="ref9">2015</xref>
        ). of deep visual representations, in: Proceedings
[7] H. Shao, A. Kumar, P. Thomas Fletcher, The of the IEEE conference on computer vision and
riemannian geometry of deep generative mod- pattern recognition, 2017, pp. 6541–6549.
els, in: Proceedings of the IEEE Conference on [21] M. T. Ribeiro, S. Singh, C. Guestrin, Why should
Computer Vision and Pattern Recognition Work- i trust you?: Explaining the predictions of any
shops, 2018, pp. 315–323. classifier, in: Proceedings of the 22nd ACM
[8] G. Arvanitidis, L. K. Hansen, S. Hauberg, Latent SIGKDD international conference on knowledge
space oddity: on the curvature of deep generative discovery and data mining, ACM, 2016, pp. 1135–
models, arXiv preprint arXiv:1710.11379 (2017). 1144.
[9] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efi- [22] Y. Ming, H. Qu, E. Bertini, Rulematrix:
Visucient estimation of word representations in vec- alizing and understanding classifiers with rules,
tor space, arXiv preprint arXiv:1301.3781 (2013). IEEE transactions on visualization and computer
[10] S. Liu, P.-T. Bremer, J. J. Thiagarajan, V. Srikumar, graphics 25 (2019) 342–352.
      </p>
      <p>B. Wang, Y. Livnat, V. Pascucci, Visual explo- [23] J. Zhang, Y. Wang, P. Molino, L. Li, D. S. Ebert,
ration of semantic relationships in neural word Manifold: A model-agnostic framework for
inembeddings, IEEE transactions on visualization terpretation and diagnosis of machine learning
and computer graphics 24 (2018) 553–562. models, IEEE transactions on visualization and
[11] L. Kuhnel, T. Fletcher, S. Joshi, S. Sommer, La- computer graphics 25 (2019) 364–373.
tent space non-linear statistics, arXiv preprint [24] J. Wang, L. Gou, W. Zhang, H. Yang, H.-W. Shen,
arXiv:1805.07632 (2018). Deepvid: Deep visual interpretation and
diagno[12] J. Wang, W. Zhang, H. Yang, Scanviz: Interpret- sis for image classifiers via knowledge
distillaing the symbol-concept association captured by tion, IEEE transactions on visualization and
comdeep neural networks through visual analytics, puter graphics 25 (2019) 2168–2180.
in: 2020 IEEE Pacific Visualization Symposium [25] C. Zhang, S. Bengio, M. Hardt, B. Recht,
(PacificVis), IEEE, 2020, pp. 51–60. O. Vinyals, Understanding deep learning
re[13] P. Bojanowski, A. Joulin, D. Lopez-Paz, A. Szlam, quires rethinking generalization, arXiv preprint
Optimizing the latent space of generative net- arXiv:1611.03530 (2016).</p>
      <p>works, arXiv preprint arXiv:1707.05776 (2017). [26] E. D. Cubuk, B. Zoph, S. S. Schoenholz, Q. V.
[14] A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, Le, Intriguing properties of adversarial examples,
R. Pascanu, S. Osindero, R. Hadsell, Meta- arXiv preprint arXiv:1711.02846 (2017).
learning with latent embedding optimization, [27] X. Hou, L. Shen, K. Sun, G. Qiu, Deep feature
arXiv preprint arXiv:1807.05960 (2018). consistent variational autoencoder, in: 2017 IEEE
[15] Y. Hoshen, J. Malik, Non-adversarial image syn- Winter Conference on Applications of Computer
thesis with generative latent nearest neighbors, Vision (WACV), IEEE, 2017, pp. 1133–1141.
arXiv preprint arXiv:1812.08985 (2018). [28] D. P. Kingma, M. Welling, Auto-encoding
vari[16] Q. Le, T. Mikolov, Distributed representations ational bayes, arXiv preprint arXiv:1312.6114
of sentences and documents, in: International (2013).
conference on machine learning, 2014, pp. 1188– [29] J. Johnson, A. Alahi, L. Fei-Fei, Perceptual losses
1196. for real-time style transfer and super-resolution,
[17] M. Liu, J. Shi, Z. Li, C. Li, J. Zhu, S. Liu, in: European conference on computer vision,
Towards Better Analysis of Deep Convolu- Springer, 2016, pp. 694–711.
tional Neural Networks, arXiv e-prints (2016) [30] K. Simonyan, A. Zisserman, Very Deep
ConvoarXiv:1604.07043. arXiv:1604.07043. lutional Networks for Large-Scale Image
Recog[18] J. Wang, L. Gou, H. Yang, H.-W. Shen, Ganviz: A nition, arXiv e-prints (2014) arXiv:1409.1556.</p>
      <sec id="sec-6-1">
        <title>The Data Module contains two linked visualization views.</title>
        <p>The first view presents a statistical summary of all
training images. Each bubble in this view represents one
feature category (the color of the bubble corresponds
to diferent clusters in Fig. 4b), and the distances among
bubbles reflect the Euclidean distances between those
feature categories in the latent space. These distances
are calculated via the Multi-Dimensional Scaling (MDS)
algorithm, whose input is the average latent variable
of images belong to the same feature category.
Clicking on any bubble in this view will trigger the second
view to display images from the corresponding
category. The second view displays numerous randomly
selected images from the selected image category, so
that users can check the features of those images and
select interested ones for further exploration. To save
the screen space, images are scaled down to 32×32.
Clicking on any image in this view will trigger further
updates in other views.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>d e c o d e r 1 d e c o d e r 2 e n c o d e r 1 E 1 E 1</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>D 1 D 2 e n c o d e r 2 E 2</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <issue>D 1</issue>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          , I. Sutskever,
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          , Ima-
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <source>processing systems</source>
          ,
          <year>2012</year>
          , pp.
          <fpage>1097</fpage>
          -
          <lpage>1105</lpage>
          . [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Erhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Szegedy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Toshev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Anguelov</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>computer vision</article-title>
          and pattern recognition,
          <year>2014</year>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          pp.
          <fpage>2147</fpage>
          -
          <lpage>2154</lpage>
          . [3]
          <string-name>
            <given-names>O.</given-names>
            <surname>Ronneberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fischer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Brox</surname>
          </string-name>
          , U-net: Con-
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          tervention, Springer,
          <year>2015</year>
          , pp.
          <fpage>234</fpage>
          -
          <lpage>241</lpage>
          . [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Metz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chintala</surname>
          </string-name>
          , Unsupervised
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <source>arXiv:1409</source>
          .
          <fpage>1556</fpage>
          . [31]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tang</surname>
          </string-name>
          , Deep learn-
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>(ICCV)</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>