=Paper=
{{Paper
|id=Vol-2699/paper04
|storemode=property
|title=LatentVis: Investigating and Comparing Variational
Auto-Encoders via Their Latent Space

|pdfUrl=https://ceur-ws.org/Vol-2699/paper04.pdf
|volume=Vol-2699
|authors=Xiao Liu,Junpeng Wang
|dblpUrl=https://dblp.org/rec/conf/cikm/LiuW20
}}
==LatentVis: Investigating and Comparing Variational
Auto-Encoders via Their Latent Space
==
<pdf width="1500px">https://ceur-ws.org/Vol-2699/paper04.pdf</pdf>
<pre>
LatentVis: Investigating and Comparing Variational
Auto-Encoders via Their Latent Space
Xiao Liua , Junpeng Wangb
a Department of Computer Science and Engineering, the Ohio State University, 2015 Neil Avenue, Columbus, Ohio, 43210, USA
b Visa Research, 385 Sherman Avenue, Palo Alto, California, 94306, USA


                                          Abstract
                                          As the result of compression and the source of reconstruction, the latent space of Variational Auto-Encoders (VAEs) captures
                                          the essences of the training data and hence plays a fundamental role in data understanding and analysis. Focused on revealing
                                          what data features/semantics are encoded and how they are related in the latent space, this paper proposes a visual analytics
                                          system, i.e., LatentVis, to interactively study the latent space for better understanding and diagnosing image-based VAEs.
                                          Specifically, we train a supervised linear model to relate the machine-learned latents with the human-understandable se-
                                          mantics. With this model, each important data feature is expressed along a unique direction in the latent space (i.e., semantic
                                          direction). Comparing the semantic directions of different features allows us to compare the feature similarity encoded in the
                                          latent space, and thus to better understand the encoding process of the corresponding VAE. Moreover, LatentVis empowers
                                          us to examine and compare latent spaces across various training stages, or different VAE models, which can provide useful
                                          insight into model diagnosis.

                                          Keywords
                                          Deep generative model, variational auto-encoder, latent space, semantics, visual analytics


1. Introduction                                                                                                    Investigating them could help to understand and di-
                                                                                                                   agnose the DGMs, and thus shed light on the mystery
With the powerful capability in feature extractions,                                                               power of DGMs. However, those latent spaces are usu-
Deep Neural Networks (DNNs) have made a series of                                                                  ally with high-dimensionality and the semantics of in-
breakthroughs across a wide range of applications, e.g.,                                                           dividual latent dimension is not human-understandable.
image classification [1], object recognition [2], image                                                               Recently, we have witnessed many works on inter-
segmentation [3], etc. More interestingly, DNNs also                                                               preting the latent space of DNNs. Some considered a
demonstrate excellent performance in feature genera-                                                               latent space as a high-dimensional manifold and fo-
tions, which has attracted more research attention [4].                                                            cused on the geometric interpretation of the manifold.
For example, Generative Adversarial Nets (GANs) and                                                                For example, [7] showed that geodesic curves on the
Variational Auto-Encoders (VAEs) are able to generate                                                              latent space manifold are approximately straight in their
data (including images [5], sounds [6]) that are almost                                                            experiments. [8] revealed that a stochastic Rieman-
indistinguishable from real data.                                                                                  nian metric in the latent space could produce smoother
   The outstanding performance of DNNs comes from                                                                  interpolations than the conventional Euclidean distance.
their complicated internal model architectures and the                                                             With static visualizations of the geometric path in the
long-time model training processes, which, however,                                                                latent space, these studies have helped to understand
have gone far beyond humans’ interpretability. As a                                                                the abstractive manifold holistically.
result, it is very difficult to explain how Deep Genera-                                                              Others explored the semantics of different latent spa-
tive Models (DGMs) understand the extracted features                                                               ces by focusing on specific tasks. For example, [9, 10]
and further use them to generate new features. The la-                                                             analyzed the word embedding and verified the linear
tent spaces of these models, located at the pivot point                                                            arithmetic of the semantics in the embedding/latent
between extraction and generation, compress all the                                                                space, e.g., 𝑞𝑢𝑒𝑒𝑛−𝑤𝑜𝑚𝑎𝑛+𝑚𝑎𝑛≈𝑘𝑖𝑛𝑔. Similar linear
extracted features and control what to be generated.                                                               arithmetic has also been found in the latent space of
                                                                                                                   image-based DGMs [4]. These studies expose some
Proceedings of the CIKM 2020 Workshops, October 19-20, 2020,                                                       structures of the latent spaces, but are still insufficient
Galway, Ireland
email: liu.5764@osu.edu (X. Liu); junpeng.wang.nk@gmail.com (J.
                                                                                                                   to comprehensively reveal their essential semantics.
Wang); The work was done while the author was at The Ohio State                                                       This paper targets to diagnose image-based VAEs
University. (J. Wang)                                                                                              by interactively investigating their latent space, and
orcid: 0000-0002-6303-0771 (X. Liu); 0000-0002-1130-9914 (J.                                                       hence answers three concrete research questions: (1)
Wang)
                                    © 2020 Copyright for this paper by its authors. Use permitted under Creative   what semantics are embedded in the latent space of
                                    Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)                                        VAEs; (2) how can we transfer the machine-learned la-
                                                                                                        mean
tent space to a human-understandable semantic space a
for better interpretation; (3) how to use the latent spaces                             Encoder                            Decoder
                                                                x: input
of VAEs to track and compare VAE models. To the end,              image
                                                                                                         std . z: latent
                                                                                                                                                  x': output
                                                                                                                                                      image
we design and develop a three-module visual analytics                                                             variable

prototype, named LatentVis, for this matter. The Data
                                                                                               Feature Perceptual Loss:
module presents an interface to interact with the ex-                                           adding every difference

perimental dataset and select images with desired fea-        b     A p re-trained C NN                                          the same p re-trained C NN
tures. The Semantics module identifies and compares Figure 1: (a) The architecture of VAE; (b) the perceptual loss,
semantic directions of different image features, bridg- introduced in DFC-VAE [27], for feature reconstruction.
ing the machine-encoded latents with human-under-
standable semantics. The Comparison module com-
pares the latent space of (1) the same model in two
                                                            latent spaces. Moreover, in natural language process-
different training stages, (2) the same model from two
                                                            ing, the learned embedding of words/ paragraphs also
separate trainings with randomly initialized network
                                                            form a latent space. [9] and [16] interpreted this space
parameters, and (3) two different VAE models. To sum
                                                            and found that the correlations between words/ para-
up, the contributions of this paper are three-fold:
                                                            graphs were well-captured in the space.
     • We present LatentVis, a visual analytics system          Visual Analytics for Deep Learning (VIS4DL).
       that helps to understand and diagnose VAEs by        There      are two groups of VIS4DL works in general. One
       interactively revealing the encoded semantics of     focuses        on a specific model to reveal the internal work-
       the latent space.                                    ing   mechanism              of the model, such as CNNVis [17],
                                                            GANViz [18], and ActiVis [19]. These works usually
     • Enlightened by the linear arithmetic of features, design a visualization system to expose the hidden fea-
       we use a linear model to transfer a machine- tures and feature connections, for specific DNNs on
       learned latent space into a human-understandable specific datasets. Some works in this group also tried
       semantic space.                                      to generalize to different models on various datasets.
                                                            For example, [20] proposed Network Dissection to
     • Based on our analysis of the latent space, we quantify the interpretability of latent representations
       propose a model-agnostic approach to compare captured by CNNs (AlexNet, VGG, GoogLeNet, ResNet)
       VAEs, across training stages, separate trainings, via the alignment between hidden units and seman-
       or different VAE models.                             tic concepts. The other group focuses on using only
                                                            the model inputs and outputs to interpret/diagnose the
                                                            model, without touching the intermediate model de-
2. Background and Related                                   tails (i.e., model-agnostic). For example, [21] proposed
     Works                                                  a model-agnostic approach to reveal the dominant re-
                                                            gions of input images in controlling the prediction re-
Interpreting Latent Spaces. DNNs can be consid- sults of a classifier. More examples in this group also
ered as functions that transfer data instances from the include [22, 23, 24]. Our work needs no examination
input data space to a latent space (𝑓 ∶ 𝑅 𝑚 → 𝑅 𝑛 ). on the internal working mechanism of VAEs (as se-
A well-trained DNN will preserve the essential infor- mantics are encoded in the space formed by activa-
mation of the input data during this transformation. tions, rather than individual neurons [25, 26]), and thus
However, due to the complexity of DNNs, it is a non- belongs to the second group. Integrating a linear space
trivial problem to reveal or verify what information is transformer into our visual analytics process, we try to
preserved and how it is preserved in the latent space. present a human-understandable latent space to diag-
Targeted on this problem, many research efforts have nose DGMs.
been devoted to interpret the latent spaces of DNNs.            Variational Auto-Encoder (VAE) [28] aims to re-
For example, [11] showed how the statistics of data construct the input image from a latent representation
can be examined in the latent space representation. of the image encoded/learned by itself. It is comprised
[12] interpreted the association between visual con- of two neural networks: an encoder network encodes
cepts and symbolic annotations captured by 𝛽VAE thr- the image into a latent variable, and a decoder network
ough parallel coordinates plots. Latent embedding learn- decodes the image from the latent variable (Fig. 1a).
ing methods (GLO, LEO, GLANN [13, 14, 15]) were also Specifically, the encoder maps an input image 𝑥 to a
developed for the interpretation and understanding of latent variable 𝑧 (i.e., 𝑧 = 𝑒𝑛𝑐𝑜𝑑𝑒𝑟(𝑥) ∼ 𝑞(𝑧|𝑥)), and
                            Image S pace
                                            a
                                                D ata
                                                                          (i) visualize semantics
                                                                              at Image Space
                                                                                                        in this work. This dataset is constituted of 202599 hu-
                                                features
                                                          c
                                                                                                        man face images. Each image has 40 binary attributes
                                                              mustache      S emantic S pace            (e.g., the image is a male face or not, a face with glasses
VAE
      b                     glasses                       a latent (z)
                                                                                       glasses
                                                                                       a semantic (y)
                                                                                                        or not) with a resolution of 178×218. We pre-processed
             z                                  L inear
                                                          d imensio n                  d imensio n
                                                                                                        those images by cropping them into 148×148 and scal-
                 mustache
                            Latent S pace
                                                model                    (ii) detect the encoding
                                                                              of semantics
                                                                                                        ing down to 64×64 for our VAEs. Images with the same
                                                                                                        feature (i.e., have the same value on a binary attribute)
Figure 2: Three spaces: (a) the image space is where the                                                belong to the same feature category.
CelebA images reside, each pixel is an independent dimen-
                                                                                                           We focused on this face image dataset for two rea-
sion; (b) the latent space is the VAE learned representation
                                                                                                        sons. First, this dataset presents rich attributes for the
of those images, the VAE encoder and decoder enable the
transformation between the image space and latent space;                                                same object (the human face) in the same scale. Com-
(c) the semantic space is derived from the latent space un-                                             pared to other datasets with numerous objects in dif-
der the supervision of the 40 binary features of the images                                             ferent scales (e.g., ImageNet), a VAE can more accu-
(using our linear model).                                                                               rately capture the underlying data distribution. Sec-
                                                                                                        ond, the well-labeled attributes in this dataset can help
                                                                                                        to interpret the semantics encoded in the latent space
the decoder maps a latent variable 𝑧 to an output im-                                                   of VAEs, through which, we derived the semantic space
age 𝑥 ′ (i.e., 𝑥 ′ = 𝑑𝑒𝑐𝑜𝑑𝑒𝑟(𝑧) ∼ 𝑝(𝑥|𝑧)). The encoder                                                  using our linear model (Fig. 2).
and decoder, defined by trainable parameters 𝜃 and 𝜙                                                       Image Semantics. The semantics of face images is
respectively, are optimized via minimizing the follow-                                                  the existence and scale of the 40 features in the CelebA
ing loss function:                                                                                      dataset. A well-trained VAE can transfer the semantics
                                                                                                        from the image space to the VAE’s latent space (i.e.,
  𝑙(𝜃, 𝜙) = −𝐸𝑞𝜃 (𝑧|𝑥) [𝑙𝑜𝑔𝑝𝜙 (𝑥|𝑧)] + 𝐾 𝐿(𝑞𝜃 (𝑧|𝑥)‖𝑝(𝑧)).                                              from Fig. 2a to 2b). However, the transferred seman-
                                                                                                        tics in the latent space is not human-understandable.
   By 𝑧 = 𝑒𝑛𝑐𝑜𝑑𝑒𝑟(𝑥), the latent variable of a specific                                                 Hence, our goal is to interpret them via a semantic
image from the VAE is readily accessible for further                                                    space (Fig. 2c), in which, we can explore if the image
semantic explorations. One common issue of VAEs is                                                      semantics have been accurately encoded (Fig. 2i) and
that the generated images tend to be blurry, due to the                                                 how they are encoded (Fig. 2ii).
aggregated pixel-wise image distance used in the loss                                                      Semantic Direction. In Fig. 2b, we use a point to
function, i.e., the 𝐿2 distance between 𝑥 and 𝑥 ′ .                                                     denote the VAE encoded latent variable for the corre-
   Deep Feature Consistent VAE (DFC-VAE) [27] is                                                        sponding image in the image space. All latent vari-
a variant of the regular VAE. It improves the quality of                                                ables for images of the same category (e.g. “glasses",
the reconstructed images by replacing the pixel-wise                                                    “mustache") form a cluster in the space, denoted as a
reconstruction loss with a feature perceptual loss [29].                                                blue bubble in Fig. 2b. We identify the direction from
In DFC-VAE, multiple levels of features are extracted                                                   one cluster without a particular feature to the clus-
from both the input and reconstructed images by pass-                                                   ter with that feature as the semantic direction for the
ing them into a pre-trained CNN. Each layer of the                                                      feature. For example, in Fig. 2b, the directions on the
CNN extracts certain levels of image features. The                                                      red and green lines reflect the semantic directions for
features from the input and reconstructed images are                                                    “mustache" and “glasses". Moving the latent variable
then used to measure their perceptual distance (Fig. 1b,                                                of one image along a semantic direction will change
the 𝐿2 loss between the corresponding feature maps).                                                    the corresponding feature of the reconstructed image
In this work, we adopted this perceptual loss to im-                                                    the most. Along this semantic direction, a vector with
prove the reconstruction quality. The VGG19 [30] pre-                                                   a certain length is referred to as a Semantic Vector. For
trained on the ImageNet data is used as our pre-trained                                                 CelebA, there are 40 features, and we have 40 unique
CNN.                                                                                                    semantic directions.


3. Methodology                                                                                          3.2. Our Contributions
                                                       The Linear Model. Enlightened by the linear arith-
3.1. Fundamental Concepts                              metic of features (e.g., 𝑤𝑜𝑚𝑎𝑛 𝑓 𝑎𝑐𝑒 𝑤𝑖𝑡ℎ𝑜𝑢𝑡 𝑔𝑙𝑎𝑠𝑠𝑒𝑠 +
Image Dataset. We focus on a face image dataset, i.e., (𝑚𝑎𝑛 𝑓 𝑎𝑐𝑒 𝑤𝑖𝑡ℎ 𝑔𝑙𝑎𝑠𝑠𝑒𝑠−𝑚𝑎𝑛 𝑓 𝑎𝑐𝑒 𝑤𝑖𝑡ℎ𝑜𝑢𝑡 𝑔𝑙𝑎𝑠𝑠𝑒𝑠) ≈
CelebA [31], to explore the latent space of VAE models 𝑤𝑜𝑚𝑎𝑛 𝑓 𝑎𝑐𝑒 𝑤𝑖𝑡ℎ 𝑔𝑙𝑎𝑠𝑠𝑒𝑠) [4], we trained a linear model
                                                       to quantify the semantic directions, as well as to trans-
                                     Information Flow of M odules                                       details in Sec 4.2).
                   View data            Detect          Co m pare different     Co m pare the sam e
                    features          sem antics        sem antics in a VAE   sem antics in two VAEs

                        T1                         T2                                          T3
                                                                                                        4. Experiments and Results
LatentV is


             a   Data Mo d ule
                                 b
                                            S emantics Mo d ule               C o mp ariso n Mo d ule
                                                                              c
                                                             Neural Network Structure. We worked with one reg-
                                                             ular VAE and one DFC-VAE, both with an encoder and
                                                             a decoder of four convolutional layers. The four-layer
Figure 3: The framework of LatentVis system.                 encoder compresses the 64×64×3 CelebA images to 32×32
                                                             ×32, 16×16×64, 8×8×128, and 4×4×256. The compression
                                                             result is then flattened and mapped to a 100D Gaus-
form the latent space to a human-understandable se- sian distribution, represented by a 100D mean and a
mantic space (i.e., from Fig. 2b to 2c). The linear model, 100D standard deviation, through two fully-connected
𝐲 = 𝑊 𝑇 ⋅ 𝐳 + 𝐛, is trained using the latent variable of layers. The decoder has a symmetric structure with
all CelebA images encoded by VAEs (denoted as 𝐳) and the encoder, but with a reversed order of the layers to
the 40 binary attributes of the images (denoted as 𝐲). 𝐳 up-sample the 100D latent variables (sampled from the
is a vector with many dimensions (the black arrow in 100D Gaussian).
Fig. 2b), and 𝐲 is a vector with 40 dimensions (the green       The difference between the VAE and DFC-VAE is
arrow in Fig. 2c). Each row of the weight matrix 𝑊 𝑇 whether a pre-trained VGG19 model was used to com-
is the derivatives of a certain feature to all the latent pute the perceptual loss. We trained the VAE once
dimensions, which represents a semantic direction in and the DFC-VAE twice with the same batch size and
the latent space. Each column of 𝑊 𝑇 represents the the Adam optimizer in all trainings. However, notice
contributions of a latent dimension to all the seman- that the hyperparamters for these trainings could be
tics.                                                        the same or be different on-purpose to compare the
   Analytic Tasks and LatentVis. We focus on three trained models and investigate the effect of the hyper-
analytic tasks to better understand and diagnose VAEs parameters, e.g., comparing models trained with dif-
in a lens of their latent space: (T1) navigating the dataset ferent learning rates to study their convergence speed.
and feature selection; (T2) visualizing and comparing           All three trainings used the 202599 CelebA images
the image semantics/ semantic directions in a latent and the batch size is 64. Every 800 batches were con-
space; and (T3) facilitating model comparisons and di- sidered as a training stage to collect model statistics,
agnosis through comparing their latent spaces. Fol- like loss-values, and all the three trainings were run
lowing these tasks, we propose a visual analytics sys- for 197 stages (i.e., 157600 batches).
tem, LatentVis (Fig. 3, bottom), which contains three
analytical modules corresponding to an hierarchical
information flow (Fig. 3, top).
                                                             4.1. Detecting and Comparing Semantic
   The Data Module (Fig. 3a) gives an overview of the               Directions
studied dataset, allowing us to flexibly explore data in- We propose Algorithm 1 to detect and compare seman-
stances. It is also an interface to select any interested tic directions in the Semantics Module (Fig. 4). First,
feature category and data instance for further analy- we give all CelebA images to the well-trained VAE model
sis in other modules (please check details in the Ap- to obtain their latent variables. Then, we use these
pendix).                                                     latent variables and their corresponding 40 binary at-
   The Semantics Module (Fig. 3b) demonstrates what tributes to train a linear model to capture the semantic
semantics has been captured by the VAE, and how the directions encoded in the latent space. To verify the ef-
data features are correlated in the latent space by con- fectiveness of the semantic directions, we also gener-
necting the image space with the VAE latent space. Its ate many random directions in the latent space. Given
three views follow an hierarchical information flow to any selected image, we visualize the modified version
detect, cluster, and compare semantics (see details in of the image resulted from changing its latent variable
Sec. 4.1).                                                   along the 40 semantic directions and a random direc-
   The Comparison Module (Fig. 3c) compares the se- tion. For example, by dragging the control point for
mantic directions of latent space from different VAEs the feature “bangs" and “glasses" in Fig. 4a (i.e., change
to diagnose these models. The diagnosis for the com- the length of a semantic vector, 𝜆∈[−10, 10]), we ob-
pared VAEs is performed by examining the learning served how those two features were added (Fig. 4-a2)
work division between the encoder and decoder (see to the selected image (Fig. 4-a1). However, no obvi-
ous changes towards a particular feature were found
                                                                                                                                                     1
in the image when dragging the control point on the                       bangs                          glasses                                     2
axis representing random directions (Fig. 4-a3).                  1
                                                                                                                                                     3

                                                                  2
Algorithm 1 : Detecting and Comparing Semantic Di-                                                                                                   4

rections                                                          3                                                                                  5

Require: images 𝑋 = {𝐱𝑖 }𝑛𝑖=1 , feature labels 𝑌 =               a    ① original ② changed along semantic direction
                                                                                                                      b
                                                                                                                          ① weak-woman-ish
                                                                                                                          ② man-ish ③ woman-ish          c Rosy cheeks
                                                                                                                                                           Male
     {𝐲𝑖 }𝑛𝑖=1                                                        ③ changed along random direction                    ④ old-ish ⑤ weak-man-ish


Require: selected image 𝐱̂ ∈ 𝑋 , selected feature 𝑓 ,            Figure 4: Semantic Module on DFC-VAE from the 197𝑡ℎ
     compared feature 𝑓 ′ , the length of a semantic vec-        training stage: (a) detect, (b) cluster, and (c) compare se-
     tor 𝜆                                                       mantic directions. The cell color from red over white to blue
Require: the VAE model with an 𝑒𝑛𝑐𝑜𝑑𝑒𝑟 and a                     in the matrix (b) indicates the cosine similarity of two se-
     𝑑𝑒𝑐𝑜𝑑𝑒𝑟                                                     mantic directions from -1 over 0 to 1; (b1-b5) represent five
  1: for 𝑖 in 𝑟𝑎𝑛𝑔𝑒(𝑛) do
                                                                 clusters: “weak-woman-ish", “man-ish", “woman-ish", “old-
                                                                 ish", and “weak-man-ish".
  2:      𝐳𝑖 ← 𝑒𝑛𝑐𝑜𝑑𝑒𝑟(𝐱𝑖 )
  3: end for
  4: train a linear model 𝐲 = 𝑊 𝑇 ⋅ 𝐳 + 𝐛
  5: semantic directions 𝐷 ← 𝑊 𝑇                                 which are from the “woman-ish" (Fig. 4-b3) and “man-
  6: 𝐝𝑓 ← 𝐷[𝑓 , ∶]                 // a row of 𝐷 corresponding   ish" (Fig. 4-b2) groups respectively. The negative cor-
     to feature 𝑓                                                relation is indicated by an obtuse angle between the
  7: randomly initialize 𝐝𝑟 with ‖𝐝𝑟 ‖ = ‖𝐝𝑓 ‖                   two colored semantic vectors (i.e., 𝜆𝐝𝑓 and 𝜆𝐝𝑓 ′ in Al-
  8: 𝐳̂ ← 𝑒𝑛𝑐𝑜𝑑𝑒𝑟(𝐱      ̂ ).                                    gorithm 1). To visualize these two high-dimensional
  9: 𝐳̂𝑓 ← 𝐳̂ + 𝜆𝐝𝑓                                              vectors 𝜆𝐝𝑓 and 𝜆𝐝𝑓 ′ in a 2D plot intuitively, 𝜆𝐝𝑓 (cor-
 10: 𝐳̂𝑟 ← 𝐳̂ + 𝜆𝐝𝑟                                              responding to the selected feature) is always along the
 11: 𝐱̂ 𝑓 ← 𝑑𝑒𝑐𝑜𝑑𝑒𝑟(𝐳̂𝑓 )                                        horizontal direction, and 𝜆𝐝𝑓 ′ (corresponding to the
 12: 𝐱̂ 𝑟 ← 𝑑𝑒𝑐𝑜𝑑𝑒𝑟(𝐳̂𝑟 )                                        compared feature) presents an angle with 𝜆𝐝𝑓 , calcu-
 13: 𝑣𝑖𝑠𝑢𝑎𝑙𝑖𝑧𝑒(𝐱  ̂ , 𝐱̂𝑓 , 𝐱̂𝑟 ) to verify semantics in 𝐝𝑓      lated as the angle between them in the original HD
     //Fig. 4a                                                   latent space. The length of each colored segment re-
 14: 𝐱𝑓 ← 𝑑𝑒𝑐𝑜𝑑𝑒𝑟(𝜆𝐝𝑓 )                                          flects the norm of the corresponding semantic vector.
 15: 𝐝′𝑓 ← 𝐷[𝑓 ′ , ∶] // another row of 𝐷 corresponding
                                                                 Dragging the green/red point in Fig. 4c to the oppo-
                                                                 site direction (i.e., change 𝜆 to a negative value), we
     to feature 𝑓 ′                                              can also verify that the opposite direction indeed en-
 16: 𝐱𝑓 ′ ← 𝑑𝑒𝑐𝑜𝑑𝑒𝑟(𝜆𝐝𝑓 ′ )
                                                                 codes the opposite feature, e.g., “pale skin" is the op-
 17: 𝑣𝑖𝑠𝑢𝑎𝑙𝑖𝑧𝑒(𝜆𝐝𝑓 , 𝜆𝐝𝑓 ′ ), (𝐱𝑓 , 𝐱𝑓 ′ ) to compare seman-
                                                                 posite of “dark skin". Interestingly, we found several
     tics //Fig. 4c                                              such pairs, showing a similar way of how human un-
                                                                 derstand these semantics, such as “smile" v.s. “scary
   We cluster the 40 feature categories (40 semantics)           face", “bangs" v.s. “high hairlines".
into five groups using the 𝐾 -means algorithm based on              From the above explorations and visual evidence,
the cosine similarity between their corresponding se-            we feel confident to believe the following hypothesis
mantic directions. Interestingly, the feature categories         on the semantic structure of the latent space: (1) la-
inside each group present similar semantics. For ex-             tent space tends to encode semantics along unique di-
ample, the feature “make up", “no beard", and “attrac-           rections (i.e., semantic directions); (2) smaller angles
tive" are in the same group, which are all “women-               between semantic directions denote similar semantics
ish" features. With the similar logic, the other four se-        and opposite semantic directions encode opposite se-
mantic groups are named “man-ish" (e.g., “mustache",             mantics.
“five-o’clock shadow"), “weak-woman-ish" (e.g., “smile",
“bangs"), “weak-man-ish"(e.g., “bushy eyebrows", “hat"), 4.2. Comparing Semantics across VAEs
and “old-ish" (e.g., “chubby", “bald"). The five clusters
can be easily identified from the symmetric pair-wise The Comparison Module compares two VAE models
similarity matrix (Fig. 4b), and we can select any two and facilitates model diagnosis using their latent spaces.
semantics (i.e., one row and one column) for compar- The comparison is across different training stages, train-
ison. For example, Fig. 4c shows the negative corre- ings (with randomly initialized neural network param-
lation between the “rosy cheeks" and “male" feature, eters), and VAE models. For each pair of compared
                                                                                                                             a
                                               selected
                                                image                                                                   b                                           c   d
                                                                                   SD from        SD from
b                                        decoder1    decoder2                      model 1        model 2
                    SD from   encoder1                                                                                                         e   f
                    model 1               E1          E1
           SD 1                                                                      SD 1          SD 2
                                          D1          D2                                                                             b
                                                                                            i           ii
                              encoder2                                                                          Figure 6: The Comparison Module on the “ glasses" feature.
                                          E2          E2
    SD 1          SD 2                                                                          SD 2            Left: the orange and purple color represent the DFC-VAE
                                          D1          D2        Semantic D irection (SD )
                                                                learned from model 2
SD from      SD from          a                                                                                 model from the 3𝑟𝑑 and the 197𝑡ℎ training stages. Right: the
model 1      model 2                                                                                        b
                                                                                                                orange and purple color represent two separate trainings of
Figure 5: The mapping between images in the Comparison                                                          the same DFC-VAE model from the 197𝑡ℎ training stage.
Module. (a) Reconstructed images from different combina-
tions of encoders and decoders of two compared VAEs; (b)
reconstructed images when changing along the semantic di-
rection (SD) learned by different VAEs.                    When moving the latent variable of the image in Fig. 6a
                                                           along the semantic direction learned in those two stages,
                                                           all the six reconstructed images generated the “glasses"
                                                           feature (as shown in Fig. 6b), regardless of the swap-
models, we run Algorithm 1 to obtain their individual
                                                           ping of the encoders and decoders from the two train-
semantic directions and generate the corresponding
                                                           ing stages. This observation indicates that different
reconstructed images. Given any selected image, Al-
                                                           stages of the training encode the semantic direction of
gorithm 1 also outputs its latent variable, from which,
                                                           the same feature in a consistent way. It also implies
we can regenerate the image with a VAE’s decoder. We
                                                           another insight on the semantic structure of the latent
can use the same VAE’s decoder and a different VAE’s
                                                           space, i.e., the semantic directions may have a tolerable
decoder to reconstruct two images and compare them.
                                                           range, within which, the learned semantics is evolving
From the comparison, we can track the work division
                                                           over the training process.
between an encoder and a decoder, and also diagnose
                                                              Comparing Semantic Directions Across-Training.
which of these two networks is more responsible for
                                                           Focusing on a well-trained stage, we compared the DFC-
certain model functions.
                                                           VAE from two separate trainings (where the model pa-
   Since we have two encoders and two decoders, there
                                                           rameters are randomly initialized in each training). Our
are four possible combinations between them. Each
                                                           goal is to explore whether the semantics is encoded
rectangle in Fig. 5a represents one combination. We
                                                           in the same way over the two trainings. We used the
use purple and orange color to denote model 1 and
                                                           same training hyperparameters and trained the same
model 2. The left and right borders’ color of a rectangle
                                                           model twice with enough epochs. Fig. 6c investigates
reflects which model’s encoder is in use, whereas the
                                                           the semantic directions of the “glasses" feature from
top and bottom borders’ color reflects which model’s
                                                           the two trainings. We found the “glasses" feature can
decoder is in use. For example, the top right rectangle
                                                           only be generated when using the matched encoder-
in Fig. 5a means the reconstructed image uses encoder
                                                           decoder pairs. For example, Fig. 6c reconstructs an
1 (the left and right borders of the rectangle are in pur-
                                                           image using the decoder from the second training, but
ple) and decoder 2 (the top and bottom borders of the
                                                           the latent variable is moved along the semantic direc-
rectangle are in orange), i.e., the pair of E1-D2.
                                                           tion learned from the first training. As a result, the
   When dragging the horizontal control point (i.e., cha-
                                                           “glasses" feature was not generated. Conversely, the
nge 𝜆) in the top of the Comparison Module, these four
                                                           “glasses" feature could be generated when moving along
images will be modified along the semantic directions
                                                           the semantic direction learned from the second train-
(SD) for the focused semantics learned from the two
                                                           ing, as shown in Fig. 6d. The results shown in Fig. 6e
models. Fig. 5b shows the mappings, i.e., which image
                                                           (image with “glasses"), 6f (image without “glasses") fur-
is changing along which semantic direction. For exam-
                                                           ther verify this. The observation indicates that differ-
ples, Fig. 5i and Fig. 5ii show the reconstructed images
                                                           ent trainings of the same VAE may encode the seman-
when changing the top right image of Fig. 5a along the
                                                           tic direction of the same feature differently.
semantic directions learned from model 1 and model 2,
respectively.
   Comparing Semantic Directions Across-Time. We 4.3. Diagnosing VAEs via Semantics
take the DFC-VAE in an early training stage and a well-           Comparison
trained stage to perform the comparison. Fig. 6 in-
                                                           Learning Process Comparison between Encoders and
vestigates the DFC-VAE model with parameters from
                                                           Decoders. To interpret the learning work division be-
the 3 (orange) and the 197 (purple) training stage.
      𝑟𝑑                       𝑡ℎ
Figure 7: Images reconstructed from model 1 (DFC-VAE,
197𝑡ℎ stage, in purple) and model 2 (DFC-VAE, 3𝑟𝑑 stage, in     Figure 8: Comparing the cosine similarity between the
orange) with matched and swapped encoder-decoder pairs.         “glasses" semantic direction and other semantic directions
The numbers 1, 2, 3, 4 represent the combinations of 𝐸1 -𝐷1 ,   across seven training stages; red over white to blue denotes
𝐸2 -𝐷1 , 𝐸1 -𝐷2 , 𝐸2 -𝐷2 , respectively.                        values from -1 over 0 to 1.

                                                                 a
tween the encoder and decoder, we explored the DFC-
VAE model from a well-trained stage and an early train-         b         smile       pale skin   mustache        hat                   glasses


ing stage. We swapped the pairing between the two
encoders and two decoders to investigate their respec-
                                                                Figure 9: Comparing the reconstructed images along se-
tive responsibilities, i.e., the well-trained encoder is
                                                                mantic directions at 𝜆 = 4.5 from DFC-VAE (purple borders)
paired with the early-stage decoder, and the early-stage        and VAE (orange borders) in (a) the 3𝑟𝑑 and (b) 197𝑡ℎ stage.
encoder is paired with the well-trained decoder. By
comparing the reconstructed images from them, we                                  a 3rd stage                           b 197th stage
discovered that a well-trained encoder is responsible
                                                                DFC-VAE
                                                                 VAE                                   smile


for controlling semantics, while a well-trained decoder
                                                                                                      pale skin

                                                                                                     mustache

is responsible for generating clear images. For exam-                                                   hat


ple, the image in Fig. 7-a2 reconstructs the image in                                                 glasses


Fig. 7-a0 using the early-stage encoder (𝐸2 ) and the           Figure 10: Comparing the semantic correlations learned
well-trained decoder (𝐷1 ). Although the reconstruc-            by DFC-VAE (top row) and VAE (bottom row) between the
                                                                five interested semantic directions and other semantic di-
tion did not catch the features of the original image
                                                                rections in the the (a) 3𝑟𝑑 and (b) 197𝑡ℎ stage.
(e.g., gender, hair style and color), the generated im-
age is clear. On the contrary, the image in Fig. 7-a3 is
reconstructed using the well-trained encoder (𝐸1 ) and
the early-stage decoder (𝐷2 ). The image captures most          ple, the left and right image in the five pairs of images
of the features in the original image, but it is blurry.        in Fig. 9a compare five image features generated from
Similar observations can also be found in Fig. 7b, 7c,          DFC-VAE and VAE respectively from the 3𝑟𝑑 training
and 7d.                                                         stage. Fig. 9b shows the same comparison but using
   We believe a well-trained encoder can better con-            the parameters of DFC-VAE and VAE from the 197𝑡ℎ
trol semantics because it better captures the correla-          stage. Comparing Fig. 9a and 9b vertically, i.e., across
tion between different semantics. In other words, bet-          time, we can see that DFC-VAE enhanced the features
ter semantic correlations make the semantic directions          in the reconstructed images more than VAE.
more accurate in the latent space generated from a                 Comparing the semantic correlations learned from
well-trained encoder. For example, Fig. 8 compares the          DFC-VAE and VAE in those two training stages, we
correlations between the “glasses" feature and other            found that both models captured the semantic corre-
features in the 3𝑟𝑑 − 6𝑡ℎ , 8𝑡ℎ , 13𝑡ℎ , and 197𝑡ℎ training     lations at a similar pace. For example, the top and
stages. It is obvious that the negative correlations of         bottom row of the 10 row-pairs in Fig. 10 show the
different semantics (i.e., the region in the black dashed       semantic correlations from the DFC-VAE and VAE re-
lines) were evolving gradually over the training. Com-          spectively, in stage 3 and 197. As highlighted by the
paring the trend of negative to positive correlations           rectangles, the evolutions of the correlations between
between semantics (i.e., red to blue cells), we can see         the five semantics and other semantics are similar in
the negative correlations are acquired in later stages.         both models, from the 3𝑟𝑑 stage (left) to the 197𝑡ℎ stage
   Comparing VAE and DFC-VAE. Although the VAE                  (right).
and DFC-VAE shared a similar network structure, the                Combining our observations from Fig. 9 and Fig. 10,
feature perceptual loss used in DFC-VAE dramatically            we get a better understanding on how the perceptual
improved the semantics learning. The image features             loss (from the pre-trained VGG19) was affecting the
generated from DFC-VAE tend to be less blurry and               model, i.e., compared to DFC-VAE, VAE captured the
more recognizable than those from VAE. For exam-                correlations between different semantics but it still could
                                                                not generate clear features. We suspect that the per-
ceptual loss contributed more to improving the decoder and human interpreted similar/different semantics tend
in better reconstructing image features.                  to have smaller/larger angles between semantic direc-
                                                          tions. Also, LatentVis can be used to examine and com-
                                                          pare VAEs from three different perspectives: (1) dif-
5. Limitations and Future Work ferent training stages, (2) separate trainings with ran-
                                                          domly initialized neural network parameters, and (3)
LatentVis can be easily adapted to analyze other VAE
                                                          different VAEs. Several interesting points on VAEs are
models, as it is a model-agnostic approach and does
                                                          discovered and summarized as follows:
not use any model-specific information (e.g., network
architectures). The required data are the input images,        • Different stages of one training encode the se-
the reconstructed images, and the learned latent vari-           mantic direction of the same feature in a consis-
ables at different training stages. The labels for dif-          tent way.
ferent feature categories are also demanded to train
our supervised linear model. One interesting ques-             • Different trainings of the same VAE model may
tion here is whether finer granularity labels can fur-           result in the VAE encoding the semantic direc-
ther improve the accuracy of the derived semantic di-            tion for the same feature in a different way.
rections. For example, the current "glasses" feature
                                                               • For a well-trained VAE, its encoder tends to be
includes both "sunglasses" and "normal glasses". Dif-
                                                                 responsible for controlling semantics, while its
ferentiating them as two features may help in more
                                                                 decoder tends to be responsible for generating
accurately extracting the semantic directions. Addi-
                                                                 clear images.
tionally, we can also verify the existence of the class
hierarchy of features in the latent space. These are in-       • For the specific dataset we worked on, the per-
teresting research directions for us to explore in the           ceptual loss of DFC-VAE contributes more to the
future. However, similar to the current limitation of            training of the decoder in better reconstructing
our work, these future works also heavily depend on              image features. Without using the perceptual
the availability of the labeled datasets.                        loss, VAE is still able to accurately capture the
   Moreover, it is also possible to extend LatentVis to          semantics correlations.
VAEs trained on other data types, e.g., texts or audios.
Compared to images, those types of data may not be           These explorations and comparisons demonstrate how
able to be visually interpreted. However, through dif- the latent spaces can be used to interpret and compare
ferent visual encodings used in existing works, we be- the corresponding VAEs. With the promising results
lieve they can be intuitively presented as well. We plan demonstrated in the paper, we are confident in extend-
to investigate more from the literature and spend more ing LatentVis to other latent variable models or other
efforts this direction in the future.                     data types in the future.
   It is worth mentioning that our current explorations
in this work are heuristic and based only on one dataset,
through which, we hope to shed some light on how References
the latent space of VAEs captured the semantics of im-
                                                            [1] A. Krizhevsky, I. Sutskever, G. E. Hinton, Ima-
ages. More thorough experimental studies on more
                                                                genet classification with deep convolutional neu-
datasets would be needed to further validate our find-
                                                                ral networks, in: Advances in neural information
ings, which is another planned future work for us.
                                                                processing systems, 2012, pp. 1097–1105.
                                                            [2] D. Erhan, C. Szegedy, A. Toshev, D. Anguelov,
6. Conclusion                                                   Scalable object detection using deep neural net-
                                                                works, in: Proceedings of the IEEE conference on
In this paper, we propose LatentVis, a visual analyt-           computer vision and pattern recognition, 2014,
ics system to interpret and compare the semantics en-           pp. 2147–2154.
coded in the latent space of image-based VAEs. The [3] O. Ronneberger, P. Fischer, T. Brox, U-net: Con-
system trains a supervised linear model to bridge the           volutional networks for biomedical image seg-
machine learned latent space with the human under-              mentation, in: International Conference on Med-
standable semantic space. From this bridging, we found          ical image computing and computer-assisted in-
that data semantics is usually expressed along a fixed          tervention, Springer, 2015, pp. 234–241.
direction in the latent space (i.e., semantic direction), [4] A. Radford, L. Metz, S. Chintala, Unsupervised
                                                                representation learning with deep convolutional
     generative adversarial networks, arXiv preprint              visual analytics approach to understand the ad-
     arXiv:1511.06434 (2015).                                     versarial game, IEEE transactions on visualiza-
 [5] T. Karras, T. Aila, S. Laine, J. Lehtinen, Pro-              tion and computer graphics 24 (2018) 1905–1917.
     gressive growing of gans for improved qual-             [19] M. Kahng, P. Y. Andrews, A. Kalro, D. H. Chau,
     ity, stability, and variation,       arXiv preprint          ActiVis: Visual Exploration of Industry-Scale
     arXiv:1710.10196 (2017).                                     Deep Neural Network Models, arXiv e-prints
 [6] H. Sak, A. Senior, K. Rao, F. Beaufays, Fast                 (2017) arXiv:1704.01942. arXiv:1704.01942.
     and accurate recurrent neural network acoustic          [20] D. Bau, B. Zhou, A. Khosla, A. Oliva, A. Torralba,
     models for speech recognition, arXiv preprint                Network dissection: Quantifying interpretability
     arXiv:1507.06947 (2015).                                     of deep visual representations, in: Proceedings
 [7] H. Shao, A. Kumar, P. Thomas Fletcher, The                   of the IEEE conference on computer vision and
     riemannian geometry of deep generative mod-                  pattern recognition, 2017, pp. 6541–6549.
     els, in: Proceedings of the IEEE Conference on          [21] M. T. Ribeiro, S. Singh, C. Guestrin, Why should
     Computer Vision and Pattern Recognition Work-                i trust you?: Explaining the predictions of any
     shops, 2018, pp. 315–323.                                    classifier, in: Proceedings of the 22nd ACM
 [8] G. Arvanitidis, L. K. Hansen, S. Hauberg, Latent             SIGKDD international conference on knowledge
     space oddity: on the curvature of deep generative            discovery and data mining, ACM, 2016, pp. 1135–
     models, arXiv preprint arXiv:1710.11379 (2017).              1144.
 [9] T. Mikolov, K. Chen, G. Corrado, J. Dean, Effi-         [22] Y. Ming, H. Qu, E. Bertini, Rulematrix: Visu-
     cient estimation of word representations in vec-             alizing and understanding classifiers with rules,
     tor space, arXiv preprint arXiv:1301.3781 (2013).            IEEE transactions on visualization and computer
[10] S. Liu, P.-T. Bremer, J. J. Thiagarajan, V. Srikumar,        graphics 25 (2019) 342–352.
     B. Wang, Y. Livnat, V. Pascucci, Visual explo-          [23] J. Zhang, Y. Wang, P. Molino, L. Li, D. S. Ebert,
     ration of semantic relationships in neural word              Manifold: A model-agnostic framework for in-
     embeddings, IEEE transactions on visualization               terpretation and diagnosis of machine learning
     and computer graphics 24 (2018) 553–562.                     models, IEEE transactions on visualization and
[11] L. Kuhnel, T. Fletcher, S. Joshi, S. Sommer, La-             computer graphics 25 (2019) 364–373.
     tent space non-linear statistics, arXiv preprint        [24] J. Wang, L. Gou, W. Zhang, H. Yang, H.-W. Shen,
     arXiv:1805.07632 (2018).                                     Deepvid: Deep visual interpretation and diagno-
[12] J. Wang, W. Zhang, H. Yang, Scanviz: Interpret-              sis for image classifiers via knowledge distilla-
     ing the symbol-concept association captured by               tion, IEEE transactions on visualization and com-
     deep neural networks through visual analytics,               puter graphics 25 (2019) 2168–2180.
     in: 2020 IEEE Pacific Visualization Symposium           [25] C. Zhang, S. Bengio, M. Hardt, B. Recht,
     (PacificVis), IEEE, 2020, pp. 51–60.                         O. Vinyals, Understanding deep learning re-
[13] P. Bojanowski, A. Joulin, D. Lopez-Paz, A. Szlam,            quires rethinking generalization, arXiv preprint
     Optimizing the latent space of generative net-               arXiv:1611.03530 (2016).
     works, arXiv preprint arXiv:1707.05776 (2017).          [26] E. D. Cubuk, B. Zoph, S. S. Schoenholz, Q. V.
[14] A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals,                Le, Intriguing properties of adversarial examples,
     R. Pascanu, S. Osindero, R. Hadsell, Meta-                   arXiv preprint arXiv:1711.02846 (2017).
     learning with latent embedding optimization,            [27] X. Hou, L. Shen, K. Sun, G. Qiu, Deep feature
     arXiv preprint arXiv:1807.05960 (2018).                      consistent variational autoencoder, in: 2017 IEEE
[15] Y. Hoshen, J. Malik, Non-adversarial image syn-              Winter Conference on Applications of Computer
     thesis with generative latent nearest neighbors,             Vision (WACV), IEEE, 2017, pp. 1133–1141.
     arXiv preprint arXiv:1812.08985 (2018).                 [28] D. P. Kingma, M. Welling, Auto-encoding vari-
[16] Q. Le, T. Mikolov, Distributed representations               ational bayes, arXiv preprint arXiv:1312.6114
     of sentences and documents, in: International                (2013).
     conference on machine learning, 2014, pp. 1188–         [29] J. Johnson, A. Alahi, L. Fei-Fei, Perceptual losses
     1196.                                                        for real-time style transfer and super-resolution,
[17] M. Liu, J. Shi, Z. Li, C. Li, J. Zhu, S. Liu,                in: European conference on computer vision,
     Towards Better Analysis of Deep Convolu-                     Springer, 2016, pp. 694–711.
     tional Neural Networks, arXiv e-prints (2016)           [30] K. Simonyan, A. Zisserman, Very Deep Convo-
     arXiv:1604.07043. arXiv:1604.07043.                          lutional Networks for Large-Scale Image Recog-
[18] J. Wang, L. Gou, H. Yang, H.-W. Shen, Ganviz: A              nition, arXiv e-prints (2014) arXiv:1409.1556.
     arXiv:1409.1556.
[31] Z. Liu, P. Luo, X. Wang, X. Tang, Deep learn-
     ing face attributes in the wild, in: Proceedings
     of International Conference on Computer Vision
     (ICCV), 2015.


A. Appendix
The Data Module contains two linked visualization views.
The first view presents a statistical summary of all train-
ing images. Each bubble in this view represents one
feature category (the color of the bubble corresponds
to different clusters in Fig. 4b), and the distances among
bubbles reflect the Euclidean distances between those
feature categories in the latent space. These distances
are calculated via the Multi-Dimensional Scaling (MDS)
algorithm, whose input is the average latent variable
of images belong to the same feature category. Click-
ing on any bubble in this view will trigger the second
view to display images from the corresponding cate-
gory. The second view displays numerous randomly
selected images from the selected image category, so
that users can check the features of those images and
select interested ones for further exploration. To save
the screen space, images are scaled down to 32×32.
Clicking on any image in this view will trigger further
updates in other views.

</pre>