1. Introduction

LatentVis: Investigating and Comparing Variational Auto-Encoders via Their Latent Space

Xiao Liu

Junpeng Wang

1 0 Department of Computer Science and Engineering, the Ohio State University , 2015 Neil Avenue, Columbus, Ohio, 43210 , USA 1 Visa Research , 385 Sherman Avenue, Palo Alto, California, 94306 , USA

As the result of compression and the source of reconstruction, the latent space of Variational Auto-Encoders (VAEs) captures the essences of the training data and hence plays a fundamental role in data understanding and analysis. Focused on revealing what data features/semantics are encoded and how they are related in the latent space, this paper proposes a visual analytics system, i.e., LatentVis, to interactively study the latent space for better understanding and diagnosing image-based VAEs. Specifically, we train a supervised linear model to relate the machine-learned latents with the human-understandable semantics. With this model, each important data feature is expressed along a unique direction in the latent space (i.e., semantic direction). Comparing the semantic directions of diferent features allows us to compare the feature similarity encoded in the latent space, and thus to better understand the encoding process of the corresponding VAE. Moreover, LatentVis empowers us to examine and compare latent spaces across various training stages, or diferent VAE models, which can provide useful insight into model diagnosis.

eol>Deep generative model variational auto-encoder latent space semantics visual analytics

1. Introduction Investigating them could help to understand and di

agnose the DGMs, and thus shed light on the mystery With the powerful capability in feature extractions, power of DGMs. However, those latent spaces are usuDeep Neural Networks (DNNs) have made a series of ally with high-dimensionality and the semantics of inbreakthroughs across a wide range of applications, e.g., dividual latent dimension is not human-understandable. image classification [1], object recognition [2], image Recently, we have witnessed many works on intersegmentation [3], etc. More interestingly, DNNs also preting the latent space of DNNs. Some considered a demonstrate excellent performance in feature genera- latent space as a high-dimensional manifold and fotions, which has attracted more research attention [4]. cused on the geometric interpretation of the manifold. For example, Generative Adversarial Nets (GANs) and For example, [7] showed that geodesic curves on the Variational Auto-Encoders (VAEs) are able to generate latent space manifold are approximately straight in their data (including images [5], sounds [6]) that are almost experiments. [8] revealed that a stochastic Riemanindistinguishable from real data. nian metric in the latent space could produce smoother

The outstanding performance of DNNs comes from interpolations than the conventional Euclidean distance. their complicated internal model architectures and the With static visualizations of the geometric path in the long-time model training processes, which, however, latent space, these studies have helped to understand have gone far beyond humans’ interpretability. As a the abstractive manifold holistically. result, it is very dificult to explain how Deep Genera- Others explored the semantics of diferent latent spative Models (DGMs) understand the extracted features ces by focusing on specific tasks. For example, [9, 10] and further use them to generate new features. The la- analyzed the word embedding and verified the linear tent spaces of these models, located at the pivot point arithmetic of the semantics in the embedding/latent between extraction and generation, compress all the space, e.g., − + ≈ . Similar linear extracted features and control what to be generated. arithmetic has also been found in the latent space of image-based DGMs [4]. These studies expose some Proceedings of the CIKM 2020 Workshops, October 19-20, 2020, structures of the latent spaces, but are still insuficient eGmaalwila:yl,iIur.e5l7a6n4d@osu.edu (X. Liu); junpeng.wang.nk@gmail.com (J. to comprehensively reveal their essential semantics. Wang); The work was done while the author was at The Ohio State This paper targets to diagnose image-based VAEs University. (J. Wang) by interactively investigating their latent space, and orcid: 0000-0002-6303-0771 (X. Liu); 0000-0002-1130-9914 (J. hence answers three concrete research questions: ( 1 ) Wang) © 2020 Copyright for this paper by its authors. Use permitted under Creative what semantics are embedded in the latent space of CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmUmoRns WLiceonrsekAsthtriobuptioPnr4o.0cIneteerdnaitniognasl ((CCC EBYU4R.0)-.WS.org) VAEs; (2) how can we transfer the machine-learned lalearned latent space into a human-understandable specific datasets. Some works in this group also tried tent space to a human-understandable semantic space for better interpretation; (3) how to use the latent spaces a of VAEs to track and compare VAE models. To the end, we design and develop a three-module visual analytics prototype, named LatentVis, for this matter. The Data module presents an interface to interact with the experimental dataset and select images with desired features. The Semantics module identifies and compares semantic directions of diferent image features, bridging the machine-encoded latents with human-understandable semantics. The Comparison module compares the latent space of ( 1 ) the same model in two diferent training stages, (2) the same model from two separate trainings with randomly initialized network parameters, and (3) two diferent VAE models. To sum up, the contributions of this paper are three-fold: • We present LatentVis, a visual analytics system that helps to understand and diagnose VAEs by interactively revealing the encoded semantics of the latent space. • Enlightened by the linear arithmetic of features, we use a linear model to transfer a machinesemantic space. • Based on our analysis of the latent space, we propose a model-agnostic approach to compare VAEs, across training stages, separate trainings, or diferent VAE models.

2. Background and Related Works Interpreting Latent Spaces. DNNs can be consid

ered as functions that transfer data instances from the input data space to a latent space ( ∶

A well-trained DNN will preserve the essential infor

mation of the input data during this transformation.

However, due to the complexity of DNNs, it is a nontrivial problem to reveal or verify what information is preserved and how it is preserved in the latent space.

→ ).

Targeted on this problem, many research eforts have

been devoted to interpret the latent spaces of DNNs.

For example, [11] showed how the statistics of data can be examined in the latent space representation. [12] interpreted the association between visual concepts and symbolic annotations captured by VAE through parallel coordinates plots. Latent embedding learn- decodes the image from the latent variable (Fig. 1a). ing methods (GLO, LEO, GLANN [13, 14, 15]) were also developed for the interpretation and understanding of latent variable (i.e., = Specifically, the encoder maps an input image to a ( ) ∼ ( | )

), and x : i n p u t i m a g e

Encoder

Decoder introduced in DFC-VAE [27], for feature reconstruction. latent spaces. Moreover, in natural language processing, the learned embedding of words/ paragraphs also form a latent space. [9] and [16] interpreted this space and found that the correlations between words/ paragraphs were well-captured in the space.

Visual Analytics for Deep Learning (VIS4DL). There are two groups of VIS4DL works in general. One

focuses on a specific model to reveal the internal working mechanism of the model, such as CNNVis [17],

GANViz [18], and ActiVis [19]. These works usually

design a visualization system to expose the hidden features and feature connections, for specific DNNs on to generalize to diferent models on various datasets. For example, [20] proposed Network Dissection to quantify the interpretability of latent representations captured by CNNs (AlexNet, VGG, GoogLeNet, ResNet) via the alignment between hidden units and semantic concepts. The other group focuses on using only the model inputs and outputs to interpret/diagnose the model, without touching the intermediate model details (i.e., model-agnostic). For example, [21] proposed a model-agnostic approach to reveal the dominant regions of input images in controlling the prediction results of a classifier. More examples in this group also include [22, 23, 24]. Our work needs no examination on the internal working mechanism of VAEs (as semantics are encoded in the space formed by activations, rather than individual neurons [25, 26]), and thus belongs to the second group. Integrating a linear space transformer into our visual analytics process, we try to present a human-understandable latent space to diagnose DGMs.

Variational Auto-Encoder (VAE) [28] aims to re

construct the input image from a latent representation of the image encoded/learned by itself. It is comprised of two neural networks: an encoder network encodes the image into a latent variable, and a decoder network I m a g e S p a c e a Df e aa tt au r e s ( i ) va its Iu m a la i gz ee Ss ep ma ca en t i c s imnatnhifsacweoirmk.agTehsi.sEdaacthasiemt aisgecohnassti4tu0tbeidnaorfy20a2tt5r9ib9uhtuesc m u s t a c h e S e m a n t i c S p a c e (e.g., the image is a male face or not, a face with glasses V A E b z m u s t a c h e Lg laa s ts ee sn t S p a c e Lm io n d e ea lr ad ilma tee nn st i(o zn) ( i i ) do ef t es ce tm tah ne t gadieclinasms sec esmone adss niiotn incg ( y ) itonhrgonsdoeoti)wmwnaitgthoesa64br×ey6sco4rlfouoptripooinnugrofVth1A7eE8ms×.2iIn1mt8o.aWg1e4es8×pw1rie4t-h8patrhnoecdesssacsmaeldefeature (i.e., have the same value on a binary attribute) Figure 2: Three spaces: (a) the image space is where the belong to the same feature category. CelebA images reside, each pixel is an independent dimen- We focused on this face image dataset for two reasoifotnh; o(bse) tihmealgaetse,ntthsepaVcAeEisenthceodVeArEanledardneecdodreeprreenseanbtleattiohne sons. First, this dataset presents rich attributes for the transformation between the image space and latent space; same object (the human face) in the same scale. Com(c) the semantic space is derived from the latent space un- pared to other datasets with numerous objects in difder the supervision of the 40 binary features of the images ferent scales (e.g., ImageNet), a VAE can more accu(using our linear model). rately capture the underlying data distribution. Second, the well-labeled attributes in this dataset can help to interpret the semantics encoded in the latent space the decoder maps a latent variable to an output im- of VAEs, through which, we derived the semantic space age ′ (i.e., ′ = ( ) ∼ ( | )). The encoder using our linear model (Fig. 2). and decoder, defined by trainable parameters and Image Semantics. The semantics of face images is respectively, are optimized via minimizing the follow- the existence and scale of the 40 features in the CelebA ing loss function: dataset. A well-trained VAE can transfer the semantics from the image space to the VAE’s latent space (i.e., (, ) = − ( | )[ ( | )] + ( ( | )‖ ( )). ftricosminFitgh.e2laatteon2tbs)p. aHceowisenvoetr,htuhmeatrna-nusnfedrerresdtasnedmabalne-.

By = ( ), the latent variable of a specific Hence, our goal is to interpret them via a semantic image from the VAE is readily accessible for further space (Fig. 2c), in which, we can explore if the image semantic explorations. One common issue of VAEs is semantics have been accurately encoded (Fig. 2i) and that the generated images tend to be blurry, due to the how they are encoded (Fig. 2ii). aggregated pixel-wise image distance used in the loss Semantic Direction. In Fig. 2b, we use a point to function, i.e., the 2 distance between and ′. denote the VAE encoded latent variable for the corre

Deep Feature Consistent VAE (DFC-VAE) [27] is sponding image in the image space. All latent varia variant of the regular VAE. It improves the quality of ables for images of the same category (e.g. “glasses", the reconstructed images by replacing the pixel-wise “mustache") form a cluster in the space, denoted as a reconstruction loss with a feature perceptual loss [29]. blue bubble in Fig. 2b. We identify the direction from In DFC-VAE, multiple levels of features are extracted one cluster without a particular feature to the clusfrom both the input and reconstructed images by pass- ter with that feature as the semantic direction for the ing them into a pre-trained CNN. Each layer of the feature. For example, in Fig. 2b, the directions on the CNN extracts certain levels of image features. The red and green lines reflect the semantic directions for features from the input and reconstructed images are “mustache" and “glasses". Moving the latent variable then used to measure their perceptual distance (Fig. 1b, of one image along a semantic direction will change the 2 loss between the corresponding feature maps). the corresponding feature of the reconstructed image In this work, we adopted this perceptual loss to im- the most. Along this semantic direction, a vector with prove the reconstruction quality. The VGG19 [30] pre- a certain length is referred to as a Semantic Vector. For trained on the ImageNet data is used as our pre-trained CelebA, there are 40 features, and we have 40 unique CNN. semantic directions.

3. Methodology

3.2. Our Contributions

The Linear Model. Enlightened by the linear arith

3.1. Fundamental Concepts metic of features (e.g., ℎ + Image Dataset. We focus on a face image dataset, i.e., ( ℎ − ℎ ) ≈ CelebA [31], to explore the latent space of VAE models ℎ ) [4], we trained a linear model to quantify the semantic directions, as well as to transI n f o r m a t i o n F l o w o f M o d u l e s

D e t e c t s e m a n t i c s

C o m p a r e d i f f e r e n t s e m a n t i c s i n a V A E

C o m p a r e t h e s a m e s e m a n t i c s i n t w o V A E s

details in Sec 4.2).

T 2 S e m a n t i c s M o d u l e contributions of a latent dimension to all the seman- that the hyperparamters for these trainings could be in a lens of their latent space: (T1) navigating the dataset ferent learning rates to study their convergence speed. form the latent space to a human-understandable semantic space (i.e., from Fig. 2b to 2c). The linear model, = ⋅ + , is trained using the latent variable of all CelebA images encoded by VAEs (denoted as ) and the 40 binary attributes of the images (denoted as ). is a vector with many dimensions (the black arrow in

Fig. 2b), and is a vector with 40 dimensions (the green arrow in Fig. 2c). Each row of the weight matrix is the derivatives of a certain feature to all the latent dimensions, which represents a semantic direction in the latent space. Each column of represents the tics.

Analytic Tasks and LatentVis. We focus on three analytic tasks to better understand and diagnose VAEs and feature selection; (T2) visualizing and comparing the image semantics/ semantic directions in a latent space; and (T3) facilitating model comparisons and diagnosis through comparing their latent spaces. Foltem, LatentVis (Fig. 3, bottom), which contains three analytical modules corresponding to an hierarchical information flow (Fig. 3, top).

The Data Module (Fig. 3a) gives an overview of the

studied dataset, allowing us to flexibly explore data instances. It is also an interface to select any interested feature category and data instance for further analyV i e w d a t a f e a t u r e s

4. Experiments and Results

Neural Network Structure. We worked with one regular VAE and one DFC-VAE, both with an encoder and a decoder of four convolutional layers. The four-layer encoder compresses the 64×64×3 CelebA images to 32×32 ×32, 16×16×64, 8×8×128, and

4×4×256. The compression result is then flattened and mapped to a 100D Gaussian distribution, represented by a 100D mean and a 100D standard deviation, through two fully-connected layers. The decoder has a symmetric structure with the encoder, but with a reversed order of the layers to up-sample the 100D latent variables (sampled from the 100D Gaussian).

The diference between the VAE and DFC-VAE is whether a pre-trained VGG19 model was used to compute the perceptual loss. We trained the VAE once

and the DFC-VAE twice with the same batch size and the Adam optimizer in all trainings. However, notice the same or be diferent on-purpose to compare the trained models and investigate the efect of the hyperparameters, e.g., comparing models trained with dif

All three trainings used the 202599 CelebA images and the batch size is 64. Every 800 batches were considered as a training stage to collect model statistics, like loss-values, and all the three trainings were run lowing these tasks, we propose a visual analytics sys- for 197 stages (i.e., 157600 batches). semantics has been captured by the VAE, and how the data features are correlated in the latent space by connecting the image space with the VAE latent space. Its three views follow an hierarchical information flow to detect, cluster, and compare semantics (see details in Sec. 4.1). mantic directions of latent space from diferent VAEs to diagnose these models. The diagnosis for the compared VAEs is performed by examining the learning 4.1. Detecting and Comparing Semantic

Directions We propose Algorithm 1 to detect and compare semantic directions in the Semantics Module (Fig. 4). First, we give all CelebA images to the well-trained VAE model latent variables and their corresponding 40 binary atdirections encoded in the latent space. To verify the effectiveness of the semantic directions, we also generate many random directions in the latent space. Given any selected image, we visualize the modified version of the image resulted from changing its latent variable along the 40 semantic directions and a random directhe feature “bangs" and “glasses" in Fig. 4a (i.e., change the length of a semantic vector, ∈[−10, 10]), we observed how those two features were added (Fig. 4-a2) pendix). sis in other modules (please check details in the Ap- to obtain their latent variables. Then, we use these The Semantics Module (Fig. 3b) demonstrates what tributes to train a linear model to capture the semantic The Comparison Module (Fig. 3c) compares the se- tion. For example, by dragging the control point for work division between the encoder and decoder (see to the selected image (Fig. 4-a1). However, no obvious changes towards a particular feature were found in the image when dragging the control point on the axis representing random directions (Fig. 4-a3).

Algorithm 1 : Detecting and Comparing Semantic Di

rections Require: images

Req{u ir}e :=1selected image ̂ = { } =1 , feature labels

= ∈ , selected feature , compared feature ′, the length of a semantic vecRequire: the VAE model with an 4: train a linear model = ⋅ + 5: semantic directions

← 7: randomly initialize with ‖ ‖ = ‖ ‖

// a row of corresponding tor 2: 1: for in 3: end for

← 6: to feature

← [ , ∶] 8: ̂ ← ← ̂ + ← ̂ + ← ← ← //Fig. 4a 9: ̂ 10: ̂ 11: ̂ 12: ̂ 13: 14:

to feature ′ 17: 16: ′ ← tics //Fig. 4c ( ) do

( ) ( ̂ ).

(̂ ) (̂ ) ( ) ( ′ ) 1 2 3 bangs glasses 2 3 4 5 ③ changed along random direction a ① original ② changed along semantic direction b ②① wmeaank--iswho③manw-ioshman-ish ④ old-ish ⑤ weak-man-ish c RMoaslyecheeks training stage: (a) detect, (b) cluster, and (c) compare semantic directions. The cell color from red over white to blue and a in the matrix (b) indicates the cosine similarity of two semantic directions from -1 over 0 to 1; (b1-b5) represent five clusters: “weak-woman-ish", “man-ish", “woman-ish", “oldish", and “weak-man-ish". which are from the “woman-ish" (Fig. 4-b3) and “manish" (Fig. 4-b2) groups respectively. The negative correlation is indicated by an obtuse angle between the two colored semantic vectors (i.e., and ′ in Algorithm 1). To visualize these two high-dimensional vectors

and ′ in a 2D plot intuitively, responding to the selected feature) is always along the (corhorizontal direction, and ′ (corresponding to the compared feature) presents an angle with , calculated as the angle between them in the original HD latent space. The length of each colored segment relfects the norm of the corresponding semantic vector. Dragging the green/red point in Fig. 4c to the opposite direction (i.e., change to a negative value), we can also verify that the opposite direction indeed encodes the opposite feature, e.g., “pale skin" is the opposite of “dark skin". Interestingly, we found several such pairs, showing a similar way of how human understand these semantics, such as “smile" v.s. “scary

From the above explorations and visual evidence, we feel confident to believe the following hypothesis on the semantic structure of the latent space: ( 1 ) latent space tends to encode semantics along unique directions (i.e., semantic directions); (2) smaller angles between semantic directions denote similar semantics and opposite semantic directions encode opposite semantics.

The Comparison Module compares two VAE models

and facilitates model diagnosis using their latent spaces.

The comparison is across diferent training stages, trainings (with randomly initialized neural network parameters), and VAE models. For each pair of compared ( ̂ , ̂ , ̂ ) to verify semantics in ( , ′ ), ( , ′ ) to compare seman15: ′ ← [ ′, ∶] // another row of corresponding into five groups using the -means algorithm based on the cosine similarity between their corresponding semantic directions. Interestingly, the feature categories inside each group present similar semantics. For example, the feature “make up", “no beard", and “attractive" are in the same group, which are all “womenish" features. With the similar logic, the other four semantic groups are named “man-ish" (e.g., “mustache", “five-o’clock shadow"), “weak-woman-ish" (e.g., “smile", and “old-ish" (e.g., “chubby", “bald"). The five clusters can be easily identified from the symmetric pair-wise similarity matrix (Fig. 4b), and we can select any two semantics (i.e., one row and one column) for comparison. For example, Fig. 4c shows the negative correlation between the “rosy cheeks" and “male" feature,

We cluster the 40 feature categories (40 semantics) face", “bangs" v.s. “high hairlines". “bangs"), “weak-man-ish"(e.g., “bushy eyebrows", “hat"), 4.2. Comparing Semantics across VAEs b

E 2

D 2 lS e ea mr n a e n d t i fc r oD m i r em c ot ido en l (2S D ) S D 2 mS Do d fer lo m1 Sm Do d fer ol m2 a b Figure 5: The mapping between images in the Comparison Module. (a) Reconstructed images from diferent combinations of encoders and decoders of two compared VAEs; (b) reconstructed images when changing along the semantic direction (SD) learned by diferent VAEs.

Sm Do d fer ol m1 Sm Do d fer ol m2 S D 1 S D 2 i i i b

c d b

e f

When moving the latent variable of the image in Fig. 6a along the semantic direction learned in those two stages, all the six reconstructed images generated the “glasses" feature (as shown in Fig. 6b), regardless of the swapmodels, we run Algorithm 1 to obtain their individual ping of the encoders and decoders from the two trainsemantic directions and generate the corresponding ing stages. This observation indicates that diferent reconstructed images. Given any selected image, Al- stages of the training encode the semantic direction of gorithm 1 also outputs its latent variable, from which, the same feature in a consistent way. It also implies we can regenerate the image with a VAE’s decoder. We another insight on the semantic structure of the latent can use the same VAE’s decoder and a diferent VAE’s space, i.e., the semantic directions may have a tolerable decoder to reconstruct two images and compare them. range, within which, the learned semantics is evolving From the comparison, we can track the work division over the training process. between an encoder and a decoder, and also diagnose Comparing Semantic Directions Across-Training. which of these two networks is more responsible for Focusing on a well-trained stage, we compared the DFCcertain model functions. VAE from two separate trainings (where the model pa

Since we have two encoders and two decoders, there rameters are randomly initialized in each training). Our are four possible combinations between them. Each goal is to explore whether the semantics is encoded rectangle in Fig. 5a represents one combination. We in the same way over the two trainings. We used the use purple and orange color to denote model 1 and same training hyperparameters and trained the same model 2. The left and right borders’ color of a rectangle model twice with enough epochs. Fig. 6c investigates reflects which model’s encoder is in use, whereas the the semantic directions of the “glasses" feature from top and bottom borders’ color reflects which model’s the two trainings. We found the “glasses" feature can decoder is in use. For example, the top right rectangle only be generated when using the matched encoderin Fig. 5a means the reconstructed image uses encoder decoder pairs. For example, Fig. 6c reconstructs an 1 (the left and right borders of the rectangle are in pur- image using the decoder from the second training, but ple) and decoder 2 (the top and bottom borders of the the latent variable is moved along the semantic direcrectangle are in orange), i.e., the pair of E1-D2. tion learned from the first training. As a result, the

When dragging the horizontal control point (i.e., cha- “glasses" feature was not generated. Conversely, the nge ) in the top of the Comparison Module, these four “glasses" feature could be generated when moving along images will be modified along the semantic directions the semantic direction learned from the second train(SD) for the focused semantics learned from the two ing, as shown in Fig. 6d. The results shown in Fig. 6e models. Fig. 5b shows the mappings, i.e., which image (image with “glasses"), 6f (image without “glasses") furis changing along which semantic direction. For exam- ther verify this. The observation indicates that diferples, Fig. 5i and Fig. 5ii show the reconstructed images ent trainings of the same VAE may encode the semanwhen changing the top right image of Fig. 5a along the tic direction of the same feature diferently. semantic directions learned from model 1 and model 2, respectively.

Comparing Semantic Directions Across-Time. We 4.3. Diagnosing VAEs via Semantics take the DFC-VAE in an early training stage and a well- Comparison trained stage to perform the comparison. Fig. 6 in- Learning Process Comparison between Encoders and vestigates the DFC-VAE model with parameters from Decoders. To interpret the learning work division bethe 3 (orange) and the 197ℎ (purple) training stage. tween the encoder and decoder, we explored the DFCVAE model from a well-trained stage and an early train- b smile pale skin mustache hat glasses ing stage. We swapped the pairing between the two encoders and two decoders to investigate their respective responsibilities, i.e., the well-trained encoder is Figure 9: Comparing the reconstructed images along sepaired with the early-stage decoder, and the early-stage amnadnVtiAcEdi(roercatniognesbaotrde=rs4).i5nfr(ao)mthDeF3C -VaAnEd((pbu)r1p9l7eℎbosrtdaegres.) encoder is paired with the well-trained decoder. By cdoismcopvaerrinedg tthhaet raecwoenlslt-rtruacitneeddimenacgoedserfriosmretshpeomns,ibwlee DVFAC-EVAE a 3rd stage smile b 197th stage for controlling semantics, while a well-trained decoder mpaulsetasckhine is responsible for generating clear images. For exam- hat ple, the image in Fig. 7-a2 reconstructs the image in glasses Fig. 7-a0 using the early-stage encoder ( 2) and the Figure 10: Comparing the semantic correlations learned well-trained decoder ( 1). Although the reconstruc- by DFC-VAE (top row) and VAE (bottom row) between the tion did not catch the features of the original image five interested semantic directions and other semantic di(e.g., gender, hair style and color), the generated im- rections in the the (a) 3 and (b) 197ℎ stage. age is clear. On the contrary, the image in Fig. 7-a3 is reconstructed using the well-trained encoder ( 1) and the early-stage decoder ( 2). The image captures most ple, the left and right image in the five pairs of images of the features in the original image, but it is blurry. in Fig. 9a compare five image features generated from Similar observations can also be found in Fig. 7b, 7c, DFC-VAE and VAE respectively from the 3 training and 7d. stage. Fig. 9b shows the same comparison but using

We believe a well-trained encoder can better con- the parameters of DFC-VAE and VAE from the 197ℎ trol semantics because it better captures the correla- stage. Comparing Fig. 9a and 9b vertically, i.e., across tion between diferent semantics. In other words, bet- time, we can see that DFC-VAE enhanced the features ter semantic correlations make the semantic directions in the reconstructed images more than VAE. more accurate in the latent space generated from a Comparing the semantic correlations learned from well-trained encoder. For example, Fig. 8 compares the DFC-VAE and VAE in those two training stages, we correlations between the “glasses" feature and other found that both models captured the semantic correfsetaagtuerse.sItinisthoebv3iou−s t6hℎa,t 8thℎe, n13egℎa,taivnedc1o9r7reℎlattriaoinnsinogf lbaottiotonms artowa soifmtihlaer 1p0acreo.w-FpoariresxianmFpilge., 1t0heshtoopw atnhde diferent semantics (i.e., the region in the black dashed semantic correlations from the DFC-VAE and VAE relines) were evolving gradually over the training. Com- spectively, in stage 3 and 197. As highlighted by the paring the trend of negative to positive correlations rectangles, the evolutions of the correlations between between semantics (i.e., red to blue cells), we can see the five semantics and other semantics are similar in the negative correlations are acquired in later stages. both models, from the 3 stage (left) to the 197ℎ stage

Comparing VAE and DFC-VAE. Although the VAE (right). and DFC-VAE shared a similar network structure, the Combining our observations from Fig. 9 and Fig. 10, feature perceptual loss used in DFC-VAE dramatically we get a better understanding on how the perceptual improved the semantics learning. The image features loss (from the pre-trained VGG19) was afecting the generated from DFC-VAE tend to be less blurry and model, i.e., compared to DFC-VAE, VAE captured the more recognizable than those from VAE. For exam- correlations between diferent semantics but it still could not generate clear features. We suspect that the perceptual loss contributed more to improving the decoder and human interpreted similar/diferent semantics tend in better reconstructing image features. to have smaller/larger angles between semantic directions. Also, LatentVis can be used to examine and compare VAEs from three diferent perspectives: ( 1 ) dif5. Limitations and Future Work ferent training stages, (2) separate trainings with randomly initialized neural network parameters, and (3) diferent VAEs. Several interesting points on VAEs are discovered and summarized as follows: LatentVis can be easily adapted to analyze other VAE models, as it is a model-agnostic approach and does not use any model-specific information (e.g., network architectures). The required data are the input images, • Diferent stages of one training encode the sethe reconstructed images, and the learned latent vari- mantic direction of the same feature in a consisables at diferent training stages. The labels for dif- tent way. ferent feature categories are also demanded to train our supervised linear model. One interesting ques- • Diferent trainings of the same VAE model may tion here is whether finer granularity labels can fur- result in the VAE encoding the semantic directher improve the accuracy of the derived semantic di- tion for the same feature in a diferent way. rections. For example, the current "glasses" feature • For a well-trained VAE, its encoder tends to be includes both "sunglasses" and "normal glasses". Dif- responsible for controlling semantics, while its ferentiating them as two features may help in more decoder tends to be responsible for generating accurately extracting the semantic directions. Addi- clear images. tionally, we can also verify the existence of the class hierarchy of features in the latent space. These are in- • For the specific dataset we worked on, the perteresting research directions for us to explore in the ceptual loss of DFC-VAE contributes more to the future. However, similar to the current limitation of training of the decoder in better reconstructing our work, these future works also heavily depend on image features. Without using the perceptual the availability of the labeled datasets. loss, VAE is still able to accurately capture the

Moreover, it is also possible to extend LatentVis to semantics correlations.

VAEs trained on other data types, e.g., texts or audios.

Compared to images, those types of data may not be These explorations and comparisons demonstrate how able to be visually interpreted. However, through dif- the latent spaces can be used to interpret and compare ferent visual encodings used in existing works, we be- the corresponding VAEs. With the promising results lieve they can be intuitively presented as well. We plan demonstrated in the paper, we are confident in extendto investigate more from the literature and spend more ing LatentVis to other latent variable models or other eforts this direction in the future. data types in the future.

It is worth mentioning that our current explorations in this work are heuristic and based only on one dataset, References through which, we hope to shed some light on how the latent space of VAEs captured the semantics of images. More thorough experimental studies on more datasets would be needed to further validate our findings, which is another planned future work for us.

6. Conclusion

In this paper, we propose LatentVis, a visual analytics system to interpret and compare the semantics encoded in the latent space of image-based VAEs. The system trains a supervised linear model to bridge the machine learned latent space with the human understandable semantic space. From this bridging, we found that data semantics is usually expressed along a fixed direction in the latent space (i.e., semantic direction), generative adversarial networks, arXiv preprint visual analytics approach to understand the adarXiv:1511.06434 ( 2015 ). versarial game, IEEE transactions on visualiza[5] T. Karras, T. Aila, S. Laine, J. Lehtinen, Pro- tion and computer graphics 24 (2018) 1905–1917. gressive growing of gans for improved qual- [19] M. Kahng, P. Y. Andrews, A. Kalro, D. H. Chau, ity, stability, and variation, arXiv preprint ActiVis: Visual Exploration of Industry-Scale arXiv:1710.10196 (2017). Deep Neural Network Models, arXiv e-prints [6] H. Sak, A. Senior, K. Rao, F. Beaufays, Fast (2017) arXiv:1704.01942. arXiv:1704.01942. and accurate recurrent neural network acoustic [20] D. Bau, B. Zhou, A. Khosla, A. Oliva, A. Torralba, models for speech recognition, arXiv preprint Network dissection: Quantifying interpretability arXiv:1507.06947 ( 2015 ). of deep visual representations, in: Proceedings [7] H. Shao, A. Kumar, P. Thomas Fletcher, The of the IEEE conference on computer vision and riemannian geometry of deep generative mod- pattern recognition, 2017, pp. 6541–6549. els, in: Proceedings of the IEEE Conference on [21] M. T. Ribeiro, S. Singh, C. Guestrin, Why should Computer Vision and Pattern Recognition Work- i trust you?: Explaining the predictions of any shops, 2018, pp. 315–323. classifier, in: Proceedings of the 22nd ACM [8] G. Arvanitidis, L. K. Hansen, S. Hauberg, Latent SIGKDD international conference on knowledge space oddity: on the curvature of deep generative discovery and data mining, ACM, 2016, pp. 1135– models, arXiv preprint arXiv:1710.11379 (2017). 1144. [9] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efi- [22] Y. Ming, H. Qu, E. Bertini, Rulematrix: Visucient estimation of word representations in vec- alizing and understanding classifiers with rules, tor space, arXiv preprint arXiv:1301.3781 (2013). IEEE transactions on visualization and computer [10] S. Liu, P.-T. Bremer, J. J. Thiagarajan, V. Srikumar, graphics 25 (2019) 342–352.

B. Wang, Y. Livnat, V. Pascucci, Visual explo- [23] J. Zhang, Y. Wang, P. Molino, L. Li, D. S. Ebert, ration of semantic relationships in neural word Manifold: A model-agnostic framework for inembeddings, IEEE transactions on visualization terpretation and diagnosis of machine learning and computer graphics 24 (2018) 553–562. models, IEEE transactions on visualization and [11] L. Kuhnel, T. Fletcher, S. Joshi, S. Sommer, La- computer graphics 25 (2019) 364–373. tent space non-linear statistics, arXiv preprint [24] J. Wang, L. Gou, W. Zhang, H. Yang, H.-W. Shen, arXiv:1805.07632 (2018). Deepvid: Deep visual interpretation and diagno[12] J. Wang, W. Zhang, H. Yang, Scanviz: Interpret- sis for image classifiers via knowledge distillaing the symbol-concept association captured by tion, IEEE transactions on visualization and comdeep neural networks through visual analytics, puter graphics 25 (2019) 2168–2180. in: 2020 IEEE Pacific Visualization Symposium [25] C. Zhang, S. Bengio, M. Hardt, B. Recht, (PacificVis), IEEE, 2020, pp. 51–60. O. Vinyals, Understanding deep learning re[13] P. Bojanowski, A. Joulin, D. Lopez-Paz, A. Szlam, quires rethinking generalization, arXiv preprint Optimizing the latent space of generative net- arXiv:1611.03530 (2016).

works, arXiv preprint arXiv:1707.05776 (2017). [26] E. D. Cubuk, B. Zoph, S. S. Schoenholz, Q. V. [14] A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, Le, Intriguing properties of adversarial examples, R. Pascanu, S. Osindero, R. Hadsell, Meta- arXiv preprint arXiv:1711.02846 (2017). learning with latent embedding optimization, [27] X. Hou, L. Shen, K. Sun, G. Qiu, Deep feature arXiv preprint arXiv:1807.05960 (2018). consistent variational autoencoder, in: 2017 IEEE [15] Y. Hoshen, J. Malik, Non-adversarial image syn- Winter Conference on Applications of Computer thesis with generative latent nearest neighbors, Vision (WACV), IEEE, 2017, pp. 1133–1141. arXiv preprint arXiv:1812.08985 (2018). [28] D. P. Kingma, M. Welling, Auto-encoding vari[16] Q. Le, T. Mikolov, Distributed representations ational bayes, arXiv preprint arXiv:1312.6114 of sentences and documents, in: International (2013). conference on machine learning, 2014, pp. 1188– [29] J. Johnson, A. Alahi, L. Fei-Fei, Perceptual losses 1196. for real-time style transfer and super-resolution, [17] M. Liu, J. Shi, Z. Li, C. Li, J. Zhu, S. Liu, in: European conference on computer vision, Towards Better Analysis of Deep Convolu- Springer, 2016, pp. 694–711. tional Neural Networks, arXiv e-prints (2016) [30] K. Simonyan, A. Zisserman, Very Deep ConvoarXiv:1604.07043. arXiv:1604.07043. lutional Networks for Large-Scale Image Recog[18] J. Wang, L. Gou, H. Yang, H.-W. Shen, Ganviz: A nition, arXiv e-prints (2014) arXiv:1409.1556.

The Data Module contains two linked visualization views.

The first view presents a statistical summary of all training images. Each bubble in this view represents one feature category (the color of the bubble corresponds to diferent clusters in Fig. 4b), and the distances among bubbles reflect the Euclidean distances between those feature categories in the latent space. These distances are calculated via the Multi-Dimensional Scaling (MDS) algorithm, whose input is the average latent variable of images belong to the same feature category. Clicking on any bubble in this view will trigger the second view to display images from the corresponding category. The second view displays numerous randomly selected images from the selected image category, so that users can check the features of those images and select interested ones for further exploration. To save the screen space, images are scaled down to 32×32. Clicking on any image in this view will trigger further updates in other views.

d e c o d e r 1 d e c o d e r 2 e n c o d e r 1 E 1 E 1

D 1 D 2 e n c o d e r 2 E 2

D 1 [1]

Krizhevsky , I. Sutskever,

G. E.

Hinton , Ima-

processing systems , 2012 , pp. 1097 - 1105 . [2]

Erhan ,

Szegedy ,

Toshev ,

Anguelov ,

computer vision and pattern recognition, 2014 ,

pp. 2147 - 2154 . [3]

Ronneberger ,

Fischer ,

Brox , U-net: Con-

tervention, Springer, 2015 , pp. 234 - 241 . [4]

Radford ,

Metz ,

Chintala , Unsupervised

arXiv:1409 . 1556 . [31]

Liu ,

Luo ,

Wang ,

Tang , Deep learn-

(ICCV) , 2015 .