=Paper=
{{Paper
|id=Vol-2699/paper04
|storemode=property
|title=LatentVis: Investigating and Comparing Variational
Auto-Encoders via Their Latent Space
|pdfUrl=https://ceur-ws.org/Vol-2699/paper04.pdf
|volume=Vol-2699
|authors=Xiao Liu,Junpeng Wang
|dblpUrl=https://dblp.org/rec/conf/cikm/LiuW20
}}
==LatentVis: Investigating and Comparing Variational
Auto-Encoders via Their Latent Space
==
LatentVis: Investigating and Comparing Variational Auto-Encoders via Their Latent Space Xiao Liua , Junpeng Wangb a Department of Computer Science and Engineering, the Ohio State University, 2015 Neil Avenue, Columbus, Ohio, 43210, USA b Visa Research, 385 Sherman Avenue, Palo Alto, California, 94306, USA Abstract As the result of compression and the source of reconstruction, the latent space of Variational Auto-Encoders (VAEs) captures the essences of the training data and hence plays a fundamental role in data understanding and analysis. Focused on revealing what data features/semantics are encoded and how they are related in the latent space, this paper proposes a visual analytics system, i.e., LatentVis, to interactively study the latent space for better understanding and diagnosing image-based VAEs. Specifically, we train a supervised linear model to relate the machine-learned latents with the human-understandable se- mantics. With this model, each important data feature is expressed along a unique direction in the latent space (i.e., semantic direction). Comparing the semantic directions of different features allows us to compare the feature similarity encoded in the latent space, and thus to better understand the encoding process of the corresponding VAE. Moreover, LatentVis empowers us to examine and compare latent spaces across various training stages, or different VAE models, which can provide useful insight into model diagnosis. Keywords Deep generative model, variational auto-encoder, latent space, semantics, visual analytics 1. Introduction Investigating them could help to understand and di- agnose the DGMs, and thus shed light on the mystery With the powerful capability in feature extractions, power of DGMs. However, those latent spaces are usu- Deep Neural Networks (DNNs) have made a series of ally with high-dimensionality and the semantics of in- breakthroughs across a wide range of applications, e.g., dividual latent dimension is not human-understandable. image classification [1], object recognition [2], image Recently, we have witnessed many works on inter- segmentation [3], etc. More interestingly, DNNs also preting the latent space of DNNs. Some considered a demonstrate excellent performance in feature genera- latent space as a high-dimensional manifold and fo- tions, which has attracted more research attention [4]. cused on the geometric interpretation of the manifold. For example, Generative Adversarial Nets (GANs) and For example, [7] showed that geodesic curves on the Variational Auto-Encoders (VAEs) are able to generate latent space manifold are approximately straight in their data (including images [5], sounds [6]) that are almost experiments. [8] revealed that a stochastic Rieman- indistinguishable from real data. nian metric in the latent space could produce smoother The outstanding performance of DNNs comes from interpolations than the conventional Euclidean distance. their complicated internal model architectures and the With static visualizations of the geometric path in the long-time model training processes, which, however, latent space, these studies have helped to understand have gone far beyond humansβ interpretability. As a the abstractive manifold holistically. result, it is very difficult to explain how Deep Genera- Others explored the semantics of different latent spa- tive Models (DGMs) understand the extracted features ces by focusing on specific tasks. For example, [9, 10] and further use them to generate new features. The la- analyzed the word embedding and verified the linear tent spaces of these models, located at the pivot point arithmetic of the semantics in the embedding/latent between extraction and generation, compress all the space, e.g., ππ’πππβπ€ππππ+πππβππππ. Similar linear extracted features and control what to be generated. arithmetic has also been found in the latent space of image-based DGMs [4]. These studies expose some Proceedings of the CIKM 2020 Workshops, October 19-20, 2020, structures of the latent spaces, but are still insufficient Galway, Ireland email: liu.5764@osu.edu (X. Liu); junpeng.wang.nk@gmail.com (J. to comprehensively reveal their essential semantics. Wang); The work was done while the author was at The Ohio State This paper targets to diagnose image-based VAEs University. (J. Wang) by interactively investigating their latent space, and orcid: 0000-0002-6303-0771 (X. Liu); 0000-0002-1130-9914 (J. hence answers three concrete research questions: (1) Wang) Β© 2020 Copyright for this paper by its authors. Use permitted under Creative what semantics are embedded in the latent space of Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) VAEs; (2) how can we transfer the machine-learned la- mean tent space to a human-understandable semantic space a for better interpretation; (3) how to use the latent spaces Encoder Decoder x: input of VAEs to track and compare VAE models. To the end, image std . z: latent x': output image we design and develop a three-module visual analytics variable prototype, named LatentVis, for this matter. The Data Feature Perceptual Loss: module presents an interface to interact with the ex- adding every difference perimental dataset and select images with desired fea- b A p re-trained C NN the same p re-trained C NN tures. The Semantics module identifies and compares Figure 1: (a) The architecture of VAE; (b) the perceptual loss, semantic directions of different image features, bridg- introduced in DFC-VAE [27], for feature reconstruction. ing the machine-encoded latents with human-under- standable semantics. The Comparison module com- pares the latent space of (1) the same model in two latent spaces. Moreover, in natural language process- different training stages, (2) the same model from two ing, the learned embedding of words/ paragraphs also separate trainings with randomly initialized network form a latent space. [9] and [16] interpreted this space parameters, and (3) two different VAE models. To sum and found that the correlations between words/ para- up, the contributions of this paper are three-fold: graphs were well-captured in the space. β’ We present LatentVis, a visual analytics system Visual Analytics for Deep Learning (VIS4DL). that helps to understand and diagnose VAEs by There are two groups of VIS4DL works in general. One interactively revealing the encoded semantics of focuses on a specific model to reveal the internal work- the latent space. ing mechanism of the model, such as CNNVis [17], GANViz [18], and ActiVis [19]. These works usually β’ Enlightened by the linear arithmetic of features, design a visualization system to expose the hidden fea- we use a linear model to transfer a machine- tures and feature connections, for specific DNNs on learned latent space into a human-understandable specific datasets. Some works in this group also tried semantic space. to generalize to different models on various datasets. For example, [20] proposed Network Dissection to β’ Based on our analysis of the latent space, we quantify the interpretability of latent representations propose a model-agnostic approach to compare captured by CNNs (AlexNet, VGG, GoogLeNet, ResNet) VAEs, across training stages, separate trainings, via the alignment between hidden units and seman- or different VAE models. tic concepts. The other group focuses on using only the model inputs and outputs to interpret/diagnose the model, without touching the intermediate model de- 2. Background and Related tails (i.e., model-agnostic). For example, [21] proposed Works a model-agnostic approach to reveal the dominant re- gions of input images in controlling the prediction re- Interpreting Latent Spaces. DNNs can be consid- sults of a classifier. More examples in this group also ered as functions that transfer data instances from the include [22, 23, 24]. Our work needs no examination input data space to a latent space (π βΆ π π β π π ). on the internal working mechanism of VAEs (as se- A well-trained DNN will preserve the essential infor- mantics are encoded in the space formed by activa- mation of the input data during this transformation. tions, rather than individual neurons [25, 26]), and thus However, due to the complexity of DNNs, it is a non- belongs to the second group. Integrating a linear space trivial problem to reveal or verify what information is transformer into our visual analytics process, we try to preserved and how it is preserved in the latent space. present a human-understandable latent space to diag- Targeted on this problem, many research efforts have nose DGMs. been devoted to interpret the latent spaces of DNNs. Variational Auto-Encoder (VAE) [28] aims to re- For example, [11] showed how the statistics of data construct the input image from a latent representation can be examined in the latent space representation. of the image encoded/learned by itself. It is comprised [12] interpreted the association between visual con- of two neural networks: an encoder network encodes cepts and symbolic annotations captured by π½VAE thr- the image into a latent variable, and a decoder network ough parallel coordinates plots. Latent embedding learn- decodes the image from the latent variable (Fig. 1a). ing methods (GLO, LEO, GLANN [13, 14, 15]) were also Specifically, the encoder maps an input image π₯ to a developed for the interpretation and understanding of latent variable π§ (i.e., π§ = πππππππ(π₯) βΌ π(π§|π₯)), and Image S pace a D ata (i) visualize semantics at Image Space in this work. This dataset is constituted of 202599 hu- features c man face images. Each image has 40 binary attributes mustache S emantic S pace (e.g., the image is a male face or not, a face with glasses VAE b glasses a latent (z) glasses a semantic (y) or not) with a resolution of 178Γ218. We pre-processed z L inear d imensio n d imensio n those images by cropping them into 148Γ148 and scal- mustache Latent S pace model (ii) detect the encoding of semantics ing down to 64Γ64 for our VAEs. Images with the same feature (i.e., have the same value on a binary attribute) Figure 2: Three spaces: (a) the image space is where the belong to the same feature category. CelebA images reside, each pixel is an independent dimen- We focused on this face image dataset for two rea- sion; (b) the latent space is the VAE learned representation sons. First, this dataset presents rich attributes for the of those images, the VAE encoder and decoder enable the transformation between the image space and latent space; same object (the human face) in the same scale. Com- (c) the semantic space is derived from the latent space un- pared to other datasets with numerous objects in dif- der the supervision of the 40 binary features of the images ferent scales (e.g., ImageNet), a VAE can more accu- (using our linear model). rately capture the underlying data distribution. Sec- ond, the well-labeled attributes in this dataset can help to interpret the semantics encoded in the latent space the decoder maps a latent variable π§ to an output im- of VAEs, through which, we derived the semantic space age π₯ β² (i.e., π₯ β² = πππππππ(π§) βΌ π(π₯|π§)). The encoder using our linear model (Fig. 2). and decoder, defined by trainable parameters π and π Image Semantics. The semantics of face images is respectively, are optimized via minimizing the follow- the existence and scale of the 40 features in the CelebA ing loss function: dataset. A well-trained VAE can transfer the semantics from the image space to the VAEβs latent space (i.e., π(π, π) = βπΈππ (π§|π₯) [πππππ (π₯|π§)] + πΎ πΏ(ππ (π§|π₯)βπ(π§)). from Fig. 2a to 2b). However, the transferred seman- tics in the latent space is not human-understandable. By π§ = πππππππ(π₯), the latent variable of a specific Hence, our goal is to interpret them via a semantic image from the VAE is readily accessible for further space (Fig. 2c), in which, we can explore if the image semantic explorations. One common issue of VAEs is semantics have been accurately encoded (Fig. 2i) and that the generated images tend to be blurry, due to the how they are encoded (Fig. 2ii). aggregated pixel-wise image distance used in the loss Semantic Direction. In Fig. 2b, we use a point to function, i.e., the πΏ2 distance between π₯ and π₯ β² . denote the VAE encoded latent variable for the corre- Deep Feature Consistent VAE (DFC-VAE) [27] is sponding image in the image space. All latent vari- a variant of the regular VAE. It improves the quality of ables for images of the same category (e.g. βglasses", the reconstructed images by replacing the pixel-wise βmustache") form a cluster in the space, denoted as a reconstruction loss with a feature perceptual loss [29]. blue bubble in Fig. 2b. We identify the direction from In DFC-VAE, multiple levels of features are extracted one cluster without a particular feature to the clus- from both the input and reconstructed images by pass- ter with that feature as the semantic direction for the ing them into a pre-trained CNN. Each layer of the feature. For example, in Fig. 2b, the directions on the CNN extracts certain levels of image features. The red and green lines reflect the semantic directions for features from the input and reconstructed images are βmustache" and βglasses". Moving the latent variable then used to measure their perceptual distance (Fig. 1b, of one image along a semantic direction will change the πΏ2 loss between the corresponding feature maps). the corresponding feature of the reconstructed image In this work, we adopted this perceptual loss to im- the most. Along this semantic direction, a vector with prove the reconstruction quality. The VGG19 [30] pre- a certain length is referred to as a Semantic Vector. For trained on the ImageNet data is used as our pre-trained CelebA, there are 40 features, and we have 40 unique CNN. semantic directions. 3. Methodology 3.2. Our Contributions The Linear Model. Enlightened by the linear arith- 3.1. Fundamental Concepts metic of features (e.g., π€ππππ π πππ π€ππ‘βππ’π‘ ππππ π ππ + Image Dataset. We focus on a face image dataset, i.e., (πππ π πππ π€ππ‘β ππππ π ππ βπππ π πππ π€ππ‘βππ’π‘ ππππ π ππ ) β CelebA [31], to explore the latent space of VAE models π€ππππ π πππ π€ππ‘β ππππ π ππ ) [4], we trained a linear model to quantify the semantic directions, as well as to trans- Information Flow of M odules details in Sec 4.2). View data Detect Co m pare different Co m pare the sam e features sem antics sem antics in a VAE sem antics in two VAEs T1 T2 T3 4. Experiments and Results LatentV is a Data Mo d ule b S emantics Mo d ule C o mp ariso n Mo d ule c Neural Network Structure. We worked with one reg- ular VAE and one DFC-VAE, both with an encoder and a decoder of four convolutional layers. The four-layer Figure 3: The framework of LatentVis system. encoder compresses the 64Γ64Γ3 CelebA images to 32Γ32 Γ32, 16Γ16Γ64, 8Γ8Γ128, and 4Γ4Γ256. The compression result is then flattened and mapped to a 100D Gaus- form the latent space to a human-understandable se- sian distribution, represented by a 100D mean and a mantic space (i.e., from Fig. 2b to 2c). The linear model, 100D standard deviation, through two fully-connected π² = π π β π³ + π, is trained using the latent variable of layers. The decoder has a symmetric structure with all CelebA images encoded by VAEs (denoted as π³) and the encoder, but with a reversed order of the layers to the 40 binary attributes of the images (denoted as π²). π³ up-sample the 100D latent variables (sampled from the is a vector with many dimensions (the black arrow in 100D Gaussian). Fig. 2b), and π² is a vector with 40 dimensions (the green The difference between the VAE and DFC-VAE is arrow in Fig. 2c). Each row of the weight matrix π π whether a pre-trained VGG19 model was used to com- is the derivatives of a certain feature to all the latent pute the perceptual loss. We trained the VAE once dimensions, which represents a semantic direction in and the DFC-VAE twice with the same batch size and the latent space. Each column of π π represents the the Adam optimizer in all trainings. However, notice contributions of a latent dimension to all the seman- that the hyperparamters for these trainings could be tics. the same or be different on-purpose to compare the Analytic Tasks and LatentVis. We focus on three trained models and investigate the effect of the hyper- analytic tasks to better understand and diagnose VAEs parameters, e.g., comparing models trained with dif- in a lens of their latent space: (T1) navigating the dataset ferent learning rates to study their convergence speed. and feature selection; (T2) visualizing and comparing All three trainings used the 202599 CelebA images the image semantics/ semantic directions in a latent and the batch size is 64. Every 800 batches were con- space; and (T3) facilitating model comparisons and di- sidered as a training stage to collect model statistics, agnosis through comparing their latent spaces. Fol- like loss-values, and all the three trainings were run lowing these tasks, we propose a visual analytics sys- for 197 stages (i.e., 157600 batches). tem, LatentVis (Fig. 3, bottom), which contains three analytical modules corresponding to an hierarchical information flow (Fig. 3, top). 4.1. Detecting and Comparing Semantic The Data Module (Fig. 3a) gives an overview of the Directions studied dataset, allowing us to flexibly explore data in- We propose Algorithm 1 to detect and compare seman- stances. It is also an interface to select any interested tic directions in the Semantics Module (Fig. 4). First, feature category and data instance for further analy- we give all CelebA images to the well-trained VAE model sis in other modules (please check details in the Ap- to obtain their latent variables. Then, we use these pendix). latent variables and their corresponding 40 binary at- The Semantics Module (Fig. 3b) demonstrates what tributes to train a linear model to capture the semantic semantics has been captured by the VAE, and how the directions encoded in the latent space. To verify the ef- data features are correlated in the latent space by con- fectiveness of the semantic directions, we also gener- necting the image space with the VAE latent space. Its ate many random directions in the latent space. Given three views follow an hierarchical information flow to any selected image, we visualize the modified version detect, cluster, and compare semantics (see details in of the image resulted from changing its latent variable Sec. 4.1). along the 40 semantic directions and a random direc- The Comparison Module (Fig. 3c) compares the se- tion. For example, by dragging the control point for mantic directions of latent space from different VAEs the feature βbangs" and βglasses" in Fig. 4a (i.e., change to diagnose these models. The diagnosis for the com- the length of a semantic vector, πβ[β10, 10]), we ob- pared VAEs is performed by examining the learning served how those two features were added (Fig. 4-a2) work division between the encoder and decoder (see to the selected image (Fig. 4-a1). However, no obvi- ous changes towards a particular feature were found 1 in the image when dragging the control point on the bangs glasses 2 axis representing random directions (Fig. 4-a3). 1 3 2 Algorithm 1 : Detecting and Comparing Semantic Di- 4 rections 3 5 Require: images π = {π±π }ππ=1 , feature labels π = a β original β‘ changed along semantic direction b β weak-woman-ish β‘ man-ish β’ woman-ish c Rosy cheeks Male {π²π }ππ=1 β’ changed along random direction β£ old-ish β€ weak-man-ish Require: selected image π±Μ β π , selected feature π , Figure 4: Semantic Module on DFC-VAE from the 197π‘β compared feature π β² , the length of a semantic vec- training stage: (a) detect, (b) cluster, and (c) compare se- tor π mantic directions. The cell color from red over white to blue Require: the VAE model with an πππππππ and a in the matrix (b) indicates the cosine similarity of two se- πππππππ mantic directions from -1 over 0 to 1; (b1-b5) represent five 1: for π in πππππ(π) do clusters: βweak-woman-ish", βman-ish", βwoman-ish", βold- ish", and βweak-man-ish". 2: π³π β πππππππ(π±π ) 3: end for 4: train a linear model π² = π π β π³ + π 5: semantic directions π· β π π which are from the βwoman-ish" (Fig. 4-b3) and βman- 6: ππ β π·[π , βΆ] // a row of π· corresponding ish" (Fig. 4-b2) groups respectively. The negative cor- to feature π relation is indicated by an obtuse angle between the 7: randomly initialize ππ with βππ β = βππ β two colored semantic vectors (i.e., πππ and πππ β² in Al- 8: π³Μ β πππππππ(π± Μ ). gorithm 1). To visualize these two high-dimensional 9: π³Μπ β π³Μ + πππ vectors πππ and πππ β² in a 2D plot intuitively, πππ (cor- 10: π³Μπ β π³Μ + πππ responding to the selected feature) is always along the 11: π±Μ π β πππππππ(π³Μπ ) horizontal direction, and πππ β² (corresponding to the 12: π±Μ π β πππππππ(π³Μπ ) compared feature) presents an angle with πππ , calcu- 13: π£ππ π’ππππ§π(π± Μ , π±Μπ , π±Μπ ) to verify semantics in ππ lated as the angle between them in the original HD //Fig. 4a latent space. The length of each colored segment re- 14: π±π β πππππππ(πππ ) flects the norm of the corresponding semantic vector. 15: πβ²π β π·[π β² , βΆ] // another row of π· corresponding Dragging the green/red point in Fig. 4c to the oppo- site direction (i.e., change π to a negative value), we to feature π β² can also verify that the opposite direction indeed en- 16: π±π β² β πππππππ(πππ β² ) codes the opposite feature, e.g., βpale skin" is the op- 17: π£ππ π’ππππ§π(πππ , πππ β² ), (π±π , π±π β² ) to compare seman- posite of βdark skin". Interestingly, we found several tics //Fig. 4c such pairs, showing a similar way of how human un- derstand these semantics, such as βsmile" v.s. βscary We cluster the 40 feature categories (40 semantics) face", βbangs" v.s. βhigh hairlines". into five groups using the πΎ -means algorithm based on From the above explorations and visual evidence, the cosine similarity between their corresponding se- we feel confident to believe the following hypothesis mantic directions. Interestingly, the feature categories on the semantic structure of the latent space: (1) la- inside each group present similar semantics. For ex- tent space tends to encode semantics along unique di- ample, the feature βmake up", βno beard", and βattrac- rections (i.e., semantic directions); (2) smaller angles tive" are in the same group, which are all βwomen- between semantic directions denote similar semantics ish" features. With the similar logic, the other four se- and opposite semantic directions encode opposite se- mantic groups are named βman-ish" (e.g., βmustache", mantics. βfive-oβclock shadow"), βweak-woman-ish" (e.g., βsmile", βbangs"), βweak-man-ish"(e.g., βbushy eyebrows", βhat"), 4.2. Comparing Semantics across VAEs and βold-ish" (e.g., βchubby", βbald"). The five clusters can be easily identified from the symmetric pair-wise The Comparison Module compares two VAE models similarity matrix (Fig. 4b), and we can select any two and facilitates model diagnosis using their latent spaces. semantics (i.e., one row and one column) for compar- The comparison is across different training stages, train- ison. For example, Fig. 4c shows the negative corre- ings (with randomly initialized neural network param- lation between the βrosy cheeks" and βmale" feature, eters), and VAE models. For each pair of compared a selected image b c d SD from SD from b decoder1 decoder2 model 1 model 2 SD from encoder1 e f model 1 E1 E1 SD 1 SD 1 SD 2 D1 D2 b i ii encoder2 Figure 6: The Comparison Module on the β glasses" feature. E2 E2 SD 1 SD 2 SD 2 Left: the orange and purple color represent the DFC-VAE D1 D2 Semantic D irection (SD ) learned from model 2 SD from SD from a model from the 3ππ and the 197π‘β training stages. Right: the model 1 model 2 b orange and purple color represent two separate trainings of Figure 5: The mapping between images in the Comparison the same DFC-VAE model from the 197π‘β training stage. Module. (a) Reconstructed images from different combina- tions of encoders and decoders of two compared VAEs; (b) reconstructed images when changing along the semantic di- rection (SD) learned by different VAEs. When moving the latent variable of the image in Fig. 6a along the semantic direction learned in those two stages, all the six reconstructed images generated the βglasses" feature (as shown in Fig. 6b), regardless of the swap- models, we run Algorithm 1 to obtain their individual ping of the encoders and decoders from the two train- semantic directions and generate the corresponding ing stages. This observation indicates that different reconstructed images. Given any selected image, Al- stages of the training encode the semantic direction of gorithm 1 also outputs its latent variable, from which, the same feature in a consistent way. It also implies we can regenerate the image with a VAEβs decoder. We another insight on the semantic structure of the latent can use the same VAEβs decoder and a different VAEβs space, i.e., the semantic directions may have a tolerable decoder to reconstruct two images and compare them. range, within which, the learned semantics is evolving From the comparison, we can track the work division over the training process. between an encoder and a decoder, and also diagnose Comparing Semantic Directions Across-Training. which of these two networks is more responsible for Focusing on a well-trained stage, we compared the DFC- certain model functions. VAE from two separate trainings (where the model pa- Since we have two encoders and two decoders, there rameters are randomly initialized in each training). Our are four possible combinations between them. Each goal is to explore whether the semantics is encoded rectangle in Fig. 5a represents one combination. We in the same way over the two trainings. We used the use purple and orange color to denote model 1 and same training hyperparameters and trained the same model 2. The left and right bordersβ color of a rectangle model twice with enough epochs. Fig. 6c investigates reflects which modelβs encoder is in use, whereas the the semantic directions of the βglasses" feature from top and bottom bordersβ color reflects which modelβs the two trainings. We found the βglasses" feature can decoder is in use. For example, the top right rectangle only be generated when using the matched encoder- in Fig. 5a means the reconstructed image uses encoder decoder pairs. For example, Fig. 6c reconstructs an 1 (the left and right borders of the rectangle are in pur- image using the decoder from the second training, but ple) and decoder 2 (the top and bottom borders of the the latent variable is moved along the semantic direc- rectangle are in orange), i.e., the pair of E1-D2. tion learned from the first training. As a result, the When dragging the horizontal control point (i.e., cha- βglasses" feature was not generated. Conversely, the nge π) in the top of the Comparison Module, these four βglasses" feature could be generated when moving along images will be modified along the semantic directions the semantic direction learned from the second train- (SD) for the focused semantics learned from the two ing, as shown in Fig. 6d. The results shown in Fig. 6e models. Fig. 5b shows the mappings, i.e., which image (image with βglasses"), 6f (image without βglasses") fur- is changing along which semantic direction. For exam- ther verify this. The observation indicates that differ- ples, Fig. 5i and Fig. 5ii show the reconstructed images ent trainings of the same VAE may encode the seman- when changing the top right image of Fig. 5a along the tic direction of the same feature differently. semantic directions learned from model 1 and model 2, respectively. Comparing Semantic Directions Across-Time. We 4.3. Diagnosing VAEs via Semantics take the DFC-VAE in an early training stage and a well- Comparison trained stage to perform the comparison. Fig. 6 in- Learning Process Comparison between Encoders and vestigates the DFC-VAE model with parameters from Decoders. To interpret the learning work division be- the 3 (orange) and the 197 (purple) training stage. ππ π‘β Figure 7: Images reconstructed from model 1 (DFC-VAE, 197π‘β stage, in purple) and model 2 (DFC-VAE, 3ππ stage, in Figure 8: Comparing the cosine similarity between the orange) with matched and swapped encoder-decoder pairs. βglasses" semantic direction and other semantic directions The numbers 1, 2, 3, 4 represent the combinations of πΈ1 -π·1 , across seven training stages; red over white to blue denotes πΈ2 -π·1 , πΈ1 -π·2 , πΈ2 -π·2 , respectively. values from -1 over 0 to 1. a tween the encoder and decoder, we explored the DFC- VAE model from a well-trained stage and an early train- b smile pale skin mustache hat glasses ing stage. We swapped the pairing between the two encoders and two decoders to investigate their respec- Figure 9: Comparing the reconstructed images along se- tive responsibilities, i.e., the well-trained encoder is mantic directions at π = 4.5 from DFC-VAE (purple borders) paired with the early-stage decoder, and the early-stage and VAE (orange borders) in (a) the 3ππ and (b) 197π‘β stage. encoder is paired with the well-trained decoder. By comparing the reconstructed images from them, we a 3rd stage b 197th stage discovered that a well-trained encoder is responsible DFC-VAE VAE smile for controlling semantics, while a well-trained decoder pale skin mustache is responsible for generating clear images. For exam- hat ple, the image in Fig. 7-a2 reconstructs the image in glasses Fig. 7-a0 using the early-stage encoder (πΈ2 ) and the Figure 10: Comparing the semantic correlations learned well-trained decoder (π·1 ). Although the reconstruc- by DFC-VAE (top row) and VAE (bottom row) between the five interested semantic directions and other semantic di- tion did not catch the features of the original image rections in the the (a) 3ππ and (b) 197π‘β stage. (e.g., gender, hair style and color), the generated im- age is clear. On the contrary, the image in Fig. 7-a3 is reconstructed using the well-trained encoder (πΈ1 ) and the early-stage decoder (π·2 ). The image captures most ple, the left and right image in the five pairs of images of the features in the original image, but it is blurry. in Fig. 9a compare five image features generated from Similar observations can also be found in Fig. 7b, 7c, DFC-VAE and VAE respectively from the 3ππ training and 7d. stage. Fig. 9b shows the same comparison but using We believe a well-trained encoder can better con- the parameters of DFC-VAE and VAE from the 197π‘β trol semantics because it better captures the correla- stage. Comparing Fig. 9a and 9b vertically, i.e., across tion between different semantics. In other words, bet- time, we can see that DFC-VAE enhanced the features ter semantic correlations make the semantic directions in the reconstructed images more than VAE. more accurate in the latent space generated from a Comparing the semantic correlations learned from well-trained encoder. For example, Fig. 8 compares the DFC-VAE and VAE in those two training stages, we correlations between the βglasses" feature and other found that both models captured the semantic corre- features in the 3ππ β 6π‘β , 8π‘β , 13π‘β , and 197π‘β training lations at a similar pace. For example, the top and stages. It is obvious that the negative correlations of bottom row of the 10 row-pairs in Fig. 10 show the different semantics (i.e., the region in the black dashed semantic correlations from the DFC-VAE and VAE re- lines) were evolving gradually over the training. Com- spectively, in stage 3 and 197. As highlighted by the paring the trend of negative to positive correlations rectangles, the evolutions of the correlations between between semantics (i.e., red to blue cells), we can see the five semantics and other semantics are similar in the negative correlations are acquired in later stages. both models, from the 3ππ stage (left) to the 197π‘β stage Comparing VAE and DFC-VAE. Although the VAE (right). and DFC-VAE shared a similar network structure, the Combining our observations from Fig. 9 and Fig. 10, feature perceptual loss used in DFC-VAE dramatically we get a better understanding on how the perceptual improved the semantics learning. The image features loss (from the pre-trained VGG19) was affecting the generated from DFC-VAE tend to be less blurry and model, i.e., compared to DFC-VAE, VAE captured the more recognizable than those from VAE. For exam- correlations between different semantics but it still could not generate clear features. We suspect that the per- ceptual loss contributed more to improving the decoder and human interpreted similar/different semantics tend in better reconstructing image features. to have smaller/larger angles between semantic direc- tions. Also, LatentVis can be used to examine and com- pare VAEs from three different perspectives: (1) dif- 5. Limitations and Future Work ferent training stages, (2) separate trainings with ran- domly initialized neural network parameters, and (3) LatentVis can be easily adapted to analyze other VAE different VAEs. Several interesting points on VAEs are models, as it is a model-agnostic approach and does discovered and summarized as follows: not use any model-specific information (e.g., network architectures). The required data are the input images, β’ Different stages of one training encode the se- the reconstructed images, and the learned latent vari- mantic direction of the same feature in a consis- ables at different training stages. The labels for dif- tent way. ferent feature categories are also demanded to train our supervised linear model. One interesting ques- β’ Different trainings of the same VAE model may tion here is whether finer granularity labels can fur- result in the VAE encoding the semantic direc- ther improve the accuracy of the derived semantic di- tion for the same feature in a different way. rections. For example, the current "glasses" feature β’ For a well-trained VAE, its encoder tends to be includes both "sunglasses" and "normal glasses". Dif- responsible for controlling semantics, while its ferentiating them as two features may help in more decoder tends to be responsible for generating accurately extracting the semantic directions. Addi- clear images. tionally, we can also verify the existence of the class hierarchy of features in the latent space. These are in- β’ For the specific dataset we worked on, the per- teresting research directions for us to explore in the ceptual loss of DFC-VAE contributes more to the future. However, similar to the current limitation of training of the decoder in better reconstructing our work, these future works also heavily depend on image features. Without using the perceptual the availability of the labeled datasets. loss, VAE is still able to accurately capture the Moreover, it is also possible to extend LatentVis to semantics correlations. VAEs trained on other data types, e.g., texts or audios. Compared to images, those types of data may not be These explorations and comparisons demonstrate how able to be visually interpreted. However, through dif- the latent spaces can be used to interpret and compare ferent visual encodings used in existing works, we be- the corresponding VAEs. With the promising results lieve they can be intuitively presented as well. We plan demonstrated in the paper, we are confident in extend- to investigate more from the literature and spend more ing LatentVis to other latent variable models or other efforts this direction in the future. data types in the future. It is worth mentioning that our current explorations in this work are heuristic and based only on one dataset, through which, we hope to shed some light on how References the latent space of VAEs captured the semantics of im- [1] A. Krizhevsky, I. Sutskever, G. E. Hinton, Ima- ages. More thorough experimental studies on more genet classification with deep convolutional neu- datasets would be needed to further validate our find- ral networks, in: Advances in neural information ings, which is another planned future work for us. processing systems, 2012, pp. 1097β1105. [2] D. Erhan, C. Szegedy, A. Toshev, D. Anguelov, 6. Conclusion Scalable object detection using deep neural net- works, in: Proceedings of the IEEE conference on In this paper, we propose LatentVis, a visual analyt- computer vision and pattern recognition, 2014, ics system to interpret and compare the semantics en- pp. 2147β2154. coded in the latent space of image-based VAEs. The [3] O. Ronneberger, P. Fischer, T. Brox, U-net: Con- system trains a supervised linear model to bridge the volutional networks for biomedical image seg- machine learned latent space with the human under- mentation, in: International Conference on Med- standable semantic space. From this bridging, we found ical image computing and computer-assisted in- that data semantics is usually expressed along a fixed tervention, Springer, 2015, pp. 234β241. direction in the latent space (i.e., semantic direction), [4] A. Radford, L. Metz, S. Chintala, Unsupervised representation learning with deep convolutional generative adversarial networks, arXiv preprint visual analytics approach to understand the ad- arXiv:1511.06434 (2015). versarial game, IEEE transactions on visualiza- [5] T. Karras, T. Aila, S. Laine, J. Lehtinen, Pro- tion and computer graphics 24 (2018) 1905β1917. gressive growing of gans for improved qual- [19] M. Kahng, P. Y. Andrews, A. Kalro, D. H. Chau, ity, stability, and variation, arXiv preprint ActiVis: Visual Exploration of Industry-Scale arXiv:1710.10196 (2017). Deep Neural Network Models, arXiv e-prints [6] H. Sak, A. Senior, K. Rao, F. Beaufays, Fast (2017) arXiv:1704.01942. arXiv:1704.01942. and accurate recurrent neural network acoustic [20] D. Bau, B. Zhou, A. Khosla, A. Oliva, A. Torralba, models for speech recognition, arXiv preprint Network dissection: Quantifying interpretability arXiv:1507.06947 (2015). of deep visual representations, in: Proceedings [7] H. Shao, A. Kumar, P. Thomas Fletcher, The of the IEEE conference on computer vision and riemannian geometry of deep generative mod- pattern recognition, 2017, pp. 6541β6549. els, in: Proceedings of the IEEE Conference on [21] M. T. Ribeiro, S. Singh, C. Guestrin, Why should Computer Vision and Pattern Recognition Work- i trust you?: Explaining the predictions of any shops, 2018, pp. 315β323. classifier, in: Proceedings of the 22nd ACM [8] G. Arvanitidis, L. K. Hansen, S. Hauberg, Latent SIGKDD international conference on knowledge space oddity: on the curvature of deep generative discovery and data mining, ACM, 2016, pp. 1135β models, arXiv preprint arXiv:1710.11379 (2017). 1144. [9] T. Mikolov, K. Chen, G. Corrado, J. Dean, Effi- [22] Y. Ming, H. Qu, E. Bertini, Rulematrix: Visu- cient estimation of word representations in vec- alizing and understanding classifiers with rules, tor space, arXiv preprint arXiv:1301.3781 (2013). IEEE transactions on visualization and computer [10] S. Liu, P.-T. Bremer, J. J. Thiagarajan, V. Srikumar, graphics 25 (2019) 342β352. B. Wang, Y. Livnat, V. Pascucci, Visual explo- [23] J. Zhang, Y. Wang, P. Molino, L. Li, D. S. Ebert, ration of semantic relationships in neural word Manifold: A model-agnostic framework for in- embeddings, IEEE transactions on visualization terpretation and diagnosis of machine learning and computer graphics 24 (2018) 553β562. models, IEEE transactions on visualization and [11] L. Kuhnel, T. Fletcher, S. Joshi, S. Sommer, La- computer graphics 25 (2019) 364β373. tent space non-linear statistics, arXiv preprint [24] J. Wang, L. Gou, W. Zhang, H. Yang, H.-W. Shen, arXiv:1805.07632 (2018). Deepvid: Deep visual interpretation and diagno- [12] J. Wang, W. Zhang, H. Yang, Scanviz: Interpret- sis for image classifiers via knowledge distilla- ing the symbol-concept association captured by tion, IEEE transactions on visualization and com- deep neural networks through visual analytics, puter graphics 25 (2019) 2168β2180. in: 2020 IEEE Pacific Visualization Symposium [25] C. Zhang, S. Bengio, M. Hardt, B. Recht, (PacificVis), IEEE, 2020, pp. 51β60. O. Vinyals, Understanding deep learning re- [13] P. Bojanowski, A. Joulin, D. Lopez-Paz, A. Szlam, quires rethinking generalization, arXiv preprint Optimizing the latent space of generative net- arXiv:1611.03530 (2016). works, arXiv preprint arXiv:1707.05776 (2017). [26] E. D. Cubuk, B. Zoph, S. S. Schoenholz, Q. V. [14] A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, Le, Intriguing properties of adversarial examples, R. Pascanu, S. Osindero, R. Hadsell, Meta- arXiv preprint arXiv:1711.02846 (2017). learning with latent embedding optimization, [27] X. Hou, L. Shen, K. Sun, G. Qiu, Deep feature arXiv preprint arXiv:1807.05960 (2018). consistent variational autoencoder, in: 2017 IEEE [15] Y. Hoshen, J. Malik, Non-adversarial image syn- Winter Conference on Applications of Computer thesis with generative latent nearest neighbors, Vision (WACV), IEEE, 2017, pp. 1133β1141. arXiv preprint arXiv:1812.08985 (2018). [28] D. P. Kingma, M. Welling, Auto-encoding vari- [16] Q. Le, T. Mikolov, Distributed representations ational bayes, arXiv preprint arXiv:1312.6114 of sentences and documents, in: International (2013). conference on machine learning, 2014, pp. 1188β [29] J. Johnson, A. Alahi, L. Fei-Fei, Perceptual losses 1196. for real-time style transfer and super-resolution, [17] M. Liu, J. Shi, Z. Li, C. Li, J. Zhu, S. Liu, in: European conference on computer vision, Towards Better Analysis of Deep Convolu- Springer, 2016, pp. 694β711. tional Neural Networks, arXiv e-prints (2016) [30] K. Simonyan, A. Zisserman, Very Deep Convo- arXiv:1604.07043. arXiv:1604.07043. lutional Networks for Large-Scale Image Recog- [18] J. Wang, L. Gou, H. Yang, H.-W. Shen, Ganviz: A nition, arXiv e-prints (2014) arXiv:1409.1556. arXiv:1409.1556. [31] Z. Liu, P. Luo, X. Wang, X. Tang, Deep learn- ing face attributes in the wild, in: Proceedings of International Conference on Computer Vision (ICCV), 2015. A. Appendix The Data Module contains two linked visualization views. The first view presents a statistical summary of all train- ing images. Each bubble in this view represents one feature category (the color of the bubble corresponds to different clusters in Fig. 4b), and the distances among bubbles reflect the Euclidean distances between those feature categories in the latent space. These distances are calculated via the Multi-Dimensional Scaling (MDS) algorithm, whose input is the average latent variable of images belong to the same feature category. Click- ing on any bubble in this view will trigger the second view to display images from the corresponding cate- gory. The second view displays numerous randomly selected images from the selected image category, so that users can check the features of those images and select interested ones for further exploration. To save the screen space, images are scaled down to 32Γ32. Clicking on any image in this view will trigger further updates in other views.