Intrinsic Analysis of Learned Representations in Encoder-Decoder Architectures Shashi Durbha1,2 , Thomas Goerttler1 , Eduardo Vellasques2 , Jan Hendrik Stockemer2 and Klaus Obermayer1,3 1 Technische Universität Berlin, Chair of Neural Information Processing, Germany 2 SAP SE, Germany 3 Bernstein Center for Computational Neuroscience Berlin, Germany Abstract Encoder-decoder architectures have widespread use, both in research and the industry. Recently, similar- ity analysis applied to the representations of neural networks has contributed to a better understanding of the architectures. Previous work has found that for two instances (under different initialization) of the same layer of a given model, the learned representation becomes more and more dissimilar the farther away that layer is from the input layer. Since the encoder-decoder is often mirrored (causing a representational bottleneck), we investigate how the representation changes and if the objective of reconstructing the input influences this. Using representation similarity analysis, we find out that corresponding layers from the encoder and decoder are not very similar to each other and are more similar to their neighbor layers. In addition, our experiments show that except for classification tasks, the representations of the same decoder with different initialization become more and more similar the closer a layer is to the output layer. Some of our analysis includes comparing the average distances between the layers to the average distance of the current layer clusters, the impact of varying the latent dimension as well as having multiple bottleneck layers on the representational consistency. Keywords Neural Networks, Encoder-Decoder Architectures, Representational Similarity Analysis (RSA), Represen- tational Consistency, 1. Introduction One of the first successful cases of training deep architectures involved successively training (and consequently stacking) denoising autoencoders [1]. However, until very recently, much has been hypothesized about how learning occurs as information propagates through the different layers. Notwithstanding, the research community has shed some light on that subject in the last few years. For example, Ansuini et al. [2] applied intrinsic dimension analysis to examine the geometrical properties of the representations learned by a deep neural network. Popal et al. [3] use Representational Similarity Analysis (RSA) to analyze representation learning in CNNs. In RSA, a Representational Dissimilarity Matrix (RDM) is employed in LWDA’22: Lernen, Wissen, Daten, Analysen. October 05–07, 2022, Hildesheim, Germany Envelope-Open thomas.goerttler@tu-berlin.de (T. Goerttler); eduardo.vellasques@sap.com (E. Vellasques); jan.hendrik.stockemer@sap.com (J. H. Stockemer); klaus.obermayer@tu-berlin.de (K. Obermayer) GLOBE https://www.thomas.goerttler.de/ (T. Goerttler); https://www.ni.tu-berlin.de/ (K. Obermayer) Orcid 0000-0002-1437-0235 (T. Goerttler); 0000-0002-5057-6142 (K. Obermayer) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) order to characterize a system’s inner stimulus representations in terms of pairwise response differences [4]. In a technique called Representational Consistency (RC), Mehrer et al. [5], apply techniques until then only used in the field of neuroscience in order to compare different instances of the same neural network under random initialization. More specifically, they employ Pearson correlation analysis in order to quantify the variance between RDMs of the different instances. One important takeaway from their research is that by relying on multiple instances of the same architecture, it is possible to derive insights about representation learning in a statistically quantifiable way. In this paper, we apply RSA to the representations of (convolutional) autoencoders and observe their consistency. We observe that in contrast to the supervised classifiers examined in Mehrer et al. [5], where consistency decreases with each layer stacked, for the autoencoder, the consistency reaches a minimum in the bottleneck layer and increases again in the decoder part. We also investigate the influence of the size of the latent layer. The larger it is, the more consistent it stays. Analyzing a single model, we see that the layers of a network are more similar to the neighbor layers than to their corresponding mirror layers. 2. Background In this section, we describe the encoder-decoder architectures and introduce representational similarity techniques. 2.1. Encoder-Decoder Architectures Encoder-decoder architectures are networks in which an encoder takes a variable-length se- quence as the input and transforms it into a state with a fixed shape. The decoder then maps the encoded state of a fixed shape to a variable-length sequence. The final encoder layer from which the decoder maps the encoded state is known as the latent dimension layer, which has the least number of neurons among all the layers of the network. Encoder-Decoder architectures have widespread use in Natural Language Processing (NLP) and Image Denoising. In NLP, the common approach is to use sequential models such as Recurrent Neural Networks (RNNs) as encoder/decoder. 2.1.1. Autoencoders Baldi [6] describes Autoencoders as ‘simple learning circuits which aim to transform inputs into outputs with the least possible amount of distortion’. Goodfellow et al. [7] describe autoencoders as a special type of neural network which comes under the class of unsupervised networks. Such type of model is trained using a reconstruction loss. It is a common practice to add noise to the input. This leads to a learned representation that is robust to noise. Overall, the network may be viewed as consisting of two parts: (i) an encoder (ℎ = 𝑓 (𝑥)), and (ii) a decoder (𝑟 = 𝑔(ℎ)). The representational bottleneck forces the model to discard redundant information, so it can learn the useful properties of the data. 2.2. Representational Similarity Analysis (RSA) and Representational Consistency (RC) Representational Similarity Analysis (RSA) was first proposed in the field of neuroscience [4], as a way to compare computational models to brain-activity data. Recently, it has also become a tool to analyze representations of deep convolutional networks. Mehrer et al. [5] firstly proposed using RSA to analyze the behavior of Convolutional Neural Networks (CNN) under different initializations. More recently, Goertler and Obermayer [8] employed the same technique to analyze model-agnostic meta-learning. The main building block of RSA is the Representation Dissimilarity Matrix (RDM), which is a symmetric matrix containing pairwise measurements for elements of two representation vectors (given a set of test stimuli). More formally, for any given pair of representation vectors ℎ𝑖 and ℎ𝑗 , the RDM is defined as: 𝑅𝐷𝑀𝑖,𝑗 = 𝛿(ℎ𝑖 , ℎ𝑗 ) (1) where 𝛿(ℎ𝑖 , ℎ𝑗 ) is a dissimilarity function. Kriegeskorte et al [4] proposes using 1 - correlation as dissimilarity function. Next to RSA, there also exist similar approaches, e.g. CKA [9] or CCA [10] which have also been used to analyze the similarity of deep networks [11, 12]. Given an RDM, it is possible to analyze how well the distribution of representational distances generalizes across network instances. This can be done using Representational Consistency (RC), which is defined as the shared variance between the upper triangle of a given RDM matrix. More specifically, given a set of instances 𝑋 ∈ {𝑥1 , ..., 𝑥𝑛 }, and a layer 𝑔𝑖 (), the RC for layer 𝑖 (𝑅𝐶𝑖 ) is defined as: ‖𝑋 ‖(‖𝑋 ‖ − 1) 𝑍= (2) 2 ∑𝑗<𝑘 𝑃𝐶𝐶(𝑔𝑖 (𝑥𝑗 ), 𝑔𝑖 (𝑥𝑘 )) 𝑅𝐶𝑖 = (3) 𝑍 where ‖𝑋 ‖ is the number of instances in 𝑋, and 𝑃𝐶𝐶 is the Pearson correlation coefficient. The 𝑅𝐶 for a given model is defined as the average 𝑅𝐶 of all the layers in that model. The intuition behind RC is that the closer the value is to 1, the higher the consistency for that particular layer. For example, the input layer is always 1 because the inputs are the same for all the different instances of the network. As we progress deeper into the network layers, we see a continuous drop in consistency until we reach the bottleneck layer. Multidimensional scaling (MDS) [13] gives a concise 2D projection of representation vectors. Since an RDM contains distances between instances of representation vectors, it can be used as a way to project these instances in a 2D space, such that similar instances are projected closer than dissimilar instances. 3. Experimental setup The goal of our experiments is to get a better understanding of the representational behavior in autoencoder architectures. To that end we analyze representation similarity and consistency under slight variations of the experimental setup described below: Index Type Output Shape Index Type Output Shape 1 Input Layer (32, 32, 3) 2 Convolution1 (32, 32, 32) 1 Input Layer 784 3 MaxPooling1 (16, 16, 32) 2 Dense1 512 4 Convolution2 (16, 16, 64) 3 Dense2 256 5 MaxPooling2 (8, 8, 64) 4 Dense3 128 6 Convolution3 (8, 8, 128) 5 Dense4 64 7 Flatten 8192 6 Bottleneck1 8 8 Bottleneck1 12 7 Bottleneck2 8 9 Bottleneck2 12 8 Bottleneck3 8 10 Bottleneck3 12 9 Dense8 64 11 Dense 8192 10 Dense9 128 12 Reshape (8, 8, 128) 11 Dense10 256 13 Convolution4 (8, 8, 64) 12 Dense11 512 14 UpSampling1 (16, 16, 64) 13 Output Layer 784 15 Convolution5 (16, 16, 32) 16 UpSampling2 (32, 32, 32) (a) autoencoder for MNIST 17 Output Layer (32, 32, 3) (b) convolutional autoencoder for CIFAR10 Table 1 Basic architectures for the experiments. In some experiments we included additional layers as indicated. 3.1. Training Except for the experiment where we were attempting to analyse the change in consistency with the epochs increasing logarithmically till 256, all the other models were trained using early stopping, monitoring the validation loss with a patience of 10. Additionally, for the experiments conducted on representational consistency - for a Keras- based architecture for an autoencoder, the default kernel initialization is generally a Glorot- Uniform for a dense/convolutional layer. So to change this default initialization, the experiments in the thesis use a Glorot-Normal instead to see the difference obtained. 3.2. Architectures In our experiments, we use two different architectures, one with only fully connected layers and one with also convolutional ones. In Table 1a, we see the base architecture of the fully connected networks, whereas, in Table 1b, the convolutional architecture is depicted. In several experiments, we extend the networks as indicated. 3.3. Datasets Below is a short description of the datasets used in the experiments conducted in the paper. 3.3.1. MNIST (Modified National Institute of Standards and Technology) The MNIST database [14] consists of 70,000 hand-written grayscale digits of size (28 × 28), which are divided into training and testing, with a ratio of 6:1 along with their own labels. It contains ten (digits from 0-9) different classes with an equal number of images in each class. It is a very common and widely used database in the field of ML and is also very easily accessible through the Keras dataset library. Because of these reasons, it is one of the default benchmark datasets in the field of ML. When working with a regular autoencoder, the pixel values in these datasets were binarized using a threshold of 255 and flattened out to a vector of size 784. The main motivation behind using this dataset was to compare the results between both the architectures, regular and convolutional autoencoders. Image 1a shows a sample image of how the dataset looks. (a) Sample image from the MNIST dataset [14] (b) Sample CIFAR10 images [15] Figure 1: MNIST and CIFAR10 datasets sample images. 3.4. CIFAR10 (Canadian Institute For Advanced Research) Similar to the MNIST dataset, CIFAR10 [16] is a database of 60,000 images having ten different classes with 6000 images in each class. It is divided into training and testing, with a ratio of 5:1 along with its own labels. Unlike the MNIST dataset, the image size for the CIFAR10 dataset is (32 × 32) with three RGB channels. The image in figure 1b shows a sample image of how this dataset looks. 4. Results In this section, we depict our results. 4.1. Representational similarity in autoencoders In our first experiment (Figure 2a), we show all layers of ten differently initialized instances embedded into two dimensions according to their respective similarities. The input and the output layers are comparatively closer to each other, and all the other layers are quite far apart. We can conclude that the points of the same layer get further apart from each other as we move deeper into the layers of the network, and they start getting closer to each other again as we reach the end. The input representations are obviously all the same. The next layer (Encoder1 - E1 in the plot in Figure 2a) has all the points very close to each other, except maybe one point, (b) Sample comparison between cluster average ver- (a) Sample MNIST dataset MDS plots for 10 models sus next layer distance. Figure 2: MDS plot form MNIST, and comparision. which is comparatively far away from all the other points with the same color. This becomes more frequent as we keep going deeper into these layers. In Figure 2b, we observe that except for the first and last layers, the distance to the successive (next) layer is smaller the the overall cluster average. 4.2. Representational consistency in autoencoders (a) Representational consistency: logarithmically in- creasing epochs comparison for a Fully-connected (b) Comparison between multiple latent dimen- autoencoder with 8 neurons in the bottleneck lay- sions for an autoencoder architecture on an ers. MNIST dataset Figure 3: Representational consistency regarding epochs and different size of the latent space Similar to Mehrer et al. [5] we also investigate the representation consistency described in Section 2.2. In Figure 3a, we see that the similarity of the representation increases after the bottleneck layer. This is different from the supervised case of Mehrer et al. [5] where the similarity diverges coming closer to the output layer. This is very interesting as this means that an autoencoder can reconstruct the representation despite having different encodings. In addition, the plot in Figure 3a shows the comparison inconsistencies when working with a varying number of epochs (from 0 to 256, with the epochs increasing logarithmically). Epoch 0, shown in the plot, is when the consistency is calculated without any prior training. While in the other trained cases for this network, we see that as we keep increasing the number of epochs, the consistencies start getting higher and gradually appear to be converging. A couple of key observations when conducting these experiments was that untrained networks behave as expected, i.e., the consistency decreases as we progress through the layers. We also observe that the convergence with regards to consistency is really fast, where we see the typical autoencoder behavior after a single epoch itself. 4.2.1. Impact of varying Latent dimension on representational consistency In this experiment, we observe the representation consistency when changing the dimension of the latent vector and letting everything be the same. Based on the results of the experiments shown in Figure 3b, we can see a clear correlation between the representational consistency and the number of neurons we have in the latent dimension layers. We notice that as the number of neurons increases, the representational consistency of the bottleneck layer also increases. We can interpret that with the increase in the number of neurons of the bottleneck layer from 10 to 40 though these values are comparatively less, the information held in the previous layers has to be less compressed in the case of the 40 neurons than in the case of the ten neurons. We would always attribute some amount of loss in the case of information transfer from the previous encoder layer to the bottleneck layer, but it would be comparatively less in the case of the 40 neurons than the ten neurons, which automatically correlates to a more consistent reproducibility of the input image in the output and hence a better consistency comparison in the final output layer as well. 4.2.2. Representational consistency on multiple bottleneck layers When training architectures with multiple bottleneck layers (3, 5 and 7), the below experi- ments were conducted with ten differently initialized instances for all the experiments: (i) Fully-connected autoencoder: MNIST dataset with 8 and 16 neurons in the bottleneck layers respectively, (ii) Convolutional autoencoder: MNIST dataset with 8 and 16 neurons in the bottleneck layers respectively, (iii) Convolutional autoencoder: CIFAR10 dataset with 12 and 25 neurons in the bottleneck layers respectively. The results of these experiments can be seen in the plots in Figure 4: The above plots show the comparisons while working with a varying number of bottleneck layers for an MNIST dataset for a Fully-connected autoencoder as well as a convolutional autoencoder, with 16 neurons in the bottleneck layer. A common latent dimension and dataset is chosen so that the networks are more comparable. We see, in both cases, that there is not much separating the consistencies of the networks with five and seven bottleneck layers. However, the network with three bottleneck layers has a much higher consistency at the bottleneck level (a) Fully-connected autoencoder with 16 (b) Convolutional autoencoder with 16 neurons in the neurons in the bottleneck layers. bottleneck layers. Figure 4: Representational consistencies comparisons for an MNIST dataset working with different networks for a varying number of bottleneck layers as well as overall in both cases. In the case of the networks with five and seven bottleneck layers, we see the consistencies flattening out in the region shaded blue in the above plots. And though this cannot be confirmed with a guarantee as yet, this trend has more or less been consistent regardless of the dataset and the network. A possible, interesting future work for this thesis would be to replicate these results on much larger and more complicated datasets. 4.2.3. Dependency of consistency on the number of trained model instances To check the dependence of the consistency on the number of trained model instances with different random initial weights, two experiments were performed. For the fully-connected autoencoder trained on MNIST as well as the convolutional autoencoder trained on CIFAR-10 we compared the consistency values in the different layers for different number of model instances. The results are shown in Figure 5a for the fully connected autoencoder and Figure 5b for the convolutional autoencoder. The fully connected autoencoder, tested for 10, 20, 30 and 40 model instances, shows convergence at 20 instances. Due to technical limitations the convolutional autoencoder was only tested for 5, 10, 15, and 20 model instances. While the consistency is not convergent at these instance numbers, we can see the same pattern as for the fully connected autoencoder and assume that using more than 20 model instances would not result in relevant changes. 4.2.4. Effect of increased layer numbers on the representation consistency Figure 6 shows all the representational consistencies in a single plot for the MNIST dataset for a convolutional autoencoder. We can see here that, regardless of which layer has how many multiple layers in other networks, the overall highest consistency was shown by the network, which has no multiple numbers for any of the layers. The basic motivation behind this above experiment was to see how multiple adding layers of a certain type of layer, before and after the bottleneck layer, impacted the representational consistencies across multiple instances. A small note here is that when working with a convo- (a) Fully connected Autoencoder on MNIST (b) Convolutional Autoencoder on CIFAR-10 Figure 5: Calculating representation consistency with varying number of randomly initialize model instances Figure 6: Representational consistency for all the multiple layers together. The orange line indicates multiple encoders. The green line indicates multiple decoders. The red line indicates multiple bottleneck layers, and the final purple line indicates a consistent plot for a network without any multiple layers. The green shade in the plot indicates all the multiple encoder layers. The blue shade indicates all the multiple bottleneck layers, and the yellow shade indicates all the multiple decoder layers. lutional autoencoder, the bottleneck layer was a dense layer, but the other multiple layers (C2 and C5) were all convolutional layers. 4.3. Comparison of encoder and decoder So far, we have only looked at the representation consistency regarding ten different models with different random initialization. In this section, we want to look at individual models to understand single models better and observe if mirrored layers share a structure. In Figure 7 are the MDS plots for two samples that were obtained using a CIFAR10 dataset on a convolutional autoencoder with 12 neurons in its bottleneck layer. A slight distance is seen between the input and output. However, it is important to understand here that the motive behind performing these experiments was not to find the best possible solution as to which parameters or architectures give the most accurate results. Hence, it was chosen to work with neurons (8 and 16 neurons in the case of an MNIST dataset and 12 and 25 Figure 7: Sample instances MDS plots. Out of the numerous instances used in the experiments, these sample plots have been taken from the convolutional autoencoder experiments which were worked on a CIFAR10 dataset. As seen in the plots, the maxpooling and upsampling layers have been removed from the plot as they do not add anything significant to the direction of the network in general in the plot. Instead, all the convolutional (𝐶) layers and dense layers (𝐷) were kept along with the input and output layers. The D1 we see in the plot is the only bottleneck layer used in this experiment. The second dense layer D2 is used to rebuild the neurons for the convolutional layers. neurons in the case of a CIFAR10 dataset) that were not giving us pixelated outputs while at the same time, they were not overcompensating what is being attempted to study when conducting these experiments. Although the individual instance plots look similar in nature, we see some differences in the directions of the layers, which would be attributed to the different random initializations that take place inside these networks. 4.3.1. Analysing the impact on representational consistency when varying the latent dimension It is important to note that when the input and output layers are included in an instance plot, these two layers would always be close or far away from each other depending on the number of neurons chosen to work with for the latent dimension. This, however, can be misleading at times because of what the autoencoder attempts to do. For this very reason, the above plots of Figure 8 were included where the input and output layers were left out, and a simpler comparison was made between all the other layers of the network. As shown in the above Figure 7, we do not see much difference between different instances of the same latent dimension. For the current analysis in Figure 8, we see four different instance plots (all four of them are taken from the 3𝑟𝑑 instance of each experiment for a consistent comparison), where the instances have been plotted without the input and output layers with linearly increasing latent dimensions (i.e., 10, 20, 30 and 40 neurons). Even in such a scenario, we see consistency among the models. As previously discussed, we would normally expect a U-shape between the mirror layers of the model, but we see a sort of a 𝑉-shape forming between the mirror layers with the shortest distance between the final encoder layer and the first decoder layer as the distance keeps increasing all the way till the first encoder layer and the final decoder layer. This distance appears to be consistent across all the instances, even with an increasing latent dimension. (a) Latent dimension with 10 neurons (b) Latent dimension with 20 neurons (c) Latent dimension with 30 neurons (d) Latent dimension with 40 neurons Figure 8: Comparing instance plots with varying Latent Dimension 5. Conclusion In this paper, we demonstrated that for an encoder-decoder neural network, as information is propagated through encoder layers, there is a decrease in the representational consistency of the layer. And the opposite happens for the decoder layers. Such type of representational bottleneck forces the encoder to “prioritize” features, discarding features not relevant for reconstruction. We also demonstrated that for any given instance, the representations tend to diverge (from the input) as information is propagated through the encoder layers and then converge (towards the input) as information is propagated through the decoder. It is also observed that even when we have multiple bottlenecks, a consistent line is not seen for that particular period. There was a dip even among the bottleneck layers. Furthermore, the dip was observed in the middle bottleneck layer in these instances with multiple bottleneck layers. In practical terms, these findings validate the common intuition behind encoder-decoder neu- ral networks (that such bottleneck provides some sort of “lossy compression”). For practitioners, such type of analysis can be used a helpful tool when debugging novel architectures. References [1] P. Vincent, H. Larochelle, Y. Bengio, P.-A. Manzagol, Extracting and composing robust features with denoising autoencoders, in: Proceedings of the 25th international conference on Machine learning, 2008, pp. 1096–1103. [2] A. Ansuini, A. Laio, J. H. Macke, D. Zoccolan, Intrinsic dimension of data representations in deep neural networks, Advances in Neural Information Processing Systems 32 (2019). [3] H. Popal, Y. Wang, I. R. Olson, A guide to representational similarity analysis for social neuroscience, Social Cognitive and Affective Neuroscience 14 (2019) 1243–1253. [4] N. Kriegeskorte, M. Mur, P. A. Bandettini, Representational similarity analysis-connecting the branches of systems neuroscience, Frontiers in Systems Neuroscience 2 (2008) 4. [5] J. Mehrer, C. J. Spoerer, N. Kriegeskorte, T. C. Kietzmann, Individual differences among deep neural network models, Nature communications 11 (2020) 1–12. [6] P. Baldi, Autoencoders, unsupervised learning, and deep architectures, in: Proceedings of ICML workshop on unsupervised and transfer learning, JMLR Workshop and Conference Proceedings, 2012, pp. 37–49. [7] I. Goodfellow, Y. Bengio, A. Courville, Y. Bengio, Deep learning, volume 1, MIT press Cambridge, 2016. [8] T. Goerttler, K. Obermayer, Exploring the similarity of representations in model-agnostic meta-learning, arXiv preprint arXiv:2105.05757 (2021). [9] S. Kornblith, M. Norouzi, H. Lee, G. E. Hinton, Similarity of neural network representations revisited, in: K. Chaudhuri, R. Salakhutdinov (Eds.), Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, PMLR, 2019, pp. 3519–3529. [10] A. Morcos, M. Raghu, S. Bengio, Insights on representational similarity in neural networks with canonical correlation, Advances in Neural Information Processing Systems 31 (2018). [11] S. Kornblith, T. Chen, H. Lee, M. Norouzi, Why do better loss functions lead to less transferable features?, in: M. Ranzato, A. Beygelzimer, Y. N. Dauphin, P. Liang, J. W. Vaughan (Eds.), Advances in Neural Information Processing Systems 34: Annual Con- ference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, 2021, pp. 28648–28662. URL: https://proceedings.neurips.cc/paper/2021/hash/ f0bf4a2da952528910047c31b6c2e951-Abstract.html. [12] T. Goerttler, K. Obermayer, Similarity of pre-trained and fine-tuned representations, arXiv preprint arXiv:2207.09225 (2022). [13] J. O. Ramsay, Some statistical considerations in multidimensional scaling, Psychometrika 34 (1969) 167–182. [14] Y. LeCun, C. Cortes, MNIST handwritten digit database (2010). URL: http://yann.lecun. com/exdb/mnist/. [15] H. Xiao, K. Rasul, R. Vollgraf, Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, CoRR abs/1708.07747 (2017). URL: http://arxiv.org/abs/1708. 07747. arXiv:1708.07747 . [16] A. Krizhevsky, Learning multiple layers of features from tiny images, Technical Report, 2009.