<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Intrinsic Analysis of Learned Representations in Encoder-Decoder Architectures</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shashi Durbha</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thomas Goerttler</string-name>
          <email>thomas.goerttler@tu-berlin.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eduardo Vellasques</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jan Hendrik Stockemer</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Klaus Obermayer</string-name>
          <email>klaus.obermayer@tu-berlin.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>SAP SE</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Germany</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Bernstein Center for Computational Neuroscience Berlin</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>LWDA'22: Lernen</institution>
          ,
          <addr-line>Wissen, Daten, Analysen</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Neural Networks, Encoder-Decoder Architectures, Representational Similarity Analysis (RSA)</institution>
          ,
          <addr-line>Represen-</addr-line>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Technische Universität Berlin, Chair of Neural Information Processing</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Encoder-decoder architectures have widespread use, both in research and the industry. Recently, similarity analysis applied to the representations of neural networks has contributed to a better understanding of the architectures. Previous work has found that for two instances (under diferent initialization) of the same layer of a given model, the learned representation becomes more and more dissimilar the farther away that layer is from the input layer. Since the encoder-decoder is often mirrored (causing a representational bottleneck), we investigate how the representation changes and if the objective of reconstructing the input influences this. Using representation similarity analysis, we find out that corresponding layers from the encoder and decoder are not very similar to each other and are more similar to their neighbor layers. In addition, our experiments show that except for classification tasks, the representations of the same decoder with diferent initialization become more and more similar the closer a layer is to the output layer. Some of our analysis includes comparing the average distances between the layers to the average distance of the current layer clusters, the impact of varying the latent dimension as well as having multiple bottleneck layers on the representational consistency.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        One of the first successful cases of training deep architectures involved successively training
(and consequently stacking) denoising autoencoders [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. However, until very recently, much has
been hypothesized about how learning occurs as information propagates through the diferent
layers. Notwithstanding, the research community has shed some light on that subject in the
last few years. For example, Ansuini et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] applied intrinsic dimension analysis to examine
the geometrical properties of the representations learned by a deep neural network.
      </p>
      <p>
        Popal et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] use Representational Similarity Analysis (RSA) to analyze representation
learning in CNNs. In RSA, a Representational Dissimilarity Matrix (RDM) is employed in
nEvelop-O
LGOBE
https://www.thomas.goerttler.de/ (T. Goerttler); https://www.ni.tu-berlin.de/ (K. Obermayer)
      </p>
      <p>
        © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
order to characterize a system’s inner stimulus representations in terms of pairwise response
diferences [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In a technique called Representational Consistency (RC), Mehrer et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ],
apply techniques until then only used in the field of neuroscience in order to compare diferent
instances of the same neural network under random initialization. More specifically, they
employ Pearson correlation analysis in order to quantify the variance between RDMs of the
diferent instances. One important takeaway from their research is that by relying on multiple
instances of the same architecture, it is possible to derive insights about representation learning
in a statistically quantifiable way.
      </p>
      <p>
        In this paper, we apply RSA to the representations of (convolutional) autoencoders and
observe their consistency. We observe that in contrast to the supervised classifiers examined in
Mehrer et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], where consistency decreases with each layer stacked, for the autoencoder, the
consistency reaches a minimum in the bottleneck layer and increases again in the decoder part.
      </p>
      <p>We also investigate the influence of the size of the latent layer. The larger it is, the more
consistent it stays. Analyzing a single model, we see that the layers of a network are more
similar to the neighbor layers than to their corresponding mirror layers.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <p>In this section, we describe the encoder-decoder architectures and introduce representational
similarity techniques.</p>
      <sec id="sec-2-1">
        <title>2.1. Encoder-Decoder Architectures</title>
        <p>Encoder-decoder architectures are networks in which an encoder takes a variable-length
sequence as the input and transforms it into a state with a fixed shape. The decoder then maps
the encoded state of a fixed shape to a variable-length sequence. The final encoder layer from
which the decoder maps the encoded state is known as the latent dimension layer, which has
the least number of neurons among all the layers of the network.</p>
        <p>Encoder-Decoder architectures have widespread use in Natural Language Processing (NLP)
and Image Denoising. In NLP, the common approach is to use sequential models such as
Recurrent Neural Networks (RNNs) as encoder/decoder.</p>
        <sec id="sec-2-1-1">
          <title>2.1.1. Autoencoders</title>
          <p>
            Baldi [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ] describes Autoencoders as ‘simple learning circuits which aim to transform inputs into
outputs with the least possible amount of distortion’. Goodfellow et al. [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ] describe autoencoders
as a special type of neural network which comes under the class of unsupervised networks.
Such type of model is trained using a reconstruction loss. It is a common practice to add noise
to the input. This leads to a learned representation that is robust to noise. Overall, the network
may be viewed as consisting of two parts: (i) an encoder (ℎ =  () ), and (ii) a decoder ( = (ℎ) ).
The representational bottleneck forces the model to discard redundant information, so it can
learn the useful properties of the data.
          </p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Representational Similarity Analysis (RSA) and Representational</title>
      </sec>
      <sec id="sec-2-3">
        <title>Consistency (RC)</title>
        <p>
          Representational Similarity Analysis (RSA) was first proposed in the field of neuroscience
[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], as a way to compare computational models to brain-activity data. Recently, it has also
become a tool to analyze representations of deep convolutional networks. Mehrer et al. [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]
ifrstly proposed using RSA to analyze the behavior of Convolutional Neural Networks (CNN)
under diferent initializations. More recently, Goertler and Obermayer [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] employed the same
technique to analyze model-agnostic meta-learning. The main building block of RSA is the
Representation Dissimilarity Matrix (RDM), which is a symmetric matrix containing pairwise
measurements for elements of two representation vectors (given a set of test stimuli). More
formally, for any given pair of representation vectors ℎ and ℎ , the RDM is defined as:
        </p>
        <p>
          , =  (ℎ  , ℎ )
where  (ℎ  , ℎ ) is a dissimilarity function. Kriegeskorte et al [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] proposes using 1 - correlation
as dissimilarity function. Next to RSA, there also exist similar approaches, e.g. CKA [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] or CCA
[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] which have also been used to analyze the similarity of deep networks [
          <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
          ].
        </p>
        <p>Given an RDM, it is possible to analyze how well the distribution of representational distances
generalizes across network instances. This can be done using Representational Consistency
(RC), which is defined as the shared variance between the upper triangle of a given RDM matrix.
More specifically, given a set of instances  ∈ { 1, ...,   }, and a layer   (), the RC for layer  (  )
is defined as:
  =
 =
∑&lt;   (
‖ ‖(‖ ‖ − 1)
2

 (  ),   (  ))
(1)
(2)
(3)
where ‖ ‖ is the number of instances in  , and  

for a given model is defined as the average 
is the Pearson correlation coeficient. The
of all the layers in that model.</p>
        <p>The intuition behind RC is that the closer the value is to 1, the higher the consistency for
that particular layer. For example, the input layer is always 1 because the inputs are the same
for all the diferent instances of the network. As we progress deeper into the network layers,
we see a continuous drop in consistency until we reach the bottleneck layer.</p>
        <p>
          Multidimensional scaling (MDS) [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] gives a concise 2D projection of representation vectors.
Since an RDM contains distances between instances of representation vectors, it can be used as
a way to project these instances in a 2D space, such that similar instances are projected closer
than dissimilar instances.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experimental setup</title>
      <p>The goal of our experiments is to get a better understanding of the representational behavior in
autoencoder architectures. To that end we analyze representation similarity and consistency
under slight variations of the experimental setup described below:</p>
      <p>Index</p>
      <sec id="sec-3-1">
        <title>3.1. Training</title>
        <p>Except for the experiment where we were attempting to analyse the change in consistency with
the epochs increasing logarithmically till 256, all the other models were trained using early
stopping, monitoring the validation loss with a patience of 10.</p>
        <p>Additionally, for the experiments conducted on representational consistency - for a
Kerasbased architecture for an autoencoder, the default kernel initialization is generally a
GlorotUniform for a dense/convolutional layer. So to change this default initialization, the experiments
in the thesis use a Glorot-Normal instead to see the diference obtained.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Architectures</title>
        <p>In our experiments, we use two diferent architectures, one with only fully connected layers
and one with also convolutional ones. In Table 1a, we see the base architecture of the fully
connected networks, whereas, in Table 1b, the convolutional architecture is depicted. In several
experiments, we extend the networks as indicated.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Datasets</title>
        <p>Below is a short description of the datasets used in the experiments conducted in the paper.</p>
        <sec id="sec-3-3-1">
          <title>3.3.1. MNIST (Modified National Institute of Standards and Technology)</title>
          <p>
            The MNIST database [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ] consists of 70,000 hand-written grayscale digits of size (28 × 28), which
are divided into training and testing, with a ratio of 6:1 along with their own labels. It contains
ten (digits from 0-9) diferent classes with an equal number of images in each class. It is a very
common and widely used database in the field of ML and is also very easily accessible through
the Keras dataset library. Because of these reasons, it is one of the default benchmark datasets
in the field of ML. When working with a regular autoencoder, the pixel values in these datasets
were binarized using a threshold of 255 and flattened out to a vector of size 784. The main
motivation behind using this dataset was to compare the results between both the architectures,
regular and convolutional autoencoders. Image 1a shows a sample image of how the dataset
looks.
          </p>
          <p>
            (a) Sample image from the MNIST dataset [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ]
(b) Sample CIFAR10 images [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ]
          </p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. CIFAR10 (Canadian Institute For Advanced Research)</title>
        <p>
          Similar to the MNIST dataset, CIFAR10 [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] is a database of 60,000 images having ten diferent
classes with 6000 images in each class. It is divided into training and testing, with a ratio of 5:1
along with its own labels. Unlike the MNIST dataset, the image size for the CIFAR10 dataset is
(32 × 32) with three RGB channels. The image in figure 1b shows a sample image of how this
dataset looks.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>In this section, we depict our results.</p>
      <sec id="sec-4-1">
        <title>4.1. Representational similarity in autoencoders</title>
        <p>In our first experiment (Figure 2a), we show all layers of ten diferently initialized instances
embedded into two dimensions according to their respective similarities. The input and the
output layers are comparatively closer to each other, and all the other layers are quite far apart.
We can conclude that the points of the same layer get further apart from each other as we move
deeper into the layers of the network, and they start getting closer to each other again as we
reach the end. The input representations are obviously all the same. The next layer (Encoder1
E1 in the plot in Figure 2a) has all the points very close to each other, except maybe one point,
(a) Sample MNIST dataset MDS plots for 10 models
(b) Sample comparison between cluster average
versus next layer distance.
which is comparatively far away from all the other points with the same color. This becomes
more frequent as we keep going deeper into these layers.</p>
        <p>In Figure 2b, we observe that except for the first and last layers, the distance to the successive
(next) layer is smaller the the overall cluster average.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Representational consistency in autoencoders</title>
        <p>(a) Representational consistency: logarithmically
increasing epochs comparison for a Fully-connected (b) Comparison between multiple latent
dimenautoencoder with 8 neurons in the bottleneck lay- sions for an autoencoder architecture on an
ers. MNIST dataset</p>
        <p>
          Similar to Mehrer et al. [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] we also investigate the representation consistency described
in Section 2.2. In Figure 3a, we see that the similarity of the representation increases after
the bottleneck layer. This is diferent from the supervised case of Mehrer et al. [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] where the
similarity diverges coming closer to the output layer. This is very interesting as this means
that an autoencoder can reconstruct the representation despite having diferent encodings. In
addition, the plot in Figure 3a shows the comparison inconsistencies when working with a
varying number of epochs (from 0 to 256, with the epochs increasing logarithmically). Epoch
0, shown in the plot, is when the consistency is calculated without any prior training. While
in the other trained cases for this network, we see that as we keep increasing the number of
epochs, the consistencies start getting higher and gradually appear to be converging. A couple
of key observations when conducting these experiments was that untrained networks behave as
expected, i.e., the consistency decreases as we progress through the layers. We also observe that
the convergence with regards to consistency is really fast, where we see the typical autoencoder
behavior after a single epoch itself.
        </p>
        <sec id="sec-4-2-1">
          <title>4.2.1. Impact of varying Latent dimension on representational consistency</title>
          <p>In this experiment, we observe the representation consistency when changing the dimension of
the latent vector and letting everything be the same.</p>
          <p>Based on the results of the experiments shown in Figure 3b, we can see a clear correlation
between the representational consistency and the number of neurons we have in the latent
dimension layers. We notice that as the number of neurons increases, the representational
consistency of the bottleneck layer also increases.</p>
          <p>We can interpret that with the increase in the number of neurons of the bottleneck layer from
10 to 40 though these values are comparatively less, the information held in the previous layers
has to be less compressed in the case of the 40 neurons than in the case of the ten neurons.
We would always attribute some amount of loss in the case of information transfer from the
previous encoder layer to the bottleneck layer, but it would be comparatively less in the case
of the 40 neurons than the ten neurons, which automatically correlates to a more consistent
reproducibility of the input image in the output and hence a better consistency comparison in
the final output layer as well.</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>4.2.2. Representational consistency on multiple bottleneck layers</title>
          <p>When training architectures with multiple bottleneck layers (3, 5 and 7), the below
experiments were conducted with ten diferently initialized instances for all the experiments: (i)
Fully-connected autoencoder: MNIST dataset with 8 and 16 neurons in the bottleneck layers
respectively, (ii) Convolutional autoencoder: MNIST dataset with 8 and 16 neurons in the
bottleneck layers respectively, (iii) Convolutional autoencoder: CIFAR10 dataset with 12 and 25
neurons in the bottleneck layers respectively. The results of these experiments can be seen in
the plots in Figure 4:</p>
          <p>The above plots show the comparisons while working with a varying number of bottleneck
layers for an MNIST dataset for a Fully-connected autoencoder as well as a convolutional
autoencoder, with 16 neurons in the bottleneck layer. A common latent dimension and dataset is
chosen so that the networks are more comparable. We see, in both cases, that there is not much
separating the consistencies of the networks with five and seven bottleneck layers. However,
the network with three bottleneck layers has a much higher consistency at the bottleneck level
(a) Fully-connected autoencoder with 16 (b) Convolutional autoencoder with 16 neurons in the
neurons in the bottleneck layers. bottleneck layers.
as well as overall in both cases. In the case of the networks with five and seven bottleneck
layers, we see the consistencies flattening out in the region shaded blue in the above plots.
And though this cannot be confirmed with a guarantee as yet, this trend has more or less been
consistent regardless of the dataset and the network. A possible, interesting future work for
this thesis would be to replicate these results on much larger and more complicated datasets.</p>
        </sec>
        <sec id="sec-4-2-3">
          <title>4.2.3. Dependency of consistency on the number of trained model instances</title>
          <p>To check the dependence of the consistency on the number of trained model instances with
diferent random initial weights, two experiments were performed. For the fully-connected
autoencoder trained on MNIST as well as the convolutional autoencoder trained on CIFAR-10 we
compared the consistency values in the diferent layers for diferent number of model instances.
The results are shown in Figure 5a for the fully connected autoencoder and Figure 5b for the
convolutional autoencoder. The fully connected autoencoder, tested for 10, 20, 30 and 40 model
instances, shows convergence at 20 instances. Due to technical limitations the convolutional
autoencoder was only tested for 5, 10, 15, and 20 model instances. While the consistency is not
convergent at these instance numbers, we can see the same pattern as for the fully connected
autoencoder and assume that using more than 20 model instances would not result in relevant
changes.</p>
        </sec>
        <sec id="sec-4-2-4">
          <title>4.2.4. Efect of increased layer numbers on the representation consistency</title>
          <p>Figure 6 shows all the representational consistencies in a single plot for the MNIST dataset for
a convolutional autoencoder. We can see here that, regardless of which layer has how many
multiple layers in other networks, the overall highest consistency was shown by the network,
which has no multiple numbers for any of the layers.</p>
          <p>The basic motivation behind this above experiment was to see how multiple adding layers
of a certain type of layer, before and after the bottleneck layer, impacted the representational
consistencies across multiple instances. A small note here is that when working with a
convo(a) Fully connected Autoencoder on MNIST
(b) Convolutional Autoencoder on CIFAR-10
lutional autoencoder, the bottleneck layer was a dense layer, but the other multiple layers (C2
and C5) were all convolutional layers.</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Comparison of encoder and decoder</title>
        <p>So far, we have only looked at the representation consistency regarding ten diferent models
with diferent random initialization. In this section, we want to look at individual models to
understand single models better and observe if mirrored layers share a structure. In Figure 7 are
the MDS plots for two samples that were obtained using a CIFAR10 dataset on a convolutional
autoencoder with 12 neurons in its bottleneck layer.</p>
        <p>A slight distance is seen between the input and output. However, it is important to understand
here that the motive behind performing these experiments was not to find the best possible
solution as to which parameters or architectures give the most accurate results. Hence, it was
chosen to work with neurons (8 and 16 neurons in the case of an MNIST dataset and 12 and 25
neurons in the case of a CIFAR10 dataset) that were not giving us pixelated outputs while at the
same time, they were not overcompensating what is being attempted to study when conducting
these experiments. Although the individual instance plots look similar in nature, we see some
diferences in the directions of the layers, which would be attributed to the diferent random
initializations that take place inside these networks.</p>
        <sec id="sec-4-3-1">
          <title>4.3.1. Analysing the impact on representational consistency when varying the latent dimension</title>
          <p>It is important to note that when the input and output layers are included in an instance plot,
these two layers would always be close or far away from each other depending on the number of
neurons chosen to work with for the latent dimension. This, however, can be misleading at times
because of what the autoencoder attempts to do. For this very reason, the above plots of Figure 8
were included where the input and output layers were left out, and a simpler comparison was
made between all the other layers of the network. As shown in the above Figure 7, we do not
see much diference between diferent instances of the same latent dimension. For the current
analysis in Figure 8, we see four diferent instance plots (all four of them are taken from the 3
instance of each experiment for a consistent comparison), where the instances have been plotted
without the input and output layers with linearly increasing latent dimensions (i.e., 10, 20, 30
and 40 neurons). Even in such a scenario, we see consistency among the models. As previously
discussed, we would normally expect a U-shape between the mirror layers of the model, but we
see a sort of a  -shape forming between the mirror layers with the shortest distance between
the final encoder layer and the first decoder layer as the distance keeps increasing all the way
till the first encoder layer and the final decoder layer. This distance appears to be consistent
across all the instances, even with an increasing latent dimension.</p>
          <p>(a) Latent dimension with 10 neurons
(b) Latent dimension with 20 neurons
(c) Latent dimension with 30 neurons
(d) Latent dimension with 40 neurons</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this paper, we demonstrated that for an encoder-decoder neural network, as information is
propagated through encoder layers, there is a decrease in the representational consistency of the
layer. And the opposite happens for the decoder layers. Such type of representational bottleneck
forces the encoder to “prioritize” features, discarding features not relevant for reconstruction.
We also demonstrated that for any given instance, the representations tend to diverge (from the
input) as information is propagated through the encoder layers and then converge (towards the
input) as information is propagated through the decoder.</p>
      <p>It is also observed that even when we have multiple bottlenecks, a consistent line is not seen
for that particular period. There was a dip even among the bottleneck layers. Furthermore, the
dip was observed in the middle bottleneck layer in these instances with multiple bottleneck
layers.</p>
      <p>In practical terms, these findings validate the common intuition behind encoder-decoder
neural networks (that such bottleneck provides some sort of “lossy compression”). For practitioners,
such type of analysis can be used a helpful tool when debugging novel architectures.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Vincent</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Larochelle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          , P.
          <article-title>-A. Manzagol, Extracting and composing robust features with denoising autoencoders</article-title>
          ,
          <source>in: Proceedings of the 25th international conference on Machine learning</source>
          ,
          <year>2008</year>
          , pp.
          <fpage>1096</fpage>
          -
          <lpage>1103</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ansuini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Laio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Macke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zoccolan</surname>
          </string-name>
          ,
          <article-title>Intrinsic dimension of data representations in deep neural networks</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>32</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Popal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. R.</given-names>
            <surname>Olson</surname>
          </string-name>
          ,
          <article-title>A guide to representational similarity analysis for social neuroscience</article-title>
          ,
          <source>Social Cognitive and Afective Neuroscience</source>
          <volume>14</volume>
          (
          <year>2019</year>
          )
          <fpage>1243</fpage>
          -
          <lpage>1253</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N.</given-names>
            <surname>Kriegeskorte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. A.</given-names>
            <surname>Bandettini</surname>
          </string-name>
          ,
          <article-title>Representational similarity analysis-connecting the branches of systems neuroscience</article-title>
          ,
          <source>Frontiers in Systems Neuroscience</source>
          <volume>2</volume>
          (
          <year>2008</year>
          )
          <article-title>4</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Mehrer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Spoerer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kriegeskorte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. C.</given-names>
            <surname>Kietzmann</surname>
          </string-name>
          ,
          <article-title>Individual diferences among deep neural network models</article-title>
          ,
          <source>Nature communications 11</source>
          (
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Baldi</surname>
          </string-name>
          ,
          <article-title>Autoencoders, unsupervised learning, and deep architectures</article-title>
          ,
          <source>in: Proceedings of ICML workshop on unsupervised and transfer learning</source>
          ,
          <source>JMLR Workshop and Conference Proceedings</source>
          ,
          <year>2012</year>
          , pp.
          <fpage>37</fpage>
          -
          <lpage>49</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>I.</given-names>
            <surname>Goodfellow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Courville</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <article-title>Deep learning</article-title>
          , volume
          <volume>1</volume>
          , MIT press Cambridge,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Goerttler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Obermayer</surname>
          </string-name>
          ,
          <article-title>Exploring the similarity of representations in model-agnostic meta-learning</article-title>
          ,
          <source>arXiv preprint arXiv:2105.05757</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kornblith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Norouzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <article-title>Similarity of neural network representations revisited</article-title>
          , in: K. Chaudhuri, R. Salakhutdinov (Eds.),
          <source>Proceedings of the 36th International Conference on Machine Learning</source>
          ,
          <string-name>
            <surname>ICML</surname>
          </string-name>
          <year>2019</year>
          ,
          <volume>9</volume>
          -
          <fpage>15</fpage>
          June 2019, Long Beach, California, USA, volume
          <volume>97</volume>
          <source>of Proceedings of Machine Learning Research, PMLR</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>3519</fpage>
          -
          <lpage>3529</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Morcos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Raghu</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. Bengio,</surname>
          </string-name>
          <article-title>Insights on representational similarity in neural networks with canonical correlation</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>31</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kornblith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Norouzi</surname>
          </string-name>
          ,
          <article-title>Why do better loss functions lead to less transferable features?</article-title>
          , in: M.
          <string-name>
            <surname>Ranzato</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Beygelzimer</surname>
            ,
            <given-names>Y. N.</given-names>
          </string-name>
          <string-name>
            <surname>Dauphin</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Liang</surname>
            ,
            <given-names>J. W.</given-names>
          </string-name>
          <string-name>
            <surname>Vaughan</surname>
          </string-name>
          (Eds.),
          <source>Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems</source>
          <year>2021</year>
          ,
          <article-title>NeurIPS 2021</article-title>
          , December 6-
          <issue>14</issue>
          ,
          <year>2021</year>
          , virtual,
          <year>2021</year>
          , pp.
          <fpage>28648</fpage>
          -
          <lpage>28662</lpage>
          . URL: https://proceedings.neurips.cc/paper/2021/hash/ f0bf4a2da952528910047c31b6c2e951-Abstract.html.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>T.</given-names>
            <surname>Goerttler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Obermayer</surname>
          </string-name>
          ,
          <article-title>Similarity of pre-trained and fine-tuned representations</article-title>
          ,
          <source>arXiv preprint arXiv:2207.09225</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J. O.</given-names>
            <surname>Ramsay</surname>
          </string-name>
          ,
          <article-title>Some statistical considerations in multidimensional scaling</article-title>
          ,
          <source>Psychometrika</source>
          <volume>34</volume>
          (
          <year>1969</year>
          )
          <fpage>167</fpage>
          -
          <lpage>182</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Y.</given-names>
            <surname>LeCun</surname>
          </string-name>
          , C. Cortes,
          <article-title>MNIST handwritten digit database (</article-title>
          <year>2010</year>
          ). URL: http://yann.lecun. com/exdb/mnist/.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>H.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Rasul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Vollgraf</surname>
          </string-name>
          ,
          <article-title>Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms</article-title>
          ,
          <source>CoRR abs/1708</source>
          .07747 (
          <year>2017</year>
          ). URL: http://arxiv.org/abs/1708. 07747. arXiv:
          <volume>1708</volume>
          .
          <fpage>07747</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          ,
          <article-title>Learning multiple layers of features from tiny images</article-title>
          ,
          <source>Technical Report</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>