Applications of Koopman Mode Analysis to Neural Networks
Ryan Mohr,1 Maria Fonoberova,1 Iva Manojlović,1 Aleksandr Andrejčuk,1 Zlatko Drmač,2 Yannis
                                Kevrekidis,3 Igor Mezić,1, 4
                                                1
                                                  AIMdyn, Inc., Santa Barbara, CA 93101
                                          2
                                                University of Zagreb, 10000 Zagreb, Croatia
                                            3
                                              Johns Hopkins University, Baltimore, MD 21218
                                          4
                                             University of California, Santa Barbara, CA 93106
   mohrr@aimdyn.com, mfonoberova@aimdyn.com, imanojlovic@aimdyn.com, aleksandr@aimdyn.com, drmac@math.hr,
                                                  yannisk@jhu.edu, mezici@aimdyn.com
                             Abstract                                    an application of the map induced by the optimization al-
                                                                         gorithm and the loss function. Using this induced map, we
   We consider the training process of a neural network as a dy-
                                                                         can apply observables on the weight space and measure their
   namical system acting on the high-dimensional weight space.
   Each epoch is an application of the map induced by the opti-          evolution. The evolution of the observables are given by a
   mization algorithm and the loss function. Using this induced          linear operator, the Koopman operator, associated with the
   map, we can apply observables on the weight space and mea-            induced dynamical system. We use the spectrum and modes
   sure their evolution. The evolution of the observables are            of the Koopman operator to obtain insight into the training
   given by the Koopman operator associated with the induced             process. This viewpoint can help to determine if we have a
   dynamical system. We use the spectrum and modes of the                bad initialization of the network weights, allowing a restart
   Koopman operator to analyze the convergence versus non-               before training too long. Additionally, we incorporate struc-
   convergence of the training process. Our methods can help to          tured loss functions based on negative Sobolev norms into
   determine if (1) a bad initialization of the network weights          our network. Using these new types of loss functions (1) can
   was used—in particular, how the existence of eigenvalues
                                                                         improve interpretability of the networks and (2) can allow
   clustering around 1 determines when to terminate the learning
   process—allowing a restart before training too long and (2) to        for significant noise rejection when training on noisy multi-
   speed up the training time. Additionally, we show that incor-         scale signals.
   porating structured loss functions based on negative Sobolev             The rest of the paper is structured as follows. In the next
   norms can allow for the reconstruction of a multi-scale signal        section, we mathematically formulate the training of a neu-
   polluted by very large amounts of noise. Using these Sobolev          ral network as a dynamical system and introduce the Koop-
   based loss functions improves robustness and interpretability.        man operator viewpoint of analyzing dynamical systems.
                                                                         The following section, investigates the convergence of the
                                                                         training process through the lens of the Koopman operator’s
                         Introduction                                    spectrum. In the final section before the conclusions, we ap-
 The training of neural networks is a topic of much interest             ply loss functions inspired by negative Sobolev norms and
due to their wide spread use in a variety of application do-             show how they can be used for significant noise rejection
mains, from image recognition and classification to solving              when trying to reconstruct a signal.
ordinary and partial differential equations. Unfortunately,
the dimensionality of the problem often prevents rigorous                     Neural network training as a dynamical
analysis. Viewing a neural network training as dynamical
system (Chang et al. 2017; Dietrich, Thiem, and Kevrekidis                                          system.
2020; Chang et al. 2019) provides a framework for a mathe-               Let n(x; w), n : X × Rn → Rd , be a neural network, where
matical approach to training.                                            x ∈ X ⊂ Rm is the input feature vector, w ∈ Rn is the vec-
   Our objective is to introduce the wider community to var-             tor of network weights (parameters), and the output of the
ious results in applying methods from dynamical systems, in              network is a d-dimensional real vector. Let Ltr (w) be the
particular the operator-theoretic approach, to extract insight           loss function of the network on the training set as a function
into both the choosing of the neural network’s architecture,             of the network parameter weights. The optimization prob-
such as its depth, for a given problem and insight into the              lem that is to be solved is
networks training process.
                                                                                            w∗ = arg min Ltr (w).                   (1)
   We consider the training process as a dynamical system
 acting on the high-dimensional weight space. Each epoch is                 The loss function Ltr and the chosen optimization algo-
Copyright © 2021, Association for the Advancement of Artificial
                                                                         rithm (e.g. stochastic gradient descent) induce a nonlinear,
Intelligence (www.aaai.org). All rights reserved.                        discrete time map on the network weights that updates the
     Copyright ©2021 for this paper by its authors. Use permit-          weights at each epoch:
ted under Creative Commons License Attribution 4.0 International
(CC BY 4.0)
                                                                                 T : Rn → Rn ,             wt+1 = T (wt ),         (2)
where wt are the network weights at the beginning of epoch        specting the Koopman spectrum of either of these observ-
t ∈ N0 ; w0 represents the initialized network weights prior      ables tells us how quickly the system is learning and when a
to training. This induced map T is a discrete dynamical sys-      fixed point is appearing. The Koopman mode decomposition
tem having the weight space as its state space.                   (3) can be written as
   Trying to directly analyze the map T can be quite diffi-                                   X
cult as it is implemented as a black box in whatever training                     Ltr (wt ) =    mk λtk φk + et ,          (4)
framework one is using. Instead one can track the evolution                                   k
of observables to gain insight into the dynamical system by       where mk is the Koopman mode normalized to norm 1, λk
studying the spectral properties of an induced linear opera-      is the associated eigenvalue, φk is the reconstruction coef-
tor that drives the evolution of the observable. Let F be a       ficient and et is the error term at epoch t. If all the eigen-
function space that is closed under composition with T ; that     values not equal to 1 satisfy |λk | < 1, then the training
is, if f ∈ F, then f ◦ T ∈ F. An operator U : F → F               is stable (Mauroy and Mezić 2016). In this case, the slow-
can be defined via this composition U f = f ◦ T . This oper-      est eigenvalue determines the rate of convergence. The im-
ator, called the Koopman or composition operator, is linear       portance of a mode is determined by the absolute value of
(Mezić 2005), albeit infinite-dimensional, even if T is non-     the reconstruction coefficient, with higher values giving the
linear. Even though it is linear, it can still capture the full   mode more importance. The further inside the unit circle the
nonlinear behavior of T . In many cases, the Koopman op-          eigenvalues corresponding to the important modes are, the
erator has a spectral decomposition (Mezić 2005; Budišić,       faster the training.
Mohr, and Mezić 2012; Mohr and Mezić 2014; Mezić 2019)            The standard MNIST data set was used for testing the
                     X∞             Z                             methods. The network tested was a convolutional network
               Uf =      cj λj φj + zdE(z)f,                (3)   with two convolutional layers with 16 and 32 kernels, each
                    j=1              C                            followed by 2 × 2 max-pooling. Each kernel is 5 × 5.
                                                                  Convolutional layers are followed by fully-connected layer
where cj ∈ C is a coefficient, λj ∈ C is an eigenvalue, φj ∈
                                                                  with 100 neurons, which is then followed by another fully-
F is an eigenfunction of U , and dE(z) is a projection-valued
                                                                  connected layer with 10 neurons, after which follows soft-
measure (PVM). The set of eigenvalues form the point spec-
                                                                  max classification. The architecture is shown on Figure 1.
trum. The PVM is associated with the continuous spectrum
                                                                  A cross-entropy loss function was used with a learning rate
of U ; it takes sets in the complex plane and associates pro-
                                                                  of η = 1e-3. Different weight initialization (He or Xavier)
jection operators on F with them. We can also consider the
                                                                  were tested as well. After each epoch, cross-entropy loss on
case of vector-valued observables f = (f1 , . . . , fm ), where
                                                                  train and test sets were recorded along with the network
fi ∈ F. In this case, there is an analogous decomposition
                                                                  weights. All networks were trained for 1000 epochs, and
to (3), whose only difference is that the scalar-valued co-
                                                                  KMD analysis was applied to the snapshots for 3 different
efficients cj become vector-valued coefficients mj ∈ Cm .
                                                                  epoch ranges specified in the figures.
These vector-valued coefficients are called Koopman modes,
originally called shape modes (Mezić 2005). Data-driven
algorithms, like the family of Dynamic Mode Decomposi-
tion algorithms, e.g. (Schmid and Sesterhenn 2008; Row-
ley et al. 2009; Jovanović, Schmid, and Nichols 2012, 2014;
Hemati, Williams, and Rowley 2014; Williams, Kevrekidis,
and Rowley 2015; Drmač, Mezić, and Mohr 2018; Drmač,
Mezić, and Mohr 2019), are used to approximate the modes                    Figure 1: Neural network architecture.
and eigenvalues using a trajectory from a single initial condi-
tion. Thus we do not need explicit access to U to analyze the
dynamical system. We use the DMD_RRR algorithm from                  Figure 2 shows the results for the Xavier initialization
(Drmač, Mezić, and Mohr 2018) in the following work.            scheme. The top left figure shows the cross entropy loss of
                                                                  the network evaluated on the training set at each epoch. The
                                                                  top right figure shows the cross entropy on the test set. The
      Insights into neural network training                       second row shows the KMD spectrum using the cross en-
    convergence via the Koopman spectrum.                         tropy loss function on the training set as the observable. The
The main idea here is that by monitoring the spectrum of          spectrum is computed using the first 40 (left), 100 (middle),
the Koopman operator during the training process will give        and 500 (rights) snapshots of the observable. The third row
us a method to determine when the training process should         shows the spectrum computed using the weight vectors as
be terminated so that good performance is given on the test-      the vector-valued observable. The spectrum is computed us-
ing set without the network memorizing the set. Having this       ing the first 50 (left), 100 (middle), and 500 (right) snap-
indicator allow the network to generalize better.                 shots of the observable. The spectrum was computed using
   Given the trajectory of all network weights {wt }, where       DMD_RRR. Eigenvalues inside the unit circle correspond
t ∈ N is the training epoch, we use two different observ-         to decaying modes, whereas eigenvalues outside the unit cir-
ables, a delay-embedded version of the (cross-entropy) loss       cle correspond to growing modes. Eigenvalues very close to
function, Ltr (wt ) and the full-state observable which re-       the unit circle correspond to modes that change slower than
turns the network weights wt at each training epoch t. In-        modes associated with eigenvalues closer to zero.
   Each of the initialization schemes trains very fast, with a              By applying a loss of different scales, s(`), at each layer,
large drop in the training error after a few epochs. The HE                 we are enforcing a more interpretable network. Note that for
and Xavier initialization schemes seem to over-memorize                     ` = 0, the denominator of (6) is 1 and by Parseval’s identity,
the training set which we see as an increase in the cross en-               the expression is equivalent to the L2 norm.
tropy loss on the test set. However, the final errors on the test
set are still lower than the final error for the random normal              Sobolev loss 2. Here, instead of applying pieces of (5) at
scheme on the test set.                                                     each layer, we apply it only at the output of the final layer,
   Since for all initialization schemes, the training process is            fL−1 = CL−1 z` :
extremely fast, with a large drop in training set cross entropy,
                                                                                             M L−1                              (m)
the eigenvalues clustered close to 0 in the second row of the                                X X           X      |ĥ(m) (k) − f̂L−1 (k)|2
figures (training set cross entropy observable) makes sense.                 L(h, fL−1 ) =                                                 .
                                                                                                                   (1 + (2πkkk2 )2 )1/2
As more snapshots are taken to be used to compute the spec-                                  m=1 `=0 kkk1 =s(`)
trum, there seems to some important eigenvalues showing up                                                                            (7)
close to 1, which would indicate that the training is nearing                  We apply the two methods on the multiscale signal
a fixed point for the training process. This trend seems clear              h(x) = x+sin(2πx4 ) (therefore M =1) on the interval [0, 2].
in the final row of the plots which used the weights vector as              At each point x, we add noise η(x) to the signal that is dis-
the observable for the KMD computation. After 50 epochs,                    tributed according to
there are important eigenvalues clustered close to 0 and, as                                                        
more snapshots are taken, important eigenvalues appear and                          η ∼  max h(x) − min h(x) N (0, 1).               (8)
                                                                                             x∈[0,2]        x∈[0,2]
cluster around 1.
                                                                            The noisy signal ĥ = h + η is used as the data set. Figure 3
Noise rejections using negative Sobolev norms                               shows the performance of the different loss functions using
              in the loss function.                                         the noisy signal to train. The left column contains the re-
Here, we investigate the performance of different loss func-                sult for pure L2 loss function, the middle column is Sobolev
tions compared to the standard L2 loss when learning a mul-                 loss 1, and the third column is Sobolev loss 2. Each row,
tiscale signal. The loss functions that we use are inspired by              top to bottom, corresponds to noise levels  = 0.05, 0.5,
the functional form of negative-index Sobolev norms. For                    and 1.0, respectively. The network had 9 hidden layers each
functions f, h : Td → R, where Td is the d-dimensional                      containing 20 neurons and a single node output layer. Lay-
torus (R/Z)d , the Sobolev norm of order p = 2 and index                    ers were fully-connected. As can be seen the pure L2 loss
s < 0 of their difference can be computed via                               function reconstructs the clean signal h poorly at every noise
                                                                            level, basically reconstructing the average of the signal. Of
                            X      |fˆ(k) − ĥ(k)|2                         the two Sobolev loss functions, the first performs the best at
          kf − hk2H s =                              ,                (5)   reconstructing the clean signal, even in the presence of high
                               d
                                 (1 + (2πkkk2 )2 )−s
                           k∈Z                                              amounts of noise. This is likely due to imposing a specific
where fˆ and ĥ are the Fourier transforms of f and h, respec-              scale for each layer, rather than trying to force the network
tively. As the norm kkk2 of the wave vector increases, the                  to try to disentangle the scales at the final layer. For each
                                                                            of the Sobolev loss, the linear scale function, s(`) = ` was
contribution of the term |fˆ(k) − ĥ(k)|2 to the loss dimin-                used.
ishes; discrepancies between f and h at small scales are not
as important as discrepancies at larger/coarser scales.
   We use the Sobolev loss functions to fully connected neu-
                                                                                                       Conclusions
ral networks with L layers (` = 0, . . . , L − 1) in two ways.              In this paper, we have presented results on using the Koop-
Let h : RD → RM be the function to be learned and                           man operator to analyze aspects of neural networks. In order
s : R → R be a monotonically increasing function that de-                   to do this, we consider the training process as a dynamical
fines a scale. Usually we will use the linear function s(`) = `             system on the weights. In particular, we used the spectrum
or the exponential function s(`) = 2` .                                     of the Koopman operator to analyze the training process and
                                                                            determine when to terminate training and restart. We have
                                                                            also introduced structured loss functions based on negative
Sobolev loss 1. For each hidden layer `, we define an aux-                  Sobolev norms which allow for significant noise rejection
iliary output for that layer denoted by f` = C` z` , where                  when trained on a noisy multi-scale signal.
z` ∈ RN` are the activation functions for layer ` and C` ∈
RM ×N` is a matrix. We add a loss function for each auxil-
iary output having the form
                                                                                               Acknowledgements
                                                                            This work was partially supported under DARPA con-
                                                              2
                   M
                   X      X
                                                  (m)
                                    ĥ(m) (k) − f̂`     (k)                 tract HR0011-18-9-0033 and AFOSR contract FA9550-17-
   L` (h, f` ) =                                                  ,   (6)   C-0012.
                   m=1 kkk1 =s(`)
                                    (1 + (2πkkk2 )2 )1/2
                                                                                                       References
for ` ∈ {1, . . . , L} and where ĥ(m) is the Fourier transform             Budišić, M.; Mohr, R.; and Mezić, I. 2012. Applied Koop-
of the m-th component function of h = (h(1) , . . . , h(M ) ).              manism. Chaos 22(4): 047510.
Figure 2: KMD analysis of network 1 training with Xavier initialization scheme. (1st Row left) Cross entropy loss of the
network on the training set. (1st Row right) Cross entropy loss on the test set. (2nd Row) KMD eigenvalues using the cross
entropy on the training set as the observable. Left: computed after 40 epochs. Middle: 100 epochs. Right: 500 epochs. (3rd
Row) KMD eigenvalues using the weights vectors during training as the observable. Left: computed after 50 epochs. Middle:
100 epochs. Right: 500 epochs.
Figure 3: Reconstructing a mulitscale signal polluted by noise. (Left column) Pure L2 loss function. (Middle column) Sobolev
loss 1. (Right column) Sobolev loss 2. Each row, top to bottom, corresponds to noise levels  = 0.05, 0.5, and 1.0, respectively.
The noisy signal ĥ was used as the data set with the goal of reconstructing the clean signal h. A train/test split of 75/25 was
used on {ĥ(x)}.


Figure 4: Reconstructing a mulitscale signal polluted by noise ( = 1 in Eq. (8)). The noisy signal ĥ = h + η was used as the
data set with the goal of reconstructing the clean signal h. A train/test split of 75/25 was used on {ĥ(x)}. Sobolev 1 and 2 are
as described above. Sobolev 1 + 2 means that both the Sobolev 1 method and the Sobolev 2 method were applied. It is clear
that the best performance is given by the Sobolev 1 method. The same trend continues when using  = 0.05 and 0.5.
Chang, B.; Chen, M.; Haber, E.; and Chi, E. H. 2019. An-
tisymmetricRNN: A Dynamical System View on Recurrent
Neural Networks. arXiv.org. arXiv:1902.09689(stat.ML).
Chang, B.; Meng, L.; Haber, E.; Tung, F.; and Begert, D.
2017. Multi-level Residual Networks from Dynamical Sys-
tems View. arXiv.org. arXiv:1710.10348(stat.ML).
Dietrich, F.; Thiem, T. N.; and Kevrekidis, I. G. 2020. On
the Koopman Operator of Algorithms. SIAM Journal on
Applied Dynamical Systems 19(2): 860–885. doi:10.1137/
19M1277059. URL https://doi.org/10.1137/19M1277059.
Drmač, Z.; Mezić, I.; and Mohr, R. 2018. Data Driven Modal
Decompositions: Analysis and Enhancements. SIAM Journal
on Scientific Computing 40(4): A2253–A2285.
Drmač, Z.; Mezić, I.; and Mohr, R. 2019. Data Driven
Koopman Spectral Analysis in Vandermonde–Cauchy Form
via the DFT: Numerical Method and Theoretical Insights.
SIAM Journal on Scientific Computing 41(5): A3118–A3151.
doi:10.1137/18M1227688.      URL https://doi.org/10.1137/
18M1227688.
Hemati, M. S.; Williams, M. O.; and Rowley, C. W. 2014. Dy-
namic mode decomposition for large and streaming datasets.
Physics of Fluids 26(11): 111701.
Jovanović, M. R.; Schmid, P. J.; and Nichols, J. W. 2012.
Low-rank and sparse dynamic mode decomposition. Center
for Turbulence Research, Annual Research Briefs 139–152.
Jovanović, M. R.; Schmid, P. J.; and Nichols, J. W. 2014.
Sparsity-promoting dynamic mode decomposition. Physics
of Fluids 26(2): 024103.
Mauroy, A.; and Mezić, I. 2016. Global Stability Analysis
Using the Eigenfunctions of the Koopman Operator. IEEE
Transactions on Automatic Control 61(11): 3356–3369.
Mezić, I. 2005. Spectral Properties of Dynamical Systems,
Model Reduction and Decompositions. Nonlinear Dynamics
41: 309–325.
Mezić, I. 2019. Spectrum of the Koopman operator, spec-
tral expansions in functional spaces, and state-space geome-
try. Journal of Nonlinear Science 1–55.
Mohr, R.; and Mezić, I. 2014. Construction of Eigenfunctions
for Scalar-type Operators via Laplace Averages with Connec-
tions to the Koopman Operator. arXiv.org 1–25.
Rowley, C. W.; Mezić, I.; Bagheri, S.; Schlatter, P.; and Hen-
ningson, D. S. 2009. Spectral analysis of nonlinear flows.
Journal of Fluid Mechanics 641: 115–127.
Schmid, P. J.; and Sesterhenn, J. 2008. Dynamic Mode De-
composition of Numerical and Experimental Data. In Sixty-
First Annual Meeting of the APS Division of Fluid Dynamics.
San Antonio, Texas, USA.
Williams, M. O.; Kevrekidis, I. G.; and Rowley, C. W. 2015.
A Data–Driven Approximation of the Koopman Operator:
Extending Dynamic Mode Decomposition. Journal of Non-
linear Science 25(6): 1307–1346.