Applications of Koopman Mode Analysis to Neural Networks Ryan Mohr,1 Maria Fonoberova,1 Iva Manojlović,1 Aleksandr Andrejčuk,1 Zlatko Drmač,2 Yannis Kevrekidis,3 Igor Mezić,1, 4 1 AIMdyn, Inc., Santa Barbara, CA 93101 2 University of Zagreb, 10000 Zagreb, Croatia 3 Johns Hopkins University, Baltimore, MD 21218 4 University of California, Santa Barbara, CA 93106 mohrr@aimdyn.com, mfonoberova@aimdyn.com, imanojlovic@aimdyn.com, aleksandr@aimdyn.com, drmac@math.hr, yannisk@jhu.edu, mezici@aimdyn.com Abstract an application of the map induced by the optimization al- gorithm and the loss function. Using this induced map, we We consider the training process of a neural network as a dy- can apply observables on the weight space and measure their namical system acting on the high-dimensional weight space. Each epoch is an application of the map induced by the opti- evolution. The evolution of the observables are given by a mization algorithm and the loss function. Using this induced linear operator, the Koopman operator, associated with the map, we can apply observables on the weight space and mea- induced dynamical system. We use the spectrum and modes sure their evolution. The evolution of the observables are of the Koopman operator to obtain insight into the training given by the Koopman operator associated with the induced process. This viewpoint can help to determine if we have a dynamical system. We use the spectrum and modes of the bad initialization of the network weights, allowing a restart Koopman operator to analyze the convergence versus non- before training too long. Additionally, we incorporate struc- convergence of the training process. Our methods can help to tured loss functions based on negative Sobolev norms into determine if (1) a bad initialization of the network weights our network. Using these new types of loss functions (1) can was used—in particular, how the existence of eigenvalues improve interpretability of the networks and (2) can allow clustering around 1 determines when to terminate the learning process—allowing a restart before training too long and (2) to for significant noise rejection when training on noisy multi- speed up the training time. Additionally, we show that incor- scale signals. porating structured loss functions based on negative Sobolev The rest of the paper is structured as follows. In the next norms can allow for the reconstruction of a multi-scale signal section, we mathematically formulate the training of a neu- polluted by very large amounts of noise. Using these Sobolev ral network as a dynamical system and introduce the Koop- based loss functions improves robustness and interpretability. man operator viewpoint of analyzing dynamical systems. The following section, investigates the convergence of the training process through the lens of the Koopman operator’s Introduction spectrum. In the final section before the conclusions, we ap- The training of neural networks is a topic of much interest ply loss functions inspired by negative Sobolev norms and due to their wide spread use in a variety of application do- show how they can be used for significant noise rejection mains, from image recognition and classification to solving when trying to reconstruct a signal. ordinary and partial differential equations. Unfortunately, the dimensionality of the problem often prevents rigorous Neural network training as a dynamical analysis. Viewing a neural network training as dynamical system (Chang et al. 2017; Dietrich, Thiem, and Kevrekidis system. 2020; Chang et al. 2019) provides a framework for a mathe- Let n(x; w), n : X × Rn → Rd , be a neural network, where matical approach to training. x ∈ X ⊂ Rm is the input feature vector, w ∈ Rn is the vec- Our objective is to introduce the wider community to var- tor of network weights (parameters), and the output of the ious results in applying methods from dynamical systems, in network is a d-dimensional real vector. Let Ltr (w) be the particular the operator-theoretic approach, to extract insight loss function of the network on the training set as a function into both the choosing of the neural network’s architecture, of the network parameter weights. The optimization prob- such as its depth, for a given problem and insight into the lem that is to be solved is networks training process. w∗ = arg min Ltr (w). (1) We consider the training process as a dynamical system acting on the high-dimensional weight space. Each epoch is The loss function Ltr and the chosen optimization algo- Copyright © 2021, Association for the Advancement of Artificial rithm (e.g. stochastic gradient descent) induce a nonlinear, Intelligence (www.aaai.org). All rights reserved. discrete time map on the network weights that updates the Copyright ©2021 for this paper by its authors. Use permit- weights at each epoch: ted under Creative Commons License Attribution 4.0 International (CC BY 4.0) T : Rn → Rn , wt+1 = T (wt ), (2) where wt are the network weights at the beginning of epoch specting the Koopman spectrum of either of these observ- t ∈ N0 ; w0 represents the initialized network weights prior ables tells us how quickly the system is learning and when a to training. This induced map T is a discrete dynamical sys- fixed point is appearing. The Koopman mode decomposition tem having the weight space as its state space. (3) can be written as Trying to directly analyze the map T can be quite diffi- X cult as it is implemented as a black box in whatever training Ltr (wt ) = mk λtk φk + et , (4) framework one is using. Instead one can track the evolution k of observables to gain insight into the dynamical system by where mk is the Koopman mode normalized to norm 1, λk studying the spectral properties of an induced linear opera- is the associated eigenvalue, φk is the reconstruction coef- tor that drives the evolution of the observable. Let F be a ficient and et is the error term at epoch t. If all the eigen- function space that is closed under composition with T ; that values not equal to 1 satisfy |λk | < 1, then the training is, if f ∈ F, then f ◦ T ∈ F. An operator U : F → F is stable (Mauroy and Mezić 2016). In this case, the slow- can be defined via this composition U f = f ◦ T . This oper- est eigenvalue determines the rate of convergence. The im- ator, called the Koopman or composition operator, is linear portance of a mode is determined by the absolute value of (Mezić 2005), albeit infinite-dimensional, even if T is non- the reconstruction coefficient, with higher values giving the linear. Even though it is linear, it can still capture the full mode more importance. The further inside the unit circle the nonlinear behavior of T . In many cases, the Koopman op- eigenvalues corresponding to the important modes are, the erator has a spectral decomposition (Mezić 2005; Budišić, faster the training. Mohr, and Mezić 2012; Mohr and Mezić 2014; Mezić 2019) The standard MNIST data set was used for testing the X∞ Z methods. The network tested was a convolutional network Uf = cj λj φj + zdE(z)f, (3) with two convolutional layers with 16 and 32 kernels, each j=1 C followed by 2 × 2 max-pooling. Each kernel is 5 × 5. Convolutional layers are followed by fully-connected layer where cj ∈ C is a coefficient, λj ∈ C is an eigenvalue, φj ∈ with 100 neurons, which is then followed by another fully- F is an eigenfunction of U , and dE(z) is a projection-valued connected layer with 10 neurons, after which follows soft- measure (PVM). The set of eigenvalues form the point spec- max classification. The architecture is shown on Figure 1. trum. The PVM is associated with the continuous spectrum A cross-entropy loss function was used with a learning rate of U ; it takes sets in the complex plane and associates pro- of η = 1e-3. Different weight initialization (He or Xavier) jection operators on F with them. We can also consider the were tested as well. After each epoch, cross-entropy loss on case of vector-valued observables f = (f1 , . . . , fm ), where train and test sets were recorded along with the network fi ∈ F. In this case, there is an analogous decomposition weights. All networks were trained for 1000 epochs, and to (3), whose only difference is that the scalar-valued co- KMD analysis was applied to the snapshots for 3 different efficients cj become vector-valued coefficients mj ∈ Cm . epoch ranges specified in the figures. These vector-valued coefficients are called Koopman modes, originally called shape modes (Mezić 2005). Data-driven algorithms, like the family of Dynamic Mode Decomposi- tion algorithms, e.g. (Schmid and Sesterhenn 2008; Row- ley et al. 2009; Jovanović, Schmid, and Nichols 2012, 2014; Hemati, Williams, and Rowley 2014; Williams, Kevrekidis, and Rowley 2015; Drmač, Mezić, and Mohr 2018; Drmač, Mezić, and Mohr 2019), are used to approximate the modes Figure 1: Neural network architecture. and eigenvalues using a trajectory from a single initial condi- tion. Thus we do not need explicit access to U to analyze the dynamical system. We use the DMD_RRR algorithm from Figure 2 shows the results for the Xavier initialization (Drmač, Mezić, and Mohr 2018) in the following work. scheme. The top left figure shows the cross entropy loss of the network evaluated on the training set at each epoch. The top right figure shows the cross entropy on the test set. The Insights into neural network training second row shows the KMD spectrum using the cross en- convergence via the Koopman spectrum. tropy loss function on the training set as the observable. The The main idea here is that by monitoring the spectrum of spectrum is computed using the first 40 (left), 100 (middle), the Koopman operator during the training process will give and 500 (rights) snapshots of the observable. The third row us a method to determine when the training process should shows the spectrum computed using the weight vectors as be terminated so that good performance is given on the test- the vector-valued observable. The spectrum is computed us- ing set without the network memorizing the set. Having this ing the first 50 (left), 100 (middle), and 500 (right) snap- indicator allow the network to generalize better. shots of the observable. The spectrum was computed using Given the trajectory of all network weights {wt }, where DMD_RRR. Eigenvalues inside the unit circle correspond t ∈ N is the training epoch, we use two different observ- to decaying modes, whereas eigenvalues outside the unit cir- ables, a delay-embedded version of the (cross-entropy) loss cle correspond to growing modes. Eigenvalues very close to function, Ltr (wt ) and the full-state observable which re- the unit circle correspond to modes that change slower than turns the network weights wt at each training epoch t. In- modes associated with eigenvalues closer to zero. Each of the initialization schemes trains very fast, with a By applying a loss of different scales, s(`), at each layer, large drop in the training error after a few epochs. The HE we are enforcing a more interpretable network. Note that for and Xavier initialization schemes seem to over-memorize ` = 0, the denominator of (6) is 1 and by Parseval’s identity, the training set which we see as an increase in the cross en- the expression is equivalent to the L2 norm. tropy loss on the test set. However, the final errors on the test set are still lower than the final error for the random normal Sobolev loss 2. Here, instead of applying pieces of (5) at scheme on the test set. each layer, we apply it only at the output of the final layer, Since for all initialization schemes, the training process is fL−1 = CL−1 z` : extremely fast, with a large drop in training set cross entropy, M L−1 (m) the eigenvalues clustered close to 0 in the second row of the X X X |ĥ(m) (k) − f̂L−1 (k)|2 figures (training set cross entropy observable) makes sense. L(h, fL−1 ) = . (1 + (2πkkk2 )2 )1/2 As more snapshots are taken to be used to compute the spec- m=1 `=0 kkk1 =s(`) trum, there seems to some important eigenvalues showing up (7) close to 1, which would indicate that the training is nearing We apply the two methods on the multiscale signal a fixed point for the training process. This trend seems clear h(x) = x+sin(2πx4 ) (therefore M =1) on the interval [0, 2]. in the final row of the plots which used the weights vector as At each point x, we add noise η(x) to the signal that is dis- the observable for the KMD computation. After 50 epochs, tributed according to there are important eigenvalues clustered close to 0 and, as   more snapshots are taken, important eigenvalues appear and η ∼  max h(x) − min h(x) N (0, 1). (8) x∈[0,2] x∈[0,2] cluster around 1. The noisy signal ĥ = h + η is used as the data set. Figure 3 Noise rejections using negative Sobolev norms shows the performance of the different loss functions using in the loss function. the noisy signal to train. The left column contains the re- Here, we investigate the performance of different loss func- sult for pure L2 loss function, the middle column is Sobolev tions compared to the standard L2 loss when learning a mul- loss 1, and the third column is Sobolev loss 2. Each row, tiscale signal. The loss functions that we use are inspired by top to bottom, corresponds to noise levels  = 0.05, 0.5, the functional form of negative-index Sobolev norms. For and 1.0, respectively. The network had 9 hidden layers each functions f, h : Td → R, where Td is the d-dimensional containing 20 neurons and a single node output layer. Lay- torus (R/Z)d , the Sobolev norm of order p = 2 and index ers were fully-connected. As can be seen the pure L2 loss s < 0 of their difference can be computed via function reconstructs the clean signal h poorly at every noise level, basically reconstructing the average of the signal. Of X |fˆ(k) − ĥ(k)|2 the two Sobolev loss functions, the first performs the best at kf − hk2H s = , (5) reconstructing the clean signal, even in the presence of high d (1 + (2πkkk2 )2 )−s k∈Z amounts of noise. This is likely due to imposing a specific where fˆ and ĥ are the Fourier transforms of f and h, respec- scale for each layer, rather than trying to force the network tively. As the norm kkk2 of the wave vector increases, the to try to disentangle the scales at the final layer. For each of the Sobolev loss, the linear scale function, s(`) = ` was contribution of the term |fˆ(k) − ĥ(k)|2 to the loss dimin- used. ishes; discrepancies between f and h at small scales are not as important as discrepancies at larger/coarser scales. We use the Sobolev loss functions to fully connected neu- Conclusions ral networks with L layers (` = 0, . . . , L − 1) in two ways. In this paper, we have presented results on using the Koop- Let h : RD → RM be the function to be learned and man operator to analyze aspects of neural networks. In order s : R → R be a monotonically increasing function that de- to do this, we consider the training process as a dynamical fines a scale. Usually we will use the linear function s(`) = ` system on the weights. In particular, we used the spectrum or the exponential function s(`) = 2` . of the Koopman operator to analyze the training process and determine when to terminate training and restart. We have also introduced structured loss functions based on negative Sobolev loss 1. For each hidden layer `, we define an aux- Sobolev norms which allow for significant noise rejection iliary output for that layer denoted by f` = C` z` , where when trained on a noisy multi-scale signal. z` ∈ RN` are the activation functions for layer ` and C` ∈ RM ×N` is a matrix. We add a loss function for each auxil- iary output having the form Acknowledgements This work was partially supported under DARPA con- 2 M X X (m) ĥ(m) (k) − f̂` (k) tract HR0011-18-9-0033 and AFOSR contract FA9550-17- L` (h, f` ) = , (6) C-0012. m=1 kkk1 =s(`) (1 + (2πkkk2 )2 )1/2 References for ` ∈ {1, . . . , L} and where ĥ(m) is the Fourier transform Budišić, M.; Mohr, R.; and Mezić, I. 2012. Applied Koop- of the m-th component function of h = (h(1) , . . . , h(M ) ). manism. Chaos 22(4): 047510. Figure 2: KMD analysis of network 1 training with Xavier initialization scheme. (1st Row left) Cross entropy loss of the network on the training set. (1st Row right) Cross entropy loss on the test set. (2nd Row) KMD eigenvalues using the cross entropy on the training set as the observable. Left: computed after 40 epochs. Middle: 100 epochs. Right: 500 epochs. (3rd Row) KMD eigenvalues using the weights vectors during training as the observable. Left: computed after 50 epochs. Middle: 100 epochs. Right: 500 epochs. Figure 3: Reconstructing a mulitscale signal polluted by noise. (Left column) Pure L2 loss function. (Middle column) Sobolev loss 1. (Right column) Sobolev loss 2. Each row, top to bottom, corresponds to noise levels  = 0.05, 0.5, and 1.0, respectively. The noisy signal ĥ was used as the data set with the goal of reconstructing the clean signal h. A train/test split of 75/25 was used on {ĥ(x)}. Figure 4: Reconstructing a mulitscale signal polluted by noise ( = 1 in Eq. (8)). The noisy signal ĥ = h + η was used as the data set with the goal of reconstructing the clean signal h. A train/test split of 75/25 was used on {ĥ(x)}. Sobolev 1 and 2 are as described above. Sobolev 1 + 2 means that both the Sobolev 1 method and the Sobolev 2 method were applied. It is clear that the best performance is given by the Sobolev 1 method. The same trend continues when using  = 0.05 and 0.5. Chang, B.; Chen, M.; Haber, E.; and Chi, E. H. 2019. An- tisymmetricRNN: A Dynamical System View on Recurrent Neural Networks. arXiv.org. arXiv:1902.09689(stat.ML). Chang, B.; Meng, L.; Haber, E.; Tung, F.; and Begert, D. 2017. Multi-level Residual Networks from Dynamical Sys- tems View. arXiv.org. arXiv:1710.10348(stat.ML). Dietrich, F.; Thiem, T. N.; and Kevrekidis, I. G. 2020. On the Koopman Operator of Algorithms. SIAM Journal on Applied Dynamical Systems 19(2): 860–885. doi:10.1137/ 19M1277059. URL https://doi.org/10.1137/19M1277059. Drmač, Z.; Mezić, I.; and Mohr, R. 2018. Data Driven Modal Decompositions: Analysis and Enhancements. SIAM Journal on Scientific Computing 40(4): A2253–A2285. Drmač, Z.; Mezić, I.; and Mohr, R. 2019. Data Driven Koopman Spectral Analysis in Vandermonde–Cauchy Form via the DFT: Numerical Method and Theoretical Insights. SIAM Journal on Scientific Computing 41(5): A3118–A3151. doi:10.1137/18M1227688. URL https://doi.org/10.1137/ 18M1227688. Hemati, M. S.; Williams, M. O.; and Rowley, C. W. 2014. Dy- namic mode decomposition for large and streaming datasets. Physics of Fluids 26(11): 111701. Jovanović, M. R.; Schmid, P. J.; and Nichols, J. W. 2012. Low-rank and sparse dynamic mode decomposition. Center for Turbulence Research, Annual Research Briefs 139–152. Jovanović, M. R.; Schmid, P. J.; and Nichols, J. W. 2014. Sparsity-promoting dynamic mode decomposition. Physics of Fluids 26(2): 024103. Mauroy, A.; and Mezić, I. 2016. Global Stability Analysis Using the Eigenfunctions of the Koopman Operator. IEEE Transactions on Automatic Control 61(11): 3356–3369. Mezić, I. 2005. Spectral Properties of Dynamical Systems, Model Reduction and Decompositions. Nonlinear Dynamics 41: 309–325. Mezić, I. 2019. Spectrum of the Koopman operator, spec- tral expansions in functional spaces, and state-space geome- try. Journal of Nonlinear Science 1–55. Mohr, R.; and Mezić, I. 2014. Construction of Eigenfunctions for Scalar-type Operators via Laplace Averages with Connec- tions to the Koopman Operator. arXiv.org 1–25. Rowley, C. W.; Mezić, I.; Bagheri, S.; Schlatter, P.; and Hen- ningson, D. S. 2009. Spectral analysis of nonlinear flows. Journal of Fluid Mechanics 641: 115–127. Schmid, P. J.; and Sesterhenn, J. 2008. Dynamic Mode De- composition of Numerical and Experimental Data. In Sixty- First Annual Meeting of the APS Division of Fluid Dynamics. San Antonio, Texas, USA. Williams, M. O.; Kevrekidis, I. G.; and Rowley, C. W. 2015. A Data–Driven Approximation of the Koopman Operator: Extending Dynamic Mode Decomposition. Journal of Non- linear Science 25(6): 1307–1346.