Deep Learning for Climate Models of the Atlantic Ocean Anton Nikolaev,1 Ingo Richter,2 Peter Sadowski1∗ 1 Information and Computer Sciences, University of Hawai‘i at Mānoa 2 Japan Agency for Marine-Earth Science and Technology Abstract de Bezenac, Pajot, and Gallinari 2019; Ham, Kim, and Luo 2019). A deep neural network is trained to predict sea surface tem- perature variations at two important regions of the Atlantic ocean, using 800 years of simulated climate dynamics based on the first-principles physics models. This model is then tested against 60 years of historical data. Our statistical model learns to approximate the physical laws governing the simula- tion, providing significant improvement over simple statisti- cal forecasts and comparable to most state-of-the-art dynami- cal/conventional forecast models for a fraction of the compu- tational cost. Introduction General circulation models (GCMs) describe the time- evolution of the atmosphere or ocean using mathematical models of fluids and thermodynamics. These models are Figure 1: Example heat map of sea surface temperature good at predicting climate variations in the Pacific Ocean anomalies for a single month. Physics-based models can such as the El Niño–Southern Oscillation (ENSO), but the forecast variations due to ENSO in the Pacific (center re- same models perform poorly in predicting an analogous cli- gion), but perform poorly in two regions of interest in the At- mate pattern in the Atlantic Ocean. Indeed one of the most lantic:: the Northern Tropical Atlantic (NTA; 40◦ W-10◦ W, successful approaches to predicting short term (1-6 month) 10◦ N-20◦ N) and the eastern equitorial Atlantic (ATL3; climate variability in the Atlantic is just a ”damped persis- 20◦ W-0◦ , 3◦ S-3◦ N) tence” model — i.e. the prediction that the seasonal climate anomaly will remain constant with a regression (damping) In this work we apply deep learning to the challenging towards the mean. task of predicting sea surface temperature (SST) anomalies Data-driven machine learning methods take a different ap- in two particular regions of the Atlantic (Figure 1) where proach to climate forecasting. Rather than integrating the GCMs are known to perform relatively poorly: the east- physics equations forward in time, machine learning at- ern equatorial Atlantic (ATL3), which is subject to pro- tempts to learn emergent patterns from data, sacrificing nounced warm and cold events lasting 3-6 months, and the the interpretability and robustness of first principles in fa- northern tropical Atlantic (NTA). Deep learning methods re- vor of black-box statistical models. When trained on real quire large data sets for training, and we use simulated cli- data, these models could capture deficiencies in the physical mate processes from Version 2 of the Canadian Earth Sys- models. When trained on simulation data, they can provide tem Model (CanESM2). The dynamical core of this climate a fast approximation to computationally-expensive simula- model is based on the first principles Navier-Stokes equa- tions. Deep learning with artificial neural networks, a ma- tions for fluid dynamics, with some unresolved processes chine learning approach that is particularly well-suited for such as convection and turbulence represented through pa- high-dimensional data, has recently shown promise in mod- rameterization schemes. The latter introduces a few free pa- eling a variety of fluid flow processes (Wang et al. 2019; rameters that are tuned to observational data. This tuning, ∗ peter.sadowski@hawaii.edu however, only concerns the mean statistics of the model out- Copyright c 2020, for this paper by its authors. Use permit- put and does not provide any information that would allow ted under Creative Commons License Attribution 4.0 International the model to forecast particular climate events. Running this (CCBY 4.0). model forward in time produces simulated climate cycles that demonstrate a range of fluctuations under steady radia- Hyper-parameter Range Best tive forcing. We use this to test the hypothesis that a deep Input timesteps 1 - 12 2 learning model trained on GCM simulations can provide a Convolution layers 4 - 12 12 fast approximation to GCM-based forecasts, and whether Convolution channels 20 - 160 49 such a model performs better than simple persistence fore- Kernel shape (3,3), (5,5), (7,7) (3,3) cast models. Initial learning rate 0.0001 - 0.1 0.0001 Batch size 1, 2, 4 1 Methods Early stopping patience 1 - 10 10 Data Table 1: Hyper-parameter search space explored and optimal The training data consists of an 800 year time series from values selected by SHERPA after training 100 neural net- CanESM2 simulations, represented as a sequence of one works using Bayesian Optimization. Kernel shape and batch month time steps. The first 600 years are used for train- size were treated as categorical variables with the choices ing, years 601-700 are used for early stopping and hyper- listed. parameters tuning, and years 701-800 are used as a clean test set for evaluation. After hyper-parameter optimization a final model is trained on the first 700 years with the fi- search space shown in Table 1. The best model consisted nal 100 years used for early stopping, and we evaluate per- of the maximum number of hidden layers (twelve) in our formance on historical SST anomaly data from years 1958- hyper-parameter search space. Many of the models over- 2017, pre-processed by subtracting the linear climate change fit to the data set, and regularization was important — the trend line. best model used a small batch size, a small kernel size, a The CanESM2 data is represented by a grid of 128 lon- small number of timesteps to consider in the input, and a gitudinal steps (ranging from 180◦ W-180◦ E) and 22 latitu- small channel size. We tried four other modifications that did dinal steps (ranging from 30◦ S-30◦ N). A mask is applied to not improve the performance on the GCM validation set, so cells that do not consist entirely of open ocean. For each were not used in the final model: (1) using locally-connected unmasked cell we have sea surface temperature anomaly, layers instead of convolutional layers; (2) passing the land- surface wind stress decomposed into longitudinal and lati- mass mask as an input instead of zero-filling; (3) including tudinal components τu and τv , and the depth of the 20 de- the month as an extra input channel; (4) dropping the z20 gree Celsius isotherm z20, which essentially measures the input channel. upper ocean heat content. The data is normalized by mean- subtraction and scaling by the standard deviation, with the Results mean and variance of each feature calculated over all grid The model is trained to make predictions for the entire cells over the entire data set. For masked cells the values of CanESM2 grid, but we focus our analysis on the NTA and all input features are filled with zeroes; predictions at these the ATL3 regions. In order to evaluate the generalization cells do not contribute to the loss. from simulation to observed data, we evaluate performance on both (1) the final 100 years of the CanESM2 simulation, Deep Learning and (2) the de-trended historical data. In both test sets and The deep learning approach can leverage global information both regions, the NN predictions beat the persistence model to predict SST at any particular location. However, limiting for lead times of 1-6 months (Figure 2). There is a signifi- the information to a local region is advantageous because it cant increase in RMSE when transferring the model from the helps prevent overfitting. The size of this “receptive field” is simulation data it was trained on to the historical data, con- something that is optimized during hyper-parameter selec- firming that the simulations are only an imperfect approxi- tion. mation to the real system, but the NN maintains its perfor- In our experiments, a neural network architecture takes in mance advantage. a (128+k)×(22+k)×T ×4 tensor, where k is the kernel size and T is the number of months to consider when making pre- NTA ATL3 dictions. The T × 4 input features at each grid cell are con- GCM Historical GCM Historical catenated and treated as input channels. The model consists Persistence 0.27 0.50 0.35 0.51 of a sequence of 2D convolutional layers, with skip connec- Deep learning 0.23 0.41 0.26 0.43 tions concatenating the input SST values to the penultimate layer (similar to the widely-used U-net architectures (Ron- Table 2: RMSE on the GCM simulation and historical test neberger, Fischer, and Brox 2015)) and adding them to the sets, in degrees Celsius, averaged over a six-month lead linear output layer (as in a ResNet (He et al. 2016)). The time. objective is the Mean Squared Error (MSE) loss computed over the non-masked grid cells. In the NTA, the NN predictions also beat the damped per- Hyper-parameters were optimized using the Bayesian Op- sistence model on both test sets. Figure 2 breaks down the timisation algorithm implemented in the SHERPA black- performance on the 1958-2017 historical data by lead time box optimization framework (Hertel et al. 2018). A total for predictions made on February 1st of each year, showing of 400 neural networks were trained, optimizing over the that the forecasting ability degrades with longer lead times with observations to predict the evolution of the system. The GCM forecast models include the SINTEX-F, a prediction model used at the Japan Agency for Marine-Earth Science and Technology (Luo et al. 2005), and 8 models from vari- ous forecast centers that participated in the Climate-system Historical Forecast Project (Tompkins et al. 2017); see also (Kirtman and Pirani 2009). These GCM forecast models were selected to illustrate the performance of complex pre- diction systems. The performance of the NN is competitive with these state-of-the-art methods. Conclusion We demonstrate the use of deep learning for forecasting monthly sea surface temperature variations in the Atlantic Ocean with a lead time of 1-6 months, a problem known to be significantly harder than forecasting the ENSO in the Pa- cific. Training on CanESM2 climate model data and testing on historical data, the deep learning approach performs as well as the best GCM physics models on the northern trop- ical atlantic region with much less computation. However, on the equatorial Atlantic, our model does no better than a simple damped persistence model. In this work we restricted ourselves to only training on GCM simulation data at a fixed grid size, and thus we only expect the model to perform as well as the simulation it was trained on. We expect the NN approach to do better if it is given a chance to learn from historical data, since then it could learn to correct for deficiencies in the GCM. Fine tun- ing the model on historical data is an opportunity for future work, although there is a significant danger of overfitting given the limited amount of historical data. Acknowledgments Figure 2: Anomaly correlation coefficient (ACC) and RMSE The authors would like to thank NVIDIA for a hardware in degrees Celsius of deep learning approach vs. persistence grant to PS, and technical support and advanced comput- model and damped persistence model on the Northern Trop- ing resources from the University of Hawai‘i Information ical Atlantic from 1958-2017. The climate change trend has Technology Services Cyberinfrastructure. The authors ac- been removed from the data. All predictions are made from knowledge the WCRP/CLIVAR Working Group on Sea- February 1st. sonal to Interannual Prediction (WGSIP) for establishing the Climate-system Historical Forecast Project (CHFP, see Kirt- man and Pirani 2009) and the Centro de Investigaciones del (i.e. farther into the future). However, the NN is no bet- Mar y la Atmosfera (CIMA) for providing the model out- ter than the damped persistence approach on the historical put http://chfps.cima.fcen.uba.ar/. We also thank the data ATL3 data (Figure 3), reflecting the challenge in modeling providers for making the model output available through this region. CHFP. Figure 4 compares the sea-surface temperature prediction skill of the NTA model with a range of other approaches. In References addition to the persistence forecast, we compare to a linear de Bezenac, E.; Pajot, A.; and Gallinari, P. 2019. Deep inverse model (LIM) and GCM-based predictions. Linear in- learning for physical processes: Incorporating prior scien- verse modeling is a technique that assumes that the evolution tific knowledge. Journal of Statistical Mechanics: Theory of a system can be approximated by a linear operator with and Experiment 2019(12):124009. white noise forcing. In practice, the linear operator is typi- cally calculated in principal component space using multi- Ham, Y.-G.; Kim, J.-H.; and Luo, J.-J. 2019. Deep learning variate regression at a fixed time lag (Penland and Sardesh- for multi-year enso forecasts. Nature 573(7775):568–572. mukh 1995). LIMs are usually derived from observational He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep resid- data but here we use a LIM derived from the output of the ual learning for image recognition. In Proceedings of the CanESM2 GCM. The other forecast models are GCM based, IEEE conference on computer vision and pattern recogni- i.e. they use complex atmosphere-ocean models initialized tion, 770–778. Hertel, L.; Collado, J.; Sadowski, P.; and Baldi, P. 2018. Sherpa: Hyperparameter optimization for machine learning models. Kirtman, B., and Pirani, A. 2009. The state of the art of sea- sonal prediction: Outcomes and recommendations from the first world climate research program workshop on seasonal prediction. Bulletin of the American Meteorological Society. Luo, J.-J.; Masson, S.; Behera, S.; Shingu, S.; and Yamagata, T. 2005. Seasonal climate predictability in a coupled oagcm using a different approach for ensemble forecasts. Journal of climate 18(21):4474–4497. Penland, C., and Sardeshmukh, P. D. 1995. The optimal growth of tropical sea surface temperature anomalies. Jour- nal of climate 8(8):1999–2024. Ronneberger, O.; Fischer, P.; and Brox, T. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, 234–241. Springer. Tompkins, A. M.; Ortiz De Zárate, M. I.; Saurral, R. I.; Vera, C.; Saulo, C.; Merryfield, W. J.; Sigmond, M.; Lee, W.-S.; Baehr, J.; Braun, A.; et al. 2017. The climate-system his- torical forecast project: Providing open access to seasonal forecast ensembles from centers around the globe. Bulletin of the American Meteorological Society 98(11):2293–2301. Wang, R.; Kashinath, K.; Mustafa, M.; Albert, A.; and Yu, R. 2019. Towards physics-informed deep learning for turbulent flow prediction. arXiv preprint arXiv:1911.08655. Figure 3: Anomaly correlation coefficient (ACC) and RMSE in degrees Celsius of deep learning approach vs. persis- tence model and damped persistence model on the equito- rial Atlantic from 1958-2017. The climate change trend has been removed from the data. All predictions are made from February 1st. Figure 4: Comparison of deep neural network model (NN CanEMS2, blue line) and dynamical models from the CHFP inter- comparison for years 1982-2010 in the NTA region. The performance of the linear inverse model trained on CanESM2 output (LIM CanESM2) is shown in yellow-green. All predictions are initialized on February 1. The metric shown is the anomaly correlation coefficient (ACC), which is calculated as the correlation of predicted and observed deviations (anomalies) from the mean climate.