Deep Learning for Climate Models of the Atlantic Ocean


                                      Anton Nikolaev,1 Ingo Richter,2 Peter Sadowski1∗
                                1
                                    Information and Computer Sciences, University of Hawai‘i at Mānoa
                                         2
                                           Japan Agency for Marine-Earth Science and Technology


                            Abstract                                   de Bezenac, Pajot, and Gallinari 2019; Ham, Kim, and Luo
                                                                       2019).
  A deep neural network is trained to predict sea surface tem-
  perature variations at two important regions of the Atlantic
  ocean, using 800 years of simulated climate dynamics based
  on the first-principles physics models. This model is then
  tested against 60 years of historical data. Our statistical model
  learns to approximate the physical laws governing the simula-
  tion, providing significant improvement over simple statisti-
  cal forecasts and comparable to most state-of-the-art dynami-
  cal/conventional forecast models for a fraction of the compu-
  tational cost.


                        Introduction
General circulation models (GCMs) describe the time-
evolution of the atmosphere or ocean using mathematical
models of fluids and thermodynamics. These models are                  Figure 1: Example heat map of sea surface temperature
good at predicting climate variations in the Pacific Ocean             anomalies for a single month. Physics-based models can
such as the El Niño–Southern Oscillation (ENSO), but the              forecast variations due to ENSO in the Pacific (center re-
same models perform poorly in predicting an analogous cli-             gion), but perform poorly in two regions of interest in the At-
mate pattern in the Atlantic Ocean. Indeed one of the most             lantic:: the Northern Tropical Atlantic (NTA; 40◦ W-10◦ W,
successful approaches to predicting short term (1-6 month)             10◦ N-20◦ N) and the eastern equitorial Atlantic (ATL3;
climate variability in the Atlantic is just a ”damped persis-          20◦ W-0◦ , 3◦ S-3◦ N)
tence” model — i.e. the prediction that the seasonal climate
anomaly will remain constant with a regression (damping)                  In this work we apply deep learning to the challenging
towards the mean.                                                      task of predicting sea surface temperature (SST) anomalies
   Data-driven machine learning methods take a different ap-           in two particular regions of the Atlantic (Figure 1) where
proach to climate forecasting. Rather than integrating the             GCMs are known to perform relatively poorly: the east-
physics equations forward in time, machine learning at-                ern equatorial Atlantic (ATL3), which is subject to pro-
tempts to learn emergent patterns from data, sacrificing               nounced warm and cold events lasting 3-6 months, and the
the interpretability and robustness of first principles in fa-         northern tropical Atlantic (NTA). Deep learning methods re-
vor of black-box statistical models. When trained on real              quire large data sets for training, and we use simulated cli-
data, these models could capture deficiencies in the physical          mate processes from Version 2 of the Canadian Earth Sys-
models. When trained on simulation data, they can provide              tem Model (CanESM2). The dynamical core of this climate
a fast approximation to computationally-expensive simula-              model is based on the first principles Navier-Stokes equa-
tions. Deep learning with artificial neural networks, a ma-            tions for fluid dynamics, with some unresolved processes
chine learning approach that is particularly well-suited for           such as convection and turbulence represented through pa-
high-dimensional data, has recently shown promise in mod-              rameterization schemes. The latter introduces a few free pa-
eling a variety of fluid flow processes (Wang et al. 2019;             rameters that are tuned to observational data. This tuning,
   ∗
     peter.sadowski@hawaii.edu                                         however, only concerns the mean statistics of the model out-
Copyright c 2020, for this paper by its authors. Use permit-           put and does not provide any information that would allow
ted under Creative Commons License Attribution 4.0 International       the model to forecast particular climate events. Running this
(CCBY 4.0).                                                            model forward in time produces simulated climate cycles
that demonstrate a range of fluctuations under steady radia-          Hyper-parameter                  Range            Best
tive forcing. We use this to test the hypothesis that a deep          Input timesteps                  1 - 12             2
learning model trained on GCM simulations can provide a               Convolution layers               4 - 12            12
fast approximation to GCM-based forecasts, and whether                Convolution channels           20 - 160            49
such a model performs better than simple persistence fore-            Kernel shape               (3,3), (5,5), (7,7)    (3,3)
cast models.                                                          Initial learning rate        0.0001 - 0.1        0.0001
                                                                      Batch size                       1, 2, 4            1
                          Methods                                     Early stopping patience          1 - 10            10
Data                                                               Table 1: Hyper-parameter search space explored and optimal
The training data consists of an 800 year time series from         values selected by SHERPA after training 100 neural net-
CanESM2 simulations, represented as a sequence of one              works using Bayesian Optimization. Kernel shape and batch
month time steps. The first 600 years are used for train-          size were treated as categorical variables with the choices
ing, years 601-700 are used for early stopping and hyper-          listed.
parameters tuning, and years 701-800 are used as a clean
test set for evaluation. After hyper-parameter optimization
a final model is trained on the first 700 years with the fi-       search space shown in Table 1. The best model consisted
nal 100 years used for early stopping, and we evaluate per-        of the maximum number of hidden layers (twelve) in our
formance on historical SST anomaly data from years 1958-           hyper-parameter search space. Many of the models over-
2017, pre-processed by subtracting the linear climate change       fit to the data set, and regularization was important — the
trend line.                                                        best model used a small batch size, a small kernel size, a
   The CanESM2 data is represented by a grid of 128 lon-           small number of timesteps to consider in the input, and a
gitudinal steps (ranging from 180◦ W-180◦ E) and 22 latitu-        small channel size. We tried four other modifications that did
dinal steps (ranging from 30◦ S-30◦ N). A mask is applied to       not improve the performance on the GCM validation set, so
cells that do not consist entirely of open ocean. For each         were not used in the final model: (1) using locally-connected
unmasked cell we have sea surface temperature anomaly,             layers instead of convolutional layers; (2) passing the land-
surface wind stress decomposed into longitudinal and lati-         mass mask as an input instead of zero-filling; (3) including
tudinal components τu and τv , and the depth of the 20 de-         the month as an extra input channel; (4) dropping the z20
gree Celsius isotherm z20, which essentially measures the          input channel.
upper ocean heat content. The data is normalized by mean-
subtraction and scaling by the standard deviation, with the                                  Results
mean and variance of each feature calculated over all grid         The model is trained to make predictions for the entire
cells over the entire data set. For masked cells the values of     CanESM2 grid, but we focus our analysis on the NTA and
all input features are filled with zeroes; predictions at these    the ATL3 regions. In order to evaluate the generalization
cells do not contribute to the loss.                               from simulation to observed data, we evaluate performance
                                                                   on both (1) the final 100 years of the CanESM2 simulation,
Deep Learning                                                      and (2) the de-trended historical data. In both test sets and
The deep learning approach can leverage global information         both regions, the NN predictions beat the persistence model
to predict SST at any particular location. However, limiting       for lead times of 1-6 months (Figure 2). There is a signifi-
the information to a local region is advantageous because it       cant increase in RMSE when transferring the model from the
helps prevent overfitting. The size of this “receptive field” is   simulation data it was trained on to the historical data, con-
something that is optimized during hyper-parameter selec-          firming that the simulations are only an imperfect approxi-
tion.                                                              mation to the real system, but the NN maintains its perfor-
   In our experiments, a neural network architecture takes in      mance advantage.
a (128+k)×(22+k)×T ×4 tensor, where k is the kernel size
and T is the number of months to consider when making pre-                                   NTA                   ATL3
dictions. The T × 4 input features at each grid cell are con-                         GCM     Historical    GCM      Historical
catenated and treated as input channels. The model consists         Persistence       0.27      0.50        0.35       0.51
of a sequence of 2D convolutional layers, with skip connec-         Deep learning     0.23      0.41        0.26       0.43
tions concatenating the input SST values to the penultimate
layer (similar to the widely-used U-net architectures (Ron-        Table 2: RMSE on the GCM simulation and historical test
neberger, Fischer, and Brox 2015)) and adding them to the          sets, in degrees Celsius, averaged over a six-month lead
linear output layer (as in a ResNet (He et al. 2016)). The         time.
objective is the Mean Squared Error (MSE) loss computed
over the non-masked grid cells.                                       In the NTA, the NN predictions also beat the damped per-
   Hyper-parameters were optimized using the Bayesian Op-          sistence model on both test sets. Figure 2 breaks down the
timisation algorithm implemented in the SHERPA black-              performance on the 1958-2017 historical data by lead time
box optimization framework (Hertel et al. 2018). A total           for predictions made on February 1st of each year, showing
of 400 neural networks were trained, optimizing over the           that the forecasting ability degrades with longer lead times
                                                                 with observations to predict the evolution of the system. The
                                                                 GCM forecast models include the SINTEX-F, a prediction
                                                                 model used at the Japan Agency for Marine-Earth Science
                                                                 and Technology (Luo et al. 2005), and 8 models from vari-
                                                                 ous forecast centers that participated in the Climate-system
                                                                 Historical Forecast Project (Tompkins et al. 2017); see also
                                                                 (Kirtman and Pirani 2009). These GCM forecast models
                                                                 were selected to illustrate the performance of complex pre-
                                                                 diction systems. The performance of the NN is competitive
                                                                 with these state-of-the-art methods.

                                                                                        Conclusion
                                                                 We demonstrate the use of deep learning for forecasting
                                                                 monthly sea surface temperature variations in the Atlantic
                                                                 Ocean with a lead time of 1-6 months, a problem known to
                                                                 be significantly harder than forecasting the ENSO in the Pa-
                                                                 cific. Training on CanESM2 climate model data and testing
                                                                 on historical data, the deep learning approach performs as
                                                                 well as the best GCM physics models on the northern trop-
                                                                 ical atlantic region with much less computation. However,
                                                                 on the equatorial Atlantic, our model does no better than a
                                                                 simple damped persistence model.
                                                                    In this work we restricted ourselves to only training on
                                                                 GCM simulation data at a fixed grid size, and thus we only
                                                                 expect the model to perform as well as the simulation it was
                                                                 trained on. We expect the NN approach to do better if it is
                                                                 given a chance to learn from historical data, since then it
                                                                 could learn to correct for deficiencies in the GCM. Fine tun-
                                                                 ing the model on historical data is an opportunity for future
                                                                 work, although there is a significant danger of overfitting
                                                                 given the limited amount of historical data.

                                                                                   Acknowledgments
Figure 2: Anomaly correlation coefficient (ACC) and RMSE         The authors would like to thank NVIDIA for a hardware
in degrees Celsius of deep learning approach vs. persistence     grant to PS, and technical support and advanced comput-
model and damped persistence model on the Northern Trop-         ing resources from the University of Hawai‘i Information
ical Atlantic from 1958-2017. The climate change trend has       Technology Services Cyberinfrastructure. The authors ac-
been removed from the data. All predictions are made from        knowledge the WCRP/CLIVAR Working Group on Sea-
February 1st.                                                    sonal to Interannual Prediction (WGSIP) for establishing the
                                                                 Climate-system Historical Forecast Project (CHFP, see Kirt-
                                                                 man and Pirani 2009) and the Centro de Investigaciones del
(i.e. farther into the future). However, the NN is no bet-       Mar y la Atmosfera (CIMA) for providing the model out-
ter than the damped persistence approach on the historical       put http://chfps.cima.fcen.uba.ar/. We also thank the data
ATL3 data (Figure 3), reflecting the challenge in modeling       providers for making the model output available through
this region.                                                     CHFP.
   Figure 4 compares the sea-surface temperature prediction
skill of the NTA model with a range of other approaches. In                             References
addition to the persistence forecast, we compare to a linear     de Bezenac, E.; Pajot, A.; and Gallinari, P. 2019. Deep
inverse model (LIM) and GCM-based predictions. Linear in-        learning for physical processes: Incorporating prior scien-
verse modeling is a technique that assumes that the evolution    tific knowledge. Journal of Statistical Mechanics: Theory
of a system can be approximated by a linear operator with        and Experiment 2019(12):124009.
white noise forcing. In practice, the linear operator is typi-
cally calculated in principal component space using multi-       Ham, Y.-G.; Kim, J.-H.; and Luo, J.-J. 2019. Deep learning
variate regression at a fixed time lag (Penland and Sardesh-     for multi-year enso forecasts. Nature 573(7775):568–572.
mukh 1995). LIMs are usually derived from observational          He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep resid-
data but here we use a LIM derived from the output of the        ual learning for image recognition. In Proceedings of the
CanESM2 GCM. The other forecast models are GCM based,            IEEE conference on computer vision and pattern recogni-
i.e. they use complex atmosphere-ocean models initialized        tion, 770–778.
                                                             Hertel, L.; Collado, J.; Sadowski, P.; and Baldi, P. 2018.
                                                             Sherpa: Hyperparameter optimization for machine learning
                                                             models.
                                                             Kirtman, B., and Pirani, A. 2009. The state of the art of sea-
                                                             sonal prediction: Outcomes and recommendations from the
                                                             first world climate research program workshop on seasonal
                                                             prediction. Bulletin of the American Meteorological Society.
                                                             Luo, J.-J.; Masson, S.; Behera, S.; Shingu, S.; and Yamagata,
                                                             T. 2005. Seasonal climate predictability in a coupled oagcm
                                                             using a different approach for ensemble forecasts. Journal
                                                             of climate 18(21):4474–4497.
                                                             Penland, C., and Sardeshmukh, P. D. 1995. The optimal
                                                             growth of tropical sea surface temperature anomalies. Jour-
                                                             nal of climate 8(8):1999–2024.
                                                             Ronneberger, O.; Fischer, P.; and Brox, T. 2015. U-net:
                                                             Convolutional networks for biomedical image segmentation.
                                                             In International Conference on Medical image computing
                                                             and computer-assisted intervention, 234–241. Springer.
                                                             Tompkins, A. M.; Ortiz De Zárate, M. I.; Saurral, R. I.; Vera,
                                                             C.; Saulo, C.; Merryfield, W. J.; Sigmond, M.; Lee, W.-S.;
                                                             Baehr, J.; Braun, A.; et al. 2017. The climate-system his-
                                                             torical forecast project: Providing open access to seasonal
                                                             forecast ensembles from centers around the globe. Bulletin
                                                             of the American Meteorological Society 98(11):2293–2301.
                                                             Wang, R.; Kashinath, K.; Mustafa, M.; Albert, A.; and Yu, R.
                                                             2019. Towards physics-informed deep learning for turbulent
                                                             flow prediction. arXiv preprint arXiv:1911.08655.


Figure 3: Anomaly correlation coefficient (ACC) and RMSE
in degrees Celsius of deep learning approach vs. persis-
tence model and damped persistence model on the equito-
rial Atlantic from 1958-2017. The climate change trend has
been removed from the data. All predictions are made from
February 1st.
Figure 4: Comparison of deep neural network model (NN CanEMS2, blue line) and dynamical models from the CHFP inter-
comparison for years 1982-2010 in the NTA region. The performance of the linear inverse model trained on CanESM2 output
(LIM CanESM2) is shown in yellow-green. All predictions are initialized on February 1. The metric shown is the anomaly
correlation coefficient (ACC), which is calculated as the correlation of predicted and observed deviations (anomalies) from the
mean climate.