A Variational U-Net for Weather Forecasting Pak Hay Kwok1 , Qi Qi2 1 pak_hay_kwok@hotmail.com 2 qiq208@gmail.com Abstract Not only can discovering patterns and insights from atmospheric data enable more accurate weather predictions, but it may also provide valuable information to help tackle climate change. Weather4cast is an open competition that aims to evaluate machine learning algorithms’ capability to predict future atmospheric states. Here, we describe our third-place solution to Weather4cast. We present a novel Variational U-Net that combines a Variational Autoencoder’s ability to consider the probabilistic nature of data with a U-Net’s ability to recover fine-grained details. This solution is an evolution from our fourth-place solution to Traffic4cast 2020 with many commonalities, suggesting its applicability to vastly different domains, such as weather and traffic. The code for this solution is available at https://github.com/qiq208/weather4cast2021_Stage1 Keywords IARAI, Traffic4cast, Weather4cast, U-Net, Variational Autoencoder 1. Introduction R1-3 correspond to the core challenge in which training, validation and test data are provided, while regions R4-6 Meteorological satellites around the globe are constantly correspond to the transfer learning challenge in which gathering a trove of data about the atmosphere. How- only the test data are provided. In addition, static infor- ever, the high-dimensionality nature of atmospheric data mation, such as altitude, latitude and longitude, are also makes it challenging to analyse, hindering the discovery given for all regions. of valuable insights. With the advent of machine learning Weather4cast demands an algorithm that can return methods, it is believed these methods can help better un- the atmospheric states over the defined regions for the derstand atmospheric data. To evaluate the applicability next 8 hours (32-off 15-minute intervals) given an hour of such techniques to atmospheric data, Weather4cast (4-off 15-minute intervals) worth of data. While only 4 [1] by the Institute of Advanced Research in Artificial target variables are required, namely π‘‘π‘’π‘šπ‘π‘’π‘Ÿπ‘Žπ‘‘π‘’π‘Ÿπ‘’ (a Intelligence is an open competition that challenges its channel of CTTH), π‘π‘Ÿπ‘Ÿ_𝑖𝑛𝑑𝑒𝑛𝑠𝑖𝑑𝑦 (a channel of CRR), participants to develop algorithms to predict the future π‘Žπ‘ π‘–π‘–_π‘‘π‘’π‘Ÿπ‘_π‘‘π‘Ÿπ‘œπ‘_π‘π‘Ÿπ‘œπ‘ (a channel of ASII-TF) and π‘π‘šπ‘Ž (a states of the atmosphere over specific regions. channel of CMA), any channels of the weather products The Weather4cast dataset [2] is obtained from Me- or static information of the regions can be used as input teosat geostationary meteorological satellites operated variables. by EUMETSAT for the period from February 2019 to This work describes a novel Variational U-Net solu- February 2021. The Meteosat images are processed by tion which achieved third place in both the core and NWC SAF software into weather products. The weather transfer learning challenges of Weather4cast. This Varia- products of interest are: Cloud Top Temperature and tional U-Net can be viewed as a U-Net with a Variational Height (CTTH), Convective Rainfall Rate (CRR), Auto- Autoencoder (VAE) style bottleneck, or as a VAE with matic Satellite Image Interpretation - Tropopause Folding U-Net style skip connections. The intuition behind this detection (ASII-TF), Cloud Mask (CMA), and Cloud Type architecture is to combined VAE’s ability to consider the (CT). Each of these weather products is recorded in 15- probabilistic nature of data with U-Net’s ability to recover minute intervals and consists of multiple channels. Each fine-grained details. channel is in the format of an image of shape 256x256 pixels, with each pixel covering an area of about 4x4 km. The regions of interest are illustrated in Figure 1; regions 2. Related work Weather4cast can be viewed as a video frame prediction CDCEO 2021: 1st workshop on Complex Data Challenges in Earth Observation, November 1, 2021, Virtual problem, in which the inputs are the first 4 frames of a " pak_hay_kwok@hotmail.com (P. H. Kwok); qiq208@gmail.com video, and the outputs are the subsequent 32 frames. This (Q. Qi) format of the problem is identical to that of Traffic4cast ~ https://github.com/ivans-github/ (P. H. Kwok); [3, 4]. Overlooking the difference in domains between https://github.com/qiq208/ (Q. Qi) Weather4cast and Traffic4cast, the two competitions can Β© 2021 Copyright for this paper by its authors. Use permitted under Creative CEUR Workshop http://ceur-ws.org ISSN 1613-0073 Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) be considered the same, hence solutions for Traffic4cast Proceedings Figure 1: Weather4cast regions should be somewhat transferable to Weather4cast. A tion to last year’s Traffic4cast [5]. The encoder is made range of algorithms, including U-Net, LSTM and Graph up of Dense Blocks connected by 2D Max Pooling. Each Neural Network were proposed for Traffic4cast [5, 6], Dense Block consists of 4 repeats of 2D Convolution, ELU yet various flavours of U-Net dominated the competition [10], Group Normalisation [11] and 2D Dropout [12], fol- in both 2019 and 2020, with all winning teams adopting lowed by another 2D Convolution and ELU. Different to U-Nets in their final solutions [5, 7]. Thus, it is sensible the encoder, the decoder consists of repeats of 2D Trans- to consider U-Net-based solutions for Weather4cast. posed Convolution, ELU, 2D Convolution, ELU, Group While the formats of Weather4cast and Traffic4cast Normalisation and 2D Dropout. The encoder and the are equivalent, the differences in the underlying domains decoder are joined by skip connections. cannot be ignored. Specifically, weather is considered Inspired by the works of Kohl et al. [8] and Myronenko more random than traffic. Multiple scenarios are possible [9], the bottleneck of the Variational U-Net, the part given a set of observations, and this inherent randomness which connects the end of the encoder to the start of the needs particular attention, as it is not compatible with decoder, is replaced with one that is typically found in the deterministic nature of a typical U-Net. Segmentation VAE. At the end of the encoder, the input is reduced to 2 of medical images also suffers from intrinsic ambiguities. vectors of size 512, representing the means and standard To handle these ambiguities, Kohl et al. [8] proposed deviations of the latent variables. With the assumption a Probabilistic U-Net, a combination of a U-Net with that the latent variables are Gaussian, a sample is drawn, a conditional VAE, capable of producing an unlimited and the drawn vector is reconstructed into an image number of hypotheses from a set of inputs. Myronenko which is then passed through the decoder. [9] also proposed a different way to combine a U-Net with The architecture of the Variational U-Net is shown in a VAE, which a VAE was applied to regularise a shared Figure 2. encoder. His solution was proven successful and won first place in the Multimodal Brain Tumour Segmentation 3.2. Inputs and target variables Challenge (BraTS) in 2018. Similar to the authors’ Traffic4cast solution [5], the tem- poral dimension of the input tensor is combined with 3. Methods the channel dimension, resulting in the number of input channels of 4*8. Furthermore, since it seems intuitive that 3.1. Model architecture weather patterns are dependent on geographical location, Given the similarities between Weather4cast and Traf- the static features of altitude, latitude and longitude are fic4cast, the main structure of the proposed Variational appended, resulting in an additional 3 input channels. As U-Net largely resembles the authors’ fourth-place solu- such, the final number of input channels to the Varia- Figure 2: Variational U-Net architecture Table 1 Summary of input features and target variables Feature Target Variable Description π‘‘π‘’π‘šπ‘π‘’π‘Ÿπ‘Žπ‘‘π‘’π‘Ÿπ‘’ Yes Combined cloud top and ground temperature π‘π‘‘π‘‘β„Ž_π‘π‘Ÿπ‘’π‘  No Cloud top pressure π‘π‘Ÿπ‘Ÿ_𝑖𝑛𝑑𝑒𝑛𝑠𝑖𝑑𝑦 Yes Convective rainfall rate intensity in mm/h π‘π‘Ÿπ‘Ÿ_π‘Žπ‘π‘π‘’π‘š No Convective rainfall rate hourly accumulations π‘Žπ‘ π‘–π‘–_π‘‘π‘’π‘Ÿπ‘_π‘‘π‘Ÿπ‘œπ‘_π‘π‘Ÿπ‘œπ‘ Yes Probability of occurrence of tropopause folding π‘π‘šπ‘Ž Yes Cloud mask 𝑐𝑑 No Cloud type π‘π‘‘π‘‘β„Ž_π‘‘π‘’π‘šπ‘π‘’ mask No A mask showing pixel locations containing cloud top temperature measurements tional U-Net is 4*8+3=35. Finally, the model is designed those rejected are summarised in Table 2. to predict all 32 output frames in one go, resulting in the number of output channels being 32*4=128. Furthermore, 3.3. Loss function any missing data has been zero-filled. A series of experiments were performed to find the The loss function consists of 2 terms: most effective set of input features, and the validation set 𝐿 = 𝐿𝐿2 + 80 * 𝐿𝐾𝐿 (1) was used to evaluate the performance of each feature set. The resulting input feature set is listed in Table 1, and 𝐿𝐿2 is a modified mean squared error, it takes into account missing values and the difference in scale of the Table 2 Summary of input features not used in the final model Feature Description π‘π‘‘π‘‘β„Ž_π‘Žπ‘™π‘‘ Cloud top altitude Linear interpolation of π‘‘π‘’π‘šπ‘π‘’π‘Ÿπ‘Žπ‘‘π‘’π‘Ÿπ‘’ Using linear interpolation to fill in missing π‘‘π‘’π‘šπ‘π‘’π‘Ÿπ‘Žπ‘‘π‘’π‘Ÿπ‘’ Linear interpolation of π‘π‘‘π‘‘β„Ž_π‘π‘Ÿπ‘’π‘  Using linear interpolation to fill in missing π‘π‘‘π‘‘β„Ž_π‘π‘Ÿπ‘’π‘  4 target variables: 3.5. Regularisation 32 𝑃𝑑,𝑣 From initial experiments, it became apparent that con- 1 βˆ‘οΈ βˆ‘οΈ 𝑀𝑣 βˆ‘οΈ trolling overfitting of the model to the training data was 𝐿𝐿2 = (𝑦𝑑,𝑣,𝑝 βˆ’ 𝑦ˆ𝑑,𝑣,𝑝 )2 (2) 32 Γ— 4 𝑑=1 π‘£βˆˆπ‘‰ 𝑃𝑑,𝑣 𝑝=1 a key to success in both the core and transfer learning challenges. Hence, several regularisation strategies were where 𝑉 = {π‘‘π‘’π‘šπ‘π‘’π‘Ÿπ‘Žπ‘‘π‘’π‘Ÿπ‘’, π‘π‘Ÿπ‘Ÿ_𝑖𝑛𝑑𝑒𝑛𝑠𝑖𝑑𝑦, π‘π‘šπ‘Ž, employed. Within the model itself, the move to the Varia- π‘Žπ‘ π‘–π‘–_π‘‘π‘’π‘Ÿπ‘_π‘‘π‘Ÿπ‘œπ‘_π‘π‘Ÿπ‘œπ‘}, 𝑃𝑑,𝑣 is the total number of non- tional U-Net from a traditional U-Net, combined with the missing pixels for a given target variable 𝑣 at a given introduction of dropout layers throughout the encoder time 𝑑 and 𝑀𝑣 is the target variable weighting: and decoder, both aimed to improve the generalisation of ⎧ the model. To expose the model to as much variation in βŽͺ βŽͺ 31.610, 𝑣 = π‘‘π‘’π‘šπ‘π‘’π‘Ÿπ‘Žπ‘‘π‘’π‘Ÿπ‘’ input as possible, a single model was used for all regions in the competition and trained on all available training βŽͺ ⎨4139.4, 𝑣 = π‘π‘Ÿπ‘Ÿ_𝑖𝑛𝑑𝑒𝑛𝑠𝑖𝑑𝑦 𝑀𝑣 = data. Furthermore, for the final leaderboard submission, βŽͺ 5.2191, 𝑣 = π‘π‘šπ‘Ž the model was further trained for another cycle on all βŽͺ βŽͺ 142.17, 𝑣 = π‘Žπ‘ π‘–π‘–_π‘‘π‘’π‘Ÿπ‘_π‘‘π‘Ÿπ‘œπ‘_π‘π‘Ÿπ‘œπ‘ ⎩ the validation data available. 𝐿𝐾𝐿 is the KL divergence between the estimated Gaussian distribution 𝑁 (πœ‡, 𝜎 2 ) and a prior distribution 4. Results 𝑁 (0, 1): The majority of experimentation on the design of features 512 1 βˆ‘οΈ 2 and model architecture was conducted on single regions 𝐿𝐾𝐿 = πœ‡π‘– + πœŽπ‘–2 βˆ’ log πœŽπ‘– βˆ’ 1 (3) 2 𝑖=1 to allow for quicker feedback and learning. However, the final model was trained on data from all regions, so The 𝐿𝐾𝐿 factor of 80 in Equation 1 was determined there is a risk that some of the decisions made might empirically to balance the relative importance of the two not be optimum for a model trained on data from all terms in the loss function. regions. Results from the main experiments can be found in Appendix A. 3.4. Optimisation Final experiments on all three regions were conducted, and models were evaluated based on either the test The Variational U-Net is trained using the Adam op- learderboard or the final leaderboard. It is worth noting timiser with Cyclic Cosine Annealing described by that the test leaderboard allowed multiple submissions Loshchilov et al. [13]. The training process is split into and was open up to the final week of the competition. cycles, with each cycle consisting of 2 epochs. At each In the final week, the final leaderboard was opened and cycle, the learning rate is first set to a maximum of 2e-4, competitors were only allowed three submissions. The then is reduced following a cosine annealing schedule. results of the submissions can be found in Table 3. Resetting the learning rate at the beginning of each cy- The competition is based on the final leaderboard cle perturbs the models and encourages them to explore scores and the final model resulted in a third-place finish different basins of attraction. The training is continued for both the core and transfer learning challenges. The until an additional cycle failed to return a better valida- training history of the final model is shown in Figure 3, tion score. highlighting the loss progression during both the normal Using a batch size of 12, the final model was first training phase, as well as the additional cycle training on trained for 6 cycles (12 epochs) on the training data, then the validation data. it was further trained for an additional cycle (2 epochs) on both the training and validation data. Table 3 Summary of leaderboard scores for final models Core Challenge Transfer Learning Challenge Model Test Final Test Final Validation Leaderboard Leaderboard Leaderboard Leaderboard Mean baseline - 0.8822 - - - IARAI U-Net baseline [2] - 0.6689 - 0.6111 - One model per region - 0.5095 - - - Single model 0.3912 0.4977 0.5140 0.4878 0.4711 Single model + linear interpolation of 0.3887 - 0.5218 - - π‘‘π‘’π‘šπ‘π‘’π‘Ÿπ‘Žπ‘‘π‘’π‘Ÿπ‘’ Single model + training on validation - - 0.5102 - 0.4670 data Figure 3: Training history of the final model 5. Discussion whereas others (e.g. cloud top altitude) did not. It was found that linearly interpolating temperature provided Although various U-Net architectures were explored, it an improvement to the validation score, however, this was interesting to observe that the final architecture was did not read across to the final leaderboard score. The very similar to the architecture used for Traffic4cast [5]. authors still believe that strategies to compute missing The only changes were moving to max pooling from data is an interesting area for further work. average pooling, the addition of dropout layers and the Perhaps most surprisingly was the benefit gained from adoption of the VAE style bottleneck. The authors would training a single model on data from all regions instead be interested in exploring whether these improvements of individual models for each region. The model trained would also read back across to the traffic prediction task. on all regions displayed a significant improvement in the In terms of feature engineering, the experiments test leaderboard score (~2.3%) over individually trained showed that the inclusion of some extra features (e.g. models. This finding suggests that that the model may cloud top pressure) improved predictive capability, continue to improve its general predictive ability for any region with the addition of more training data. This [7] S. Choi, Utilizing unet for the future traffic map hypothesis was further supported as training on the vali- prediction task traffic4cast challenge 2020, arXiv dation data further improved the final leaderboard score preprint arXiv:2012.00125 (2020). for both core and transfer learning challenges. [8] S. A. Kohl, B. Romera-Paredes, C. Meyer, J. De Fauw, J. R. Ledsam, K. H. Maier-Hein, S. Eslami, D. J. Rezende, O. Ronneberger, A probabilistic u-net for 6. Conclusion segmentation of ambiguous images, arXiv preprint arXiv:1806.05034 (2018). Weather4cast provided the opportunity to explore the [9] A. Myronenko, 3d mri brain tumor segmentation use of machine learning techniques to the age-old prob- using autoencoder regularization, in: International lem of weather forecasting. Furthermore, the similarity MICCAI Brainlesion Workshop, Springer, 2018, pp. of format to Traffic4cast also provided the chance to in- 311–320. vestigate how transferable machine learning models can [10] D.-A. Clevert, T. Unterthiner, S. Hochreiter, Fast be across vastly different domains. After experiment- and accurate deep network learning by exponential ing with various U-Net architectures, the final model linear units (elus), arXiv preprint arXiv:1511.07289 was very similar to the authors’ Traffic4cast model. The (2015). main differences being changes to suppress overfitting, [11] Y. Wu, K. He, Group normalization, in: Proceedings i.e. moving to the Variational U-Net model and inclusion of the European conference on computer vision of dropout layers throughout. The authors also found (ECCV), 2018, pp. 3–19. that training on data from all regions in one model out- [12] J. Tompson, R. Goroshin, A. Jain, Y. LeCun, C. Bre- performed training individual models on each region for gler, Efficient object localization using convolu- both the core and transfer learning challenges. This sug- tional networks, in: Proceedings of the IEEE con- gests that the model prediction for all regions can be ference on computer vision and pattern recognition, improved by training on more data. 2015, pp. 648–656. [13] I. Loshchilov, F. Hutter, Sgdr: Stochastic gradi- References ent descent with warm restarts, arXiv preprint arXiv:1608.03983 (2016). [1] IARAI, Weather4cast: Multi-sensor weather fore- cast competition, 2021. URL: https://www.iarai.ac. at/weather4cast. [2] IARAI, Weather4cast: Multi-sensor weather fore- A. Experiments on R1 casting competition & benchmark dataset, 2021. Table A1 details some of the experiments done on R1 to URL: https://github.com/iarai/weather4cast. explore which input features should be included in the [3] D. Kreil, M. Kopp, D. Jonietz, M. Neun, A. Gruca, final model. All these experiments were done using the P. Herruzo, H. Martin, A. Soleymani, S. Hochre- training and validation data provided. The underlying iter, The surprising efficiency of framing geo-spatial assumption was that the results from these experiments time series forecasting as a video prediction task– would read across to the final leaderboard. insights from the iarai traffic4cast competition at neurips 2019, in: NeurIPS 2019 Competition and Table A1 Demonstration Track, PMLR, 2020, pp. 232–241. Summary of experimental results on R1 [4] M. Kopp, D. Kreil, M. Neun, D. Jonietz, H. Mar- tin, P. Herruzo, A. Gruca, A. Soleymani, F. Wu, Experiment Base 1 2 3 4 Y. Liu, et al., Traffic4cast at neurips 2020 - yet π‘π‘‘π‘‘β„Ž_π‘π‘Ÿπ‘’π‘  - - Yes Yes Yes π‘π‘Ÿπ‘Ÿ _π‘Žπ‘π‘π‘’π‘š - Yes Yes Yes Yes more on theunreasonable effectiveness of gridded 𝑐𝑑 - - Yes Yes Yes π‘π‘‘π‘‘β„Ž_π‘‘π‘’π‘šπ‘π‘’ mask - - - Yes Yes geo-spatial processes, in: NeurIPS 2020 Compe- π‘π‘‘π‘‘β„Ž_π‘Žπ‘™π‘‘ - - Yes - - Interpolated π‘π‘‘π‘‘β„Ž_π‘‘π‘’π‘šπ‘π‘’ - - - - Yes tition and Demonstration Track, PMLR, 2021, pp. Epoch 20 27 32 24 20 325–343. Training score 0.2247 0.2155 0.2091 0.2087 0.2229 Validation score 0.1933 0.1935 0.1894 0.1879 0.1889 [5] Q. Qi, P. H. Kwok, Traffic4cast 2020–graph ensem- ble net and the importance of feature and loss func- tion design for traffic prediction, arXiv preprint arXiv:2012.02115 (2020). [6] H. Martin, Y. Hong, D. Bucher, C. Rupprecht, R. Buf- fat, Traffic4cast-traffic map movie forecasting– team mie-lab, arXiv preprint arXiv:1910.13824 (2019).