A Variational U-Net for Weather Forecasting
Pak Hay Kwok1 , Qi Qi2
1
    pak_hay_kwok@hotmail.com
2
    qiq208@gmail.com


                                             Abstract
                                             Not only can discovering patterns and insights from atmospheric data enable more accurate weather predictions, but it may
                                             also provide valuable information to help tackle climate change. Weather4cast is an open competition that aims to evaluate
                                             machine learning algorithms’ capability to predict future atmospheric states. Here, we describe our third-place solution
                                             to Weather4cast. We present a novel Variational U-Net that combines a Variational Autoencoder’s ability to consider the
                                             probabilistic nature of data with a U-Net’s ability to recover fine-grained details. This solution is an evolution from our
                                             fourth-place solution to Traffic4cast 2020 with many commonalities, suggesting its applicability to vastly different domains,
                                             such as weather and traffic.
                                                 The code for this solution is available at https://github.com/qiq208/weather4cast2021_Stage1

                                             Keywords
                                             IARAI, Traffic4cast, Weather4cast, U-Net, Variational Autoencoder


1. Introduction                                                                                                       R1-3 correspond to the core challenge in which training,
                                                                                                                      validation and test data are provided, while regions R4-6
Meteorological satellites around the globe are constantly                                                             correspond to the transfer learning challenge in which
gathering a trove of data about the atmosphere. How-                                                                  only the test data are provided. In addition, static infor-
ever, the high-dimensionality nature of atmospheric data                                                              mation, such as altitude, latitude and longitude, are also
makes it challenging to analyse, hindering the discovery                                                              given for all regions.
of valuable insights. With the advent of machine learning                                                                Weather4cast demands an algorithm that can return
methods, it is believed these methods can help better un-                                                             the atmospheric states over the defined regions for the
derstand atmospheric data. To evaluate the applicability                                                              next 8 hours (32-off 15-minute intervals) given an hour
of such techniques to atmospheric data, Weather4cast                                                                  (4-off 15-minute intervals) worth of data. While only 4
[1] by the Institute of Advanced Research in Artificial                                                               target variables are required, namely 𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 (a
Intelligence is an open competition that challenges its                                                               channel of CTTH), 𝑐𝑟𝑟_𝑖𝑛𝑡𝑒𝑛𝑠𝑖𝑡𝑦 (a channel of CRR),
participants to develop algorithms to predict the future                                                              𝑎𝑠𝑖𝑖_𝑡𝑢𝑟𝑏_𝑡𝑟𝑜𝑝_𝑝𝑟𝑜𝑏 (a channel of ASII-TF) and 𝑐𝑚𝑎 (a
states of the atmosphere over specific regions.                                                                       channel of CMA), any channels of the weather products
   The Weather4cast dataset [2] is obtained from Me-                                                                  or static information of the regions can be used as input
teosat geostationary meteorological satellites operated                                                               variables.
by EUMETSAT for the period from February 2019 to                                                                         This work describes a novel Variational U-Net solu-
February 2021. The Meteosat images are processed by                                                                   tion which achieved third place in both the core and
NWC SAF software into weather products. The weather                                                                   transfer learning challenges of Weather4cast. This Varia-
products of interest are: Cloud Top Temperature and                                                                   tional U-Net can be viewed as a U-Net with a Variational
Height (CTTH), Convective Rainfall Rate (CRR), Auto-                                                                  Autoencoder (VAE) style bottleneck, or as a VAE with
matic Satellite Image Interpretation - Tropopause Folding                                                             U-Net style skip connections. The intuition behind this
detection (ASII-TF), Cloud Mask (CMA), and Cloud Type                                                                 architecture is to combined VAE’s ability to consider the
(CT). Each of these weather products is recorded in 15-                                                               probabilistic nature of data with U-Net’s ability to recover
minute intervals and consists of multiple channels. Each                                                              fine-grained details.
channel is in the format of an image of shape 256x256
pixels, with each pixel covering an area of about 4x4 km.
The regions of interest are illustrated in Figure 1; regions                                                          2. Related work
                                                                                                                      Weather4cast can be viewed as a video frame prediction
CDCEO 2021: 1st workshop on Complex Data Challenges in Earth
Observation, November 1, 2021, Virtual
                                                                                                                      problem, in which the inputs are the first 4 frames of a
" pak_hay_kwok@hotmail.com (P. H. Kwok); qiq208@gmail.com                                                             video, and the outputs are the subsequent 32 frames. This
(Q. Qi)                                                                                                               format of the problem is identical to that of Traffic4cast
~ https://github.com/ivans-github/ (P. H. Kwok);                                                                      [3, 4]. Overlooking the difference in domains between
https://github.com/qiq208/ (Q. Qi)                                                                                    Weather4cast and Traffic4cast, the two competitions can
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative
    CEUR
    Workshop
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       Commons License Attribution 4.0 International (CC BY 4.0).
                                       CEUR Workshop Proceedings (CEUR-WS.org)
                                                                                                                      be considered the same, hence solutions for Traffic4cast
    Proceedings
Figure 1: Weather4cast regions


should be somewhat transferable to Weather4cast. A             tion to last year’s Traffic4cast [5]. The encoder is made
range of algorithms, including U-Net, LSTM and Graph           up of Dense Blocks connected by 2D Max Pooling. Each
Neural Network were proposed for Traffic4cast [5, 6],          Dense Block consists of 4 repeats of 2D Convolution, ELU
yet various flavours of U-Net dominated the competition        [10], Group Normalisation [11] and 2D Dropout [12], fol-
in both 2019 and 2020, with all winning teams adopting         lowed by another 2D Convolution and ELU. Different to
U-Nets in their final solutions [5, 7]. Thus, it is sensible   the encoder, the decoder consists of repeats of 2D Trans-
to consider U-Net-based solutions for Weather4cast.            posed Convolution, ELU, 2D Convolution, ELU, Group
   While the formats of Weather4cast and Traffic4cast          Normalisation and 2D Dropout. The encoder and the
are equivalent, the differences in the underlying domains      decoder are joined by skip connections.
cannot be ignored. Specifically, weather is considered            Inspired by the works of Kohl et al. [8] and Myronenko
more random than traffic. Multiple scenarios are possible      [9], the bottleneck of the Variational U-Net, the part
given a set of observations, and this inherent randomness      which connects the end of the encoder to the start of the
needs particular attention, as it is not compatible with       decoder, is replaced with one that is typically found in
the deterministic nature of a typical U-Net. Segmentation      VAE. At the end of the encoder, the input is reduced to 2
of medical images also suffers from intrinsic ambiguities.     vectors of size 512, representing the means and standard
To handle these ambiguities, Kohl et al. [8] proposed          deviations of the latent variables. With the assumption
a Probabilistic U-Net, a combination of a U-Net with           that the latent variables are Gaussian, a sample is drawn,
a conditional VAE, capable of producing an unlimited           and the drawn vector is reconstructed into an image
number of hypotheses from a set of inputs. Myronenko           which is then passed through the decoder.
[9] also proposed a different way to combine a U-Net with         The architecture of the Variational U-Net is shown in
a VAE, which a VAE was applied to regularise a shared          Figure 2.
encoder. His solution was proven successful and won
first place in the Multimodal Brain Tumour Segmentation        3.2. Inputs and target variables
Challenge (BraTS) in 2018.
                                                         Similar to the authors’ Traffic4cast solution [5], the tem-
                                                         poral dimension of the input tensor is combined with
3. Methods                                               the channel dimension, resulting in the number of input
                                                         channels of 4*8. Furthermore, since it seems intuitive that
3.1. Model architecture                                  weather patterns are dependent on geographical location,
Given the similarities between Weather4cast and Traf- the static features of altitude, latitude and longitude are
fic4cast, the main structure of the proposed Variational appended, resulting in an additional 3 input channels. As
U-Net largely resembles the authors’ fourth-place solu- such, the final number of input channels to the Varia-
Figure 2: Variational U-Net architecture


Table 1
Summary of input features and target variables

         Feature       Target Variable                                   Description
      𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒            Yes                         Combined cloud top and ground temperature
        𝑐𝑡𝑡ℎ_𝑝𝑟𝑒𝑠            No                                        Cloud top pressure
     𝑐𝑟𝑟_𝑖𝑛𝑡𝑒𝑛𝑠𝑖𝑡𝑦           Yes                          Convective rainfall rate intensity in mm/h
       𝑐𝑟𝑟_𝑎𝑐𝑐𝑢𝑚             No                         Convective rainfall rate hourly accumulations
  𝑎𝑠𝑖𝑖_𝑡𝑢𝑟𝑏_𝑡𝑟𝑜𝑝_𝑝𝑟𝑜𝑏        Yes                        Probability of occurrence of tropopause folding
           𝑐𝑚𝑎               Yes                                          Cloud mask
            𝑐𝑡               No                                           Cloud type
   𝑐𝑡𝑡ℎ_𝑡𝑒𝑚𝑝𝑒 mask           No          A mask showing pixel locations containing cloud top temperature measurements


tional U-Net is 4*8+3=35. Finally, the model is designed those rejected are summarised in Table 2.
to predict all 32 output frames in one go, resulting in the
number of output channels being 32*4=128. Furthermore, 3.3. Loss function
any missing data has been zero-filled.
   A series of experiments were performed to find the The loss function consists of 2 terms:
most effective set of input features, and the validation set
                                                                             𝐿 = 𝐿𝐿2 + 80 * 𝐿𝐾𝐿                     (1)
was used to evaluate the performance of each feature set.
The resulting input feature set is listed in Table 1, and      𝐿𝐿2 is a modified mean squared error, it takes into
                                                             account missing values and the difference in scale of the
Table 2
Summary of input features not used in the final model

                            Feature                                       Description
                           𝑐𝑡𝑡ℎ_𝑎𝑙𝑡                                     Cloud top altitude
            Linear interpolation of 𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒     Using linear interpolation to fill in missing 𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒
              Linear interpolation of 𝑐𝑡𝑡ℎ_𝑝𝑟𝑒𝑠      Using linear interpolation to fill in missing 𝑐𝑡𝑡ℎ_𝑝𝑟𝑒𝑠


4 target variables:                                           3.5. Regularisation
               32          𝑃𝑡,𝑣                         From initial experiments, it became apparent that con-
          1 ∑︁ ∑︁ 𝑤𝑣 ∑︁                                 trolling overfitting of the model to the training data was
 𝐿𝐿2 =                          (𝑦𝑡,𝑣,𝑝 − 𝑦ˆ𝑡,𝑣,𝑝 )2 (2)
       32 × 4 𝑡=1 𝑣∈𝑉 𝑃𝑡,𝑣 𝑝=1                          a key to success in both the core and transfer learning
                                                        challenges. Hence, several regularisation strategies were
where 𝑉 = {𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒, 𝑐𝑟𝑟_𝑖𝑛𝑡𝑒𝑛𝑠𝑖𝑡𝑦, 𝑐𝑚𝑎,             employed. Within the model itself, the move to the Varia-
𝑎𝑠𝑖𝑖_𝑡𝑢𝑟𝑏_𝑡𝑟𝑜𝑝_𝑝𝑟𝑜𝑏}, 𝑃𝑡,𝑣 is the total number of non- tional U-Net from a traditional U-Net, combined with the
missing pixels for a given target variable 𝑣 at a given introduction of dropout layers throughout the encoder
time 𝑡 and 𝑤𝑣 is the target variable weighting:         and decoder, both aimed to improve the generalisation of
           ⎧                                            the model. To expose the model to as much variation in
           ⎪
           ⎪ 31.610, 𝑣 = 𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒                    input as possible, a single model was used for all regions
                                                        in the competition and trained on all available training
           ⎪
           ⎨4139.4, 𝑣 = 𝑐𝑟𝑟_𝑖𝑛𝑡𝑒𝑛𝑠𝑖𝑡𝑦
     𝑤𝑣 =                                               data. Furthermore, for the final leaderboard submission,
           ⎪ 5.2191, 𝑣 = 𝑐𝑚𝑎
                                                        the model was further trained for another cycle on all
           ⎪
           ⎪
             142.17, 𝑣 = 𝑎𝑠𝑖𝑖_𝑡𝑢𝑟𝑏_𝑡𝑟𝑜𝑝_𝑝𝑟𝑜𝑏
           ⎩
                                                        the validation data available.
  𝐿𝐾𝐿 is the KL divergence between the estimated
Gaussian distribution 𝑁 (𝜇, 𝜎 2 ) and a prior distribution    4. Results
𝑁 (0, 1):
                                                              The majority of experimentation on the design of features
                      512
                1 ∑︁ 2                                        and model architecture was conducted on single regions
          𝐿𝐾𝐿 =       𝜇𝑖 + 𝜎𝑖2 − log 𝜎𝑖 − 1             (3)
                2 𝑖=1                                         to allow for quicker feedback and learning. However,
                                                              the final model was trained on data from all regions, so
   The 𝐿𝐾𝐿 factor of 80 in Equation 1 was determined          there is a risk that some of the decisions made might
empirically to balance the relative importance of the two     not be optimum for a model trained on data from all
terms in the loss function.                                   regions. Results from the main experiments can be found
                                                              in Appendix A.
3.4. Optimisation                                                Final experiments on all three regions were conducted,
                                                              and models were evaluated based on either the test
The Variational U-Net is trained using the Adam op-           learderboard or the final leaderboard. It is worth noting
timiser with Cyclic Cosine Annealing described by             that the test leaderboard allowed multiple submissions
Loshchilov et al. [13]. The training process is split into    and was open up to the final week of the competition.
cycles, with each cycle consisting of 2 epochs. At each       In the final week, the final leaderboard was opened and
cycle, the learning rate is first set to a maximum of 2e-4,   competitors were only allowed three submissions. The
then is reduced following a cosine annealing schedule.        results of the submissions can be found in Table 3.
Resetting the learning rate at the beginning of each cy-         The competition is based on the final leaderboard
cle perturbs the models and encourages them to explore        scores and the final model resulted in a third-place finish
different basins of attraction. The training is continued     for both the core and transfer learning challenges. The
until an additional cycle failed to return a better valida-   training history of the final model is shown in Figure 3,
tion score.                                                   highlighting the loss progression during both the normal
   Using a batch size of 12, the final model was first        training phase, as well as the additional cycle training on
trained for 6 cycles (12 epochs) on the training data, then   the validation data.
it was further trained for an additional cycle (2 epochs)
on both the training and validation data.
Table 3
Summary of leaderboard scores for final models

                                                             Core Challenge               Transfer Learning Challenge
   Model                                                         Test       Final             Test           Final
                                                Validation
                                                             Leaderboard Leaderboard      Leaderboard Leaderboard
   Mean baseline                                    -          0.8822            -              -              -
   IARAI U-Net baseline [2]                         -          0.6689            -           0.6111            -

   One model per region                             -          0.5095            -              -               -
   Single model                                  0.3912        0.4977         0.5140         0.4878          0.4711
   Single model + linear interpolation of        0.3887           -           0.5218            -               -
   𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒
   Single model + training on validation            -            -            0.5102            -            0.4670
   data


Figure 3: Training history of the final model


5. Discussion                                                   whereas others (e.g. cloud top altitude) did not. It was
                                                                found that linearly interpolating temperature provided
Although various U-Net architectures were explored, it          an improvement to the validation score, however, this
was interesting to observe that the final architecture was      did not read across to the final leaderboard score. The
very similar to the architecture used for Traffic4cast [5].     authors still believe that strategies to compute missing
The only changes were moving to max pooling from                data is an interesting area for further work.
average pooling, the addition of dropout layers and the            Perhaps most surprisingly was the benefit gained from
adoption of the VAE style bottleneck. The authors would         training a single model on data from all regions instead
be interested in exploring whether these improvements           of individual models for each region. The model trained
would also read back across to the traffic prediction task.     on all regions displayed a significant improvement in the
   In terms of feature engineering, the experiments             test leaderboard score (~2.3%) over individually trained
showed that the inclusion of some extra features (e.g.          models. This finding suggests that that the model may
cloud top pressure) improved predictive capability,             continue to improve its general predictive ability for any
region with the addition of more training data. This           [7] S. Choi, Utilizing unet for the future traffic map
hypothesis was further supported as training on the vali-          prediction task traffic4cast challenge 2020, arXiv
dation data further improved the final leaderboard score           preprint arXiv:2012.00125 (2020).
for both core and transfer learning challenges.                [8] S. A. Kohl, B. Romera-Paredes, C. Meyer, J. De Fauw,
                                                                   J. R. Ledsam, K. H. Maier-Hein, S. Eslami, D. J.
                                                                   Rezende, O. Ronneberger, A probabilistic u-net for
6. Conclusion                                                      segmentation of ambiguous images, arXiv preprint
                                                                   arXiv:1806.05034 (2018).
Weather4cast provided the opportunity to explore the
                                                               [9] A. Myronenko, 3d mri brain tumor segmentation
use of machine learning techniques to the age-old prob-
                                                                   using autoencoder regularization, in: International
lem of weather forecasting. Furthermore, the similarity
                                                                   MICCAI Brainlesion Workshop, Springer, 2018, pp.
of format to Traffic4cast also provided the chance to in-
                                                                   311–320.
vestigate how transferable machine learning models can
                                                              [10] D.-A. Clevert, T. Unterthiner, S. Hochreiter, Fast
be across vastly different domains. After experiment-
                                                                   and accurate deep network learning by exponential
ing with various U-Net architectures, the final model
                                                                   linear units (elus), arXiv preprint arXiv:1511.07289
was very similar to the authors’ Traffic4cast model. The
                                                                   (2015).
main differences being changes to suppress overfitting,
                                                              [11] Y. Wu, K. He, Group normalization, in: Proceedings
i.e. moving to the Variational U-Net model and inclusion
                                                                   of the European conference on computer vision
of dropout layers throughout. The authors also found
                                                                   (ECCV), 2018, pp. 3–19.
that training on data from all regions in one model out-
                                                              [12] J. Tompson, R. Goroshin, A. Jain, Y. LeCun, C. Bre-
performed training individual models on each region for
                                                                   gler, Efficient object localization using convolu-
both the core and transfer learning challenges. This sug-
                                                                   tional networks, in: Proceedings of the IEEE con-
gests that the model prediction for all regions can be
                                                                   ference on computer vision and pattern recognition,
improved by training on more data.
                                                                   2015, pp. 648–656.
                                                              [13] I. Loshchilov, F. Hutter, Sgdr: Stochastic gradi-
References                                                         ent descent with warm restarts, arXiv preprint
                                                                   arXiv:1608.03983 (2016).
 [1] IARAI, Weather4cast: Multi-sensor weather fore-
     cast competition, 2021. URL: https://www.iarai.ac.
     at/weather4cast.
 [2] IARAI, Weather4cast: Multi-sensor weather fore-
                                                              A. Experiments on R1
     casting competition & benchmark dataset, 2021.           Table A1 details some of the experiments done on R1 to
     URL: https://github.com/iarai/weather4cast.              explore which input features should be included in the
 [3] D. Kreil, M. Kopp, D. Jonietz, M. Neun, A. Gruca,        final model. All these experiments were done using the
     P. Herruzo, H. Martin, A. Soleymani, S. Hochre-          training and validation data provided. The underlying
     iter, The surprising efficiency of framing geo-spatial   assumption was that the results from these experiments
     time series forecasting as a video prediction task–      would read across to the final leaderboard.
     insights from the iarai traffic4cast competition at
     neurips 2019, in: NeurIPS 2019 Competition and
                                                              Table A1
     Demonstration Track, PMLR, 2020, pp. 232–241.            Summary of experimental results on R1
 [4] M. Kopp, D. Kreil, M. Neun, D. Jonietz, H. Mar-
     tin, P. Herruzo, A. Gruca, A. Soleymani, F. Wu,             Experiment                Base       1        2        3        4


     Y. Liu, et al., Traffic4cast at neurips 2020 - yet
                                                                 𝑐𝑡𝑡ℎ_𝑝𝑟𝑒𝑠                   -        -       Yes      Yes      Yes
                                                                 𝑐𝑟𝑟 _𝑎𝑐𝑐𝑢𝑚                  -       Yes      Yes      Yes      Yes

     more on theunreasonable effectiveness of gridded
                                                                 𝑐𝑡                          -        -       Yes      Yes      Yes
                                                                 𝑐𝑡𝑡ℎ_𝑡𝑒𝑚𝑝𝑒 mask             -        -        -       Yes      Yes

     geo-spatial processes, in: NeurIPS 2020 Compe-
                                                                 𝑐𝑡𝑡ℎ_𝑎𝑙𝑡                    -        -       Yes       -        -
                                                                 Interpolated 𝑐𝑡𝑡ℎ_𝑡𝑒𝑚𝑝𝑒     -        -        -        -       Yes

     tition and Demonstration Track, PMLR, 2021, pp.             Epoch                       20       27       32       24       20

     325–343.
                                                                 Training score            0.2247   0.2155   0.2091   0.2087   0.2229
                                                                 Validation score          0.1933   0.1935   0.1894   0.1879   0.1889

 [5] Q. Qi, P. H. Kwok, Traffic4cast 2020–graph ensem-
     ble net and the importance of feature and loss func-
     tion design for traffic prediction, arXiv preprint
     arXiv:2012.02115 (2020).
 [6] H. Martin, Y. Hong, D. Bucher, C. Rupprecht, R. Buf-
     fat, Traffic4cast-traffic map movie forecasting–
     team mie-lab, arXiv preprint arXiv:1910.13824
     (2019).