Lightweight Deep Learning for Weather Prediction and
                         Forecasting in Africa
                         Kinyua Gikunda1,2,∗,† , Nicolas Jouandeau2
                         1
                             Dedan Kimathi University of Technology, Nyeri, Kenya
                         2
                             Université Paris 8, Vincennes - Saint-Denis


                                        Abstract
                                        Weather forecasting in Africa is hampered by sparse meteorological data and limited computational resources.
                                        This paper addresses these challenges by proposing lightweight deep learning (DL) for weather prediction and
                                        forecasting. We integrate active learning and transfer learning methods to enhance model training efficiency
                                        and accuracy. By focusing on the informativeness and representativeness of training samples, our approach
                                        significantly reduces the need for extensive and costly labeling. After training on a source dataset, model skills
                                        are transferred to target datasets, allowing for effective weather variable predictions with minimal data. Extensive
                                        experiments on three weather datasets demonstrate that our hybrid Transfer Active Learning method achieves
                                        similar classification accuracy compared to existing methods, using only 20% of the training samples. This
                                        study highlights the potential of advanced DL techniques to improve weather forecasting in Africa, despite the
                                        constraints of data scarcity and limited computational infrastructure.

                                        Keywords
                                        Weather Forecasting, Deep Learning, Transfer Learning, Active Learning


                         1. Introduction
                         Weather forecasting is a critical component in managing and adapting to environmental changes,
                         particularly in Africa [1]. The continent faces unique challenges due to its vast geographical diversity
                         and limited availability of meteorological data. Many regions in Africa have sparse weather station
                         networks, resulting in uneven and incomplete datasets [2]. Additionally, the computational resources
                         required for advanced weather prediction models are often scarce, further complicating accurate
                         forecasting efforts. These challenges necessitate innovative approaches that can leverage available
                         data and computational resources efficiently. Deep learning (DL) combined with strategies like active
                         learning and transfer learning offers promising solutions to enhance weather prediction and forecasting
                         accuracy in Africa. By utilizing lightweight DL models, it is possible to achieve weather forecasts even
                         in data-scarce and resource-constrained environments, ultimately aiding in better decision-making and
                         resource management across the continent.


                         2. Deep Learning for Weather Prediction
                         The non-linear behavior of meteorological data poses significant challenges for weather prediction, even
                         with state-of-the-art numerical models [3]. This complexity has led researchers to explore emerging
                         Artificial Intelligence (AI) approaches, which have demonstrated impressive performance in various
                         fields [4]. Traditional parametric models, such as linear models, struggle with meteorological data due to
                         their limited expressive power and inability to stack linear operations for more abstract representations


                         Proceedings of the DAAfrica’2024 Workshop, November 23, 2024, Bejaia, Aljeria
                         ∗
                             Corresponding author.
                         †
                             These authors contributed equally.
                         Envelope-Open patrick.gikunda@dkut.ac.ke (K. Gikunda); n@up8.edu (N. Jouandeau)
                         GLOBE https://csit.dkut.ac.ke/departments/it/dr-kinyua-gikunda/ (K. Gikunda); https:https://n.up8.site/ (N. Jouandeau)
                         Orcid 0000-0001-7962-2168 (K. Gikunda); 0000-0001-6902-4324 (N. Jouandeau)
                                       © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings


                                                                                                               3
[5]. Non-parametric learners like Gaussian kernels offer flexibility but are hindered by their reliance on
local generalization and the exponential growth of input dimensionality.
   Deep Learning (DL) methods address these challenges by stacking multiple feature learning layers to
form deep representations, enhancing both computational and statistical efficiency. Recent advance-
ments have improved the representation of inputs with fewer parameters, allowing for effective feature
learning using both labeled and unlabeled data. Transfer Learning (TL), a process within DL, leverages
learned features to apply knowledge from one domain to another related domain, improving learning
efficiency and effectiveness. This makes DL particularly suitable for complex and dynamic fields like
weather prediction.
   Deep learning methods, especially convolutional neural network (CNN)-based time series classifiers,
have proven highly effective for extracting temporal and spatial features from spatio-temporal weather
data [6]. These methods offer faster and more accurate predictions and can handle large, complex
datasets from weather satellites and IoT devices [7]. Unlike traditional models, DL do not require
extensive feature engineering, making them more adaptable and practical for weather forecasting
applications.
   The flexibility and robustness of DL approaches make them well-suited for the complexities of weather
data, which often exhibit non-linear and chaotic behavior. DL models, leveraging distributed and sparse
representations, can capture intricate data structures that traditional parametric and non-parametric
models struggle to represent effectively. This capability is crucial for processing high-dimensional
meteorological datasets, where capturing subtle patterns and correlations can significantly enhance
prediction accuracy.
   DL’s superior feature learning capabilities allow for better representation and understanding of
weather patterns, leading to improved prediction accuracy and reliability [8]. These techniques reduce
the need for manual data preprocessing and feature extraction, streamlining the forecasting process.
Moreover, DL methods excel at learning from vast amounts of data, continually improving predictive
performance as more data becomes available. Their scalability ensures that forecasting systems remain
efficient and effective even as data volumes grow, making DL particularly beneficial for weather
forecasting.


3. Transfer Learning and Active Learning
To address the challenge of sparse training data in time series datasets, the proposed model incorporates
two primary DL techniques: Transfer Learning and Active Learning.
   TL allows the model to leverage pre-existing knowledge from a related source task and apply it
to the target task. This technique enhances the model’s ability to generalize and perform well even
with limited data by re-using model skills. AL dynamically queries and selects the most informative
samples to add to the training set. It uses labeled data to provide critical information about class labels
or boundaries, while unlabeled data helps in understanding the base data distribution. This iterative
process improves the efficiency of the learning process by focusing on the most useful data points.
   Before delving into the specifics of these techniques, it is essential to define the Time Series Classifi-
cation (TSC) problem.

Definition 1. An univariate time series 𝑈𝑡 = [𝑥1 , 𝑥2 , ..., 𝑥𝑇 ] is an ordered set of real values. The length of
𝑈𝑡 is equal to the number of observable time-points T.

Definition 2. A multivariate time series 𝑀𝑡 = 𝑈𝑡1 , 𝑈𝑡2 , ...., 𝑈𝑡𝑛 consist of n observations per time-point with
𝑈𝑡𝑖 ∈ 𝑅𝑇

Definition 3. A dataset 𝐷 = (𝑋1 , 𝑌1 ), (𝑋2 , 𝑌2 ), ..., (𝑋𝑁 , 𝑌𝑁 ) is a collection of pairs (𝑋𝑖 , 𝑌𝑖 ) where 𝑋𝑖 could
either be Ut or Mt with 𝑌𝑖 as its corresponding label. For a dataset containing 𝐾 classes, the label vector 𝑌𝑖 is
a vector of length 𝐾 where each element 𝑗 ∈ [1, 𝐾 ] is equal to 1 if the class of 𝑋𝑖 is j and 0 otherwise.


                                                          4
  We can define Time Series Classification (TSC) as the task of mapping time-based inputs to a prob-
ability distribution over a set of labels. This can be mathematically represented by the following
equation:
                                  𝐶𝑡 = 𝑓 (𝑤 ∗ 𝑈𝑡−𝑙/2∶𝑡+𝑙/2 + 𝑏)|∀𝑡 ∈ 1, 𝑇                        (1)
𝐶 denotes the convolution result on a univariate time series 𝑈𝑡 of length 𝑇 with a filter 𝑤 of length 𝑙, a
bias parameter 𝑏 and a non-linear function 𝑓. Applying several filters on a time series will result in a
multivariate time series whose dimensions are equal to the number of filters used. Using the same filter
values 𝑤 and 𝑏 in ConvNets its possible to find the results for all time stamps 𝑡 ∈ [1, 𝑇 ]. This is possible
by using weight sharing that enables the model to learn feature detectors that are invariant across the
time array


4. Deep Transfer Active Learning
During target training, the model’s parameters are initialized using weights from a previous task,
represented as Θ ← 𝜗𝜃 . After initializing the weights, a forward pass through the model is performed
using the function 𝑓 (𝜃, 𝑥𝑖 ), which computes the output for an input 𝑥𝑖 . The output is a vector of estimated
probabilities for 𝑥𝑖 belonging to each class. The prediction loss is then computed using a cost function,
such as the negative log likelihood. Using gradient descent, the weights are updated in a backward pass
to propagate the error. This iterative process of forward pass followed by backpropagation updates the
model’s parameters to minimize the loss on the training data. During testing, the model is evaluated on
unseen data. A forward pass is performed on the new input, followed by class prediction. The predicted
class corresponds to the one with the highest probability. For this, categorical cross-entropy is applied
as the loss function, denoted as:
                                                                   𝑁
                                          𝐿(𝑦, 𝑦)̂ = − ∑ 𝑦𝑖 log(𝑦𝑖̂ )                                      (2)
                                                               𝑖=1
where 𝑦𝑖 is the true label and 𝑦𝑖̂ is the predicted probability for class 𝑖. This loss function helps to
measure the performance of the classification model by comparing the predicted probabilities with the
actual labels.
  AL is used to select instances a model is most uncertain about to improve learning efficiency. In
uncertainty sampling, the model aims to identify and learn from the most informative data points.
Three primary metrics used to define uncertainty are least confidence, sample margin, and entropy. To
take consideration of the entire output distribution, entropy is used as a metric which is defined as:

                                 𝑓𝑢 (𝑥) = arg max − ∑ 𝑃(𝑦𝑖 |𝑥𝑖 ) log 𝑃(𝑦𝑖 |𝑥𝑖 )                            (3)
                                                   𝑖
                                                               𝑖

Here, 𝑃(𝑦𝑖 |𝑥𝑖 ) is the posterior probability of instance 𝑥𝑖 belonging to class 𝑖. For binary classification,
the most uncertain instances are those with nearly equal probabilities for both classes.
  Besides uncertainty, considering the distribution of instances can enhance AL performance. Instance
diversity helps in selecting the most representative samples, thus improving query performance and
avoiding outliers.
  The correlation measure assesses the pairwise similarities of instances. The informativeness of
an instance is determined by its average similarity to its neighbors. For two instances 𝑥𝑖 and 𝑥𝑗 , the
correlation measure 𝑓𝑐 is defined as:
                                                        1
                                        𝑓𝑐 (𝑥) =             ∑ 𝑓 (𝑥 , 𝑥 )                                  (4)
                                                       𝐷𝑈 𝑥 ∈𝐷𝑈 /𝑥 𝑐 𝑖 𝑗
                                                           𝑗           𝑖


The value of 𝑓𝑐 (𝑥𝑖 ) represents the density of 𝑥𝑖 in the unlabeled set. Higher values indicate that an
instance is closely related to others, while lower values suggest outliers, which should be avoided for
labeling.


                                                               5
  To select the most informative and representative samples, a heuristic combination of correlation
and uncertainty measures is employed. The most effective instance to label can be expressed as:

                                             𝑥̂ = arg max(𝑓𝑢 (𝑥) ⋅ 𝑓𝑐 (𝑥))                             (5)
                                                        𝑖

This approach ensures that the selected samples are both uncertain and representative, enhancing the
learning process.


5. Results
Three datasets were used in the experiments namely: a) RAUS1 dataset contains daily weather obser-
vations from various Australian weather stations for a period of 10 years, b) KenCentralMet (Kenya
Meteorological Department2 privately acquired daily weather observations covering Central Kenya for
a period of 3 years from 2012-2014 ) and c) MeteoNet3 a meteorological dataset developed and made
available by the French national meteorological service. For each of the dataset, less than 20% of the
labeled samples was used as the initial training set. We present comparison of the proposed DTAL
method, as detailed in the previous section, against: i) Random selection of data samples to query,
iii) QUIRE method inspired by the margin based active learning from the minimax viewpoint with
emphasize on selecting unlabeled instances that are both informative and representative [9], iv) DFAL
method that selects unlabeled samples with the smallest perturbation. The distance between a sample
and its smallest adversarial example better approximates the original distance to the decision boundary
[10], v) Core-Set non-uncertainty based AL method [11].

                             RAUS                     KenCentralMet          MeteoNet
                             ℙ  ℝ       𝔸             ℙ   ℝ   𝔸              ℙ   ℝ   𝔸
          Random             81 80      79            64 67 62               89 85 91
          DTAL               80 85      85            68 64 67               91 90 93
          QUIRE              89 84      81            67 68 67               87 88 86
          DFAL               83 82      80            60 62 64               91 88 93
          Core-Set           79 83      84            65 65 68               90 91 91

Table 1
Experimental results with Precision ℙ, Recall ℝ and Accuracy 𝔸.

   Table 1 shows that DTAL generally outperforms a bit other methods (except QUIRE that is better with
RAUS), especially in terms of precision and recall, demonstrating the effectiveness of the proposed hybrid
strategy in selecting the most valuable training samples from the distribution. However, performance
varies depending on the dataset, highlighting the importance of dataset characteristics in the efficacy of
active learning methods and demonstrates that results can be equivalent even with less samples.


6. Conclusion
This paper demonstrates the efficacy of lightweight deep learning, integrating active and transfer
learning, for weather prediction in Africa. Our hybrid Transfer Active Learning method significantly
enhances forecasting accuracy with minimal data, using only small portion of the training samples
compared to existing methods. Despite challenges of data scarcity and limited computational resources,
our approach shows promise in providing good weather forecasts essential for effective decision-making
and resource management in Africa. Future work will focus on refining these techniques and validating
their practical benefits in real-world applications.
1
  https://www.kaggle.com/datasets/jsphyg/weather-dataset-rattle-package
2
  https://meteo.go.ke/
3
  https://www.kaggle.com/datasets/katerpillar/meteonet


                                                            6
Declaration on Generative AI
The author(s) have not employed any Generative AI tools.


References
 [1] P. J. Cooper, J. Dimes, K. Rao, B. Shapiro, B. Shiferaw, S. Twomlow, Coping better with current
     climatic variability in the rain-fed farming systems of sub-saharan africa: an essential first step in
     adapting to future climate change?, Agriculture, ecosystems & environment 126 (2008) 24–35.
 [2] M. Radeny, A. Desalegn, D. Mubiru, F. Kyazze, H. Mahoo, J. Recha, P. Kimeli, D. Solomon, Indige-
     nous knowledge for seasonal weather and climate forecasting across east africa, Climatic Change
     156 (2019) 509–526.
 [3] L. Benavides Cesar, R. Amaro e Silva, M. Á. Manso Callejo, C.-I. Cira, Review on spatio-temporal
     solar forecasting methods driven by in situ measurements or their combination with satellite and
     numerical weather prediction (nwp) estimates, Energies 15 (2022) 4341.
 [4] M. Das, S. K. Ghosh, Data-driven approaches for meteorological time series prediction: a com-
     parative study of the state-of-the-art computational intelligence techniques, Pattern Recognition
     Letters 105 (2018) 155–164.
 [5] N. Cohen, O. Sharir, A. Shashua, On the expressive power of deep learning: A tensor analysis, in:
     Conference on learning theory, PMLR, 2016, pp. 698–728.
 [6] J. F. Torres, D. Hadjout, A. Sebaa, F. Martínez-Álvarez, A. Troncoso, Deep learning for time series
     forecasting: a survey, Big Data 9 (2021) 3–21.
 [7] L. Chen, B. Han, X. Wang, J. Zhao, W. Yang, Z. Yang, Machine learning methods in weather and
     climate applications: A survey, Applied Sciences 13 (2023) 12019.
 [8] G. Huang, Y. Wang, Y.-G. Ham, B. Mu, W. Tao, C. Xie, Toward a learnable climate model in the
     artificial intelligence era, Advances in Atmospheric Sciences (2024) 1–8.
 [9] S.-J. Huang, R. Jin, Z.-H. Zhou, Active learning by querying informative and representative
     examples, Advances in neural information processing systems 23 (2010).
[10] M. Ducoffe, F. Precioso, Adversarial active learning for deep networks: a margin based approach,
     arXiv preprint arXiv:1802.09841 (2018).
[11] O. Sener, S. Savarese, Active learning for convolutional neural networks: A core-set approach,
     arXiv preprint arXiv:1708.00489 (2017).


                                                    7