Lightweight Deep Learning for Weather Prediction and Forecasting in Africa Kinyua Gikunda1,2,∗,† , Nicolas Jouandeau2 1 Dedan Kimathi University of Technology, Nyeri, Kenya 2 Université Paris 8, Vincennes - Saint-Denis Abstract Weather forecasting in Africa is hampered by sparse meteorological data and limited computational resources. This paper addresses these challenges by proposing lightweight deep learning (DL) for weather prediction and forecasting. We integrate active learning and transfer learning methods to enhance model training efficiency and accuracy. By focusing on the informativeness and representativeness of training samples, our approach significantly reduces the need for extensive and costly labeling. After training on a source dataset, model skills are transferred to target datasets, allowing for effective weather variable predictions with minimal data. Extensive experiments on three weather datasets demonstrate that our hybrid Transfer Active Learning method achieves similar classification accuracy compared to existing methods, using only 20% of the training samples. This study highlights the potential of advanced DL techniques to improve weather forecasting in Africa, despite the constraints of data scarcity and limited computational infrastructure. Keywords Weather Forecasting, Deep Learning, Transfer Learning, Active Learning 1. Introduction Weather forecasting is a critical component in managing and adapting to environmental changes, particularly in Africa [1]. The continent faces unique challenges due to its vast geographical diversity and limited availability of meteorological data. Many regions in Africa have sparse weather station networks, resulting in uneven and incomplete datasets [2]. Additionally, the computational resources required for advanced weather prediction models are often scarce, further complicating accurate forecasting efforts. These challenges necessitate innovative approaches that can leverage available data and computational resources efficiently. Deep learning (DL) combined with strategies like active learning and transfer learning offers promising solutions to enhance weather prediction and forecasting accuracy in Africa. By utilizing lightweight DL models, it is possible to achieve weather forecasts even in data-scarce and resource-constrained environments, ultimately aiding in better decision-making and resource management across the continent. 2. Deep Learning for Weather Prediction The non-linear behavior of meteorological data poses significant challenges for weather prediction, even with state-of-the-art numerical models [3]. This complexity has led researchers to explore emerging Artificial Intelligence (AI) approaches, which have demonstrated impressive performance in various fields [4]. Traditional parametric models, such as linear models, struggle with meteorological data due to their limited expressive power and inability to stack linear operations for more abstract representations Proceedings of the DAAfrica’2024 Workshop, November 23, 2024, Bejaia, Aljeria ∗ Corresponding author. † These authors contributed equally. Envelope-Open patrick.gikunda@dkut.ac.ke (K. Gikunda); n@up8.edu (N. Jouandeau) GLOBE https://csit.dkut.ac.ke/departments/it/dr-kinyua-gikunda/ (K. Gikunda); https:https://n.up8.site/ (N. Jouandeau) Orcid 0000-0001-7962-2168 (K. Gikunda); 0000-0001-6902-4324 (N. Jouandeau) © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 3 [5]. Non-parametric learners like Gaussian kernels offer flexibility but are hindered by their reliance on local generalization and the exponential growth of input dimensionality. Deep Learning (DL) methods address these challenges by stacking multiple feature learning layers to form deep representations, enhancing both computational and statistical efficiency. Recent advance- ments have improved the representation of inputs with fewer parameters, allowing for effective feature learning using both labeled and unlabeled data. Transfer Learning (TL), a process within DL, leverages learned features to apply knowledge from one domain to another related domain, improving learning efficiency and effectiveness. This makes DL particularly suitable for complex and dynamic fields like weather prediction. Deep learning methods, especially convolutional neural network (CNN)-based time series classifiers, have proven highly effective for extracting temporal and spatial features from spatio-temporal weather data [6]. These methods offer faster and more accurate predictions and can handle large, complex datasets from weather satellites and IoT devices [7]. Unlike traditional models, DL do not require extensive feature engineering, making them more adaptable and practical for weather forecasting applications. The flexibility and robustness of DL approaches make them well-suited for the complexities of weather data, which often exhibit non-linear and chaotic behavior. DL models, leveraging distributed and sparse representations, can capture intricate data structures that traditional parametric and non-parametric models struggle to represent effectively. This capability is crucial for processing high-dimensional meteorological datasets, where capturing subtle patterns and correlations can significantly enhance prediction accuracy. DL’s superior feature learning capabilities allow for better representation and understanding of weather patterns, leading to improved prediction accuracy and reliability [8]. These techniques reduce the need for manual data preprocessing and feature extraction, streamlining the forecasting process. Moreover, DL methods excel at learning from vast amounts of data, continually improving predictive performance as more data becomes available. Their scalability ensures that forecasting systems remain efficient and effective even as data volumes grow, making DL particularly beneficial for weather forecasting. 3. Transfer Learning and Active Learning To address the challenge of sparse training data in time series datasets, the proposed model incorporates two primary DL techniques: Transfer Learning and Active Learning. TL allows the model to leverage pre-existing knowledge from a related source task and apply it to the target task. This technique enhances the model’s ability to generalize and perform well even with limited data by re-using model skills. AL dynamically queries and selects the most informative samples to add to the training set. It uses labeled data to provide critical information about class labels or boundaries, while unlabeled data helps in understanding the base data distribution. This iterative process improves the efficiency of the learning process by focusing on the most useful data points. Before delving into the specifics of these techniques, it is essential to define the Time Series Classifi- cation (TSC) problem. Definition 1. An univariate time series 𝑈𝑡 = [𝑥1 , 𝑥2 , ..., 𝑥𝑇 ] is an ordered set of real values. The length of 𝑈𝑡 is equal to the number of observable time-points T. Definition 2. A multivariate time series 𝑀𝑡 = 𝑈𝑡1 , 𝑈𝑡2 , ...., 𝑈𝑡𝑛 consist of n observations per time-point with 𝑈𝑡𝑖 ∈ 𝑅𝑇 Definition 3. A dataset 𝐷 = (𝑋1 , 𝑌1 ), (𝑋2 , 𝑌2 ), ..., (𝑋𝑁 , 𝑌𝑁 ) is a collection of pairs (𝑋𝑖 , 𝑌𝑖 ) where 𝑋𝑖 could either be Ut or Mt with 𝑌𝑖 as its corresponding label. For a dataset containing 𝐾 classes, the label vector 𝑌𝑖 is a vector of length 𝐾 where each element 𝑗 ∈ [1, 𝐾 ] is equal to 1 if the class of 𝑋𝑖 is j and 0 otherwise. 4 We can define Time Series Classification (TSC) as the task of mapping time-based inputs to a prob- ability distribution over a set of labels. This can be mathematically represented by the following equation: 𝐶𝑡 = 𝑓 (𝑤 ∗ 𝑈𝑡−𝑙/2∶𝑡+𝑙/2 + 𝑏)|∀𝑡 ∈ 1, 𝑇 (1) 𝐶 denotes the convolution result on a univariate time series 𝑈𝑡 of length 𝑇 with a filter 𝑤 of length 𝑙, a bias parameter 𝑏 and a non-linear function 𝑓. Applying several filters on a time series will result in a multivariate time series whose dimensions are equal to the number of filters used. Using the same filter values 𝑤 and 𝑏 in ConvNets its possible to find the results for all time stamps 𝑡 ∈ [1, 𝑇 ]. This is possible by using weight sharing that enables the model to learn feature detectors that are invariant across the time array 4. Deep Transfer Active Learning During target training, the model’s parameters are initialized using weights from a previous task, represented as Θ ← 𝜗𝜃 . After initializing the weights, a forward pass through the model is performed using the function 𝑓 (𝜃, 𝑥𝑖 ), which computes the output for an input 𝑥𝑖 . The output is a vector of estimated probabilities for 𝑥𝑖 belonging to each class. The prediction loss is then computed using a cost function, such as the negative log likelihood. Using gradient descent, the weights are updated in a backward pass to propagate the error. This iterative process of forward pass followed by backpropagation updates the model’s parameters to minimize the loss on the training data. During testing, the model is evaluated on unseen data. A forward pass is performed on the new input, followed by class prediction. The predicted class corresponds to the one with the highest probability. For this, categorical cross-entropy is applied as the loss function, denoted as: 𝑁 𝐿(𝑦, 𝑦)̂ = − ∑ 𝑦𝑖 log(𝑦𝑖̂ ) (2) 𝑖=1 where 𝑦𝑖 is the true label and 𝑦𝑖̂ is the predicted probability for class 𝑖. This loss function helps to measure the performance of the classification model by comparing the predicted probabilities with the actual labels. AL is used to select instances a model is most uncertain about to improve learning efficiency. In uncertainty sampling, the model aims to identify and learn from the most informative data points. Three primary metrics used to define uncertainty are least confidence, sample margin, and entropy. To take consideration of the entire output distribution, entropy is used as a metric which is defined as: 𝑓𝑢 (𝑥) = arg max − ∑ 𝑃(𝑦𝑖 |𝑥𝑖 ) log 𝑃(𝑦𝑖 |𝑥𝑖 ) (3) 𝑖 𝑖 Here, 𝑃(𝑦𝑖 |𝑥𝑖 ) is the posterior probability of instance 𝑥𝑖 belonging to class 𝑖. For binary classification, the most uncertain instances are those with nearly equal probabilities for both classes. Besides uncertainty, considering the distribution of instances can enhance AL performance. Instance diversity helps in selecting the most representative samples, thus improving query performance and avoiding outliers. The correlation measure assesses the pairwise similarities of instances. The informativeness of an instance is determined by its average similarity to its neighbors. For two instances 𝑥𝑖 and 𝑥𝑗 , the correlation measure 𝑓𝑐 is defined as: 1 𝑓𝑐 (𝑥) = ∑ 𝑓 (𝑥 , 𝑥 ) (4) 𝐷𝑈 𝑥 ∈𝐷𝑈 /𝑥 𝑐 𝑖 𝑗 𝑗 𝑖 The value of 𝑓𝑐 (𝑥𝑖 ) represents the density of 𝑥𝑖 in the unlabeled set. Higher values indicate that an instance is closely related to others, while lower values suggest outliers, which should be avoided for labeling. 5 To select the most informative and representative samples, a heuristic combination of correlation and uncertainty measures is employed. The most effective instance to label can be expressed as: 𝑥̂ = arg max(𝑓𝑢 (𝑥) ⋅ 𝑓𝑐 (𝑥)) (5) 𝑖 This approach ensures that the selected samples are both uncertain and representative, enhancing the learning process. 5. Results Three datasets were used in the experiments namely: a) RAUS1 dataset contains daily weather obser- vations from various Australian weather stations for a period of 10 years, b) KenCentralMet (Kenya Meteorological Department2 privately acquired daily weather observations covering Central Kenya for a period of 3 years from 2012-2014 ) and c) MeteoNet3 a meteorological dataset developed and made available by the French national meteorological service. For each of the dataset, less than 20% of the labeled samples was used as the initial training set. We present comparison of the proposed DTAL method, as detailed in the previous section, against: i) Random selection of data samples to query, iii) QUIRE method inspired by the margin based active learning from the minimax viewpoint with emphasize on selecting unlabeled instances that are both informative and representative [9], iv) DFAL method that selects unlabeled samples with the smallest perturbation. The distance between a sample and its smallest adversarial example better approximates the original distance to the decision boundary [10], v) Core-Set non-uncertainty based AL method [11]. RAUS KenCentralMet MeteoNet ℙ ℝ 𝔸 ℙ ℝ 𝔸 ℙ ℝ 𝔸 Random 81 80 79 64 67 62 89 85 91 DTAL 80 85 85 68 64 67 91 90 93 QUIRE 89 84 81 67 68 67 87 88 86 DFAL 83 82 80 60 62 64 91 88 93 Core-Set 79 83 84 65 65 68 90 91 91 Table 1 Experimental results with Precision ℙ, Recall ℝ and Accuracy 𝔸. Table 1 shows that DTAL generally outperforms a bit other methods (except QUIRE that is better with RAUS), especially in terms of precision and recall, demonstrating the effectiveness of the proposed hybrid strategy in selecting the most valuable training samples from the distribution. However, performance varies depending on the dataset, highlighting the importance of dataset characteristics in the efficacy of active learning methods and demonstrates that results can be equivalent even with less samples. 6. Conclusion This paper demonstrates the efficacy of lightweight deep learning, integrating active and transfer learning, for weather prediction in Africa. Our hybrid Transfer Active Learning method significantly enhances forecasting accuracy with minimal data, using only small portion of the training samples compared to existing methods. Despite challenges of data scarcity and limited computational resources, our approach shows promise in providing good weather forecasts essential for effective decision-making and resource management in Africa. Future work will focus on refining these techniques and validating their practical benefits in real-world applications. 1 https://www.kaggle.com/datasets/jsphyg/weather-dataset-rattle-package 2 https://meteo.go.ke/ 3 https://www.kaggle.com/datasets/katerpillar/meteonet 6 Declaration on Generative AI The author(s) have not employed any Generative AI tools. References [1] P. J. Cooper, J. Dimes, K. Rao, B. Shapiro, B. Shiferaw, S. Twomlow, Coping better with current climatic variability in the rain-fed farming systems of sub-saharan africa: an essential first step in adapting to future climate change?, Agriculture, ecosystems & environment 126 (2008) 24–35. [2] M. Radeny, A. Desalegn, D. Mubiru, F. Kyazze, H. Mahoo, J. Recha, P. Kimeli, D. Solomon, Indige- nous knowledge for seasonal weather and climate forecasting across east africa, Climatic Change 156 (2019) 509–526. [3] L. Benavides Cesar, R. Amaro e Silva, M. Á. Manso Callejo, C.-I. Cira, Review on spatio-temporal solar forecasting methods driven by in situ measurements or their combination with satellite and numerical weather prediction (nwp) estimates, Energies 15 (2022) 4341. [4] M. Das, S. K. Ghosh, Data-driven approaches for meteorological time series prediction: a com- parative study of the state-of-the-art computational intelligence techniques, Pattern Recognition Letters 105 (2018) 155–164. [5] N. Cohen, O. Sharir, A. Shashua, On the expressive power of deep learning: A tensor analysis, in: Conference on learning theory, PMLR, 2016, pp. 698–728. [6] J. F. Torres, D. Hadjout, A. Sebaa, F. Martínez-Álvarez, A. Troncoso, Deep learning for time series forecasting: a survey, Big Data 9 (2021) 3–21. [7] L. Chen, B. Han, X. Wang, J. Zhao, W. Yang, Z. Yang, Machine learning methods in weather and climate applications: A survey, Applied Sciences 13 (2023) 12019. [8] G. Huang, Y. Wang, Y.-G. Ham, B. Mu, W. Tao, C. Xie, Toward a learnable climate model in the artificial intelligence era, Advances in Atmospheric Sciences (2024) 1–8. [9] S.-J. Huang, R. Jin, Z.-H. Zhou, Active learning by querying informative and representative examples, Advances in neural information processing systems 23 (2010). [10] M. Ducoffe, F. Precioso, Adversarial active learning for deep networks: a margin based approach, arXiv preprint arXiv:1802.09841 (2018). [11] O. Sener, S. Savarese, Active learning for convolutional neural networks: A core-set approach, arXiv preprint arXiv:1708.00489 (2017). 7