Finding relevant multivariate models for multi-plant photovoltaic energy forecasting Youssef Hmamouche∗ , Piotr Przymus† , Lotfi Lakhal∗ and Alain Casali∗ LIF - CNRS UMR 7279, Aix Marseille University, Marseille, France, ∗ firstname.lastname@lif.univ-amu.fr, † piotr@przymus.org Abstract. Forecasting the photovoltaic energy power is useful for op- timizing and controling the system. It aims to predict the power pro- duction based on internal and external variables. This problem is very similar to the one of multiple time series forecasting problem. With the presence of multiple predictor variables, not all of them will equally con- tribute to the prediction. The goal is, given a set of predictors, to find what is the best / most accurate subset (s) leading to the best fore- cast. In this work, we present a feature selection and model matching framework. The idea is that we try to find the optimal combination of forecasting model with the most relevant features for given variable. We use a variety of causality based selection approaches and dimension re- duction techniques. The experiments are conducted on real data and the results advocate the usefulness of the proposed approach. Keywords: Time Series; Prediction; Data Mining; Ensemble Selection. 1 Introduction Time series forecasting is an important tool aiming to predict the evolution of time series over time based on the existing history. It has many applica- tions, for example, in finance, neuroscience, industrial optimization . . . this field is considered as an essential part of business intelligence systems. It is delivers crucial information that can improve the decision making processes, by antic- ipating systems behavior, e.g., energy consumption or production. Forecasting photovoltaics (PV) Energy Production has gained attention with the growing of interest in using PV as source of renewable energy. Forecasting the production of such systems has a direct impact on trading and controlling the used energy. In general, the PV energy can be measured as time series variables that can change according to the system state and external conditions, like temperature and the weather conditions. The simplest approach would be to use univariate forecasting model for power generation time series. Several models can be used in this context, for example the auto-regressive models, e.g. AR or ARIMA [1]. However this option have some drawbacks: it does not include crucial informa- tions provided by other variables. In this case, it is worth to exploit this extra information from other variables using multivariate models. One approach would be to use all available variables, but this (i) incorporate some irrelevant variables, and thus decrease the forecast accuracy [2], and (ii) use too much memory. Such situation can be improved by extracting only the most relevant variables. This rises some interesting challenges for multivariate time series forecasting. The organization of the paper is as follow. In the first section, we present and dis- cuss some works related to the problem addressed. In the Section, 3 we detail the proposed method. In Section 4, we describe the forecasting process and the methodology used to perform the experiments. In Section 5, we show and discuss the results. And in the last section, we summarize our approach. 2 Related Work In the literature, many approaches was proposed to handle the problem of fore- casting PV energy production. In [3], the paper deals with multi-plant PV en- ergy production forecasting. A comparison between artificial neural networks, regression trees, and spatio-temporal auto-correlation based methods was ex- perimented. The authors show that regression trees provide better results than artificial neural networks (ANNs). In [4], ANNs are used to forecast PV energy production, taking advantage from there ability to learn the changes. To improve the forecasts, multiple predictor variables that may influence the energy produc- tion were used based on internal and external factors. The same problematic was investigated in [5]. An hybrid approach was used, by adding basic physical constraints of the PV plant to the input of an ANN. The results show an im- provement of prediction accuracy compared to model without those constraints. More works on photovoltaic power forecasting approaches can be found in [6]. We argue that the problem of PV energy forecasting can be modelled as multivariate time series prediction. In the following we reformulate this problem and discuss the main approaches used to address it. Consider a set of predictor time series X = [x1 , . . . , xk ] and a target variable y, with n observations. There are multiple strategies to predict y using X. One way consists in using models that exploit the precedent values of y and X, e.g., the vector auto- regressive models [7]. In this work, we focus on the prediction models that predict Y at time t based on values of variables of X at the same time t. Therefore, the general model can expressed as follow, y(t) = f (X1 (t) + · · · + Xk (t)) + (t). Linear models P suppose that y can be expressed as linear combination of X, k i.e., y(t) = β0 + i=1 βi Xi (t) + (t), where (t) is the error term, and β = 0 [β0 , β1 , . . . , βk ] is the vector parameter of the model. The estimation of these parameters can be performed via different methods. The most common one is the Least Square technique, which consists on minimizing the sum of squared errors, and the resolution is performed based on straightforward derivation. Shrinkage methods aim to minimize the impact of irrelevant variables by set- ting the coefficients close to zero. These technique is practical where the number of predictors is large and the classical resolution is not possible due to ma- trix operations constraints.PFor instance, the P Ridge regression method Pk proposed n k in [8], minimizes the term t=1 (y(t) − β0 − i=1 βi Xi (t))2 + λ j=1 βj2 , where Pk λ j=1 βj2 is the shrinkage penalty. This mechanism results in shrinking esti- mated coefficients towards zero. The Least Absolute Shrinkage and Selection Pk Op- erator (Lasso) method is similar to the Ridge regression, but it uses λ j=1 |βj | as a shrinkage penalty term, in order to force the coefficient of unimportant variables to be equal to zero. ANNs use generally a non-linear function (a network of nodes, where each one pass the signal using weight and eventually an activation function). They are characterized by the ability of modelling the dynamic dependencies between variables and learning from the precedent information passed through the net- work. By considering the prediction training step as a supervised problem, the main algorithms used to calibrate the coefficients of the network are based on the back-propagation of the errors using for instance gradient descent or stochastic gradient descent algorithms [9], [10]. To handle the problem of selecting the most important predictors in a mul- tivariate prediction model, different approaches based on dimension reduction and feature selection techniques were proposed in the literature. In [11], a com- parison of five dimensionality reduction and feature selection methods (t-test and correlation based method (ranking technique), step-wise regression, prin- ciple component analysis and factor analysis) is performed as a pre-processing step to improve the forecast accuracy. Also in [12], the authors combine multiple dimension reduction methods based on Principal Component Analysis (PCA), Genetic Algorithms (GA) and decision trees (CART), to improve the multi- variate prediction models with all existing variables. In [13], a feature selection algorithm based on causality is proposed for stock prediction modeling. To avoid the main problem of correlation, i.e., it cannot distinguish direct influences from indirect ones. The authors select variables based on causality. This method was compared with PCA, decision trees and LASSO. In [14], an overview of meth- ods that uses principal components approaches for regression. And a sufficient method for regression with many predictors was proposed. 3 The Proposed Feature Selection Method In this section, we expose our proposed method. Let us consider a target variable y and a set of predictors P . The goal is extract the relevant variables from P , i.e a subest of P , based on the notion of causality, that will be used in a model to forecast y. Our approach consists of three steps. First, we calculate the graph of causalities, then we reduce it by eliminating dependencies using a simple transitive reduction technique. Finally, we rank them regards with the causality on the target variable. To compute causality, we use two measures: (i) the Granger causality [15] and (ii) the Transfer entropy [16]. They are characterized by the property of modeling non-symmetric relationships between variables. In other sense, they detect which variable has a direct impact on the other one. Let us consider two univariate time series xt , yt . The Granger causality as- sumes that xt causes yt if it contains helpful information to predict yt . The as- sociated test estimates causality using the Vector Auto-Regressive model. Two models are computed, one using just the values of the target variables, and the second using the target and the predictor variables. Then a difference between with those two models is evaluated using the F-test. In the other hand, Transfer Entropy has similar idea in evaluating the behavior of the target variable by using itself and the predictor variable, but it is based on information theory. Let us undeline that Granger causality is based on a prediction model while Transfer entropy is based on information theory. It has been shown in [17] that they are equivalent only for variables follwing a normal distribution. The goal of the proposed method is simple, extracting variables by ranking them according to the causality. However, selecting them directly based on such non-symmetric measure leads to the problem of dependencies between variables. In other words, it is possible to select a set of variables in which each one cause the other, or even they can be duplicated (they could contain the same information used to predict the target). Hence, a diversification can improve the selection task. In this case, applying the transitive reduction algorithm seems natural as a processing step. We summarize in Algorithm 1 our method. A short version is provided where we suppose that the causality graph is input of the algorithm. The following notation are adopted: x → y expresses the fact that x causes y, and causality(x → y) is the value of this causality. Algorithm 1: Transitive Reduction on Causality Graph (TRCG) Input: The causality graph G, the target variable y, the reduction size k Output: S: Set of predictor variables of y. /* Eliminating dependencies with regard to the target variable */ 1: for all node ts1 ∈ G.nodes \ {y} do 2: for all node ts2 in G.nodes \ {ts1 , y} do 3: if ts1 → ts2 , ts2 → y and ts1 → y then 4: Remove edge between ts1 and y /* Selecting top k variable (nodes of G) that cause y */ 5: P = {ts ∈ G.nodes, ts → y} 6: Ps = P.sort (key=lambda x: causality(x → y)) 7: S = topk (Ps ) 8: return S 4 Methodology The data sets experimented are hourly multiple time series (from hour 2 to 20 each day), representing 3 PV plants, spanning a period of 12 months (year 2012). The goal is to predict 3 months of the production variable, from January to March 2013(where the values of target variables are not known), based on internal factors (temperature and irradiance), and external factors (cloudcover, INPUTS FEATURE SELECTION TRAINING MODEL MATCHING TESTING Regression models Feature selection on causality graphs Shrinkage methods Select the best Predict all target Data set {method, model} for variables and Regression all target variables resample results Dimension reduction tree ANNs Fig. 1: The used forecasting process dewpoint, humidity, pressure, temperature, windbearing, windspeed). The data are organized in a way to predict each hour separately, i.e., for each plant, we have 19 target variables to predict. The methodology adopted is based on model selection. First, a benchmark experiment is performed on training data (year 2012) using cross-validation with 8 experiments by predicting 3 months in each experiment. We execute all the models on the subsets generated by all the methods. Then we select for each target variable a pair {method, model} that will be used in testing step. In the reduction step, we use two existing methods, the Random Walk with Restart on Granger causality graphs GRWR and on Transfer entropy graph TRWR [18], and the PCA method. Two versions of each method proposed in 1, TTRCG and GTRCG, is using either transfer entropy or Granger causality for the causality measure. The forecasted models used can be classified in four main types: – Regression models: Linear regression, RANSAC Regressor (RR), Orthogonal Matching Pursuit (OMP), Theil Sen Regressor (TSR), Hibber Regressor (HB). – Regression models with shrinkage representation: Ridge, Bayesian Ridge, SVM, Lasso. – Decision trees: Decision Tree Regressor (DTR), Gradient Boosting Regressor (GBR) – ANNs: a simple multilayer perceptron neural network (MLP), using one hidden layer and a stochastic gradient descent algorithm to update the pa- rameters of the network. 5 Results and Discussions In this section we present obtained results, and we provide discussion. As we described in the previous section, we used 3 heuristics (PCA, RWR and TRCG) in the training step. In testing step, we also used the brute force feature selection approach that compute all the possible subset for small number of the fastest prediction models. This allowed us to improve a few of models that were pre- viously pre selected using heuristic approaches. We obtained RMSE= 0.177 for 10% of testing data and 0.253 for all testing data. In the following we present the result of the ensemble selection approach obtained in training step, i.e., with heuristics methods. We focus on the results with heuristic methods, as they can be applied for large-scale data sets. Table 1: Results of model selection step Model selection for all PV plants Hours id1 id2 id3 Method Model Method Model Method Model 2 GTRCG GBR GTRCG MLP GRWR GBR 3 TRWR GBR TRWR MLP GRWR GBR 4 GRWR GBR GRWR MLP TTRCG GBR 5 TRWR Lasso TRWR HB TTRCG Lasso 6 TRWR HB TTRCG HB GRWR HB 7 GRWR MLP GRWR MLP TTRCG MLP 8 TRWR TSR GRWR TSR TRWR HB 9 TRWR GBR TTRCG HB TRWR Lasso 10 GTRCG GBR GRWR OMP TRWR Lasso 11 GTRCG MLP GTRCG MLP TRWR TSR 12 TRWR TSR GRWR OMP TRWR TSR 13 TRWR HB TTRCG OMP TRWR HB 14 TRWR HB TRWR TSR TRWR HB 15 TRWR HB TRWR HB TTRCG HB 16 GRWR HB TTRCG HB TRWR HB 17 TRWR HB TRWR TSR GTRCG GBR 18 TTRCG GBR TTRCG MLP TTRCG MLP 19 TRWR GBR TRWR MLP GTRCG GBR 20 GRWR GBR GRWR MLP GTRCG GBR Table 1 shows that causal-based feature selection methods outperform the PCA based approaches. Especially approaches based on RWR on the graph of causalities [18] and the newly proposed algorithm are the most competitive. But the general picture is that there is no model that gets best results in all cases. In Figure 2, the forecast accuracy based on RMSE show that plants id1 and id2 are quite similar. Both in terms of characteristics and selected models (1). In the same figure, relative RMSE are showed in the last three plots. We remark that there exist some hours at the beginning and the end of the day (from 2 to 5 and from 18 to 20) when the energy production is weak and they are very hard to predict. Unfortunately this decrease the global forecast accuracy. As a side remark, the selected prediction models for these hours are the MLP and Gradient Boosting models for all the plants. Which means that when the energy production is low, the data is prone to have some outliers and missing values. 0.11 id1 0.14 id2 0.13 id3 0.10 0.12 0.09 0.12 0.11 0.08 0.10 0.07 0.10 0.09 Rmse 0.06 0.08 0.08 0.05 0.07 0.04 0.06 0.06 0.03 0.05 0.02 0.04 0.04 2 4 6 8 101214161820 2 4 6 8 101214161820 2 4 6 8 101214161820 2.5 3.5 3.5 2.0 3.0 3.0 2.5 2.5 Relative Rmse 1.5 2.0 2.0 1.0 1.5 1.5 1.0 1.0 0.5 0.5 0.5 0.0 0.0 0.0 2 4 6 8 101214161820 2 4 6 8 101214161820 2 4 6 8 101214161820 Hours Hours Hours Fig. 2: Forecast accuracy analysis using RMSE 6 Conclusion In this paper we investigated the multi-plant PV energy forecasting task. We presented an feature selection and model matching framework. The idea is that, for a given variable, we can use heuristics to find the optimal combination of a forecasting model with the most relevant features. Our matching approach is a two step process: (i) we use an algorithm that picks optimal subset of features (or combines the features), and (ii) we evaluate the selection on various prediction models, like regression, decision trees or artificial neural network models. Finally we select models that perform the best. The second contribution is a new feature selection algorithm, which uses the transitive reduction algorithm on the graph of causalities. The results show the utility of using different feature selection methods and prediction models. However, the forecast accuracy analysis using relative mean squared errors shows some difficulties to give good predictions in a decent time, especially when the energy production is low, which decrease the global performance. References 1. Box, G.: Box and Jenkins: Time Series Analysis, Forecasting and Control. In: A Very British Affair. Palgrave Advanced Texts in Econometrics. Palgrave Macmillan UK (2013) 161–215 2. Stock, J.H., Watson, M.W.: Chapter 10 Forecasting with Many Predictors. In G. Elliott, C.W.J.G., Timmermann, A., eds.: Handbook of Economic Forecasting. Volume 1. Elsevier (2006) 515–554 3. Ceci, M., Corizzo, R., Fumarola, F., Malerba, D., Rashkovska, A.: Predictive Modeling of PV Energy Production: How to Set Up the Learning Task for a Better Prediction? IEEE Transactions on Industrial Informatics 13(3) (June 2017) 956– 966 4. Dumitru, C.D., Gligor, A., Enachescu, C.: Solar Photovoltaic Energy Production Forecast Using Neural Networks. Procedia Technology 22 (January 2016) 808–815 5. Gandelli, A., Grimaccia, F., Leva, S., Mussetta, M., Ogliari, E.: Hybrid model analysis and validation for PV energy production forecasting. In: 2014 International Joint Conference on Neural Networks (IJCNN). (July 2014) 1957–1962 6. Antonanzas, J., Osorio, N., Escobar, R., Urraca, R., Martinez-de Pison, F.J., Antonanzas-Torres, F.: Review of photovoltaic power forecasting. Solar Energy 136 (October 2016) 78–111 7. Johansen, S.: Estimation and Hypothesis Testing of Cointegration Vectors in Gaus- sian Vector Autoregressive Models. Econometrica 59(6) (1991) 1551–1580 8. Hoerl, A.E., Kennard, R.W.: Ridge Regression: Biased Estimation for Nonorthog- onal Problems. Technometrics 12(1) (1970) 55–67 9. Zhang, T.: Solving Large Scale Linear Prediction Problems Using Stochastic Gra- dient Descent Algorithms. In: Proceedings of the Twenty-First International Con- ference on Machine Learning. ICML ’04, New York, NY, USA, ACM (2004) 116– 10. Bottou, L.: Stochastic Gradient Descent Tricks. In: Neural Networks: Tricks of the Trade. Lecture Notes in Computer Science. Springer, Berlin, Heidelberg (2012) 421–436 11. Tsai, C.F.: Feature selection in bankruptcy prediction. Knowledge-Based Systems 22(2) (March 2009) 120–127 12. Tsai, C.F., Hsiao, Y.C.: Combining multiple feature selection methods for stock prediction: Union, intersection, and multi-intersection approaches. Decision Sup- port Systems 50(1) (December 2010) 258–269 13. Zhang, X., Hu, Y., Xie, K., Wang, S., Ngai, E.W.T., Liu, M.: A causal feature selection algorithm for stock prediction modeling. Neurocomputing 142 (October 2014) 48–59 14. Adragni, K.P., Cook, R.D.: Sufficient dimension reduction and prediction in regres- sion. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences 367(1906) (November 2009) 4385–4405 15. Granger, C.W.J.: Testing for causality. Journal of Economic Dynamics and Control 2 (January 1980) 329–352 16. Schreiber, T.: Measuring Information Transfer. Physical Review Letters 85(2) (July 2000) 461–464 17. Barnett, L., Barrett, A.B., Seth, A.K.: Granger causality and transfer entropy are equivalent for Gaussian variables. Physical Review Letters 103(23) (December 2009) 18. Piotr, P., Youssef, H., Alain, C., Lakhal, L.: Improving multivariate time series forecasting with random walks with restarts on causality graphs. In: ICDM Work- shops 2017.