Finding relevant multivariate models for
     multi-plant photovoltaic energy forecasting

    Youssef Hmamouche∗ , Piotr Przymus† , Lotfi Lakhal∗ and Alain Casali∗

        LIF - CNRS UMR 7279, Aix Marseille University, Marseille, France,
          ∗
            firstname.lastname@lif.univ-amu.fr, † piotr@przymus.org


       Abstract. Forecasting the photovoltaic energy power is useful for op-
       timizing and controling the system. It aims to predict the power pro-
       duction based on internal and external variables. This problem is very
       similar to the one of multiple time series forecasting problem. With the
       presence of multiple predictor variables, not all of them will equally con-
       tribute to the prediction. The goal is, given a set of predictors, to find
       what is the best / most accurate subset (s) leading to the best fore-
       cast. In this work, we present a feature selection and model matching
       framework. The idea is that we try to find the optimal combination of
       forecasting model with the most relevant features for given variable. We
       use a variety of causality based selection approaches and dimension re-
       duction techniques. The experiments are conducted on real data and the
       results advocate the usefulness of the proposed approach.


Keywords: Time Series; Prediction; Data Mining; Ensemble Selection.


1    Introduction

Time series forecasting is an important tool aiming to predict the evolution
of time series over time based on the existing history. It has many applica-
tions, for example, in finance, neuroscience, industrial optimization . . . this field
is considered as an essential part of business intelligence systems. It is delivers
crucial information that can improve the decision making processes, by antic-
ipating systems behavior, e.g., energy consumption or production. Forecasting
photovoltaics (PV) Energy Production has gained attention with the growing of
interest in using PV as source of renewable energy. Forecasting the production
of such systems has a direct impact on trading and controlling the used energy.
    In general, the PV energy can be measured as time series variables that can
change according to the system state and external conditions, like temperature
and the weather conditions. The simplest approach would be to use univariate
forecasting model for power generation time series. Several models can be used
in this context, for example the auto-regressive models, e.g. AR or ARIMA [1].
However this option have some drawbacks: it does not include crucial informa-
tions provided by other variables. In this case, it is worth to exploit this extra
information from other variables using multivariate models. One approach would
be to use all available variables, but this (i) incorporate some irrelevant variables,
and thus decrease the forecast accuracy [2], and (ii) use too much memory. Such
situation can be improved by extracting only the most relevant variables. This
rises some interesting challenges for multivariate time series forecasting. The
organization of the paper is as follow. In the first section, we present and dis-
cuss some works related to the problem addressed. In the Section, 3 we detail
the proposed method. In Section 4, we describe the forecasting process and the
methodology used to perform the experiments. In Section 5, we show and discuss
the results. And in the last section, we summarize our approach.


2    Related Work

In the literature, many approaches was proposed to handle the problem of fore-
casting PV energy production. In [3], the paper deals with multi-plant PV en-
ergy production forecasting. A comparison between artificial neural networks,
regression trees, and spatio-temporal auto-correlation based methods was ex-
perimented. The authors show that regression trees provide better results than
artificial neural networks (ANNs). In [4], ANNs are used to forecast PV energy
production, taking advantage from there ability to learn the changes. To improve
the forecasts, multiple predictor variables that may influence the energy produc-
tion were used based on internal and external factors. The same problematic
was investigated in [5]. An hybrid approach was used, by adding basic physical
constraints of the PV plant to the input of an ANN. The results show an im-
provement of prediction accuracy compared to model without those constraints.
More works on photovoltaic power forecasting approaches can be found in [6].
     We argue that the problem of PV energy forecasting can be modelled as
multivariate time series prediction. In the following we reformulate this problem
and discuss the main approaches used to address it. Consider a set of predictor
time series X = [x1 , . . . , xk ] and a target variable y, with n observations.
     There are multiple strategies to predict y using X. One way consists in using
models that exploit the precedent values of y and X, e.g., the vector auto-
regressive models [7]. In this work, we focus on the prediction models that predict
Y at time t based on values of variables of X at the same time t. Therefore, the
general model can expressed as follow, y(t) = f (X1 (t) + · · · + Xk (t)) + (t).
     Linear models P       suppose that y can be expressed as linear combination of X,
                             k
i.e., y(t) = β0 + i=1 βi Xi (t) + (t), where (t) is the error term, and β =
                       0
[β0 , β1 , . . . , βk ] is the vector parameter of the model. The estimation of these
parameters can be performed via different methods. The most common one is
the Least Square technique, which consists on minimizing the sum of squared
errors, and the resolution is performed based on straightforward derivation.
     Shrinkage methods aim to minimize the impact of irrelevant variables by set-
ting the coefficients close to zero. These technique is practical where the number
of predictors is large and the classical resolution is not possible due to ma-
trix operations constraints.PFor instance, the P     Ridge regression method
                                                                       Pk proposed
                                     n                k
in [8], minimizes the term t=1 (y(t) − β0 − i=1 βi Xi (t))2 + λ j=1 βj2 , where
  Pk
λ j=1 βj2 is the shrinkage penalty. This mechanism results in shrinking esti-
mated coefficients towards zero. The Least Absolute Shrinkage and Selection
                                                                          Pk Op-
erator (Lasso) method is similar to the Ridge regression, but it uses λ j=1 |βj |
as a shrinkage penalty term, in order to force the coefficient of unimportant
variables to be equal to zero.
    ANNs use generally a non-linear function (a network of nodes, where each
one pass the signal using weight and eventually an activation function). They
are characterized by the ability of modelling the dynamic dependencies between
variables and learning from the precedent information passed through the net-
work. By considering the prediction training step as a supervised problem, the
main algorithms used to calibrate the coefficients of the network are based on the
back-propagation of the errors using for instance gradient descent or stochastic
gradient descent algorithms [9], [10].
    To handle the problem of selecting the most important predictors in a mul-
tivariate prediction model, different approaches based on dimension reduction
and feature selection techniques were proposed in the literature. In [11], a com-
parison of five dimensionality reduction and feature selection methods (t-test
and correlation based method (ranking technique), step-wise regression, prin-
ciple component analysis and factor analysis) is performed as a pre-processing
step to improve the forecast accuracy. Also in [12], the authors combine multiple
dimension reduction methods based on Principal Component Analysis (PCA),
Genetic Algorithms (GA) and decision trees (CART), to improve the multi-
variate prediction models with all existing variables. In [13], a feature selection
algorithm based on causality is proposed for stock prediction modeling. To avoid
the main problem of correlation, i.e., it cannot distinguish direct influences from
indirect ones. The authors select variables based on causality. This method was
compared with PCA, decision trees and LASSO. In [14], an overview of meth-
ods that uses principal components approaches for regression. And a sufficient
method for regression with many predictors was proposed.


3   The Proposed Feature Selection Method

In this section, we expose our proposed method. Let us consider a target variable
y and a set of predictors P . The goal is extract the relevant variables from P ,
i.e a subest of P , based on the notion of causality, that will be used in a model
to forecast y. Our approach consists of three steps. First, we calculate the graph
of causalities, then we reduce it by eliminating dependencies using a simple
transitive reduction technique. Finally, we rank them regards with the causality
on the target variable.
    To compute causality, we use two measures: (i) the Granger causality [15]
and (ii) the Transfer entropy [16]. They are characterized by the property of
modeling non-symmetric relationships between variables. In other sense, they
detect which variable has a direct impact on the other one.
    Let us consider two univariate time series xt , yt . The Granger causality as-
sumes that xt causes yt if it contains helpful information to predict yt . The as-
sociated test estimates causality using the Vector Auto-Regressive model. Two
models are computed, one using just the values of the target variables, and the
second using the target and the predictor variables. Then a difference between
with those two models is evaluated using the F-test. In the other hand, Transfer
Entropy has similar idea in evaluating the behavior of the target variable by
using itself and the predictor variable, but it is based on information theory. Let
us undeline that Granger causality is based on a prediction model while Transfer
entropy is based on information theory. It has been shown in [17] that they are
equivalent only for variables follwing a normal distribution.
    The goal of the proposed method is simple, extracting variables by ranking
them according to the causality. However, selecting them directly based on such
non-symmetric measure leads to the problem of dependencies between variables.
In other words, it is possible to select a set of variables in which each one cause the
other, or even they can be duplicated (they could contain the same information
used to predict the target). Hence, a diversification can improve the selection
task. In this case, applying the transitive reduction algorithm seems natural as
a processing step. We summarize in Algorithm 1 our method. A short version is
provided where we suppose that the causality graph is input of the algorithm.
The following notation are adopted: x → y expresses the fact that x causes y,
and causality(x → y) is the value of this causality.


    Algorithm 1: Transitive Reduction on Causality Graph (TRCG)
     Input: The causality graph G, the target variable y, the reduction size k
     Output: S: Set of predictor variables of y.
         /* Eliminating dependencies with regard to the target variable */
      1: for all node ts1 ∈ G.nodes \ {y} do
      2:   for all node ts2 in G.nodes \ {ts1 , y} do
      3:      if ts1 → ts2 , ts2 → y and ts1 → y then
      4:         Remove edge between ts1 and y
         /* Selecting top k variable (nodes of G) that cause y */
      5: P = {ts ∈ G.nodes, ts → y}
      6: Ps = P.sort (key=lambda x: causality(x → y))
      7: S = topk (Ps )
      8: return S


4     Methodology

The data sets experimented are hourly multiple time series (from hour 2 to
20 each day), representing 3 PV plants, spanning a period of 12 months (year
2012). The goal is to predict 3 months of the production variable, from January
to March 2013(where the values of target variables are not known), based on
internal factors (temperature and irradiance), and external factors (cloudcover,
    INPUTS      FEATURE SELECTION        TRAINING       MODEL MATCHING            TESTING
                                           Regression
                                            models
                  Feature selection on
                    causality graphs       Shrinkage
                                            methods          Select the best      Predict all target
     Data set                                             {method, model} for       variables and
                                           Regression      all target variables   resample results
                  Dimension reduction
                                              tree

                                             ANNs


                              Fig. 1: The used forecasting process


dewpoint, humidity, pressure, temperature, windbearing, windspeed). The data
are organized in a way to predict each hour separately, i.e., for each plant, we
have 19 target variables to predict.
   The methodology adopted is based on model selection. First, a benchmark
experiment is performed on training data (year 2012) using cross-validation with
8 experiments by predicting 3 months in each experiment. We execute all the
models on the subsets generated by all the methods. Then we select for each
target variable a pair {method, model} that will be used in testing step.
   In the reduction step, we use two existing methods, the Random Walk with
Restart on Granger causality graphs GRWR and on Transfer entropy graph
TRWR [18], and the PCA method. Two versions of each method proposed in 1,
TTRCG and GTRCG, is using either transfer entropy or Granger causality for
the causality measure. The forecasted models used can be classified in four main
types:
 – Regression models: Linear regression, RANSAC Regressor (RR), Orthogonal
   Matching Pursuit (OMP), Theil Sen Regressor (TSR), Hibber Regressor
   (HB).
 – Regression models with shrinkage representation: Ridge, Bayesian Ridge,
   SVM, Lasso.
 – Decision trees: Decision Tree Regressor (DTR), Gradient Boosting Regressor
   (GBR)
 – ANNs: a simple multilayer perceptron neural network (MLP), using one
   hidden layer and a stochastic gradient descent algorithm to update the pa-
   rameters of the network.


5      Results and Discussions
In this section we present obtained results, and we provide discussion. As we
described in the previous section, we used 3 heuristics (PCA, RWR and TRCG)
in the training step. In testing step, we also used the brute force feature selection
approach that compute all the possible subset for small number of the fastest
prediction models. This allowed us to improve a few of models that were pre-
viously pre selected using heuristic approaches. We obtained RMSE= 0.177 for
10% of testing data and 0.253 for all testing data. In the following we present
the result of the ensemble selection approach obtained in training step, i.e., with
heuristics methods. We focus on the results with heuristic methods, as they can
be applied for large-scale data sets.


                    Table 1: Results of model selection step
                                 Model selection for all PV plants
       Hours             id1                     id2                   id3
                Method      Model       Method      Model     Method     Model
         2      GTRCG          GBR      GTRCG       MLP       GRWR           GBR
         3      TRWR           GBR      TRWR        MLP       GRWR           GBR
         4      GRWR           GBR      GRWR        MLP       TTRCG          GBR
         5      TRWR           Lasso    TRWR         HB       TTRCG          Lasso
         6      TRWR            HB      TTRCG        HB       GRWR            HB
         7      GRWR           MLP      GRWR        MLP       TTRCG          MLP
         8      TRWR           TSR      GRWR        TSR       TRWR            HB
         9      TRWR           GBR      TTRCG        HB       TRWR           Lasso
         10     GTRCG          GBR      GRWR        OMP       TRWR           Lasso
         11     GTRCG          MLP      GTRCG       MLP       TRWR           TSR
         12     TRWR           TSR      GRWR        OMP       TRWR           TSR
         13     TRWR            HB      TTRCG       OMP       TRWR            HB
         14     TRWR            HB      TRWR        TSR       TRWR            HB
         15     TRWR            HB      TRWR         HB       TTRCG           HB
         16     GRWR            HB      TTRCG        HB       TRWR            HB
         17     TRWR            HB      TRWR        TSR       GTRCG          GBR
         18     TTRCG          GBR      TTRCG       MLP       TTRCG          MLP
         19     TRWR           GBR      TRWR        MLP       GTRCG          GBR
         20     GRWR           GBR      GRWR        MLP       GTRCG          GBR


    Table 1 shows that causal-based feature selection methods outperform the
PCA based approaches. Especially approaches based on RWR on the graph of
causalities [18] and the newly proposed algorithm are the most competitive. But
the general picture is that there is no model that gets best results in all cases.
    In Figure 2, the forecast accuracy based on RMSE show that plants id1 and
id2 are quite similar. Both in terms of characteristics and selected models (1).
    In the same figure, relative RMSE are showed in the last three plots. We
remark that there exist some hours at the beginning and the end of the day
(from 2 to 5 and from 18 to 20) when the energy production is weak and they
are very hard to predict. Unfortunately this decrease the global forecast accuracy.
As a side remark, the selected prediction models for these hours are the MLP
and Gradient Boosting models for all the plants. Which means that when the
energy production is low, the data is prone to have some outliers and missing
values.
                  0.11        id1          0.14        id2          0.13        id3
                  0.10                                              0.12
                  0.09                     0.12                     0.11
                  0.08                                              0.10
                  0.07                     0.10                     0.09
       Rmse
                  0.06                     0.08                     0.08
                  0.05                                              0.07
                  0.04                     0.06                     0.06
                  0.03                                              0.05
                  0.02                     0.04                     0.04
                      2 4 6 8 101214161820     2 4 6 8 101214161820     2 4 6 8 101214161820
                        2.5                     3.5                       3.5
                        2.0                     3.0                       3.0
                                                2.5                       2.5
        Relative Rmse


                        1.5                     2.0                       2.0
                        1.0                     1.5                       1.5
                                                1.0                       1.0
                        0.5                     0.5                       0.5
                        0.0                     0.0                       0.0
                           2 4 6 8 101214161820    2 4 6 8 101214161820      2 4 6 8 101214161820
                                  Hours                   Hours                     Hours

                              Fig. 2: Forecast accuracy analysis using RMSE


6   Conclusion
In this paper we investigated the multi-plant PV energy forecasting task. We
presented an feature selection and model matching framework. The idea is that,
for a given variable, we can use heuristics to find the optimal combination of a
forecasting model with the most relevant features. Our matching approach is a
two step process: (i) we use an algorithm that picks optimal subset of features (or
combines the features), and (ii) we evaluate the selection on various prediction
models, like regression, decision trees or artificial neural network models. Finally
we select models that perform the best. The second contribution is a new feature
selection algorithm, which uses the transitive reduction algorithm on the graph
of causalities. The results show the utility of using different feature selection
methods and prediction models. However, the forecast accuracy analysis using
relative mean squared errors shows some difficulties to give good predictions in
a decent time, especially when the energy production is low, which decrease the
global performance.

References
 1. Box, G.: Box and Jenkins: Time Series Analysis, Forecasting and Control. In: A
    Very British Affair. Palgrave Advanced Texts in Econometrics. Palgrave Macmillan
    UK (2013) 161–215
 2. Stock, J.H., Watson, M.W.: Chapter 10 Forecasting with Many Predictors. In
    G. Elliott, C.W.J.G., Timmermann, A., eds.: Handbook of Economic Forecasting.
    Volume 1. Elsevier (2006) 515–554
 3. Ceci, M., Corizzo, R., Fumarola, F., Malerba, D., Rashkovska, A.: Predictive
    Modeling of PV Energy Production: How to Set Up the Learning Task for a Better
    Prediction? IEEE Transactions on Industrial Informatics 13(3) (June 2017) 956–
    966
 4. Dumitru, C.D., Gligor, A., Enachescu, C.: Solar Photovoltaic Energy Production
    Forecast Using Neural Networks. Procedia Technology 22 (January 2016) 808–815
 5. Gandelli, A., Grimaccia, F., Leva, S., Mussetta, M., Ogliari, E.: Hybrid model
    analysis and validation for PV energy production forecasting. In: 2014 International
    Joint Conference on Neural Networks (IJCNN). (July 2014) 1957–1962
 6. Antonanzas, J., Osorio, N., Escobar, R., Urraca, R., Martinez-de Pison, F.J.,
    Antonanzas-Torres, F.: Review of photovoltaic power forecasting. Solar Energy
    136 (October 2016) 78–111
 7. Johansen, S.: Estimation and Hypothesis Testing of Cointegration Vectors in Gaus-
    sian Vector Autoregressive Models. Econometrica 59(6) (1991) 1551–1580
 8. Hoerl, A.E., Kennard, R.W.: Ridge Regression: Biased Estimation for Nonorthog-
    onal Problems. Technometrics 12(1) (1970) 55–67
 9. Zhang, T.: Solving Large Scale Linear Prediction Problems Using Stochastic Gra-
    dient Descent Algorithms. In: Proceedings of the Twenty-First International Con-
    ference on Machine Learning. ICML ’04, New York, NY, USA, ACM (2004) 116–
10. Bottou, L.: Stochastic Gradient Descent Tricks. In: Neural Networks: Tricks of the
    Trade. Lecture Notes in Computer Science. Springer, Berlin, Heidelberg (2012)
    421–436
11. Tsai, C.F.: Feature selection in bankruptcy prediction. Knowledge-Based Systems
    22(2) (March 2009) 120–127
12. Tsai, C.F., Hsiao, Y.C.: Combining multiple feature selection methods for stock
    prediction: Union, intersection, and multi-intersection approaches. Decision Sup-
    port Systems 50(1) (December 2010) 258–269
13. Zhang, X., Hu, Y., Xie, K., Wang, S., Ngai, E.W.T., Liu, M.: A causal feature
    selection algorithm for stock prediction modeling. Neurocomputing 142 (October
    2014) 48–59
14. Adragni, K.P., Cook, R.D.: Sufficient dimension reduction and prediction in regres-
    sion. Philosophical Transactions of the Royal Society of London A: Mathematical,
    Physical and Engineering Sciences 367(1906) (November 2009) 4385–4405
15. Granger, C.W.J.: Testing for causality. Journal of Economic Dynamics and Control
    2 (January 1980) 329–352
16. Schreiber, T.: Measuring Information Transfer. Physical Review Letters 85(2)
    (July 2000) 461–464
17. Barnett, L., Barrett, A.B., Seth, A.K.: Granger causality and transfer entropy
    are equivalent for Gaussian variables. Physical Review Letters 103(23) (December
    2009)
18. Piotr, P., Youssef, H., Alain, C., Lakhal, L.: Improving multivariate time series
    forecasting with random walks with restarts on causality graphs. In: ICDM Work-
    shops 2017.