<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multi-Plant Photovoltaic Energy Forecasting Challenge with Regression Tree Ensembles and Hourly Average Forecasts</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kathrin Bujna</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Wistuba</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>In this paper, we present the winning solution to the ECMLPKDD Discovery Challenge 2017 on Multi-Plant Photovoltaic (PV) Energy Forecasting. The goal of the challenge is to utilize the historic data of three di erent PV plants in Italy regarding meteorological conditions and production in order to forecast their energy production. A major problem is that the data contains many missing value for most sensors and especially for the period of the year for which predictions shall be made in the subsequent year. We investigate two approaches: a regression tree ensemble whose hyperparameters where tuned via Bayesian optimization and a simple rule that just predicts the hourly averages that were observed during the previous year. The latter approach is the winning solution of the challenge.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1 Paderborn University
2 IBM Research Ireland</p>
    </sec>
    <sec id="sec-2">
      <title>Introduction</title>
      <p>Due to the urgent need to reduce pollution emission, clean and renewable energy
sources have become a strategic European Union and international sector. In
particular, photovoltaic (PV) power plants, which are already widely distributed
in Europe, are becoming a major source of renewable energy. The main challenges
faced by the renewable energy market are grid integration, load balancing and
energy trading. In order to face these challenges, we need reliable tools for the
prediction of the energy production.</p>
      <p>Challenge. In the ECML PKDD 2017 Multi-Plant Photovoltaic Energy
Forecasting Challenge, we are given multiple time series regarding meteorological
conditions and production collected by sensors on three closely located PV plants
in Italy. Given the data for the period from January to December 2012, the goal
is to forecast the energy production of each plant for the period from January
to March 2013.</p>
      <p>
        Related Work. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] assess several approaches for PV power forecasting. In contrast
to the given challenge, the authors consider a larger number of features, for
instance, the weather forecast provided by numerical weather prediction systems
and the geographic coordinates of the plants. Moreover, they consider the data
collected from 18 power plants, instead of the data from just 3 plants.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Task</title>
      <p>
        For detailed information on the given data, we refer to [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>Time Series. The given data set consists of time series data regarding weather
conditions and production collected by sensors on three closely located PV plants
in Italy. The elements of these time series are hourly aggregations obtained as
the average of all the measures available during a speci c hour at a speci c day
of the year. That is, for each plant, day, and sensor, we are given a time series
of 19 values representing hourly aggregated observations (plants are active from
2am to 8pm).</p>
      <p>The data was gathered locally, from sensors available on plants and from
the closest meteorological stations. There are three time series that describe the
plant: irradiance, temperature, and the power production. There are seven time
series that describe the weather: temperature, cloudcover, dewpoint, humidity,
pressure, windbearing, and windspeed.</p>
      <p>Train and Test. The given training data spans over a period of 12 months (year
2012) and includes the target time series, i.e. the power production observed for
each plant. The testing data consists of 3 months (January to March 2013) for
which the target time series is not provided. According to the challenge website,
min-max scaling (between 0 and 1) on power variables needs to be performed
on the training and testing data before applying the prediction model.
Additional Information. According to the challenge website, the data can
contain reasonable missing values or outliers. Missing values are indicated by zero
measurements. Besides that, the PV plants have a maximal power rate of 1 000
kW/h.
3</p>
    </sec>
    <sec id="sec-4">
      <title>Exploration</title>
      <p>Some of the power measurements are much higher than the maximal power rate
of 1 000 kW/h (e.g. 885 124 kW/h). Therefore, in the following gures, we clean
the power measurements by setting all power values larger than 1 000 kW/h to
exactly 1 000 kW/h.
3.1</p>
      <sec id="sec-4-1">
        <title>Single Time Series vs. Power Production Time Series</title>
        <p>In Fig. 1 we consider the impact of the single variables with respect to the power
production and, additionally, the average monthly and hourly power production.
The results con rm our expectation: First and foremost, we expect that the
power production is higher during summer, during the middle of a day, and
while the plant irradiance is higher. We expect that, if the weather is sunny,
then the plant irradiance is higher. We expect sunny weather if the weather
temperature is higher, the pressure is rather high, there are few clouds, and the
humidity is low.</p>
        <p>Most notably, there is a strong correlation between the plant irradiance and
the power production. The correlation of the power production with the weather
temperature, pressure, cloud cover, and humidity is less striking. Besides that,
in Fig. 2, we see that these variables also correlate with the plant irradiance.</p>
        <p>The dewpoint, the wind speed, and wind bearing do not seem to be correlated
to the power production, respectively. One might suspect that the
informativeness of the wind speed and bearing might depend on the location. However,
except for the wind bearing measured at the plant with index 1 (see Fig. 1),
these location dependent gures are similar to the averages (over all plants)
depicted in Fig. 1.</p>
        <p>Last but not least, apparently, there is a signi cant number of missing values,
which have been set to zero by default (see Sec. 2).
3.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Daily Measurements of the Plant Sensors</title>
        <p>We considered the plant irradiance and the plant temperature the most
promising variables. Fig. 3 depicts the daily averages values of these variables during
the year 2012.</p>
        <p>For plant 1, a lot of the plant irradiance and plant temperature measurements
are probably missing between day 30 and day 58. During this interval, both
measurements are constantly 0, respectively. Besides that, we observe that the
plant power of plant 1 drops signi cantly during the very same time. That is, in
comparison to the other plants, we would expect roughly 150 kW/h, but observe
about 100 kW/h.</p>
        <p>Regarding plant 2, there seem to be even more missing values in the rst
90 days of the year 2012. During the rst 90 days, the plant irradiance and
plant temperature measurements are constantly 0. Besides that, the plant power
measurements behave as expected. Still, there seems to be a slight drop of the
power production between day 25 and 50.</p>
        <p>Finally, consider the measurements regarding plant 3. Here, there seem to be
no missing measurements. However, we observe a strange behavior of the plant
irradiance measurements: Approximately between day 50 and day 150, the plant
irradiance increases much faster compared to the irradiance measurements for
plant 1 and 2, respectively. At day 150, the measurements decrease to values
which are similar to the values measured for plant 1 and 2, respectively. Again,
there seems to be a slight drop of the power production between day 25 and 50.</p>
        <p>To sum up, the time series of all three plants exhibit an unusual behavior
in the rst 90 days of the year 2012. Unfortunately, the test data consists of
measurements for the rst 90 days of the subsequent year. On a side note, the
test data does not seem to contain missing plant irradiance or plant temperature
measurements.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Data Preprocessing</title>
      <p>From Sec. 3, we know that the data set most probably contains a lot of missing
values, which are indicated by zero measurements, and outliers.
Missing Data. Recall that missing values are indicated by zero measurements.
However, a zero measurements does not necessarily indicate a missing value.
Therefore, we did not mark all zero measurements as missing: In the plant
irradiance time series, we marked a zero measurement as missing if, at the same
time point, the plant temperature was also zero (i.e. we marked 2577 of 7921 zero
measurements as missing) (cf. exploration of data regarding plant 2 in Sec. 3.2).
After that, we marked all zero measurements in the plant temperature, weather
pressure, and weather pressure time series as missing.</p>
      <p>Outlier Removal. First, we removed unusual values. In the given training data,
the minimum plant temperature is -3 C, while in the test data we observe plant
temperature measurements of up to -202 C. This is why we decided to remove
(i.e. mark as missing) all plant temperature measurements of less than -3 C from
the test data.</p>
      <p>
        Second, we used the outlier removal proposed by [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. That is, given the time
series (vt)t=1;:::;T (i.e. the plant irradiance time series), we consider the value vt
at time t an outlier if vt 2= [v 4 v], where v = T1 PtT=1 vt denotes the sample
mean and v2 = T 1 1 PtT=1 kvt vk2 denotes the (unbiased) sample variance. To
be precise, we conduced this outlier removal separately for the time series given
by the training data and by the test data, respectively.
      </p>
      <p>Normalization of Power Measurements. After our outlier removal there were
still some power measurements larger than the maximal power value of 1 000
kW/h. We simply replaced these measurements with the maximal power value
(i.e. 1 000 kW/h).</p>
      <p>Additional Date Features. We created one training and test data set that contains
the plant ids, the plant sensor features, weather features, and additionally, three
date features. The date features indicate the hour of the day (by values 1; : : : ; 19),
the day of the year (by values 1; : : : ; 366), and the month (by values 1; : : : ; 12).
5</p>
    </sec>
    <sec id="sec-6">
      <title>Prediction</title>
      <p>In the following, we rst describe two approaches: a regression tree ensemble and
an evaluation of hourly averages. The latter has turned out to be the winning
solution of the challenge.
5.1</p>
      <sec id="sec-6-1">
        <title>Training Data Splits</title>
        <p>In order to estimate the performance of our approaches, we split the training data
in two splits: The rst split T1 23 contains all data from the training data set
concerning the rst 23 days of each month in 2012. The second split T24+ contains
the remaining data, i.e. all training data containing the 24th day through the
last day of each month in 2012. In the following, we train our models using the
larger split T1 23, and evaluate the resulting models with respect to T24+.</p>
        <p>Our motivation for these splits is twofold: First of all, we want to ensure that
there is enough information about each month contained in each split. Second,
we want the splits to correspond to rather independent time series. In other
words, we want that the number of measurements vt for which the subsequent
measurement vt+1 is contained in a di erent split than vt is as small as possible.
This property would be violated if, for example, we would choose the assignment
of a measurement to one of the splits uniformly at random.
5.2</p>
      </sec>
      <sec id="sec-6-2">
        <title>Approaches: Regression Tree Ensembles and Hourly Averages</title>
        <p>
          Our rst approach is to use gradient boosted regression trees (GBRT) [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], which
can handle heterogeneous data with missing values. To tune the hyperparameters
of the GBRT technique, we apply a black-box optimization method [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. That is,
we apply Bayesian optimization to nd the hyperparameters which minimize
the RSME of gradient boosted regression trees, which have been trained with
hyperparameters and training data T1 23, with respect to the validation split
T24+. As the resulting regression tree ensemble predicted rather strange time
series (regarding the testing data), we also tried the following simple baseline.
        </p>
        <p>Our second approach is to predict hourly average values: For each plant p,
hour of the day h, and for each month m, we computed the average power
production of plant p during the month p at this speci c hour h of the day.
For instance, our prediction of the power production of plant 1 for February 1st
between 1pm and 2pm is the average power production of plant 1 between 1pm
and 2pm in February 2012.
5.3</p>
      </sec>
      <sec id="sec-6-3">
        <title>Evaluation</title>
        <p>First, let us consider the performance of our models with respect to the validation
split T24+: Our resulting regression tree ensemble yields a root mean squared
error (RSME) of 0.0685. In comparison, our hourly average approach performs
poorly and only achieves an RSME of 0.1297. However, our models perform
di erently with respect to the testing data.</p>
        <p>Regarding the testing data, our simple hourly average approach turned out to
be the winning approach. Tab. 1 depicts the temporary and the nal leaderboard,
which have been published on the challenge website. Our simple hourly average
approach already achieved the second best score on the temporary leaderboard,
which contains 10% of the testing data. Regarding the complete testing data,
the hourly approach outperformed all the other competing solutions.
position
As already explained, the given training data contained many missing values,
especially, during the time of the year, for which we have to predict the power
production in the subsequent year (see Sec. 3.2). We expect that, given more
reasonable data, a machine learning approach will outperform our simple hourly
average approach.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Ceci</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corizzo</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fumarola</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malerba</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rashkovska</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Predictive modeling of pv energy production: How to set up the learning task for a better prediction?</article-title>
          <source>IEEE Transactions on Industrial Informatics</source>
          <volume>13</volume>
          (
          <issue>3</issue>
          ),
          <volume>956</volume>
          {966 (
          <year>June 2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guestrin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Xgboost: A scalable tree boosting system</article-title>
          .
          <source>In: Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
          . pp.
          <volume>785</volume>
          {
          <fpage>794</fpage>
          . KDD '16,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>2016</year>
          ), http: //doi.acm.
          <source>org/10</source>
          .1145/2939672.2939785
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Snoek</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Larochelle</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Adams</surname>
            ,
            <given-names>R.P.</given-names>
          </string-name>
          :
          <article-title>Practical bayesian optimization of machine learning algorithms</article-title>
          .
          <source>In: Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6</source>
          ,
          <year>2012</year>
          ,
          <string-name>
            <given-names>Lake</given-names>
            <surname>Tahoe</surname>
          </string-name>
          , Nevada, United States. pp.
          <volume>2960</volume>
          {
          <issue>2968</issue>
          (
          <year>2012</year>
          ), http://papers.nips.cc/paper/4522-practicalbayesian
          <article-title>-optimization-of-machine-learning-algorithms</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>