<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Hindcasting with Multistations Using Analog Ensembles</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alexandre Chesneau</string-name>
          <email>alexandre.chesneau@etu.enseeiht.fr</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carlos Balsa</string-name>
          <email>balsa@ipb.pt</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carlos Veiga Rodrigues</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Isabel Lopes</string-name>
          <email>isalopes@ipb.pt</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Applied Management Research Unit (UNIAG), Instituto Politecnico de Braganca</institution>
          ,
          <addr-line>Campus de Santa Apolonia, 5300-253 Braganca</addr-line>
          ,
          <country country="PT">Portugal</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Centro ALGORITMI, Escola de Engenharia - Universidade do Minho, Campus Azurem</institution>
          ,
          <addr-line>4800-058 Guimar~aes</addr-line>
          ,
          <country country="PT">Portugal</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Research Centre in Digitalization and Intelligent Robotics (CeDRI), Instituto Politecnico de Braganca</institution>
          ,
          <addr-line>Campus de Santa Apolonia, 5300-253 Braganca</addr-line>
          ,
          <country country="PT">Portugal</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Universite de Toulouse - Institut National Polytechnique de Toulouse</institution>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Vestas Wind Systems A/S - Design Center Porto -</institution>
          <country country="PT">Portugal</country>
        </aff>
      </contrib-group>
      <fpage>215</fpage>
      <lpage>229</lpage>
      <abstract>
        <p>A hindcast with multiple stations was performed with various Analog Ensembles (AnEn) algorithms. The di erent strategies were analyzed and benchmarked in order to improve the prediction. The underlying problem consists in making weather predictions for a location where no data is available, using meteorological time series from nearby stations. Various methods are explored, from the basic one, originally described by Monache and co-workers, to methods using cosine similarity, normalization, and K-means clustering. Best results were obtained with the K-means metric, wielding between 3% and 30% of lower quadratic error when compared against the Monache metric. Increasing the predictors to two stations improved the performance of the hindcast, leading up to 16% of lower error, depending on the correlation between the predictor stations.</p>
      </abstract>
      <kwd-group>
        <kwd>Analog Ensembles logical data</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Weather prediction using Analog Ensembles (AnEn) is not a recent idea. It
was described as early as 1969, by Lorenz [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], who however concluded that
such a method would not work. Further works managed to prove the usefulness
of this approach in a much more limited scope, in various elds ranging from
meteorology to ood study, especially thanks to the decisive contributions of
Van Den Dool [
        <xref ref-type="bibr" rid="ref10 ref9">9,10</xref>
        ].
      </p>
      <p>
        In the eld of meteorology, a major contribution to the use of the Analog
Ensemble method was made by Monache et al (2011) [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], having been re ned [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]
and applied to a variety of operational situations [
        <xref ref-type="bibr" rid="ref5 ref6 ref7">6,5,7</xref>
        ], showing its accuracy
and usefulness in the process.
      </p>
      <p>The Analog Ensemble method is actually a post-treatment procedure, used
to improve the accuracy of a meteorological model. The idea is very simple: a
model makes forecasts. Alongside these forecasts, are also available historical
forecasts - a record of forecasts from the model at past dates. Then, to improve
forecasting accuracy, the forecast to improve is compared to historical forecasts.
The historical forecasts closest to the current one are kept, and the actual
meteorological values observed at these dates are used to improve the forecast value.
The name of the method comes from there: past forecasts close to the current
forecast are called analogs, and these form an ensemble.</p>
      <p>The aim of this paper is to compare the performances of various methods
to determine these analogs and to establish which of these methods is the most
accurate one. The aim is to discover ways to improve the AnEn method described
in the literature. To this aim, the various methods were applied to a hindcasting
problem, where the time serie of a meteorological variable in a location was
reconstructed using data from other weather stations. Results obtained using
the data from only one predictor station, then using two stations as predictors,
were compared.</p>
      <p>Section 2 of this document contains the methodology with the
mathematical formulation, de nitions for error quanti cation, and a description of the
data used in this study. The results are presented in Section 3 with subsequent
analysis. Section 4 contains the conclusion and nal remarks.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Methodology</title>
      <p>In this section, the data used for the tests are presented, alongside the various
methods compared and the tools used to assess the performance of each model.
2.1</p>
      <sec id="sec-2-1">
        <title>Analog Ensembles Overview</title>
        <p>
          The Analog Ensemble (AnEn) method is illustrated in Fig. 1, where the objective
is to predict a time-dependent data at a location based on multiple data. These
datasets are composed by observations, available only for a limited period of
time named Training Period, and a historical dataset available at the Training
Period and the time that will be predicted. Usually, the historical dataset is a
time-series from a Numerical Weather Prediction (NWP) model used to forecast
[
          <xref ref-type="bibr" rid="ref13 ref6">13,6</xref>
          ] or hindcast [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] meteorological data. In this work, real measurements from
meteorological stations were used as the historical dataset.
        </p>
        <p>
          The AnEn procedure is implemented in three steps for each prediction time.
Firstly, in step 1 the corresponding value for the prediction is obtained from the
Historical Dataset and this dataset is scanned for analogs matching that value.
Analogs are past occurrences deemed close enough to the current prediction,
classi ed as such according to an analog metric. Step 2 consists of matching
these analogs with the corresponding real observations at the target station.
Step 3 consists of correcting the current prediction with the past values matched
to the analogs. The period where historical data from both predictions and
observations are available is called the training period. The larger this period
is, the better the AnEn method performs [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. The other period is the prediction
period or reconstruction period, in the case of hindcasting. The Analog Ensemble
method is very simple, but having an accurate similarity metric is crucial for the
success of the forecasting.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Testing Database</title>
        <p>
          Testing was done using the data from meteorological stations located on the
coast of the state of Virginia, USA. These stations were used because their
observations are freely available from the United States' National Data Buoy
Center [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. The location of the stations is shown in Fig. 2.
        </p>
        <p>The data extended from the years 2012 to 2018. The data from the years 2012
to 2016 was kept as a historical database (training period), and the experiment
aimed to reconstruct the data from 2017 to 2018 at one station (reconstruction
period). The stations data was time-integrated to samples of six minutes,
meaning that they observe the value of meteorological variables ten times per hour.
These stations observe 6 di erent meteorological variables: pressure (PRES), air
temperature (ATMP), water temperature (WTMP), wind speed (WSPD), gust
speed (GST) and wind direction (WDIR). In many cases time series are not
complete, data are missing in many more or less extended periods.</p>
        <p>The idea was therefore to hindcast: at one station where only the
meteorological data from 2012 to 2016 (training period) was known, the program had
to reconstruct the data from 2017 to 2018 (prediction period), using the other
stations, where the full range from 2012 to 2018 was known, as predictors. Based
on the AnEn procedure illustrated in Figure 1, the data from 2012-2016 at the
target station is the "Observed dataset" whilst the data at all other stations
between 2012-2018 is the "Historical Dataset" (comprised of multiple time series),
where the "Training" and "Prediction" periods are delimitated by 2012-2016
and 2017-2018, respectively.</p>
        <p>The advantage of such a setup is that it becomes easy to evaluate the model
accuracy because we can compare the estimates obtained with the AnEn method
with the real values. There is one important thing to note, however: because
the data collected by stations every six minutes over 7 years is huge, it was
time consuming to sequentially process all the records in the historical dataset.
Instead, the problem was simpli ed to predict the weather between 10 am and
noon, using analogs of the weather between 10 am and noon. This greatly reduces
computing time, while still giving data from di erent years and di erent seasons,
thus very di erent weather patterns.
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Determination of the Analogs</title>
        <p>The determination of the analogs is an important step of the AnEn method. In
the present work various methods, or metrics, have been used to compute the
similarity between forecasts.</p>
        <p>
          The rst metric used is the metric originally established for the AnEn method
by Monache [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], which will be referred to as Monache from now on. It is based
on the Euclidean distance. More precisely, it computes the di erence between
the values of atmospheric variables in the two forecasts, over a window of time.
The formula used is the following:
        </p>
        <p>Nv wi uuv k
mtt0 = X t X (Fi;t+j
i=1 fi j= k</p>
        <p>Ai;t0+j )2
(1)
(2)
(3)
With the terms being the following:
{ Ft is the current forecast, which needs to be improved.
{ At0 is a past forecast compared to the current forecast.
{ Nv is the number of meteorological variables taken into account when
comparing forecasts.
{ wi is the weight given to variable i.
{ fi is the standard deviation of variable i. This term is used to reduce
variables, thus making di erent variables possible to compare. Without this
term, computation is impossible and makes no physical sense.
{ k is the length of the time window over which the forecasts are compared.</p>
        <p>Indeed, to compare Ft and At0 , we do not only look at the variables at
the times t ad t0, which are compared, but we look at their evolution over a
period of time. The aim is to make sure the weather pattern in both forecasts
is similar.
{ Fi;t+j is the value of variable i at time t + j in the current forecast.
{ Ai;t0+j is the value of variable i at time t0 + j in the past forecast.</p>
        <p>Something of note here is the importance of parameter k: there is no obvious
value for this parameter, so a separate study would be needed to determine
its optimal value. Considering, the sets Ft and At0 as two vectors of a 2k + 1
multidimensional space, the Monache metric, presented in Equation (1), can be
rewritten as</p>
        <p>Nv wi
m1 = X
i=1 fi
jjFt</p>
        <p>At0 jj;
where jj:jj represents the Euclidean norm.</p>
        <p>An alternative metric consists to normalize Ft and At0 in Equation (2),
resulting in the normalized Monache metric presented in Equation (3).</p>
        <p>Nv wi
m2 = X</p>
        <p>Ft
i=1 fi jjFtjj</p>
        <p>At0
jjAt0 jj
:</p>
        <p>Normalized vectors all have a norm of 1. The idea behind this reasoning is
to look at the global weather pattern present in both forecasts. It can be seen
that the basic Monache metric looks not only for a similar weather pattern but
also for similar numerical values to the various variables used in the forecast. As
a consequence, it will not keep as an analog a forecast behaving exactly like the
forecast to improve, but at higher or lower values. Normalization aims at solving
this perceived problem. This method is called Normalised Monache, shortened
to Normalised.</p>
        <p>Keeping in line with the previous idea, the cosine of the angle between the
two vectors Ft and At0 can be used such that
cos( ) =</p>
        <p>
          AtT0 Ft
kAt0 k kFtk
;
where denotes the angle between the vectors Ft and At0 . The cosine can then
be used to estimate the analog by means of the correlations between the two
vectors, as demonstrated by K. Adachi [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. This is the idea beneath the use of
the metric presented in Equation (4).
        </p>
        <p>Nv wi
m3 = X</p>
        <p>AtT0 Ft</p>
        <p>:
i=1 fi kAt0 k kFtk
It is known that this value behaves like the correlation coe cient, taking values
between -1 and 1 with 1 indicating maximum similarity. The idea is, therefore, to
replace the Monache metric with the cosine of the angle between the forecasts,
keeping as analogs the past forecasts with cosine closest to 1. This is the cosine
method.</p>
        <p>
          Lastly, clustering can be applied to this problem. Clustering is the
partitioning of data into clusters of similar data, as illustrated by Fig. 3. In this case each
multidimensional vector At0 is a ected to a similar cluster. Clustering is used
to create the analog ensembles. By clustering the database of past forecasts, we
obtain analog ensembles that can then be used immediately. The only task left
is to a ect the current forecasting to the good analog ensemble - in other words,
(4)
to the closest cluster. This method was inspired by Gutierrez and co-workers,
who used clustering in a forecasting problem [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. From now on, this method is
mentioned as Kmeans.
2.4
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>Prediction Methods</title>
        <p>Making predictions using the Analog Ensemble method is very straightforward in
this case. In this case, if one takes a look at Fig. 1, there are no NWP predictions
and past predictions. The NWP is replaced by data from other stations, data
in the past (training period) and in the forecasting interval (prediction period).
The principle, however, remains the same. Since there are not forecast to correct
in this case, though, but only a prediction to make, this is done easily when used
the Monache and K-means metrics like this
It is simply the mean of the past target variable value at times matching the
analogs. In the case of Cosine and Normalised Monache, however, a simple mean
is not enough. Because what is looked at is trends and not exactly equal
numerical values, there might be a di erence between the variable's value at time t0
and the desired value. Therefore, the equation becomes</p>
        <p>Ft =
1 Na</p>
        <p>X Fti :</p>
        <p>Na i=i
Ft =
1 Na</p>
        <p>X Fti + tti :</p>
        <p>Na i=i
where tti = At
the forecast.</p>
        <p>Ati to account for the scale di erence between the analog and
2.5</p>
      </sec>
      <sec id="sec-2-5">
        <title>Using two Stations as Predictors</title>
        <p>To improve the accuracy of the hindcast, it is tempting to use data coming
from various weather stations as predictors, instead of data coming from just
one station. This raises the problem of how to treat these additional data. This
problem was solved in two di erent ways, which were both used in this paper to
determine which method is more adequate to handle data coming from various
stations.</p>
        <p>The rst method is called the dependent stations variant method. This
variant considers the stations to be nothing more than additional predictor variables,
and as such compute analogs across all stations at once every time. That it is to
say, the observation at time t0 is deemed to be an analog of the weather at time
t if, and only if, the weather at all stations at the time t0 is close to the weather
at all stations at time t.</p>
        <p>A second idea can be to look for analogs at each station. This is called the
independent stations variant method. In other words, the metric is calculated
(5)
(6)
at each station independently from one another. The prediction is then made
using the mean of the analogs from all the stations. In other words, compared to
the rst approach, each station forms a disjoint set of data in which analogs are
searched separately. Then weights are assigned to each station to form the nal
set of analogs, so that for example 90% of analogs may come from the study of
the data at the rst station and the 10% remaining come from the study of the
data at the second station.
2.6</p>
      </sec>
      <sec id="sec-2-6">
        <title>Error Assessment</title>
        <p>
          As shown by Chai and Draxler [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], assessing the model accuracy is best done
using various metrics. Three metrics are especially useful when trying to assess
the performance of a forecasting model. First one is the bias:
n
Bias = 1 X xi
n
i=1
yi;
This error is useful because the squared terms give a higher weight to high
error. Thus, the RMSE will be higher if the model makes predictions that are
far from the truth, even if these erroneous predictions are few. The RMSE will
comparably be lower for a model consistently close to the truth, even if the
forecast is still committing an error compared to the truth. It shows the random
errors of the model - errors, which happen randomly, not in a systematic way.
        </p>
        <p>
          The third metric, whose usage alongside the RMSE is recommended by Chai
and Draxler [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], is the Mean Absolute Error (MAE):
n
MAE = 1 X
n
i=1
jyi
xij:
with n being the number of forecasts, xi being the forecast and yi being the truth
values. As its name suggests, the bias simply measures the bias of the model:
it simply shows the average error compared to the truth. However, it does not
really show the behavior of the error. It is useful to determine if the model makes
predictions that are lower or higher than the truth, but in itself, it is not enough
to know how well the model performs. It only shows the systematic error of the
model.
        </p>
        <p>Thus, the Root Mean-Squared Error (RMSE) is also used, computed as:
v</p>
        <p>n
RMSE = tuu n1 X(yi
i=1
xi)2:
(7)
(8)
(9)</p>
        <p>Compared to the bias, this metric computes the average distance, in absolute
value, to the truth. The bias simply computes the average error, but positive
errors and negative errors can cover each other. The MAE then gives a somewhat
more truthful assessment of the average distance to the truth. A low bias and
high MAE means that the model is not really accurate, but that its predictions
are sometimes higher than the truth, and sometimes lower. Thus, including the
MAE is necessary to really understand how the error is distributed in the forecast
since it also shows a systematic error but this time in terms of absolute distance.</p>
        <p>
          One last tool used to show an error from a forecasting model is the Taylor
diagram, described by Taylor [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. This diagram shows the proximity between
two variables, variables which here are the truth and the prediction. Considering
two variables xi and yi, each having N components and with means x and y,
with the correlation coe cient between them being R, their standard deviations
being x and y, then it can be shown that
        </p>
        <p>RMSE2</p>
        <p>Bias2 =
x2 +
2
y
2 x yR;
(10)
which is the basis of the Taylor diagram representation.
2.7</p>
      </sec>
      <sec id="sec-2-7">
        <title>Parameter Selection</title>
        <p>Various tests were run to determine which was the most suitable value of k, the
time window, and Na, the number of analogs kept. It has been found that in this
testing environment, the best value for k is k = 20, which corresponds to a time
window of four hours in length (2 hours before the forecast, and two hours after)
since the data used made observations 10 times per hour. For Na, Na = 25 gave
satisfying results. Of course, these values may change according to the problem
and are not to set in stone but to be adapted to each case.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>The results of the test can be divided into two parts. In the rst time, the
importance of the choice of stations was assessed, to see how important choosing
the right stations to hindcast the values at another station is. Then, since it is
also possible to assign weights to stations, the importance of choosing the right
weight was studied.
3.1</p>
      <sec id="sec-3-1">
        <title>Studying the Stations</title>
        <p>The rst aim of this study is to evaluate how important the choice of the
stations used for hindcasting is. To evaluate this, the gust speed (GST) data for
the station ykt between the years 2017 and 2018 was reconstructed using three
di erent pairs of stations, and the value of GST at these pairs of stations.</p>
        <p>Table 1 contains results for an AnEn hindcast whose predictor was based
solely on the mnp station. The results on table 2 extended the AnEn method
to include a second station as a predictor. Pairs were made from stations mnp,
dom, ykr and wds to assess how consistent the results are across di erent pairs.</p>
        <p>The results in Table 1 show less error for the Kmeans method, while retaining
similar bias to the Cosine and Normalised method. This is in line with the results
from Table 2, where the Kmeans method consistently shows better performance.
A simple application of the Monache metric yielded higher bis (which is also
consistent with the results in Table 2). Normalizing the Monache metric shows
overall improvements in the bias and RMSE, though more evident in the results
from table 2. It is only for the wds,mnp pair of predictors that the Monache
method shows a superior performance, though the Kmeans method still has
lower RMSE.</p>
        <p>Comparing results from Table 1 and Table 2. Using mnp alone as predictor
is worse than using mnp and either dom or wds for the Kmeans and Normalised
Monache methods. For the Cosine and Monache methods, it is clearly better
to use both wds and mnp rather than mnp alone for hindcasting. However, for
these methods, it is better to use mnp alone rather than both dom and mnp.</p>
        <p>Comparing the pure Monache metric with the Normalised one, the results
in Table 2 show that the latter leads to bias and RMSE reduction. The only
predictor pair where this was not observed was wds,mnp, yet the di erences were
not meaningful as it corresponds to 4% of higher RMSE. The cosine method
behavior resembles the Normalised metric, though with degraded performances
in the error metrics. This similarity was expected as both methods nd analogs
on di erences ranging from -1 to 1, i.e. looking solely for relative patterns in the
time series.</p>
        <p>As it can be seen in Table 2, Kmeans behave in the same way as Normalised
Monache and Cosine, which implies that clusterization employs a similar idea
as these two methods. Results, however, are noticeably better.</p>
        <p>These results show a rift between the methods: while Kmeans, Cosine and
Normalised Monache all give results following the same trend, Monache's
results go in another direction. There is, however, a rational explanation for this
behavior: Monache looks for analogs by minimizing the distance between the
target variable's value at time t (when the prediction is made) and at time t0
(the analog). The other methods, however, disregard this distance. Instead, they
look for a similar evolution of the weather during the time window. As such, the
results imply that at the station dom, the weather follows a similar pattern as
the station ykt, but because of the di erent location, meteorological values are
not the same. This di erence in value disturbs the Monache method, but not
the other, who look at the underlying weather patterns.</p>
        <p>All methods, however, give their worse results with the dom-mnp pair, and
always by a clear margin, while the wds-mnp pair performs well in all cases.
This suggests that wds is a much better station to predict the weather at ykt
compared to dom, and that ykr is a very good predictor station too, since it
is able to o set the inaccuracies caused by the use of dom as predictor (except
in the case of the Monache method, since Monache gives a great emphasis on
numerical distance between values). Overall, the mnp and wds pair is the one
giving the lower RMSE overall across all methods.</p>
        <p>As expected, results are improved when using two stations, compared to
using just one. However, the results for Monache and Cosine methods suggest
that the choice of stations is important to really have a gain in performance.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Studying the Weights</title>
        <p>Now that the importance of using the correct stations has been assessed, it
became important to evaluate if weighting the contribution from each station
could improve on the results, and compare the two approaches described in
section 2.5. For this, it was chosen to focus on the wds and mnp pairs, whose
results gave the lowest RMSE overall in the previous test. The question is to
determine if it is possible to improve these results even further by assigning
weights to these stations.</p>
        <p>For this purpose, here both the independent stations and the dependent
stations variants were used. The former variant allows weights to be set for each
individual station, while the latter does not, which is detailed in section 2.5.
As a consequence, in Table 3 tests results showing "{" in the "Weight" column
were tested ran with the dependent method, while tests with numerical values in
the "Weights" column were tested ran with the independent method. The target
variable was kept the same (GST), for ease of comparison with the previous
results. The results are in Table 3.</p>
        <p>Considering the results from Table 3 and looking at the stations
independently Monache yields the best results, but only if the weights are equal. However</p>
        <sec id="sec-3-2-1">
          <title>Method</title>
        </sec>
        <sec id="sec-3-2-2">
          <title>Monache</title>
        </sec>
        <sec id="sec-3-2-3">
          <title>Monache</title>
        </sec>
        <sec id="sec-3-2-4">
          <title>Monache</title>
        </sec>
        <sec id="sec-3-2-5">
          <title>Monache</title>
        </sec>
        <sec id="sec-3-2-6">
          <title>Normalised</title>
        </sec>
        <sec id="sec-3-2-7">
          <title>Normalised</title>
        </sec>
        <sec id="sec-3-2-8">
          <title>Normalised</title>
        </sec>
        <sec id="sec-3-2-9">
          <title>Normalised</title>
        </sec>
        <sec id="sec-3-2-10">
          <title>Cosine</title>
        </sec>
        <sec id="sec-3-2-11">
          <title>Cosine</title>
        </sec>
        <sec id="sec-3-2-12">
          <title>Cosine</title>
        </sec>
        <sec id="sec-3-2-13">
          <title>Cosine</title>
          <p>Kmeans</p>
        </sec>
        <sec id="sec-3-2-14">
          <title>Kmeans</title>
        </sec>
        <sec id="sec-3-2-15">
          <title>Kmeans</title>
          <p>Kmeans</p>
          <p>Weight wds Weight mnp Bias RMSE MAE
setting most weight on ykr is better than setting most weight on mnp.
Normalized Monache however, prefers to have the analogs looked across all the stations
at once. Cosine shows no big di erence between the dependent and
independent methods. The independent method performs slightly better when maximal
weight is assigned to one station. In agreement with previous results, it appears
that independent Kmeans gives best results when wds has most of the weight.
However, even then it performs clearly worse than dependent Kmeans.</p>
          <p>The Fig. 4 presents the Taylor diagram for the best case of each method. It is
possible to see that the Normalised method gives the best results. Its proximity to
the truth indicates a high correlation coe cient with it, a low Root-mean Square
(RMS) distance to the truth and similar positions on the X-axis shows similar
standard deviations - in other words, the forecast obtained by the Normalised
method is close to the truth. Predictions from the Kmeans and Monache methods
are very close to one another, and also close to the truth, with a high correlation
coe cient and a low RMS distance to the truth. However, they are closer to
the origin on the X-axis, indicating that their standard deviation is lower than
that of the truth. In other words, these methods have troubles following the
variations of GST accurately. Cosine is the method, which performs the worst,
and its coe cient correlation with the truth is lower than for other methods, and
the RMS distance tot he truth is higher. However, it has the closest standard
deviation to the truth, meaning the cosine method is the method that follows
the variations of GST the most accurately.</p>
          <p>Fig. 5 shows the forecasts obtained with the di erent methods jointly with
the truth values. As expected, the forecast by the Normalised method appears to
the one closest to the truth. It is interesting to note that while the forecasts by
the Monache and Kmeans methods behave similarly, they look rather di erent</p>
        </sec>
        <sec id="sec-3-2-16">
          <title>Monache</title>
        </sec>
        <sec id="sec-3-2-17">
          <title>Normalised</title>
        </sec>
        <sec id="sec-3-2-18">
          <title>Cosine</title>
        </sec>
        <sec id="sec-3-2-19">
          <title>Kmeans</title>
          <p>-0.8
-0.9
-0.6
-0.4</p>
          <p>Correlation Coe cient
-0.2
0
from one another. Cosine, as expected, is the furthest one from the truth but
displays a lot of variabilities.</p>
          <p>Qualitatively the Cosine and Normalized methods give a representation of
the time series with higher delity, due to additional variance. Quantitatively,
however, the additional variance introduces mismatches which result in poorer
performance when compared against the Kmeans and Monache methods.
4</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>In this work meteorological data was predicted at one location based on multiple
historical datasets from weather stations. To achieve this, the Analog Ensembles
method was applied and several methods were explored, by changing the metric
used to determine the analogs in the historical dataset. The prediction horizon
was two years, based on a training period of four years of historical and observed
time series.</p>
      <p>From all these results, it appears clearly that the choice of stations, and how
to weight them if a weighted approach is used, has a very important bearing on
the hindcasting, and presumably forecasting, accuracy. The problem of selecting
stations for hindcasting and forecasting purposes in a non-trivial one, and from
these experiments it would appear that the best way to make a viable selection
is to simply test hindcasting on known data, to determine which are the stations
most suited to forecasting and hindcasting purposes at the target. The use of
the K-means metrics leads to an improvement ranging from 3% to 30% of lower
quadratic error when compared against the Monache metric. Increasing the
predictors to two stations improved the performance of the hindcast, leading up to
16% of lower error, depending on the correlation between the predictor stations.
These features show the improvements which can be made on the existing AnEn
method.</p>
      <p>As future work, one main possibility to explore is about the clustering
approach. The results look very promising; however, there are a number of
parameters required to be set that were not looked at. Therefore, the number of
clusters was left at a basic value. It is possible to tweak this value and see how
to best set this value for maximal accuracy, or even control the size of clusters.
The K-means algorithm is also a fairly basic clustering algorithm and more
accurate algorithms now exist. It would be interesting to look at their performances
compared to the basic Kmeans in this case.</p>
      <p>It is also possible to look at larger scales, both in terms of a number of
variables, of number of stations, or of the distance between stations. Here, this
study focused on a rather simple testing environment, with stations located
all next to each other. As a next step, it would be interesting to look at the
performances of these various approaches at larger scales.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgement</title>
      <p>UNIAG, R&amp;D unit funded by the FCT { Portuguese Foundation for the
Development of Science and Technology, Ministry of Science, Technology and Higher
Education. Project n.o UID/GES/4752/2019.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>1. Mathworks, https://www.mathworks.com/</mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>2. National Center for Atmospheric Research, https://nar.ucar.edu/</mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>National</given-names>
            <surname>Data</surname>
          </string-name>
          Buoy Center, https://www.ndbc.noaa.gov/
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Adachi</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Matrix-Based Introduction to Multivariate Data Analysis</article-title>
          . Springer Singapore (
          <year>2016</year>
          ). https://doi.org/10.1007/
          <fpage>978</fpage>
          -981-10-2341-5
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Alessandrini</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Monache</surname>
            ,
            <given-names>L.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sperati</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cervone</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>An analog ensemble for short-term probabilistic solar power forecast</article-title>
          .
          <source>Applied Energy</source>
          <volume>157</volume>
          ,
          <issue>95</issue>
          {
          <fpage>110</fpage>
          (
          <year>2015</year>
          ). https://doi.org/10.1016/j.apenergy.
          <year>2015</year>
          .
          <volume>08</volume>
          .011
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Alessandrini</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Monache</surname>
            ,
            <given-names>L.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sperati</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nissen</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>A novel application of an analog ensemble for short-term wind power forecasting</article-title>
          .
          <source>Renewable Energy</source>
          <volume>76</volume>
          ,
          <issue>768</issue>
          {
          <fpage>781</fpage>
          (
          <year>2015</year>
          ). https://doi.org/10.1016/j.renene.
          <year>2014</year>
          .
          <volume>11</volume>
          .061
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Alessandrini</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Monache</surname>
            ,
            <given-names>L.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rozo</surname>
            ,
            <given-names>C.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lewis</surname>
            ,
            <given-names>W.E.</given-names>
          </string-name>
          :
          <article-title>Probabilistic prediction of tropical cyclone intensity with an analog ensemble</article-title>
          .
          <source>Monthly Weather Review</source>
          <volume>146</volume>
          (
          <issue>6</issue>
          ),
          <volume>1723</volume>
          {
          <fpage>1744</fpage>
          (
          <year>2018</year>
          ). https://doi.org/10.1175/mwr-d
          <string-name>
            <surname>-</surname>
          </string-name>
          17
          <source>-0314</source>
          .
          <fpage>1</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Chai</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Draxler</surname>
            ,
            <given-names>R.R.</given-names>
          </string-name>
          :
          <article-title>Root mean square error (RMSE) or mean absolute error (MAE)? { arguments against avoiding RMSE in the literature</article-title>
          .
          <source>Geoscienti c Model Development</source>
          <volume>7</volume>
          (
          <issue>3</issue>
          ),
          <volume>1247</volume>
          {
          <fpage>1250</fpage>
          (
          <year>2014</year>
          ). https://doi.org/10.5194/gmd-7-
          <fpage>1247</fpage>
          -2014
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9. van den Dool, H.M.
          <article-title>: A new look at weather forecasting through analogues</article-title>
          .
          <source>Monthly Weather Review</source>
          <volume>117</volume>
          (
          <issue>10</issue>
          ),
          <volume>2230</volume>
          {
          <fpage>2247</fpage>
          (
          <year>1989</year>
          ). https://doi.org/10.1175/
          <fpage>1520</fpage>
          -
          <lpage>0493</lpage>
          (
          <year>1989</year>
          )
          <volume>117</volume>
          &lt;
          <fpage>2230</fpage>
          <source>:anlawf&gt;2.0.co;2</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Dool</surname>
            ,
            <given-names>H.V.D.</given-names>
          </string-name>
          :
          <article-title>Searching for analogues, how long must we wait?</article-title>
          <source>Tellus A: Dynamic Meteorology and Oceanography</source>
          <volume>46</volume>
          (
          <issue>3</issue>
          ),
          <volume>314</volume>
          {
          <fpage>324</fpage>
          (
          <year>1994</year>
          ). https://doi.org/10.3402/tellusa.v46i3.
          <fpage>15481</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Gutierrez</surname>
            ,
            <given-names>J.M.</given-names>
          </string-name>
          , Co n~o,
          <string-name>
            <given-names>A.S.</given-names>
            ,
            <surname>Cano</surname>
          </string-name>
          ,
          <string-name>
            <surname>R.</surname>
          </string-name>
          , Rodr guez, M.A.:
          <article-title>Clustering methods for statistical downscaling in short-range weather forecasts</article-title>
          .
          <source>Monthly Weather Review</source>
          <volume>132</volume>
          (
          <issue>9</issue>
          ),
          <volume>2169</volume>
          {
          <fpage>2183</fpage>
          (
          <year>2004</year>
          ). https://doi.org/10.1175/
          <fpage>1520</fpage>
          -
          <lpage>0493</lpage>
          (
          <year>2004</year>
          )
          <volume>132</volume>
          &lt;
          <fpage>2169</fpage>
          <source>:cmfsdi&gt;2.0.co;2</source>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Lorenz</surname>
            ,
            <given-names>E.N.</given-names>
          </string-name>
          :
          <article-title>Atmospheric predictability as revealed by naturally occurring analogues</article-title>
          .
          <source>Journal of the Atmospheric Sciences</source>
          <volume>26</volume>
          (
          <issue>4</issue>
          ),
          <volume>636</volume>
          {
          <fpage>646</fpage>
          (
          <year>1969</year>
          ). https://doi.org/10.1175/
          <fpage>1520</fpage>
          -
          <lpage>0469</lpage>
          (
          <year>1969</year>
          )
          <volume>26</volume>
          &lt;
          <fpage>636</fpage>
          <source>:aparbn&gt;2.0.co;2</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Monache</surname>
            ,
            <given-names>L.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eckel</surname>
            ,
            <given-names>F.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rife</surname>
            ,
            <given-names>D.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nagarajan</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Searight</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Probabilistic weather prediction with an analog ensemble</article-title>
          .
          <source>Monthly Weather Review</source>
          <volume>141</volume>
          (
          <issue>10</issue>
          ),
          <volume>3498</volume>
          {
          <fpage>3516</fpage>
          (
          <year>2013</year>
          ). https://doi.org/10.1175/mwr-d
          <string-name>
            <surname>-</surname>
          </string-name>
          12
          <source>-00281</source>
          .
          <fpage>1</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Monache</surname>
            ,
            <given-names>L.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nipen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roux</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stull</surname>
          </string-name>
          , R.:
          <article-title>Kalman lter and analog schemes to postprocess numerical weather predictions</article-title>
          .
          <source>Monthly Weather Review</source>
          <volume>139</volume>
          (
          <issue>11</issue>
          ),
          <volume>3554</volume>
          {
          <fpage>3570</fpage>
          (
          <year>2011</year>
          ). https://doi.org/10.1175/2011mwr3653.1
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Taylor</surname>
          </string-name>
          , K.E.:
          <article-title>Summarizing multiple aspects of model performance in a single diagram</article-title>
          .
          <source>Journal of Geophysical Research: Atmospheres</source>
          <volume>106</volume>
          (
          <issue>D7</issue>
          ),
          <volume>7183</volume>
          {
          <fpage>7192</fpage>
          (
          <year>2001</year>
          ). https://doi.org/10.1029/2000jd900719
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Vanvyve</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Monache</surname>
            ,
            <given-names>L.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Monaghan</surname>
            ,
            <given-names>A.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pinto</surname>
            ,
            <given-names>J.O.</given-names>
          </string-name>
          :
          <article-title>Wind resource estimates with an analog ensemble approach</article-title>
          .
          <source>Renewable Energy</source>
          <volume>74</volume>
          ,
          <issue>761</issue>
          {
          <fpage>773</fpage>
          (
          <year>2015</year>
          ). https://doi.org/10.1016/j.renene.
          <year>2014</year>
          .
          <volume>08</volume>
          .060
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>