Modeling of forecasts variance reduction at multiple time series
prediction averaging with ARMA (1, q) functions
Danila Musatov 1 and Denis Petrusevich 1
1
    MIREA – Russian Technological University, Prospekt Vernadskogo, 78, Moscow, 119454, Russia


                Abstract
                Combination of time series forecasts is usually considered as good technique in practice. But
                it has got weak theoretical explanation. In this research variance of time series forecasts and
                variance of combined models are considered. One is interested in the view of variance of
                forecasts function over one, two and three periods. Conditions which can lead to improvement
                of averaged time series predictions are in scope of this research. In this paper a few examples
                of the most popular time series models are observed: the moving average models MA(q), the
                autoregressions AR(p) models and their combination in the form of ARIMA(p, d, q) or
                ARMA(p, q) model. In particular, AR(1) and ARMA(1, q) are investigated. Nowadays there
                are researches about time series averaging. Approaches based on bagging and boosting are
                implemented very often in classification and regressions. It’s very appealing to use such
                strategy in tine series modeling. At the same time it’s easy to construct learning set and test set
                in classification tasks. But it’s a complex task in case of time series processing. There’s a need
                of two sets: to train time series models and to construct their combinations. Thus nowadays
                combination of time series models, combination of their forecasts or of their prediction
                intervals are in scope of view of a few complex researches. In this paper we investigate
                behaviour of time series predictions’ variance in order to have another useful approach in time
                series prediction averaging. Russian macroeconomical time series statistics is used as
                experimental time series.

                Keywords
                Time series forecasting, prediction averaging, ARIMA, forecast variance, information criteria

1. Introduction

    Mathematical models used to predict certain value are often used in combinations. The most simple
combination function in case of time series processing is just averaging of all models’ predictions or
selection of the best one [1]. It’s often widely believed that if averaged models are “good” enough and
reflect some part of described process’ behaviour their averaging also leads to a0 “good” model.
Though, mathematical statement of this problem isn’t researched well. There are attempts to choose the
best model [1], to construct mean model [2], to implement bagging strategy to processed models of
certain time series [3, 4]. Selection of the best model is traditional way but in practice there’s always a
lot of models and there’s no mathematically strict way to choose the best one. Then researchers thought
that multiple models can describe various sorts of processed time series behaviour from various points
of view and thus their combination is better than selection of the best one [5-7]. Bagging [8] of time
series is very appealing but complex task. Usually time series models are built in two stages: there’s a
training set used to construct them and a test set to evaluate their quality. But in case of bagging one
needs three parts or training set should be subdivided in some parts to use them in construction of the
models and in evaluation of their combinations [9-11]. In this research combination of forecasts is

Proceedings of MIP Computing-V 2022: V International Scientific Workshop on Modeling, Information Processing and Computing,
January 25, 2022, Krasnoyarsk, Russia
EMAIL: danilamusatov20@mail.ru (Danila Musatov), petrdenis@mail.ru (Denis Petrusevich)
ORCID: 0000-0003-0673-5393 (Danila Musatov); ORCID: 0000-0001-5325-6198 (Denis Petrusevich)
             © 2022 Copyright for this paper by its authors.
             Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
             CEUR Workshop Proceedings (CEUR-WS.org)
investigated from the prediction intervals point of view. Behaviour of prediction intervals is often out
of scope of research in a lot of ensemble models but this question is especially important in case of time
series forecasting [11-12]. In the paper ARIMA models are considered. Though some steps considering
combination of GARCH models (in case if time series is heteroscedastic) have already been done in
[7], combination of heterogeneous models are still in scope of future work. These models are the most
popular but sometimes frequential analysis is also a good tool [13, 14], so further analysis should
consider combinations of such models. Here we get variance of time series forecasts at 1, 2, 3 periods
of time and investigate its behaviour in case of model averaging.

2. Time series prediction intervals

In order to evaluate prediction interval one has to transform time series into MA(  ) form. According
to Wald theorem [15] this transformation (also called psi-representation) can be implemented to any
stationary time series. If one treats ARIMA models, stationary process can be achieved with time series
differentiation [16]. Each process in such view can be expressed via infinite (in common case)
summation of moving averages MA(q). Coefficients  j in this series are usually called psi-weights. .
Thus, the simplest case is, of course, handling moving average MA(q) models:
                                        X t   t  1 t 1  ...   q  t  q .
Here today value of the X time series is expressed via white noises  t  q of order less than or equal to q
[16]. These models are already considered in the form when all coefficients are equal to psi-weights.
Thus, variance of these models’ predictions [16] is (1):
                                                       n 1                                    (1)
                                    Var ( xˆn  xn )    j ,
                                                     2      2

                                                               j 0

here xˆn is a predicted value at time n, xn is a real value of time series,  is a standard error value (got
at learning procedure) and  j   j for MA(q) series. If the assumption of normally distributed errors
is hold, a 95% prediction interval of xˆn is [ xˆn  1.96 Var ( xˆn  xn ), xˆn  1.96 Var ( xˆn  xn )]. Here
variance can be found via psi-weights be means of (1).
   Another significant part of time series models consists of the autoregression AR(p) models. In this
case today value of the X time series depends on its own past values of order less than or equal to p:
                                        X t  c  1 X t 1  ...   q X t  p .
   Consideration of Wald’s theorem to autoregression processes AR(p) leads to (2):
                                            0  1,
                                            1  1 ,
                                                                                                        (2)
                                            2  1 1   2 ,
                                            3  1 2   2 1  3 .
   These expressions are recurrent and can be reinterpreted in term of summation (3):
                                                      k                                                 (3)
                                            k   
                                                    i k i .
                                                     i 1
  And in common case handling ARIMA (p, d, q) models (or ARMA(p, q)) consideration of psi-
weights leads to (4):
                                             0  1,
                                             1  1  1 ,
                                                                                                        (4)
                                             2  1 1   2   2 ,
                                             3  1 2   2 1  3  3 .
   Thorough analysis of these expressions can be found in [16, 17].
    In this research we consider AR(p) and ARMA(p, q) models with p<3. The main goal is to express
variance via terms of the models and to explain conditions at which averaging technique is going to
lead its improvement. Here only averaging of the models is taken into account. But the same approach
can be used in case of bagging and non-linear combinations of models. In further calculations and at
averaging stage of experiments all models are supposed to have the same model of seasonality because
this part is non-linear and its summation will lead to models with another complex seasonality.

2.1.    Variance of the prediction made with AR(1) model

  First of all, here AR(p) models are analyzed. Considering the simplest case of AR(1) model psi-
weights (2) are elements of geometric progression:
                                                    0  1,
                                                    1  1 ,
                                                                                                           (5)
                                                    2  1 1  12 ,
                                                    k  1 k 1  1n .
   Variance of its prediction at time n is proportional to sum of geometric progression with ratio equal
to 12 :
                                                   n 1                  n 1
                                                                                             1  12 n     (6)
                        VarAR (1) ( xˆn  xn )   2  j 2   2 (1   12 j )   2                 .
                                                   j 0                  j 1                 1  1
   This sum (6) gets finite if | 1 | 1 even if n  . Thus, in case of AR(1) process variance of the
forecast over infinite period can be finite number. Logically it’s close to the case of fluctuations with
descending amplitude (7).
                                                                        1                           (7)
                                           lim VarAR (1) ( xˆn  xn )  
                                                                    2
                                                                                         .
                                           n                                  1  1
   Variance of prediction over 1, 2 and 3 timesteps is presented in expressions (8):
                                             VarAR (1) ( xˆ1  x1 )   2 ,
                                             VarAR (1) ( xˆ2  x2 )   2 (1  12 ),                      (8)
                                             VarAR (1) ( xˆ3  x3 )   2 (1  12  14 ).
    Variance of prediction over 1 timestep ahead depends only on the standard error of model under
investigation. Plots of these functions for predictions over 2 and 3 timesteps are presented in Figure 2,
Figure 3. They are bulging downward (all degrees are even and there are only plus signs in the
expression (8)). There’s only one minimum (which is global one) in case when 1  0 and its prediction
is also zero. If there are a few models their prediction variances can be various points at plots 1, 2. If
predictions of a few models AR(1) are averaged, one can describe it as averaging of the very models
(because the ARMA(p, q) model is linear). Thus, averaged model can be considered as a new ARMA(p,
q) model with the same orders p, q but different coefficients.
    Variance of combined model is lower than variance of the source ones if one moves towards
minimum at plots 1, 2. If there are two models with variances situated at the same side from zero,
variance of their combination will be situated between them. So, in the case of two models one of them
is going to have higher variance and one of them is going to have lower variance than variance of the
combined model. At the same time the combined model is going to have lower level of variance if
variances of the source models are situated from different sides of zero.
    Basically, this discourse is true in a case of a lot of models. The variance function (8) is bulging
downward and it has got only one minimum. So, if all models are marked as points at plots 1,2, variance
of averaged model is the same function of averaged parameter. Because variance function is bulging
downward, point marking variance of averaged model is situated under line connecting models with
extreme values of parameter 1 (minimum and maximum).
                                           M  max VarM ,
                                                  N
                                                       i
                           min VarM  Var
                                      i
                                                  i
                                                                       i
                                                                                                      (9)
                                                  N
   here M i denotes enumeration of N models of AR(1) type.


Figure 1: Forecast variance of AR(1) model at 2 timesteps ahead


   Figure 2: Forecast variance of AR(1) model at 3 timesteps ahead

    But this combination gets the best model (with lower variance) only if equal numbers of models are
situated to the left of zero and to the right of it. This situation can be seen at Figure 3. Thus, averaging
leads to the best model (usually to a “good” model because there’s too much of them) only if all models
are divided into two equal parts: with negative value of 1 and positive ones. Basically, it means that
half of models predict that value of time series is going to be lower and the other ones predict tendency
to grow. Such situation can take place if investigated time series has got complex behaviuor of
researchers haven’t got enough data to make prediction. In case of “usual” situation there’s always some
model better than averaged one (with lower variance). If there’s no tool to choose the best model, the
averaged one is better than worse models. So, it can be used as another tool for prediction and it has got
“good” quality.


Figure 3: Variance of handled AR(1) models (shown with transparent dots) and variance of averaged
model (black dot)

2.2.    Variance of the prediction made with ARMA(1, q) model

   Psi-weights of ARMA(1, q) models form geometric progression starting from time q:
                                                   0  1,
                                                   1  1  1 ,
                                                                                                                (10)
                                                   2   1 1   2  1 (1  1 )   2 ,

                                                   q 1  
                                                            1 q.

   Thus, in case if order of moving average part is less than time of prediction ( q  n ) there’s no
geometric progression. For example, considering variance of the ARMA(1, 2) model for 3 timesteps,
one has got:
                                                 2
               VarARMA(1,2) ( xˆ3  x3 )   2  j 2   2 1  (1  1 )2  (1 (1  1 )   2 ) 2  .   (11)
                                                j 0
   For q>n time series variance includes sum of geometric progression starting from time q:
                                                                 n 1
                          VarARMA(1,q ) ( xˆn  xn )   2  j 2 
                                                                 j 0
                                                                                                                (12)
                                                                                             1  12 n 
                           1  (1  1 ) 2  (1 (1  1 )   2 )2  ...  q 12  q 2
                             2
                                                                                                        .
                                                                                              1  1 
   Here, the last term denotes sum of the progression starting at q-th psi-function.

2.3.    Variance of the prediction made with ARMA(1, 1) model

   Variances of predictions over 1, 2 and 3 timesteps of ARMA(1, 1) model are shown at (13):
                            VarARMA (1,1) ( xˆ1  x1 )   2 ,                                                  (13)

                            VarARMA (1,1) ( xˆ2  x2 )   2 1  (1  1 ) 2  ,
                            VarARMA (1,1) ( xˆ3  x3 )   2 1  (1  1 ) 2  12 (1  1 ) 2  .
   First of all, one considers variance of prediction over 2 timesteps ahead. Extreme value of this
function is situated at the point where first derivatives are zeros:
                 (VarARMA (1,1) ( xˆ2  x2 )) |1  (VarARMA (1,1) ( xˆ2  x2 )) |1  2 2 (1  1 )  0.                  (14)

  It happens if 1  1 , predictions of this model ( X t  c  1 X t 1  1 t 1 ) and its variance are
minimal (it can be seen at plot 4). To check whether this is minimum, one uses the second derivatives:
     (Var       ( xˆ2  x2 )) | 2  (Var
           ARMA (1,1)                1
                                            ( xˆ2  x2 )) | 2  (Var
                                                       ARMA (1,1)
                                                                         ( xˆ2  x2 )) |   2 2 . (15)
                                                                              1           ARMA (1,1)                1 1

   Hessian of variance is zero:
                                                                              2 2       2 2
                                             H ARMA(1,1) ( xˆ2  x2 )                         0.                             (16)
                                                                              2 2       2 2
   1  2 2 ,  2  0. One can’t make definite conclusions by means of Sylvester’s theorem. It means
that the function hasn’t got one minimum. But taking a glance at Figure 4 one can see that minimum is
achieved at models with 1  1. At the same time this function is also bulging downward and results
for the AR(1) are also true in this case.
    Using the same approach we’ve considered variance of prediction in three timesteps. Extreme values
of variance are obtained equalizing first derivatives to zero:
                         (Var       ( xˆ  x )) |  2 2 (   )(2 2     1)  0,
                                     ARMA (1,1)    3         3      1         1     1       1
                                                                                                (17)
                                                                                                        1 1

                               (VarARMA (1,1) ( xˆ3  x3 )) |1  2 2 (1  1 )(12  1)  0.
    Here one can see the same result as in the case of prediction over two timesteps. The second equation
in (17) has got solution 1  1.
    The second derivatives for prediction of ARMA(1, 1) model over three timesteps are:
                           (Var       ( xˆ3  x3 )) | 2  2 2 (6(1  1 )  12  1),
                                          ARMA (1,1)                     1
                                                                                                   (18)

                                   (VarARMA (1,1) ( xˆ3  x3 )) | 2  2 2 (12  1),
                                                                          1


                                   (VarARMA (1,1) ( xˆ3  x3 )) |11  2 2 (312  211  1).
   In case if 1  1 , all derivatives in (18) are equal to 2 2 (12  1) .
   Hessian has got the form:
                                                         2 2 (6(1  1 )  12  1) 2 2 (312  211  1)                  (19)
                        H ARMA(1,1) ( xˆ3  x3 )                                                                          .
                                                            2 2 (312  211  1)                     2 2 (12  1)
   Hessian  2 and minor 1 have got complicated form. But in the case of extreme
1  1 : 1  2 2 (12  1)  0,  2  0. One can’t confirm that case of 1  1 leads to the model
with minimal variance. But it’s clearly seen at the Figure 5.


Figure 4: Forecast variance of ARMA(1, 1) model at 2 timesteps ahead
Figure 5: Forecast variance of ARMA (1, 1) model at 3 timesteps ahead (13)

2.4.    Variance of the prediction made with ARMA(1, 2) model

Variances of predictions over 1, 2 and 3 timesteps of ARMA(1, 1) model are shown at (20):
                               VarARMA (1,2 ) ( xˆ1  x1 )   2 ,                                                      (20)

                               VarARMA (1,2 ) ( xˆ2  x2 )   2 1  (1  1 ) 2  ,
                               VarARMA (1,2 ) ( xˆ3  x3 )   2 1  (1  1 ) 2  (1 (1  1 ) 2   2 ) 2  .
    Variance of forecast over 1 and 2 timesteps are the same as in the case of ARMA(1,1) model (13).
So, all calculations and conclusions made there remain true. The same result is repeated in case of
ARMA(1, q) models, q>2. For predictions over 1, 2 and 3 timesteps variance functions will be the same
as in (20). Differences are expected only for forecasts over more timesteps. In case of prediction over
3 timesteps extreme point is situated in the place where:
                  (VarARMA (1,2 ) ( xˆ3  x3 )) |1  2 2 1  1  (1 (1  1 )   2 )(21  1 )   0,
                  (VarARMA (1,2 ) ( xˆ3  x3 )) |1  2 2 1  1  21 (1 (1  1 )   2 )   0,                (21)
                  (VarARMA (1,2 ) ( xˆ3  x3 )) |2  2 2 1 (1  1 )   2   0
   Substituting the last equation into the first one in the system (21), one gets 1  1 and after that
from the last equation one has got  2  0. So, structure of the model that has got minimal variance of
prediction is the same as in cases of ARMA(1, 1) and of ARMA(1, 2) but for prediction at 2 timesteps.
Determinant of hessian gets very large and here only values of the second derivatives are shown in
expression (22):
                                                                     1
                              ( xˆ  x )) |  2 2 1  2  (   )    2    ,                       
                                                                                       2         (22)
                   (VarARMA (1,2 )   3    3      12                          1    1      2          1     1


                 (VarARMA (1,2 ) ( xˆ3  x3 )) | 2  2 2 (12  1),
                                                  1


                 (VarARMA (1,2 ) ( xˆ3  x3 )) | 2  2 2 ,
                                                   2


                 (VarARMA (1,2 ) ( xˆ3  x3 )) |11  2 2 1  1  21  1   1 (1  1 )   2  ,
                 (VarARMA (1,2 ) ( xˆ3  x3 )) |12  2 2  21  1  ,
                 (VarARMA (1,2 ) ( xˆ3  x3 )) |12  2 21.
  Structure of hessian after simplification with summation of rows and columns with appropriate
multipliers is presented in (23):
                                                      1 (1  1 )   2 
                                                                               2


                     H ARMA(1,2) ( xˆ3  x3 )   6                                    2       .              (23)
                                                                                           2
    It’s positive except solutions of the equations (21). At extreme points it’s zero. This result is close
to the previous cases. There’s a hyperplane at which variation is minimal: 1  1 ,  2  0. For models
ARMA(1, q), q>2, variance of predictions over 1, 2, 3 timesteps has got the same form as in cases
presented above. Differences are going to appear only in predictions over more timesteps.

3. Experiments

    The Dynamic series of macroeconomic statistics of the Russian Federation (monthly wage index
and monthly income index) [18] have been handled in the experimental part. Last 12 values of the time
series data have been used as test while the first 300 values were used to train models. In previous part
only ARIMA(1, d, q) models have been handled. So, here they are used as parts of combined model.
At the same time the best model by value of information criteria [16, 17] is also highlighted. In order
to compare forecasts RMSE and MAE metrics of errors are used:
                                                          
                                                     ( (t )  ts(t )) 2                           (24)
                                        RMSE             t
                                                                                   ,
                                                               N
                                                   |  (t )  ts(t ) |
                                       MAE           t
                                                                          .
                                                              N
    Here N is length of test period, τ(t) denotes predicted values of the processed time series, ts(t) is a
part of the investigated time series at the test period (real data).
   Two experiments have been performed. In the first one 200 values were used to train models and 12
values to test them. Order of moving average part was limited with 3. Order of the autoregression part
was set to 1. One can see its results in the Table 1. The best model (by value of information criteria) is
ARIMA(0, 1, 1) one. It’s marked with bold font in the table.

Table 1
The ARIMA (1, d, q), q<4, models of the wage index
        ARIMA(p, d, q) models               Akaike                        RMSE of                   MAE of
                                         information                      forecast                 forecast
                                           criterion
           ARIMA(1, 1, 0)                  1518.43                            18.68                 12.41
           ARIMA(1, 1, 1)                  1496.72                            17.94                 11.62
           ARIMA(1, 1, 2)                  1498.41                            17.91                 11.60
           ARIMA(1, 1, 3)                  1495.42                            17.75                 11.46
           ARIMA(0, 1, 1)                   1494.6                            15.14                 9.21
          Combined model                   1495.73                            17.75                 11.46

   Among four tested models there are signs of terms: 1  0 one time and 1  0 3 times, 1  0 one
time and 1  0 2 times (for ARIMA(1, 1, 0) it’s zero), there’s one positive  2 term and one negative.
So, theoretically averaging is expected to achieve “good” results. One can see that combined model
looks very close to the ARIMA(1, 1, 3) model and it’s better than other ones (of 1st order) by value of
Akaike information criterion and by quality of forecast at tested period. Especially it exceeds results of
the worst model ARIMA(1, 1, 0) . The best model has got different orders and wasn’t used in
combination but its results are close to the combined model and to the ARIMA(1, 1, 3) one.
   In the second experiment models ARIMA(1, d, q), q<6, were used. Length of the training set was
300 values. Volume of the training set was also 12. Its results are shown in the Table 2.

Table 2
The ARIMA (1, d, q), q<6, models of the wage index
        ARIMA(p, d, q) models               Akaike              RMSE of            MAE of
                                         information            forecast          forecast
                                           criterion
           ARIMA(1, 1, 0)                   942.44               37.59             36.50
           ARIMA(1, 1, 1)                   911.05               25.79             14.54
           ARIMA(1, 1, 2)                   901.17               22.11             14.20
           ARIMA(1, 1, 3)                   901.51               22.24             14.16
           ARIMA(1, 1, 4)                   903.51               22.23             14.14
           ARIMA(1, 1, 5)                   905.48               22.47             14.13
          Combined model                     906.7               22.47             14.13

   Here one can see that the combined model is better than the worst ones but is worse than the best
models. In this case almost all terms in various models have got the same signs. That’s why averaging
doesn’t give results close to the best ones. But it’s far better than the worst models used in combination.
   The same experiment as presented in the Table 1 has been done in case the income index [18].
   Among four tested models with orders q<4 there are signs of terms: 1  0 2 times and 1  0 also
2 times, 1  0 3 times (for ARIMA(1, 1, 0) it’s zero),  2  0 for the both models with non-zero MA(2)
term. Here, we expect that averaging is going to be good practice because signs of AR(1) part vary.
Results are shown in the Table 3.

Table 3
The ARIMA (1, d, q), q<4, models of the income index
        ARIMA(p, d, q) models                Akaike             RMSE of            MAE of
                                          information           forecast          forecast
                                            criterion
           ARIMA(1, 1, 0)                   1742.29              26.08               15.64
           ARIMA(1, 1, 1)                    1685.3              27.02               16.61
           ARIMA(1, 1, 2)                   1676.16              28.13               18.21
           ARIMA(1, 1, 3)                   1678.15              28.13               18.21
           ARIMA(2, 1, 2)                   1678.15              28.13               18.21
          Combined model                    1678.15              28.13               18.21

   Here four models (including the best one by information criteria values and the combined model)
make almost the same predictions by quality. At the same time models with q<2 make better predictions
but they’ve got worse information criteria values. It can be explained with overfitting to the learn data.
Thus, researcher can have one more model that’s close by quality to some number of the best ones. It
can be used to analyze signs of terms where they vary in implemented models.
   In the last experiment models ARIMA(1, d, q), q<6, were used. Length of the training set (income
index) was 300 values. Volume of the training set was also 12. Its results are shown in the Table 4.

Table 4
The ARIMA (1, d, q), q<6, models of the income index
        ARIMA(p, d, q) models                Akaike            RMSE of            MAE of
                                          information          forecast           forecast
                                            criterion
           ARIMA(1, 1, 0)                   1012.57              29.16              15.28
           ARIMA(1, 1, 1)                     976.78             28.91             16.49
           ARIMA(1, 1, 2)                     956.53             29.60             14.66
           ARIMA(1, 1, 3)                     958.36             29.89             14.58
           ARIMA(1, 1, 4)                     903.51             29.34             14.74
           ARIMA(1, 1, 5)                     959.62             29.44             14.74
           ARIMA(0, 0, 1)                     973.22             29.60             20.84
           Combined model                     962.75             29.44             14.74

    There are a few models better than the best one (chosen by values of combination of information
criteria). Again in case of a lot of models the combined model shows results which are better than some
models (the worst ones) and worse than some of them (the best ones by quality of prediction).
    Also, again one can see that parameters of combined models are close to traits of the model with the
highest order of moving average part. Various approaches to averaging technique have also been tested
in [2].

4. Conclusion
    In this research averaging of predictions of ARIMA(p, d, q) time series models is investigated. To
describe variance of various time series models it has been expressed in form of psi-weights for AR(1)
and ARMA(1, q) models. It should also be mentioned that MA(q) series have already got appropriate
form and their coefficients are equal to psi-weights. Having got explicit form of prediction variance
over 1, 2, 3 periods of time it’s possible to find models with the lowest variance and to evaluate whether
averaging leads to improvement of prediction. Models AR(1), ARMA(1, 1) and ARMA(1, 2) have been
considered. Models ARMA(1, q), q>2, have got the same variance of prediction as in considered cases.
Differences in their structure are going to appear only for predictions over more timesteps.
    If there‘s a few models averaged one can be the best. It happens when models predict various
behaviour of investigated time series. In common case averaged model has got variance better than the
worst model by variance and worse than the best one (expression (9)). These results are based on the
form of variance of prediction function. It’s shown that for models ARMA(1, q) it’s bulging downward.
Thus, combined models are going to give the best results or close to the best ones in case when there
are equal number of terms with various signs (terms with large modules are the most important because
they’ve got larger weights in averaging). This situation may take place if investigated time series is
difficult for analysis and forecasting and models at hand predict various behaviour (in case of models
with low orders; if orders increase, some terms in models have got various signs).
    Time series from the Russian macroeconomical statistics [18] have been used as test data for
computational experiments. Combined models are always better than the worst models. So, they can be
used as another tool of analysis. In some cases results of combined models are close to results of the
best ones.
    Nowadays this theme is very important and there are papers on averaging and bagging of time series
predictions, bagging with use of non-linear functions of time series terms [3, 4, 9-11,19-21]. In future
work bagging of time series and averaging of ARMA(p, q) models with higher orders of autoregression
parts (p>1) are going to be investigated.

5. References

[1] P. Hansen, A. Lunde , J. Nason, Model confidence sets for forecasting models, Econometrica 79.2
    (2005) 453-497.
[2] D. Petrusevich, Improvement of time series forecasting quality by means of multiple models
    prediction averaging, in: E. Semenkin, I. Kovalev (eds.), Proceedings of the III International
    Workshop on Modeling, Information Processing and Computing (MIP: Computing-2021),Volume
    2899, Krasnoyarsk, Russia, 2021, pp. 109-117. doi: 10.47813/dnit-mip3/2021-2899-109-117.
[3] W. Chen, H. Xu, Z. Chen, M. Jiang, A novel method for time series prediction based on error
    decomposition and nonlinear combination of forecasters, Neurocomputing 426 (2021) 85-103. doi:
    10.1016/j.neucom.2020.10.048.
[4] K. Chen, Y. Peng, S. Lu, B. Lin, X. Li, Bagging based ensemble learning approaches for modeling
     the emission of PCDD/Fs from municipal solid waste incinerators, Chemosphere 274 (2021)
     129802. doi: 10.1016/j.chemosphere.2021.129802.
[5] R. J. Hyndman, R. A. Ahmed, G. Athanasopoulos, H. L. Shang, Optimal combination forecasts
     for hierarchical time series, Computational Statistics & Data Analysis 55.9 (2011) 2579-2589. doi:
     10.1016/j.csda.2011.03.006.
[6] N. Shafik, G. Tutz, Boosting nonlinear additive autoregressive time series, Computational
     Statistics & Data Analysis 53.7 (2009) 2453-2464 doi: 10.1016/j.csda.2008.12.006.
[7] J. M. Matías, M. Febrero-Bande, W. González-Manteiga, J.C. Reboredo, Boosting GARCH and
     neural networks for the prediction of heteroskedastic time series, Mathematical and Computer
     Modelling 51.3-4 (2010) 256-271. doi: 10.1016/j.mcm.2009.08.013.
[8] L. Breiman, Random forests, Machine Learning 45 (2021) 5-32. doi: 10.1023/A:1010933404324
[9] M. H. Dal Molin Ribeiro, L. dos Santos Coelho, Ensemble approach based on bagging, boosting
     and stacking for short-term prediction in agribusiness time series, Applied Soft Computing 86
     (2020) 105837. doi: 10.1016/j.asoc.2019.105837.
[10] F. Petropoulos, R.J. Hyndman, C. Bergmeir, Exploring the sources of uncertainty: Why does
     bagging for time series forecasting work?, European Journal of Operational Research 268.2 (2018)
     545-554. doi: 10.1016/j.ejor.2018.01.045.
[11] E. Meira, F. L. C. Oliveira, J. Jeon, Treating and Pruning: New approaches to forecasting model
     selection and combination using prediction intervals, International Journal of Forecasting 37.2
     (2021) 547-568. doi: 10.1016/j.ijforecast.2020.07.005.
[12] S. Pellegrini, E. Ruiz, A. Espasa, Prediction intervals in conditionally heteroscedastic time series
     with stochastic components, International Journal of Forecasting 27.2 (2011) 308-319. doi:
     10.1016/j.ijforecast.2010.05.007.
[13] K. A. Boikov, M. S. Kostin, G. V. Kulikov, Radiosensory diagnostics of signal integrity in-circuit
     and peripheral architecture of microprocessor devices, Russian Technological Journal (In Russ.)
     9.4 (2021) 20-27. doi: 10.32362/2500-316X-2021-9-4-20-27.
[14] N. M. Legkiy, N. V. Mikheev, Selection of location of radiators in a non-equivident antenna
     array, Russian Technological Journal (In Russ.) 8(6) (2020) 54-62. (In Russ.) doi: 10.32362/2500-
     316X-2020-8-6-54-62.
[15] H. Wald, A study in the analysis of stationary time series, 2nd edition, Almqvist and Wiksell Book
     Co., Uppsala, 1954.
[16] R. J. Hyndman, G. Athanasopoulos, Forecasting: principles and practice, 2nd edition, OTexts,
     Melbourne, Australia, 2018.
[17] J. H. Stock, M.W. Watson, Introduction to Econometrics, Pearson, 2019.
[18] Dynamic series of macroeconomic statistics of the Russian Federation. Wage index, income index,
     2021. URL: http://sophist.hse.ru/hse/nindex.shtml
[19] S. F. Stefenon, M. H. Dal Molin Ribeiro, A. Nied, K.-C. Yow, V .C. Mariani, L. dos Santos Coelho,
     L.O. Seman, Time series forecasting using ensemble learning methods for emergency prevention
     in hydroelectric power plants with dam, Electric Power Systems Research 202 (2022) 107584. doi:
     1016/j.epsr.2021.107584.
[20] M. Larrea, A. Porto, E. Irigoyen, A.J. Barragán, J.M. Andújar, Extreme learning machine ensemble
     model for time series forecasting boosted by PSO: Application to an electric consumption problem,
     Neurocomputing, 452 (2021) 465-472. doi: 10.1016/j.neucom.2019.12.140.
[21] R. Godahewa, K. Bandara, G. I. Webb, S. Smyl, C. Bergmeir, Ensembles of localised models for
     time series forecasting, Knowledge-Based Systems 233 (2021) 107518. doi:
     10.1016/j.knosys.2021.107518.