An Adaptive Forecasting of Nonlinear Nonstationary Time Series under Short Learning Samples Elena Mantula1 and Vladimir Mashtalir1 1 Kharkiv National University of Radio Electronics, informatics department Lenin ave., 14, 61166, Kharkiv, Ukraine Mashtalir@kture.kharkov.ua, ElenaMantula@gmail.com Abstract. Methods of nonstationary nonlinear time series forecasting under bounded a priori information provide an interdisciplinary applications area that is concerned with learning and adaptation of solutions from a traditional artifi- cial intelligence point of view. It is extremely difficult to solve this type of problems in its general form, therefore, an approach based on the additive nonlinear auto regressive model with exogenous inputs and implemented on the base of parallel adalines set has been proposed. To find optimal combination of forecasts, an improvement of global random search has been suggested. Keywords. Neural networks, forecasting model, combination of forecasts Key terms. Environment, MathematicalModel 1 Introduction ‘Conscious’ decision making, in all possible varieties, is perhaps the most principal goal of artificial intelligence systems. Necessary ‘creativity’ implies the ability to produce novel solutions which are better than previous ones. The computational tools that assist in decision making should be such that they should take into all aspects of dissimilarity between a priori and a posteriori uncertainty. Uncertainty account is, per se, a manifestation of information deficiency, and relevant information is, on the con- trary, a capacity to reduce uncertainty. An elimination of such rich in content gaps provides groundwork of knowledge engineering and management. In machine intelli- gence, manifold forecasts can be used for knowledge producing. The goal of the paper consists in reasonable (perfectly optimal) combination of forecasts to provide reliable semantic interpretation of achieved results with purpose knowledge generation. Nowadays mathematical forecasting models of the behavior of objects, systems and phenomena in a wide variety of applications are well understood. There is a wealth of publications on this subject. It should be noted that the behavior of the ob- jects is often given in the form of time series. Thus to forecast its behavior a variety of approaches to the analysis of time series can be used. Such approaches can be either traditional statistical methods (regression, correlation, spectral, Box-Jenkins) or adap- tive, based on an exponential smoothing, tuning or learning forecasting models, or 92 E. Mantula and V. Mashtalir intellectual, using vrious neural networks. At present there are many objects (financial, economical, biomedical, etc.), de- scribed by time series containing unknown behavior trends, seasonal components, stochastic and random components, which significantly complicate synthesis of an effective predictive model. This complexity is especially pronounced in the environ- mental monitoring problems [1], where the analyzing time series have in equal meas- ure stochastic and chaotic type of changes, have apparent nonstationarity and are sub- jected to striking changes. In these conditions artificial ccc have proved to be useful tools in the best way [2- 13]. As a rule, they realize so-called NARX-model [14], which has the form ŷ(k )  f (y(k  1),...,y(k  nA ), x(k  1),..., x(k  nB ) (1) where ŷ(k ) is an estimation of forecasted variable y(k ) at discrete time k  1,2,...; f () denotes certain nonlinear transform which is realized by a neural network; x(k ) is the observed exogenous factor that influences the behavior of y(k ) ; nA , nB are observations memory parameters. Moreover, it is not a matter of available observations insufficiency, since proper- ties of time series (e.g. such indicator as air pollution in ecological forecasting) are changed so often that a neural network does not have time to detect separate station- ary parts. In this connection there is a need to construct based on the neural network approach simplified predictive models for training which require the small enough volume data set. 2 Synthesis of a forecasting model In conditions of input data lack instead of NARX-model (1) it is appropriate to use the so-called ANARX-model introduced in [15, 16] and fully investigated in [17, 18]. In general ANARX-model can be written as ŷ(k )  f1(y(k  1), x(k  1))  f 2 (y(k  2), x(k  2))  ...  f max {nA , nB}(y(k  nA ), x(k  nB ))  (2) max {n A , nB }   f l (y(k  l ), x(k  l )) l 1 where original task is decomposed into many local ones with two input variables y(k  l ), x(k  l ) , l  1,2,..., max {nA ,nB} . For such nonlinear transforms it is quite convenient to use so-called N-adaline (abbr.: adaptive linear element) [19-21] that provide quadratic approximation of the data sequence. Fig. 1(a) demonstrates the architecture of N-adaline and (b) illustrates the architecture of ANARX-model constructed using N-adaline. As we can see, N-adaline represents a generally accepted two-input adaline with a nonlinear preprocessor formed by three blocks of the product (  ) and the evaluator of the quadratic combination in the form f l (y(k  l ), x(k  l ))  wl 0  wl1 y(k  l )  wl 2 y 2 (k  l )  wl 3 y(k  l )x(k  l )  wl 4 x 2 (k  l )  wl 5 x(k  l ) An Adaptive Forecasting of Nonlinear Nonstationary Time Series … 93 where each N-adaline contains 6 synaptic weights wlp , l  1,2,..., max {n A , nB } , p  0,1,...,5 . As a matter of fact, ANARX-model is formed by two lines of delay ele- ments z 1 and max {n A ,nB } parallel learned N-adaline. (a) N-adaline (b) ANARX models Fig. 1. N-adaline and ANARX models based on N-adalines Each from N-adalines is configured with any of the linear learning algorithms [22], however, it is clear that a limited amount of a priori information requires the use of time-optimal procedures. As such can be, for example, adaptive-multiplicative modi- fication of Kachmarz adaptive algorithm [23], which assumes in this case the form y(k )  wlT (k  1)l (k ) wl (k )  wl (k  1)   2 l (k ) (3)   l (k ) where wl  (wl 0 ,wl1,wl 2 ,wl 3 ,wl 4 ,wl 5 )T ; l (k )  (1, y(k  l ), y 2 (k  l ), y(k  l )x(k  l ), x 2 (k  l ), x(k  l ))T ; 0    2 ,   0 are some algorithm parameters selected on the base of empirical reasons. If the data sequences are ‘contaminated’ by perturbations, instead of the one-step algorithm (3) it is profitably to apply procedures that provide filtering of perturbation and at the same time they have to be suitable for using in non-stationary conditions. It should be noted that modification of the recursive least squares method on a sliding window can be used [24]. The traditional estimation method of least squares on the window with s observations has the form k 1 k wl (k )  (  k  s 1 l ()l ())   k  s 1 l ()y() T and recurrent one can be presented as 94 E. Mantula and V. Mashtalir  Ps (k  1)l (k )Tl (k )Ps (k  1)  P(k )  Ps (k  1)  ,  1  Tl ( k )Ps ( k  1 )l ( k )  P(k )l (k  s)lT (k  s)P(k )  Ps (k )  P(k )  , (4)  1  l (k  s)P(k )l (k  s )  ps (k )  ps (k  1)  l (k )y(k )  l (k  s)y(k  s),   wl (k )  Ps (k )ps (k ). We also note that if the algorithm (3) is in fact time-optimal gradient procedure, then the algorithm (4) is produced by Gaussian-Newton optimization procedure. 3 Optimal combination of forecasts In real conditions the choice of the forecasting model structure is not a trivial task, especially that the same time series can be effectively described by a variety of differ- ent models. Also, the value of the lag orders nA , nB remains unknown what makes it necessary to consider a set of competing models, and nonstationarity of analyzed series necessitates the use of various learning algorithms (in this case, (3), (4)) with different values  , ,s . Thus, there arises a set of forecasts of the same process, from which we have to select the best. To find the best forecast it is possible to use sufficiently effective approach, based on the optimal combination of forecasts [25], under which optimal in the sense of given criterion J c linear combination is searching for a set of existing forecasts of the same series ŷ j (k ), j  1,2,..., m m ˆy(k )   c j ˆy j (k ) (5) j 1 where the parameters of the combination satisfy the condition of unbiasedness m  c j  1. (6) j 1 In [25], an analytical approach to the weights c j finding in (5) by optimizing the sum of squared errors criterion for forecasting with the constraints (6) is proposed. The use of one-step squared forecast errors criterion leads to the estimation ŷ j (k ) c j (k )  m .  j ŷ (k ) j 1 However, combining of the analytical parameter estimates can be obtained under application of standard quadratic criterion J c solely that specified by linearity of it derivatives so the solution of the problem reduces to solving a system of linear equa- tions. At the same time for practitioners as a rule assess of the quality of forecasting An Adaptive Forecasting of Nonlinear Nonstationary Time Series … 95 using the residual variance is unconvincing, and therefore characteristics allowing to estimate the accuracy in percentage are generally used, such as the criterion of a minimum of absolute percentage error N y(k )  ˆy(k ) MAPE   100% (7) k 1 y(k ) or maximum of the determination coefficient N  (y( k )  ˆy( k ))2 R 2  (1  N k 1 N )100% . (8) 1 2  (y( k )   y(k )) k 1 N k 1 It is obvious that in this case analytical estimations can not be obtained, and the use of gradient optimization procedures becomes more complicated due to sufficiently complex properties of functions (7), (8). In this connection the use of genetic algo- rithms is proposed in [26, 27]. Though such algorithms can find the global extremum, their own distinctive features are numerical awkwardness, they have a set of free pa- rameters necessary defined by the user and at last it should be mentioned a low rate of convergence. Therefore, notice should be taken to more an efficient approach based on the random search [28] and its adaptive modifications. The most simple procedure, which allows to search for a global extremum, is walking random global search [28]. In general, this procedure is a statistic extension of the regular gradient search, and to provide the global search, random disturbance (k ) superimposes on character on a gradient movement what creates stochastic walking mode. In the continuous case, the gradient method of minimization (maximization) of the goal function J c (t ) is reduced to the motion of a point c(t )  c1(t ),..., c j (t ),..., cm (t ) in m -dimensional space of adjustable parameters by a force directed toward the anti- gradient. The trajectory of movement by antigradient c(t ) leads tuning process to a singular point. If starting point c(0) belongs to an attraction region of global extremum then the corresponding trajectory will lead to a global minimum of the function J c (t ) . But if the point c(0) does not belong to this region, the movement in the direction of anti- gradient will result in a local minimum, from which it is impossible to get out under the influence of forces directed by antigradient. Exactly because, it is helpful to use a random mechanism. Random shocks may help point c(t ) to overcome the barrier that separates the local minimum in which the learning process hit from the area in which the objective function J c(t ) could further decrease. Under the influence of ‘skew’ toward anti-gradient and random shocks such movement is determined by the differ- ential equation dc(t )   c J c (t )  (t ) dt where (t ) is m -dimensional normal random process with zero mathematical expec- 96 E. Mantula and V. Mashtalir tation, delta-figurative autocorrelation function and components variance 2 ;  is parameter of step, c denotes gradient vector. It should be emphasized that for func- tion (7) the components of the gradient can acquire the value 1 or 1 . Generally, this algorithm provides searching for a global extremum [29]. Searching for global extremum can be speed up by reasonable selection of 2 and an adaptation during this process can be introduced in two ways. First, under intro- ducing inertia in the learning process, it is possible to get a search similar to the movement by the method of ‘heavy ball’ [30]. Such movement is described by the differential equation d 2c(t ) dc(t ) 2 b  J c (t )  (t ) (9) dt dt where b is shockproofing coefficient (the more b , the less manifest of inserted iner- tia). On time series processing, i.e. in discrete time, procedure (9) corresponds to the learning algorithm, described by the second order difference equation [31] c(k )  c(k  1)  bc(k  2)  (k )c J c (k )  (t ) (10) coinciding under b  0 with walking random search. It is interesting to note that (10) is none other than the ARX- model of the second-order. Second, the adaptation in the process of global search can be introduced by random process (t ) control, for example, d  (t ) dJ c (t )   (t )    2 H(t ) (11) dt dt where   0 is a autocorrelation parameter of random process (t ) ; H(t ) is a vector of flat random noise. Introduce a modification of (11) in the discrete form (k )  (1  )(k  1)   (k ) J c (k )  2 H(k ) (12) where  is the symbol of the first difference (discrete analogue of the derivative). As it is easily seen from (11), (12), the optimization of the search process can be performed by appropriate selection of parameters  ,  and 2 , since each of them acts on the certain properties of the search. Indeed, variation of the autocorrelation parameter  determines the rate of the process (k ) decay that regulates its relations with the past. Thus, one can have an influence upon a search making it more or less dependent on the previous history if it is necessary. Some few words of comment are desirable for parameters  and  interaction explanation. If the search step  determines the intensity of accumulation of learn- ing experience, then  characterizes the level of this experience forgetting during the search. In this sense, these parameters are antagonistic. If in general   0 and there is no forgetting the vector  (k ) increases in the direction of anti-gradient. Variance An Adaptive Forecasting of Nonlinear Nonstationary Time Series … 97 of the process  (k ) is determined by the value 2 and intensity of the flat random noise disturbance H(k ) . If 2 is sufficiently large then search may become unstable and, at low value, global properties are worsening. Thus, the use of a modified global random search allows simplify significantly the process of linear combination cj (k ), j  1,2,..., m tuning. 4 Conclusion The problem of nonstationary nonlinear time series forecasting under bounded a priori information has been considered. An approach based on the additive nonlinear auto regressive model with exogenous inputs and implemented on the base of parallel adalines set has been proposed. To find optimal combination of forecasts, an im- provement of global random search has been suggested. Distinctive feature of the approach is the computational simplicity and high performance attained by significant reducing the number of adjustable parameters. References 1. Zanetti, P.: Air Pollution Modelling. Van Nostrand Reinhold, New York (1990) 2. Reich, S.L., Gomez, D.R., Dawidowski, L.E.: Artificial Neural Network for the Identifica- tion of Unknown Air Pollution Sources. Atmosphere Environment, Vol. 33, pp. 3045-3052 (1999) 3. Perez, P., Trier, A., Reyes, J.: Prediction of PM2.5 Concentration Several Hours in Ad- vance Using Neural Networks in Santiago, Chile. Atmospheric Environmental, Vol. 34, pp. 1189–1196 (2000) 4. Niska, N., Hiltunen, T., Karppinen, A., Ruuskanen, J., Kolehmanen, M.: Evolving the Neural Network Model for Forecasting Air Pollution Time Series. Engineering Applica- tion of Artificial Intelligence, Vol. 17, 159–167 (2004) 5. Corani G.: Air Quality Prediction in Milan: Feed-Forward Neural Networks, Pruned Neu- ral Networks and Lazy Learning. Ecological Modeling, Vol. 185, pp. 513–529 (2005) 6. Athanasiadis, I.N., Karatzas, K.D., Mitkas, P.A.: Classification Techniques for Air Quality Forecasting. In: Brewka G., Coradeschi S., Perini A. and Traverso P. (eds.): Proc. 17th European Conference on Artificial Intelligence, IOS Press, Amsterdam, 4.1–4.7 (2006) 7. Perez, P., Reyes, J.: An Integrated Neural Network Model for PM10 Forecasting. Atmos- pheric Environment. Vol. 40, pp. 2845–2857 (2006) 8. Lira, T.S., Barrozo, M.A.S., Assis. A.J.: Air Quality Prediction in Uberlandia, Brasil, Us- ing Linear Models and Neural Networks. In: Plesu V., Agachi P. (eds.): Proc. 17th Euro- pean Symp. on Computer Aided Process Engineering, Elsevier, Amsterdam, pp. 1–6 (2007) 9. Kurt, A., Gulbagci, B., Karaca, F., Alagha, O.: An Online Air Pollution Forecasting Sys- tem Using Neural Networks. Environmental International, Vol. 34 (2008) 592–598 10. Carnevale, C., Finzi, G., Pisoni, E., Volta, M.: Neuro-Fuzzy and Neural Network Systems for Air Quality Control. Atmospheric Environmental, Vol. 43, pp. 4811–4821 (2009) 98 E. Mantula and V. Mashtalir 11. Nagendra, S.M., Shiva, Khare M.: Modelling Urban Air Quality Using Artificial Neural Network. Clean Technical Environmental Policy, Vol. 7, pp. 116–126 (2005) 12. Aktan, M, Bayraktar, H.: The Neural Network Modeling of Suspended Particulate Matter with Autoregressive Structure. Ekoloji, Vol. 19, No. 74, pp. 32–37 (2010) 13. Esau, I.: On Application of Artificial Neural Network Methods in Large-Eddy Simulations with Unresolved Urban Surfaces. Modern Applied Science, Vol. 4, No. 8, 3–11 (2010) 14. Nelles, O.: Nonlinear System Identification: From Classical Approaches to Neural Net- works and Fuzzy Models. Springer, Berlin (2001) 15. Chowdhury, F.N., Input-Output Modeling of Nonlinear Systems with Time-Varying Lin- ear Models. IEEE Trans. on Automatic Control, Vol. 45, No. 7, pp. 1355–1358 (2000) 16. Kotta, Ü., Sadegh, N.: Two Approaches for State Space Realization of NARMA Models: Bridging the Gap. Mathematical and Computer Modeling of Dynamical Systems, Vol. 8, No. 1, pp. 21–32 (2002) 17. Belikov, J., Vassiljeva, K., Petlenkov, E., Nomm S.: A Novel Taylor Series Based Ap- proach for Control Computation in NN–ANARX Structure Based Control of Nonlinear Systems. In: Proc. 27th Chinese Control Conference, Beihang University Press, Kunming, pp. 474–478 (2008) 18. Vassiljeva, K., Petlenkov, E., Belikov, J.: State-Space Control of Nonlinear Systems Iden- tified by ANARX and Neural Network Based SANARX Models. In: Proc. WCCI 2010 IEEE World Congress on Computational Intelligence, IEEE CSS, Piscataway, pp. 3816– 3823 (2010) 19. Pham, D.T. Liu, X.: Modeling and Prediction Using GMDH Networks of Adalines with Nonlinear Preprocessors. Int. J. System Science Vol. 25, No. 11 (1994) 1743–1759 20. Pham, D.T. Liu, X.: Neural Networks for Identification, Prediction and Control. Springer, London (1995) 21. Rudenko, O.G., Bodyanskiy, Ie.V.: Artificial Neural Networks. SMIT, Kharkov (2005) (in Russian) 22. Haykyn, S.: Neural Networks: A Comprehensive Foundation. Prentice Hall. Inc., New York (1999) 23. Raybman, N.S., Chadeev, V.M.: Creating of Manufacture Process Models. Energiya, Mos- cow (1975) (in Russian) 24. Perelman, I.I.: Operative Identification of Control Objects. Energoatomizdat, Moscow (1982) (in Russian) 25. Sharkeya, A.J.C.: On Combining Artificial Neural Nets. Connection Science, Vol. 8, No. 3, pp. 299–314 (1996) 26. Zagoryjko, N.G.: Empirical Prediction. Nauka, Novosibirsk (1979) (in Russian) 27. Zagoryjko, N.G.: Applied Approach of Data Analysis, р. 264 (1999) (in Russian) 28. Rastrigin, L.A.: Statistical Search Technology. Nauka, Moscow (1968) (in Russian) 29. Rastrigin, L.A.: Systems of Extremal Control. Nauka, Moscow (1974) (in Russian) 30. Polyak B.T.: Introduction into Optimization. Nauka, Moscow (1983) (in Russian) 31. Bodyanskiy, Ie. V., Rudenko, O.G.: Artificial Neural Networks: Arhitectures, Learning, Applications. TELETEX, Kharkov (2004) (in Russian)