-

Machine Learning for Multi-step Ahead Forecasting of Volatility Proxies

Jacopo De Stefani

jacopo.de.stefani@ulb.ac.be 1

Olivier Caelen

olivier.caelen@worldline.com 0 1

Dalila Hattab

dalila.hattab@equensworldline.com 1

Gianluca Bontempi

gianluca.bontempi@ulb.ac.be 1 0 Equens Worldline R&D, Lille (Seclin) , France 1 Machine Learning Group, Departement d'Informatique, Universite Libre de Bruxelles , Boulevard du Triomphe CP212, 1050 Brussels , Belgium Worldline SA/NV R&D , Bruxelles , Belgium

In nance, volatility is de ned as a measure of variation of a trading price series over time. As volatility is a latent variable, several measures, named proxies, have been proposed in the literature to represent such quantity. The purpose of our work is twofold. On one hand, we aim to perform a statistical assessment of the relationships among the most used proxies in the volatility literature. On the other hand, while the majority of the reviewed studies in the literature focuses on a univariate time series model (NAR), using a single proxy, we propose here a NARX model, combining two proxies to predict one of them, showing that it is possible to improve the prediction of the future value of some proxies by using the information provided by the others. Our results, employing arti cial neural networks (ANN), k-Nearest Neighbours (kNN) and support vector regression (SVR), show that the supplementary information carried by the additional proxy could be used to reduce the forecasting error of the aforementioned methods. We conclude by explaining how we wish to further investigate such relationship.

nancial time series volatility forecasting multi-step ahead forecast machine learning

Introduction and problem statement In time series forecasting, the largest body of research focuses on the prediction of the future values of a time series, with either a single or a multiple steps ahead forecasting horizon, given historical knowledge about the series itself. In statistical terms, such problem is equivalent to the forecast of the expected value of the time series in the future, conditioned on the past available information. In the context of stock market, the solution to the aforementioned problem could allow to determine the future valuation of a company, thus giving an information to the traders about how to act upon such change in the valuation. However, from the traders' standpoint, price is not the only variable of interest.The knowledge of the intensity of the uctuations a ecting this price (i.e. the stock volatility) allow them to assess the risk associated to their investment. Since volatility is not directly observable given the time series, according to the granularity and the type of the available data, one could compute di erent measures, named volatility proxies [21]. Although volatility proxies based on intraday trading data exist [17], due to the restrictions on the access to such ne grained data, the rest of our analysis will be focused on proxies for daily data. A standard approach to volatility forecasting, once a given proxy has been selected, is to apply either a statistical Generalized AutoRegressive Conditional Heteroskedasticity (GARCH)-like model [2], or to apply a machine learning model. In addition, several hybrid approaches are emerging [16, 10, 19], including a non-linear computational component into the standard GARCH equations. In all the aforementioned cases, we deal with a univariate problem, where a single time series is used to predict the future values of the series itself. An exception is represented by the work of [30] where a volatility proxy is combined with external information (namely the volume of the queries to a web search engine for a given keyword). This paper proposes a method for multiple step ahead forecast of a volatility proxy incorporating the information from a second proxy in order to improve the prediction quality. The purpose of our work is twofold. First, we aim to perform a statistical assessment of the relationships among the most used proxies in the volatility literature. Second, we explore a NARX (Nonlinear Autoregressive with eXogenous input) approach to estimate multiple steps of the output, where the output and the input are two di erent proxies. In particular, our preliminary results show that the statistical dependencies between proxies can be used to improve the forecasting accuracy. The rest of the paper will be structured as follows: Section 2 will introduce the notation and provide a uni ed view on the di erent volatility proxies. Section 3 will introduce the formulation of the volatility forecasting problem as a machine learning task and will described the di erent tested models. Section 4 concludes the paper with a discussion of the results and the future research directions. 2

Volatility proxies: de nition and notation In this paper we consider univariate time series whose value at time t is denoted by the scalar value yt. Let us consider the following quantities of interest, each of them on a daily time scale: Pt(o); Pt(c); Pt(h); Pt(l), respectively the stock prices at the opening, closing of the trading day and the maximum and minimum value for each trading day; vt being the volume 1. We will assume the availability of a training set of T past observations of each univariate series.

In the absence of detailed information concerning the price movements within a given trading day, stock volatility becomes directly unobservable [27]. To cope with such problem, several di erent measures (also called proxies) have been proposed in the econometrics literature [21, 12, 20, 13] to capture this information. However, there is no consensus in the scienti c literature upon which volatility 1 Number of traded stocks in a given day. proxy should be employed for a given purpose. We will proceed by reviewing the di erent types of proxies available in the literature: SD;n, i and G. Volatility as variance The rst proxy corresponds to the natural de nition of volatility [21], that is a rolling standard deviation of a given stock's continuously compounded returns over a past time window of size n: where

v tSD;n = tuu n 1 rt = ln 1 n 1 X(rt i i=0 P (c) !

t P (c) t 1 rn)2 represents the daily continuously compounded return for day t computed from the closing prices Pt(c) and rn represents the returns' average over the period ft; ; t ng. In this formulation, n represents the degree of smoothing that is applied to the original time series.

Volatility as a proxy of the coarse grained intraday information The ti family of proxies is analytically derived in [12] by incorporating supplementary information (i.e. opening, maximum and minimum price for a given trading day) and trying to optimize the quality of the estimation.

The rst estimator t0, which the authors propose as benchmark value, simply consists of the squared value of the returns (i.e. the ratio of the logarithms of the closing price time series):

" t0 = ln

Pt(+c)1 !#2

The second proposition t1 is able to reduce the variance of the estimator, by including the opening price, and computing a weighted average between two components, representing respectively the nightly and daily volatility: ( 1 ) ( 2 ) ( 3 ) ( 4 ) t1 =

The value of f 2 [0; 1] represents the fraction of the trading day in which the market is closed. In the case of CAC40, we have that f > 1 f , since trading is only performed during roughly one third of the day. In this case, the weighting scheme proposed in ( 4 ) will give higher weight to the intraday volatility, with respect to the nightly one. The third estimator, derived in [20] through the modeling of the price evolution as a stochastic di usion process with unknown variance, is a function of the variation range (i.e. the di erence between maximum and minimum value for the current trading day):

Here, a is a weighting parameter, whose optimal value, according to the authors is shown to be 0:17, regardless of the value of f .

Furthermore, the same study introduces a family of estimators based on the normalization of the maximum, minimum and closing values by the opening price of the considered day. We can then de ne:

P (h) !

t P (o)

t u = ln d = ln

P (l) !

t P (o) t c = ln

P (c) !

t P (o) t where u is the normalized high price, d is the normalized low price and c is the normalized closing price.

We can derive Equation ( 8 ), by starting from a general, analytic form for the estimator, and then deriving the optimal values of the coe cient by minimizing the estimation variance. 0:383c2

The values of the coe cients are set assuming that the price dynamics follows a Brownian motion and enforcing scale invariance properties and price and time symmetry conditions. For all the details concerning the proof, we refer the interested reader to [12].

Equation ( 9 ) is derived from Equation ( 8 ) by eliminating the cross product terms.

Last but not least, the best estimator in terms of estimation variance e ciency is obtained by combining the overnight volatility measure with the optimal estimator described in Equation ( 8 ). ( 5 ) ( 6 ) ( 7 ) ( 8 ) ( 9 ) a f log

Pt(+o)1 !2 tG = vuut! + Xp j ( tG j )2 + Xq j=1 i=1

2 i"t i GARCH-based volatility Even though the GARCH(p,q ) [14] (Generalized AutoRegressive Conditional Heteroskedasticity) family of models is generally employed for volatility forecasting, we decided to consider it here as a lter, that, given the original time series, returns its estimation of the series volatility. All GARCH models assume that the return time series can be expressed as the sum of two components: a deterministic trend and a stochastic timevarying component "t. The stochastic component can be further decomposed and expressed as the product between a sequence of independent and identically distributed random variables Zt with null mean and unit variance and a time varying scaling factor tG.

The core of the model is the variance equation, describing how the residuals "t and the tG past volatility a ects the future volatility.

The coe cients !; i; j are tted according to the maximum likelihood estimated procedure proposed in [4]. In the case of our proxies, we consider the estimation of the volatility made by a GARCH (p = 1,q = 1) model as suggested in [13]. ( 10 ) ( 11 ) ( 12 ) ( 13 ) ( 14 ) ( 15 ) ( 16 ) 3

Multiple step ahead volatility forecasting The Nonlinear Auto Regressive (NAR) formulation of a univariate time series as an input-output mapping allows the use of supervised machine learning techniques for time series one-step-ahead forecasting [6].

y = f (x) + ! y = [yt+1] x = [yt d;

; yt d m+1]

To be more precise, this model assumes an autoregressive dependence of the future value of the time series on the past m (lag or embedding order) values, with a given delay2 d and an additional null-mean noise term !. 2 In the following of the paper we will assume d = 0 for the sake of simplicity. With this structure, the forecasting task can be reduced to a two-step process. First the mapping f between the inputs x and the outputs y is learned through a supervised learning task, and then such mapping is used to produce the onestep-ahead forecast of the future values.

Extensions of this technique allows to perform multiple-step ahead forecast (i.e. y = [yt+H ; ; yt+1]). Such extensions can be summarised into two main classes: single output (Direct and Recursive strategies) and multiple output (MIMO ) strategies. The former learns a multi-input single output dependency while the latter learns a multi-input multiple output dependency. We invite the interested reader to see [24], [6], [25] for more details.

In what follows, we focus on two multi-step ahead single output learning task, employing the Direct strategy [5], [23], [8]. In the rst one (NAR), we will focus on the multiple step ahead forecast of a primary volatility proxy tP using only its past values as input information, while in the second one (NARX), also the past values of an additional volatility proxy tX will be incorporated in the model described in ( 14 ): yNAR = yNARX = [ tP+H ;

; tP+1] xNAR = [ tP d; xNARX = [ tP d; ; tP d m+1] ; tP d m+1; tX d; ; tX d m+1]

We compare the two approaches for embedding orders m 2 f2; 5g, several forecasting horizons h 2 f2; 5; 8; 10; 12g and for di erent estimators of the dependency f . More precisely, as estimators of the dependency, we employ a naive model, a GARCH( 1,1 ), and three machine learning approaches: a feedforward Arti cial Neural Networks, a k-Nearest Neighbors approach and Support Vector Machine based regression. 3.1

Naive

The Naive method is employed mainly as a benchmark for comparison for the other models, simply consisting in taking the last available historical value: ^tP+h =

P t 1 ( 17 ) ( 18 ) ( 19 ) ( 20 ) 3.2

GARCH( 1,1 ) The GARCH model corresponds to the one described in subsection 2, in equations ( 13 ) and ( 11 ), with p = 1 and q = 1. 3.3

Arti cial Neural Networks

In machine learning and cognitive science, an arti cial neural network (ANN) is a network of interconnected processing elements, called neurons, which are used to estimate or approximate functions that can depend on a large number of inputs

B m H ^tP+h = foh BBBbo + X wio tP i + X wjo fh

B@ i=1 j=1 | Linear{zAR(m) } | m !CC X wij tP i + bj C i=1 }ACC

Non-linear{zcomponent

0 0 B B

B ^tP+h = foh BBBbo

B B B @

m + X wio tP i + w(i+m)o tX i i=1 |

H + X wjo fh j=1 |

Linear {AzRX(m) }

m X wij tP i + w(i+m)j tX i + bj i=1

Non-linear{zcomponent

1 C C C

C ! CC

C C C

A } that are generally unknown. For our task, we will focus on a speci c family of arti cial neural networks, the multi-layer perceptron (MLP), with a single hidden layer. Equations ( 21 ) and ( 22 ) describe the structure of the model for a single forecasting horizon t + h in the context of the Direct strategy, respectively for a NAR and a NARX model.

It should be noted that, as shown in Equation ( 21 ), the model can be decomposed into a linear autoregressive component of order m and a nonlinear component whose structure depends on the number of hidden nodes H (selected through k-fold cross-validation). When an external regressor is added, its inuence will a ect both the linear and nonlinear component, as shown in ( 22 ). In both cases the activity functions foh( ) and fh( ) are both logistic functions. Finally, our implementation of the MLP models is based on the nnet package for R [28].

where y(xi) is the output vector of the ith nearest neighbor of the input vector x in the dataset. The choice of the optimal number of neighbors k will 3.4

K Nearest Neighbors

The k-Nearest neighbors (kN N ) model is a local nonlinear model used for classi cation and regression. In the case of regression, the prediction for a given input vector x is obtained through local learning [3], a method that produces predictions by tting a simple local model in the neighborhood of the point to be predicted. The neighborhood of a point is de ned by taking the the k values having the minimal values for a chosen distance metric de ned on the space of the input vector [1].

In this case, every data point is represented in the form (x; y) where x represents the vector of input values and y the corresponding output vector, as described in Figure 1. Then the prediction for an unknown input vector x is computed as follows: y^(x ) = 1 X y(xi) k i2kNN 1 ( 21 ) ( 22 ) ( 23 ) be performed through automatic leave-one-out selection as described in [5]. Our implementation of the kNN models is based on the R package gbcode [7]. 3.5

Support Vector Regression

Support Vector Regression is a regression methodology, based on the Support Vector Machine theoretical framework [9]. The key idea behind SVR is that the regression model can be expressed using a subset of the input training examples, called the support vectors. In more formal terms, the model (Equation ( 24 )) is a linear combination over all the n support vector of a bivariate kernel function k( ; ) taking as inputs the data point x whose forecast is required and the ith support vector xi. The coe cients i; i are determined through the minimization of an empirical risk function (cf. [22]), solved as a continuous optimization problem.

y = n X( i i=1

kx xik2 k(x; xi) = e 2 2 i )k(x; xi) ( 24 ) ( 25 )

Among the di erent available kernel functions we employ the radial basis one (Equation( 25 )), for which the optimal value of the parameter is determined through grid search. Here, the SVM implementation of the R package e1071 [18] is used for the experiments. As for ANN and kNN, we will be testing two di erent dataset structures (cf. Figure 1), representing respectively the exclusion (on the left) and the inclusion (on the right) of an external regressor.

Direct NAR

Direct NARX { A single model f h for each horizon h. { Forecast at h step is made using hth model.

{ A single model f h for each horizon h. { Forecast at h step is made using hth

model.

x y 3P 2P 1P 5P 4P 3P 2P 6P ::: ::: ::: ::: TP 5 TP 6 TP 7 T 2

P 3 P 4 :::

P 2 P 3 :::

P 1 P 2 ::: x

X 3 X 4 :::

X 2 X 3 :::

X 1 X 2 ::: y P 5 P 6 ::: TP 5 TP 6 TP 7 TX 5 TX 6 TX 7 TP 2 Fig. 1: Comparison of the dataset structure and model identi cation procedure for NAR and NARX forecasting strategies. The primary proxy is denoted with tP , while the secondary one is tX . The example datasets are shown for a model order m = 3 and a forecasting horizon h = 3. 4.1

Experimental Results

Dataset description

The proxies have been computed on the 40 time series of the french stock market index CAC40 from 05-01-2009 to 22-10-2014 (approximately 6 years) for a total 1489 OHLC (Opening, High, Low, Closing) samples for each time series. In addition to the proxies, we include also the continuously compounded return and the volume variable (representing the number of trades in given trading day). 4.2

Statistical analysis

e m rt ? rt louV σ1 σ6 σ4 σ5 σ2 σ3 σ0 5 σSD 15 SD 21 σSD σG σ ? ? ? ? ? ? ? ? ? ? ? ?

Average GARCH(1,1) ANN-Dir kNN-Dir

SVR-Dir X ; ; ; ; 6 V olume SD;5 SD;15 SD;21 ; ; 6 V olume SD;5 SD;15 SD;21 ; ; 6 V olume SD;5 SD;15 SD;21

Horizon - H

forecasting horizon and model order we performed a number of training and test tasks by following a rolling origin strategy [26]. The size of the training set is 2N and the procedure is repeated for 50 testing sets of length H. The regressor 3 combinations have been selected in order to test whether the belonging (=) or not to the same proxy family (6=) impacts the forecasting performance. The employed error measure is the Mean Absolute Scaled Error [15], normalized at each forecasting horizon by the the MASE of the Naive method.

MASE =

T T 1

PT P t=1 t PT P t=2 t ^tP

P t 1 ( 26 ) We include in our analysis a GARCH( 1,1 ) method [13] as a baseline reference method. While employing an additional regressor, model orders higher than 2 have not been tested due to the excessive computational time required by the corresponding technique for the given task or due to numerical convergence problems. A rst observation from the table is that all the ML methods, both in the single input and the multiple input con guration, are able to outperform the reference GARCH method. Moreover, both the increase of the model order m and the introduction of an additional regressor are able to improve the methods' performances. However, only the addition of an external regressor, for horizons greater than 8 steps ahead is shown to bring a statistically signi cant improvement (paired t-test, pv=0.05). Even though no model appear to clearly outperform all the others on every horizons, we can observe that the SVR model family is generally able to produce smaller forecast errors than those based on ANN and k-NN. 5

Conclusion and Future work After having shown the bene ts of including an additional proxies in our models, our main aim is to investigate how the forecasting quality of volatility could be improved, mainly by tuning three parameters in our methods: the choice of the additional proxy, the employed machine learning technique and the size of the training window. In order to further advance our research, we also plan to study how the current approach could be generalized, in order to include an arbitrary number of volatility proxies.

Acknowledgments. Jacopo De Stefani acknowledges the support of the ULBWORLDLINE agreement. Gianluca Bontempi acknowledges the funding of the Brufence project (Scalable machine learning for automating defense system) supported by INNOVIRIS (Brussels Institute for the encouragement of scienti c research and innovation).

1. Altman , N.S.: An introduction to kernel and nearest-neighbor nonparametric regression . The American Statistician 46 ( 3 ), 175 { 185 ( 1992 )

2. Andersen , T.G. , Bollerslev , T. : Arch and garch models . Encyclopedia of Statistical Sciences ( 1998 )

3. Atkeson , C.G. , Moore , A.W. , Schaal , S. : Locally weighted learning for control . In: Lazy learning , pp. 75 { 113 . Springer ( 1997 )

4. Bollerslev , T. : Generalized autoregressive conditional heteroskedasticity . Journal of econometrics 31(3) , 307 { 327 ( 1986 )

5. Bontempi , G. , Taieb , S.B. : Conditionally dependent strategies for multiple-stepahead prediction in local learning . International journal of forecasting 27(3) , 689 { 699 ( 2011 )

6. Bontempi , G. , Taieb , S.B. , Le Borgne , Y.A.: Machine learning strategies for time series forecasting . In: Business Intelligence , pp. 62 { 77 . Springer ( 2013 )

7. Bontempi , Gianluca: Code from the handbook "statistical foundations of machine learning" , https://github.com/gbonte/gbcode

8. Cheng, H., Tan , P.N. , Gao , J. , Scripps , J.: Multistep-ahead time series prediction . In: Paci c-Asia Conference on Knowledge Discovery and Data Mining . pp. 765 { 774 . Springer ( 2006 )

9. Cortes , C. , Vapnik , V. : Support vector machine . Machine learning 20(3) , 273 { 297 ( 1995 )

10. Dash , R. , Dash , P.: An evolutionary hybrid fuzzy computationally e cient egarch model for volatility prediction . Applied Soft Computing 45 , 40 { 60 ( 2016 )

11. Field , A.P. : Meta-analysis of correlation coe cients: a monte carlo comparison of xed-and random-e ects methods . Psychological methods 6 ( 2 ), 161 ( 2001 )

12. Garman , M.B. , Klass , M.J. : On the estimation of security price volatilities from historical data . Journal of business pp. 67 { 78 ( 1980 )

13. Hansen , P.R. , Lunde , A. : A forecast comparison of volatility models: does anything beat a garch (1, 1 )? Journal of applied econometrics 20(7) , 873 { 889 ( 2005 )

14. Hentschel , L. : All in the family nesting symmetric and asymmetric garch models . Journal of Financial Economics 39 ( 1 ), 71 { 104 ( 1995 )

15. Hyndman , R.J. , Koehler , A.B. : Another look at measures of forecast accuracy . International journal of forecasting 22(4) , 679 { 688 ( 2006 )

16. Kristjanpoller , W. , Fadic , A. , Minutolo , M.C. : Volatility forecast using hybrid neural network models . Expert Systems with Applications 41 ( 5 ), 2437 { 2442 ( 2014 )

17. Martens , M. : Measuring and forecasting s&p 500 index-futures volatility using high-frequency data . Journal of Futures Markets 22 ( 6 ), 497 { 518 ( 2002 )

18. Meyer, D., Dimitriadou , E. , Hornik , K. , Weingessel , A. , Leisch , F. , Chang , C.C. , Lin , C.C. : e1071: Misc functions of the department of statistics, probability theory group (formerly: E1071), tu wien , https://cran.r-project.org/web/packages/ e1071/

19. Monfared , S.A. , Enke , D. : Volatility forecasting using a hybrid gjr-garch neural network model . Procedia Computer Science 36 , 246 { 253 ( 2014 )

20. Parkinson , M.: The extreme value method for estimating the variance of the rate of return . Journal of Business pp. 61 { 65 ( 1980 )

21. Poon , S.H. , Granger , C.W. : Forecasting volatility in nancial markets: A review . Journal of economic literature 41(2) , 478 { 539 ( 2003 )

22. Sapankevych , N.I. , Sankar , R. : Time series prediction using support vector machines: a survey . IEEE Computational Intelligence Magazine 4 ( 2 ) ( 2009 )

23. Sorjamaa , A. , Hao , J. , Reyhani , N. , Ji , Y. , Lendasse , A. : Methodology for long-term prediction of time series . Neurocomputing 70 ( 16 ), 2861 { 2869 ( 2007 )

24. Taieb , S.B. , Bontempi , G. , Atiya , A.F. , Sorjamaa , A. : A review and comparison of strategies for multi-step ahead time series forecasting based on the nn5 forecasting competition . Expert systems with applications 39(8) , 7067 { 7083 ( 2012 )

25. Taieb , S.B. , Sorjamaa , A. , Bontempi , G.: Multiple-output modeling for multi-stepahead time series forecasting . Neurocomputing 73 ( 10 ), 1950 { 1957 ( 2010 )

26. Tashman , L.J. : Out-of-sample tests of forecasting accuracy: an analysis and review . International journal of forecasting 16(4) , 437 { 450 ( 2000 )

27. Tsay , R.S.: Analysis of nancial time series , vol. 543 . John Wiley & Sons ( 2005 )

28. Venables , W.N. , Ripley , B.D. : Modern Applied Statistics with S. Springer, New York, fourth edn. ( 2002 ), http://www.stats.ox.ac.uk/pub/MASS4, iSBN 0-387- 95457-0

29. Ward

, J.H. : Hierarchical grouping to optimize an objective function . Journal of the American statistical association 58 ( 301 ), 236 { 244 ( 1963 )

30. Xiong , R. , Nichols , E.P. , Shen , Y. : Deep learning stock volatility with google domestic trends . arXiv preprint arXiv:1512.04916 ( 2015 )