=Paper=
{{Paper
|id=Vol-1218/bmaw2014_paper_2
|storemode=property
|title=Multi-Process Models -- An Application for the Construction of Financial Factor Models
|pdfUrl=https://ceur-ws.org/Vol-1218/bmaw2014_paper_2.pdf
|volume=Vol-1218
|dblpUrl=https://dblp.org/rec/conf/uai/KeaneC14
}}
==Multi-Process Models -- An Application for the Construction of Financial Factor Models==
Multi-Process Models – An Application for the Construction of Financial Factor Models Kevin R. Keane and Jason J. Corso Computer Science and Engineering University at Bu↵alo, The State University of New York Bu↵alo, NY 14260 Abstract ing investment portfolios, where the performance goal is to minimize the standard deviation of returns. In- We present an unsupervised, comprehensive vestment professionals seek to minimize the variation methodology for the construction of financial in investment outcomes because doing so results in risk models. We o↵er qualitative comments higher economic utility for their clients. To illus- on incremental functionality and quantitative trate this point, consider which of two pension scenar- measures of superior performance of compo- ios with equal expected value is preferred: one that nent and mixture dynamic linear models rel- pays 80% salary with probability 1; or, the other that ative to alternative models. We apply our pays 120% salary with probability 1/2 (if things “go methodology to a high dimensional stream of well”) and pays 40% salary with probability 1/2 (if daily closing prices for approximately 7,000 things “go poorly”). The economic concept of declin- US traded stocks, ADRs, and ETFs for the ing marginal utility, expressed mathematically with most recent 10 years. Our methodology au- concave utility functions U(E(x)) > E(U(x)), implies tomatically extracts an evolving set of ex- that given investment scenarios with equal expected planatory time series from the data stream; return, individuals prefer the scenario with lowest vari- maintains and updates parameter distribu- ation. Portfolio managers are concerned with gener- tions for component dynamic linear models ating acceptable returns with minimal risk; and, risk as the explanatory time series evolve; and, managers monitor the portfolio managers, verifying fi- ultimately specifies time-varying asset spe- nancial risks remain within authorized limits. Risk cific mixture models. Our methodology uti- models, defining the n ⇥ n asset covariance matrix ⌃ lizes a hierarchical Bayesian approach for the and precision matrix ⌃ 1 , are used by portfolio man- specification of component model parameter agers in conjunction with expected return vectors ↵ to distributions and for the specification of the construct optimal portfolios weights ⌃ 1 ↵; and, are mixing weights in the final model. Our ap- used by risk managers given portfolio weights w to proach is insensitive to the exact number of forecast portfolio variance wT ⌃w. factors, and “e↵ectively” sparse, as irrelevant We describe the construction of a set of Bayesian factors (time series of pure noise) yield pos- switching state-space models and qualitatively analyze terior parameter distributions with high den- the on-line behavior of the various component models sity around zero. The statistical models ob- and mixture models. We focus our discussion on the tained serve a variety of purposes, including: qualitative aspects then conclude by providing quan- outlier detection; portfolio construction; and titative measures of our model’s superior performance risk forecasting. relative to competing models. We look at the tradeo↵ of adaption rate and stability of parameter estimates, evaluating model responsiveness with both synthetic 1 INTRODUCTION and real data series. We show that dynamic linear models are robust 1 with respect to pure noise ex- We propose a time varying Bayesian statistical model planatory variables, appropriately generating param- for individual stock returns depicted in Figure 1. 1 We use the term robust to describe a desirable trait Our goal is to accurately model the high dimensional where an estimation method is stable in the presence of probability distribution underlying stock returns. We outlier observations and irrelevant explanatory variables demonstrate success by constructing better perform- (noise). 20 ⇡↵ eter distributions very concentrated around zero. We comment on the behavior of component models rela- tive to the mixture consensus. We illustrate compo- nent and mixture response to outlier observations. In ↵n,t Vn,t periods with ambiguous posterior model probabilities, we describe the di↵usive impact to the mixture dis- A tribution; and, we note surprisingly distinct behavior when an outlier component is selected with near cer- Ft vn,t tainty. We find unexpectedly similar updating behav- ior across a range of component models, bringing into question the necessity of more than two components. We inspect the impact of implementing intervention @ n,t in the estimation process by greatly inflating the vari- Yn,t ance of a stock’s posterior distribution subsequent to a merger event. The intervention is shown to result in extremely rapid convergence to new dynamics, while n,t 1 n,t the same model without intervention is shown to main- tain bias for an unacceptably long period. Lastly, we compare our two component mixture model against several alternatives. Analyzing results in Table 1, we show the positive impact of regularization provided by ✓ n,t 1 ✓ n,t the Bayesian framework relative to PCA, and further improvement of the mixture model as compared to the single process model. Gt wn,t 2 BACKGROUND N 2.1 RISK MODELS T High dimensional statistical models, central to mod- ern portfolio management, present significant techni- cal challenge in their construction. There is extensive W literature on the topic of constructing factor models or risk models as they are interchangeably known by Figure 1: Our final two process Bayesian switch- practitioners. Consider a matrix of observations ing state-space model for log price returns Yn,t = 2 3 log (pn,t /pn,t 1 ) for asset n in time period t. We spec- r1,1 · · · r1,t ify the prior probabilities ⇡↵ of the switch variable 6 .. 7 , X = 4 ... .. . . 5 (1) ↵n,t that controls the observation variance: Vn,t = 1 rn,1 ··· rn,t if ↵n,t = {regular model} and Vn,t = 100 if ↵n,t = {outlier model}. In our final model, we specify the representing log price returns evolution variance W. Other variables are obtained ✓ ◆ or inferred from the data in an unsupervised manner. pi,j ri,j = log , (2) The return of an individual asset is modeled as a time pi,j 1 varying regression, with regression coefficient vector for assets i 2 1 . . . n, trading days j 2 1 . . . t, and end of ✓ n,t , common factor explanatory vector Ft , and noise: day prices pi,j . The general approach to financial risk Yn,t = ✓ Tn,t Ft + n,t vn,t , vn,t ⇠ N (0, Vn,t ). The regres- modeling specifies the covariance of asset returns as a sion coefficients are a hidden state vector, evolving as structured matrix. A factor model with p explanatory a Markov chain: ✓ n,t = Gt✓ n,t 1 + n,t wn,t , wn,t ⇠ time series is specified for the n ⇥ t matrix X with N (0, W). Gt captures rotation and scaling of the ex- an n ⇥ p matrix of common factor loadings L, a p ⇥ t planatory vector Ft over time, permitting computation matrix of common factor returns time series F, and an of prior distributions for ✓ n,t from posterior distribu- n ⇥ t matrix of residual error time series ✏ : tions for ✓ n,t 1 . n,t is an asset and time specific scale factor applied to both the observation noise vn,t and X = LF + ✏ . (3) the state evolution noise wn,t . An orthogonal factor model (Johnson and Wichern, 1998, Ch. 9) implies diagonal covariance matrices 21 Cov(F) = I and Cov(✏✏) = . In an orthogonal factor Wichern, 1998, Ex. 8.5). With respect to the second model, the covariance of the observation matrix X is concern, the rotation and scaling of factors in di↵er- simply: ent sample periods, our application incorporates the Cov(X) = LLT + . (4) method of (Keane and Corso, 2012) to identify these rotations and maintain parameter distributions in the By construction, risk models based on principal com- presence of rotation and scaling. Although much dis- ponent and singular value decomposition (Wall et al., cussion surrounds the third concern, the identification 2003), including ours, possess this simple structure of of the “correct” number of factors (Roll and Ross, orthogonal common factors and residual errors. 1980; Trzcinka, 1986; Connor and Korajczyk, 1993; Onatski, 2010), we find the regularization provided by 2.2 COMMON FACTORS Bayesian dynamic linear models results in regression In pursing an unsupervised methodology, we utilize an coefficients densely centered around zero for factors SVD based approach to identifying characteristic time that are pure noise. The functioning of our process series. SVD identifies common variation, both row- in this regard is analogous to regularized least squares wise and column-wise, in a matrix of data (Wall et al., (RLS) (Bishop, 2006, Ch. 3) and the ability of RLS to 2003). SVD decomposes a rectangular matrix, such as successfully estimate regression coefficients when con- the returns X, into two orthogonal matrices U and V; fronted with a large number of candidate explanatory and, a diagonal matrix D: variables. X = UDVT . (5) 2.3 DYNAMIC LINEAR MODELS Given the format of the n ⇥ t returns matrix X in The Bayesian dynamic linear model (DLM) frame- Equation 3, the mutually orthogonal, unit-length right work elegantly addresses our need to process streams singular vectors comprising matrix V that represent of data. DLMs are state space models very similar to common variation across the columns (t trading days) Kalman filters (Kalman, 1960) and linear dynamical are the characteristic time series; the mutually orthog- systems (Bishop, 2006, Ch. 13). We summarize the onal, unit-length left singular vectors comprising ma- matrix variate notation and results from (West and trix U that represent common variation across rows Harrison, 1997, Ch. 16.4). Define t the time index, p (n assets) are the characteristic portfolios; and, the the number of common factors, and n the number of diagonal singular values matrix D captures scale. The assets. The observations Yt are generated by matrix entries of D are ordered by magnitude, therefore a p- variate dynamic linear models characterized by four factor model would use the first p-rows of VT for the time varying parameters, Ft , Gt , Vt , Wt that define factor return time series. the observation and state evolution equations. We now define and comment on the DLM parameters as they We exploit the fact that SVD methodically extracts pertain to our application: common variation, ordered by magnitude. The intu- ition is that “pervasive” sources of return important ⇥ ⇤T • Yt = Yt,1 , . . . , Yt,n , log price returns at time t, to modeling portfolio level return characteristics will common to all component DLMs; be reliably captured in the first several factors. Points of concern for some practitioners with regards to the • Ft a p ⇥ 1 dynamic regression vector, factor re- use of SVD or PCA include: turns at time t, common to all component DLMs; • Gt a p ⇥ p state evolution matrix, accounts for 1. the factors are not readily identifiable (to the hu- rotation and scaling of factor return time series man analyst); at time t, common to all component DLMs; 2. the factors can be permuted in order and sign for • Vt an observational variance scalar, individually data samples adjacent in time; and specified for each component DLM, greatly in- 3. it’s not clear how many factors to “keep”. flated in the DLMs generating “outliers” at time t; With respect to the first concern of (human) identifia- • Wt an evolution variance matrix, individually bility, we reiterate our goal is an unsupervised process specified for each component DLM, controls rate yielding a high dimensional statistical model with ad- of change in factor loadings at time t; equate explanatory power as opposed to semantically 2 2 3 meaningful groupings. Examination of assets with sig- t,1 6 .. 7 nificant weight in the characteristic portfolios typi- • ⌃t = 4 . 5, unknown diagonal ma- cally yields meaningful portfolio themes (Johnson and 2 t,n 22 trix composed of n-asset specific variance scales about the unknown state parameter matrix ⇥0 and at time t; the unknown variance scale matrix ⌃0 before data ar- rives. Our initial belief is expressed as a matrix nor- • ⌫ t , n ⇥ 1 vector of unknown observation errors at mal/inverse Wishart distribution: time t; ⇥ ⇤ (⇥0 , ⌃0 ) ⇠ NW 01 [m0 , C0 , S0 ] , (8) • ⇥t = ✓t,1 , . . . , ✓t,n , a p⇥n unknown state matrix whose columns are factor loadings of individual with mean matrix m0 , left variance matrix C0 , right assets on common factor returns at time t; and, variance matrix S0 , and degrees of freedom 0 . We al- ⇥ ⇤ low our estimate of observational variance to vary over • ⌦t = !t,1 , . . . , !t,n , a p ⇥ n unknown evolution errors matrix applied to the state matrix at time time by decaying our sample variance degrees of free- t. dom parameter t 1 immediately before computation of the prior distribution Equation 10. We specify our model variances in scale free form The marginal distribution for the state matrix ⇥0 is a (West and Harrison, 1997, Ch. 4.5), implying multipli- matrix T distribution: 2 cation by asset specific scales t,i in univariate DLMs: 2 ⇤ 2 ⇤ Ct = t,i Ct , Vt = t,i Vt , and Wt = t,i 2 Wt⇤ . When ⇥0 ⇠ T 0 [m0 , C0 , S0 ] . (9) using matrix variate notation, ⌃t is a right variance ⇥ ⇤ parameter, discussed below, scaling the n columns of Let Dt = Yt . . . Y0 refer to the information available the matrix on which it operates. When we specify subsequent to observing Yt . The conjugate parameter models in § 3.5, we will specify scale free parameters distributions are updated as follows. Vt⇤ 2 1, 100 and Wt⇤ 2 .00001, .001, .1 . For sim- Prior distribution at t: plicity, we shall omit scale free notation. (⇥t , ⌃t |Dt 1 ) ⇠ NW t1 1 [at , Rt , St 1 ] , (10) The observation equation is: Yt = ⇥T ⌫ t ⇠ N[0, Vt ⌃t ] where at = Gt mt 1 and Rt = Gt Ct 1 GT t + Wt . t Ft + ⌫ t , . (6) Forecast distribution at t given the dynamic regression The distribution of the observation errors ⌫ t is mul- vector Ft : tivariate normal, with mean vector 0 and unknown variance Vt ⌃t . (Yt |Dt 1 ) ⇠ T t 1 [ft , Qt St 1 ] , (11) The state evolution is: where ft = aT T t Ft and Qt = Vt + Ft Rt Ft . ⇥ t = Gt ⇥ t 1 + ⌦ t , ⌦t ⇠ N[0, Wt , ⌃t ] . (7) In our application, Ft is not available until Yt The distribution of the state matrix evolution errors is observed. Therefore, we accommodate ran- ⌦t is matrix normal (Dawid, 1981), with mean ma- dom regression vectors Ft (Wang et al., 2011, trix 0, left variance matrix Wt controlling variation § 7). Define µFt = E(Ft |Dt 1 ) and ⌃Ft = across rows (factors) of ⇥t , and right variance matrix Cov(Ft |Dt 1 ). The forecast distribution with Ft un- ⌃t controlling variation across columns (assets) of ⇥t . known is (Yt |Dt 1 ) ⇠ T t 1 [fˆt , Q̂t St 1 ], where the As our implementation uses diagonal matrices for both moment parameters of the multivariate T forecast Wt and ⌃t , we implicitly assume independence in the distribution are now f̂t = aT t µFt and Q̂t St 1 = T evolution of factor loadings across the factors i 2 1 . . . p Vt + µ T Ft R t µ F t + tr (R t ⌃ F t ) S t 1 + at ⌃ F t at . and across the assets j 2 1 . . . n. Posterior distribution at t: Mapping factor model notation of (Johnson and Wich- ern, 1998) in Equation 3 to DLM notation (⇥t , ⌃t |Dt ) ⇠ NW t1 [mt , Ct , St ] , (12) ⇥ of (West ⇤ and Harrison, 1997) in Equation 6: X ! Y1 . . . Yt ; ⇥ ⇤ ⇥ ⇤ with mt = at + At eT t , ⇥Ct = Rt At AT t Q⇤t , t = L ! ⇥T t ; F ! F1 . . . Ft ; and, ✏ ! ⌫ 1 . . . ⌫ t . The 1 T t 1 + 1 and St = t t 1 St 1 + et et /Qt where crucial change in perspective involves the regression At = Rt Ft /Qt and et = Yt ft . coefficients, L ! ⇥Tt . Where as the other matrices in Equation 3 are simply collections of columns present in Equation 6, the static regression coefficients L now 2.4 UNIVARIATE DLMS evolve with time in Equation 7 as ⇥T t. In § 2.3, we summarized results for matrix variate As typical with a Bayesian approach, our process DLMs. Setting the number of assets n = 1, results begins with a prior distribution reflecting our belief for univariate DLMs immediately follow. 23 2.5 MULTI-PROCESS MODELS pt (↵t , ↵t 1 ) (West and Harrison, 1997, Ch. 12) define multi-process Pr[↵t 1 |↵t , Dt ] = . (16) pt (↵t ) models composed of component DLMs. Consider a set of DLMs A = A1 , . . . , Ak . Let ↵t reference After each time step, the posterior mixture distribu- the component DLM realized at time t, A↵t 2 A. If tion for each component DLM is approximated with the observations Yt are generated with one unknown an analytic distribution using the methodology de- DLM ↵t = ↵ for all time, the observations are said scribed in (West and Harrison, 1997, Ch. 12.3.4). The to follow a multi-process, class I model. If at di↵erent Kullback-Leibler directed divergence between the ap- times s 6= t, the observations are generated by distinct proximation and the mixture is minimized in the pa- DLMs ↵s 6= ↵t , the observations are said to follow a rameters: mt (↵t ), Ct (↵t ), St (↵t ), and t (↵t ). Let multi-process, class II model. In an unsupervised mod- St (↵t , ↵t 1 ) refer to the variance scale estimate ob- eling process, we need to accommodate the arrival of tained with the DLM sequence ↵t 1 , ↵t . The param- both typical and outlier observations. We accomplish eters of the approximating distributions are as follows. this with a multi-process class II model, where various The variance scale estimates St (↵t ) are: component DLMs are appropriate for various subsets of the observations. We assume fixed model selection k 1 X pt (↵t , ↵t 1 ) probabilities, ⇡t (A↵ ) = ⇡(A↵ ). St (↵t ) 1 = . (17) t pt (↵t ) ↵ =1 St (↵t , ↵t 1 ) With class II models, there are |A| potential model t 1 histories for each asset. We avoid this explosion in model sequences by considering only two periods, t 1 The weights for computing the moments of the KL and t, thereby limiting distinct model sequences in our divergence minimizing approximation to the posterior mixtures to |A|2 . As each asset has its own history, distribution are: no longer sharing common scale free posterior variance St (↵t ) pt (↵t , ↵t 1 ) Ct , our mixture models are asset specific, forcing the p⇤t (↵t 1 ) = . (18) use of univariate component DLMs. Parameters 1 ⇥ 1 pt (↵t ) St (↵t , ↵t 1 ) in univariate DLMs are now displayed with scalar no- The mean vector mt (↵t ) for DLM ↵t is: tation. To avoid clutter, we omit implicit asset sub- scripts. k X Inference with multi-process models is based upon mt (↵t ) = p⇤t (↵t 1 )mt (↵t , ↵t 1 ) . (19) manipulation of various model probabilities: the ↵t 1 =1 posterior model probabilities for the last model pt 1 (↵t 1 ); the prior model probabilities for the cur- The variance matrix Ct (↵t ) for DLM ↵t is: rent model ⇡(↵t ); and, the model sequence likelihoods p(Yt |↵t , ↵t 1 , Dt 1 ). Posterior model sequence proba- k X n bilities for current model ↵t and last model ↵t 1 upon Ct (↵t ) = p⇤t (↵t 1 ) Ct (↵t , ↵t 1 ) + observing Yt are: ↵t 1 =1 pt (↵t , ↵t 1 ) = Pr[↵t , ↵t 1 |Dt ] [mt (↵t ) mt (↵t , ↵t 1 )] ⇥ o T / pt 1 (↵t 1 )⇡(↵t )p(Yt |↵t , ↵t 1 , Dt 1 ) . [mt (↵t ) mt (↵t , ↵t 1 )] . (20) (13) The unconditional posterior parameter distributions The degrees of freedom parameter, t (↵t ) is important are computed as mixtures of the |A|2 component DLM to the KL minimization. Intuitively, if the component sequences DLMs are in discord, the resulting mixture may be k k described as “fat-tailed”, and the precision of the un- X X known variance scale parameter reduced. We compute p(✓✓ t |Dt ) = pt (✓✓ t |↵t , ↵t 1 , Dt )pt (↵t , ↵t 1 ) . ↵t =1 ↵t t (↵t ) using the procedure described by (West and 1 =1 (14) Harrison, 1997, Ex. 12.7), with a further correction The posterior model probabilities are term. The approximation West and Harrison utilize, based upon an algorithm for computing the digamma k X function (x) discussed in (Bernardo, 1976), is appro- pt (↵t ) = Pr[↵t |Dt ] = pt (↵t , ↵t 1 ) . (15) priate when x ! 1. However, when estimating the ↵t 1 =1 reciprocal of the KL minimizing t (↵t ), we find the er- The posterior probabilities for the last model ↵t 1 ror in the approximation remains rather constant, and given the current model ↵t and information Dt are we apply a correction to eliminate this constant. 24 25 15 20 10 15 5 10 0 5 -5 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 2012 2012 (a) Synthetic scenario 1: abrupt increase in factor loading (b) Synthetic scenario 2: abrupt decrease in factor load- ✓t , units are percent per annum. Observations Yt are syn- ing ✓t , units are percent per annum. Observations Yt are thesized returns using SPY (S&P 500 ETF) until March synthesized returns using SPY until March 30, 2012; and, 30, 2012; and, 2⇥SPY thereafter. Note “true” factor load- AGG (Barclays Aggregate Bond ETF) thereafter. Note ing doubles from approximately 10 to 20. “true” factor loading drops from approximately 10 to 0. Figure 2: Unmonitored mixture models responding to abrupt change. Black line is true value of latent state variable ✓1,t . Other lines are the posterior mean m1,t (first common factor loading) obtained with three di↵erent mixture models discussed in § 3.5. None of the three above models responded quickly enough to the dramatic change in dynamics. A method of intervention to improve responsiveness is discussed in § 4.5. 3 APPLICATION DESIGN the other hand, the quickly adapting model delivers a noisier sequence of parameter distributions. Out- 3.1 END USERS side our applied context, the loss function might be specified as squared error or absolute error. In the Our application enables a proprietary trading group context of a risk model, the loss function should con- at a financial services firm to better understand the sider a portfolio manager’s cost of over-trading due aggregate behavior of stock portfolios. The models to a model adapting excessively (variance); as well a provide a statistical framework required for construct- risk manager’s problems arising in a system adapt- ing portfolios, assessing portfolio risk, and clustering ing too slowly (bias). The appropriate loss function assets. The assets of primary interest are the common depends critically on the intended end use. A quan- shares of the largest 1000 - 2000 companies in the US titative trader constructing portfolios with quadratic stock market. The group focuses on larger companies optimization tends to magnify errors in a model, as the because the liquidity of larger stocks generally makes optimization process responds dynamically to param- them cheaper and easier to transact. The group’s eter estimates (Muller, 1993). In contrast, a firm-wide strategies include short selling: borrowing stock, sell- risk manager, who typically evaluates sums of individ- ing borrowed shares; and, attempting to profit by re- ual asset exposures, but does not dynamically respond purchasing the shares at a lower price. Larger stocks to individual asset risk attributes, may prefer less bias are generally easier to borrow. and more variance, as error in the factor loadings of one asset may be o↵set by error in another asset in the 3.2 BIAS-VARIANCE TRADEOFF summation process. We construct a variety of mix- tures along the bias-variance continuum as discussed In implementing our application, one of the first issues in § 3.5 and as illustrated in Figure 2. encountered is a machine learning classic, the bias- variance tradeo↵ [Ch. 2.9](Hastie et al., 2009). With 3.3 UNIVERSE OF ASSETS respect to DLMs, a trade o↵ is incurred in the e↵ec- tive number of observations as the evolution variance We identify two universes of assets: a relatively nar- is varied. A model with greater evolution variance will row set that will be used to construct explanatory time generate parameter distributions with greater variance series; and, an all inclusive set for which we will gen- but lower bias. A model with lower evolution variance erate factor loading and residual volatility estimates. will generate parameter distributions with lower vari- It is desirable that the assets used to construct the ance but greater bias. A relatively smooth, lethar- common factor returns trade frequently and with ad- gic, slowly adapting model does not track evolving equate liquidity to minimize pricing errors. We also dynamics as quickly as a rapidly adapting model; on avoid survivor bias, the methodological error of omit- 25 ting companies no longer in existence at the time a viz. Equation 11, Ft refers to a p ⇥ 1 dynamic regres- historical analysis is performed. We eliminate this haz- sion vector, the right most column of the transposed ard by defining our common factor estimation universe and scaled right singular vectors, corresponding to the using daily exchange traded fund (ETF) create / re- desired vector of common factor returns for day t. deem portfolios for the last 10 years. Brokers create ETF shares (in units of 50,000 ETF shares) by deliv- 3.5 MODEL COMPONENTS ering a defined portfolio of stock and cash in exchange for ETF shares; or, they redeem ETF shares and re- The component DLMs in our mixture model share ob- ceive the defined portfolio of stock and cash. Given the servations Yt , common factor scores Ft , and state evo- create / redeem definitions that were used as the basis lution matrices Gt . The component DLMs are di↵er- for large transactions during the historical period, the entiated by the variance parameters: the observational assets in an ETF portfolio represent an institutionally variance scale Vt , and the evolution variance matrix held, survivor bias free, tradeable universe. Its likely Wt . We construct a set of component DLMs following the component shares were readily available to borrow, the approach of (West and Harrison, 1997, Ch. 12.4). as the component shares were held by custodian banks For component DLMs that will accommodate “typ- for the ETF shareholders. Our data base permits us ical” observation variance, we set Vt = 1; for com- to obtain data for surviving and extinct stocks. The ponent DLMs that will accommodate outlier observa- ETF we select for our factor estimation universe is the tions, we set Vt = 100. For the evolution variance, we Vanguard Total Stock Market ETF (VTI) (Vanguard similarly select a base rate of evolution Wt = .00001; Group, Inc., 2014). As of April 2014, including both and, inflate Wt by a factor of 100 and 1002 to permit mutual fund and ETF share classes, the Vanguard To- increasingly rapid changes in the factor loadings. tal Stock Market fund size was approximately USD 330 billion. The VTI constituents closely approximates 3.6 OUTPUT our desired universe, with the number of component stocks typically ranging from 1300 - 1800. The format of the output risk model will be a n ⇥ p factor loading matrix and a n ⇥ 1 residual volatility 3.4 DATA PREPARATION vector. These p + 1 numeric attributes for n stocks are stored in various formats for subsequent use through- Data preparation involves constructing the artifacts out the organization. demanded by § 2.3: Yt , Ft , and Gt . Each trading day, using price, dividend, and corporate action data for all 7000 - 8000 stocks, ADRs, and ETFs in our US pricing 4 EVALUATION data base (MarketMap Analytic Platform, 2014), we construct dividend and split adjusted log price return 600 observation vectors Yt . For stocks in the VTI ETF N(0,1) noise on that day, we construct a variance equalized histor- 500 ⇥ ⇤ 1 ical returns matrix rt = Yt T +1 . . . Yt ⌃ ˆ 2 where t 400 Factor 1 ⌃ˆ t is the diagonal matrix of sample variance for the Factor 2 300 Factor 3 period t T + 1 to T . Using (Keane and Corso, 2012, Factor 4 §3.c), we compute Ft , from the first p right singular Factor 5 200 vectors from a singular value decomposition of rt . As the vectors are unit length, and we desire unit variance 100 per day,pCov(Ft ) = I, we scale the right singular vec- 0 tors by T . The scaled characteristic time series from -0.010 -0.005 0.000 0.005 ⇥ ⇤ 1 adjacent data windows, rt 1 = Yt T . . . Yt 1 ⌃ ˆ 2 t 1 ⇥ ⇤ 1 and rt = Yt T +1 . . . Yt ⌃ ˆ 2 are then used to com- Figure 3: Distribution of factor loadings for the Van- t pute Gt as described in (Keane and Corso, 2012, §3.e): guard Total Market ZIndex portfolio on April 30, 2014. 1 Left axis is density, dx = 1. Gt = F t F T t Ft FT t 1 . (21) We are a little flexible with the notation in Equa- tion 21, where Ft and Ft 1 are p⇥(T 1) sub-matrices 4.1 NUMBER OF FACTORS representing time aligned subsets of two factor return matrices, the scaled right singular vectors obtained A Bayesian DLM updated with an “explanatory” se- from the decomposition of rt 1 and rt . Elsewhere, ries of pure noise is expected to generate posterior 26 12 10 8 6 Mixture 4 V=1 W=.00001 V=100 W=.00001 2 0 Jul Aug Sep Oct Nov Dec 2013 (a) (b) 12 10 8 e 6 4 Mixture V=1 W=.00001 V=1 W=.001 W=.00001 2 V=100 W=.00001 0 Jul Aug Sep Oct Nov Dec 2013 0 W=.00001 12 10 (c) (d) 8 e 6 4 Mixture V=1 W=.00001 V=1 W=.001 W=.00001 ture 2 V=1 W=.1 V=100 W=.00001 0 Jul Aug Sep Oct Nov Dec 2013 W=.001 Aug W=.00001 200 Sep Oct (e) 1.0 0.8 (f) Nov 12 10 0 W=.00001 IBM W=.0012013 190 0.6 8 180 0.4 6 SPY W=.1 170 0.2 4 160 0.0 2 150 Jul Aug Sep Oct Nov Dec 00 W=.00001 Jul Aug Sep Oct Nov Dec 2013 2013 (h) Posterior component probabilities (left), color key in (c); (g) Price histories for IBM and SPY, units are USD. Aug Sep Oct Nov mixture model degrees of freedom t (right), white line. Figure 4: Multi-Process Models. See § 4.3 for discussion. Figure 4(a),(c),(e) units are percent per annum. Aug Sep 2013 27 Oct Nov regression parameter distributions densely surround- fied in Figure 4(d). Finally, we specify a four compo- ing zero given sufficient observations. Further, a nent mixture (the “very adaptive model”) by adding linear combination of independent DLMs is a DLM a component with further inflated evolution variance (West and Harrison, 1997, Principle of Superposition, Wt = .1 I. The very adaptive model and compo- p. 188). We use these two points to justify the in- nent estimates for the first factor loading m1,t ap- clusion of a relatively large number of independent pear in Figure 4(e) and magnified in Figure 4(f). The explanatory series. Given the regularization inherent posterior component model probabilities for the very in Bayesian DLMs, we believe the risk of omitting a adaptive model appear as a bar chart in Figure 4(h), common factor far exceeds the risk of including noise. where the bottom bar corresponds to the probability In our application, we focus on aggregate (portfolio of the outlier component, the second bar corresponds level) forecasts, therefore it is extremely important to to the very adaptive component DLM, the third bar identify common sources of variation that may appear corresponds to the adaptive component DLM, and the insignificant at the individual asset level. (West and top bar corresponds to the base component DLM. We Harrison, 1997, Ch. 16.3) discuss in detail the hazard specified the fixed DLM selection (prior) probabili- of omitted common factors in aggregate forecasts. In ties as .01543, .00887, .0887, .887 for the components Figure 3, we show the distribution of factor loadings {outlier, very adaptive, adaptive, base} respectively. for the VTI constituents on April 30, 2014. The distri- Note several occurrences where the posterior probabil- bution of factor loadings for the first five common fac- ity of an outlier observation significantly exceeds the tors obtained from our two component mixture model DLM selection probability. The white line in Figure are shown in comparison to the distribution of loadings 4(h) corresponds to the degrees of freedom parameter on Gaussian noise, Ft ⇠ N[0, 1]. Figure 3 is consis- t for the T-distribution that approximates the mix- tent with our viewpoint, note the high density of zero ture model’s posterior parameter distribution. loadings for the noise series, and the relatively di↵use factor loadings for the first five common factor series 4.3 QUALITATIVE COMMENTS extracted with SVD. The factor loadings on the noise series have a mean loading µm = 0 bp2 , and a stan- To supplement the more analytically precise discussion dard deviation of = 2 bp. In contrast, the loadings in § 2.5, we make the following qualitative comments on the first factor have a mean loading of µm = 82bp as to the interaction of the mixture components: and a standard deviation of m = 21bp. • the mixture in Figure 4(e) with larger evolution 4.2 MIXTURE DYNAMICS variance adapts faster; the mixture in Figure 4(a) with smaller evolution variance appears smoother; Figure 4 shows the price movement and model re- sponse for IBM during the second half of 2013. In Fig- • time t component posteriors are 1-period depar- ure 4(g), the price histories for IBM and the S&P 500 tures from the t 1 consensus, see Figure 4(a), ETF SPY are displayed. Note the sharp drop in IBM’s (b), (c), (d), and (e); price on October 17, 2013 corresponding to an earnings • an outlier component “ignores” current observa- announcement. This event is helpful in understand- tions Yt and forecast error |et |, responding to the ing the interaction of components in our multi-process t 1 posterior consensus mt 1 , see Figure 4(b) models. We construct three multi-process models, the and (d); simplest of which is a two component mixture (the “base model”), comprised of a standard component • in periods of noise, the other components return DLM to handle the majority of the observations Yt , to the outlier component’s estimate with 1-period and an outlier component DLM. We specify common lag, see left-hand side of Figure 4(b) and (d); evolution variance Wt = .00001 I; observation vari- • in periods of level change, the outlier follows the ance Vt = 1 for the standard component; and observa- other components’ estimate with 1-period lag, see tion variance Vt = 100 for the outlier component. The right-hand side of Figure 4(b) and (d); base model and component estimates for the first fac- tor loading m1,t appear in Figure 4(a) and magnified • when the posterior probability of the outlier com- in Figure 4(b). We specify a three component mixture ponent spikes up, the degrees of freedom parame- (the “adaptive model”) by adding a component DLM ter t usually drops, reducing the precision of the with inflated evolution variance Wt = .001 I. The variance scale estimate St , see Figure 4(h); adaptive model and component estimates for the first factor loading m1,t appear in Figure 4(c) and magni- • however, when the outlier component is selected with very high probability, there is no impact to 2 A basis point (bp) is 10, 000 1 . t as the observation is ignored, see October 17, 28 80 15 70 2 dlm mix 10 60 50 5 40 2 dlm mix + intervention 30 0 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 2012 2013 2014 2012 2013 2014 (a) Price history, units are USD. (b) Posterior mean m1,t , units are percent per annum. Figure 5: In a deal announced April 15, 2013, Life Technologies Corporation was acquired by Thermo Fisher Scientific Inc. for USD 14 billion. The change in price behavior is noticable in Figure 5a. The factor loading estimates for two mixture models are compared in Figure 5b. One mixture model is unmonitored, the other benefits from intervention subsequent to the news event. See discussion in § 4.5 2013 in Figure 4(h), noting that the white line in Figure 5b, we intervene and inflate the prior param- does not drop when Pr{ outlier } ⇡ 1; eter variance following the April 15th announcement, R++ t = Gt (Ct 1 + I) GT + Wt , where the identity • except for the very adaptive component, the re- matrix reflects the increased uncertainty in the param- sponse does not vary with Wt , see Figure 4(d) eter distribution relative to the usual prior variance in and (f), where W=.001 and W=.00001 responses Equation 10. When subsequent updates occur, the are nearly identical. DLM with the inflated prior variance adapts to the new dynamics rapidly and satisfactorily. 4.4 UNRESPONSIVENESS TO Wt Table 1: Risk Model Performance The phenomenon we find most surprising is the in- sensitivity of the components to Wt below a cer- Volatility s.e. t-stat tain threshold. Digging further into this phenomenon, in a mixture, the various component models view of GMV portfolio the t 1 posterior parameter distribution are very PCA (1) 4.62 0.07 40.48 similar. Thinking about univariate DLMs, and as- PCA (10) 3.41 0.05 29.87 suming for discussion F = 1 and G = 1, the DLM (10) 2.17 0.03 6.82 magnitude of the adaptive scalar At = Rt /Qt = Mixture (10) 1.96 0.03 (Ct 1 + Wt ) / (Ct 1 + Wt + Vt ). When Wt ⌧ Ct 1 , as describes our situation, At ⇡ Ct 1 / (Ct 1 + Vt ) as MSR portfolio seen in Figure 4(d) and (f). Only when Wt is signifi- cant relative to Ct 1 does response vary noticeably. PCA (1) 4.07 0.06 49.38 PCA (10) 2.06 0.03 28.84 DLM (10) 1.30 0.02 4.39 4.5 INTERVENTION Mixture (10) 1.22 0.02 Our discussion of IBM focused on the mixture mod- els’ processing of unusual observations. We now ex- Cap Weight 20.31 0.29 plore an example where the data generating process Equal Weight 24.62 0.35 changes abruptly, similar to our synthetic illustrations in Figure 2. In April 2013, the acquisition of Life Technologies Corporation by Thermo Fisher Scientific 4.6 RISK MODEL PERFORMANCE Inc. was announced. The stock’s sensitivity to the market, as expressed in its first common factor load- To access the performance of our two component mix- ing, dropped abruptly, as shown in Figure 5. While ture model, we construct daily portfolios from the VTI our goal is an unsupervised estimation processes, the universe for the most recent ten years, May 2004 to Bayesian DLM framework facilitates structured inter- April 2014. The number of trading days during this vention when necessary. For one of the mixture models period was 2,516. In Table 1, we report realized out- 29 of-sample volatility and the standard error (se) of the equilibrium APT. Journal of Financial Economics, volatility measure for two strategies: global minimum 21(2):255–289, 1988. variance (GMV); and, maximum Sharpe ratio (MSR) G. Connor and R.A. Korajczyk. A test for the num- (Demey et al., 2010). We implement four risk models: ber of factors in an approximate factor model. The 1-factor PCA; 10-factor PCA; 10-factor DLM; and, 10- Journal of Finance, 48(4):1263–1291, 1993. factor 2-component mixture model. We permit short positions, and do not constrain position weights. The A.P. Dawid. Some matrix-variate distribution theory: PCA models are constructed using (Connor and Kora- notational considerations and a Bayesian applica- jczyk, 1988). The mixture model is the same model we tion. Biometrika, 68(1):265, 1981. presented earlier, with noise “factor ten” present. For P. Demey, S. Maillard, and T. Roncalli. Risk based each strategy, we report the t-statistics for the vari- indexation. Technical report, Lyxor Asset Manage- ous models’ realized volatility compared to the mixture ment, Paris, 2010. model’s realized volatility. For context on the ambient T. Hastie, R. Tibshirani, and J. Friedman. The el- volatility of the ten year period, we provide realized ements of statistical learning, volume 2. Springer, volatility for two portfolios that are long only and do 2009. not use a risk model: the capitalization weighted port- R.A. Johnson and D.W. Wichern. Applied multivariate folio; and the equal weighted portfolio. statistical analysis, volume 4. Prentice Hall, 1998. R.E. Kalman. A new approach to linear filtering and 5 CONCLUSION prediction problems. Journal of Basic Engineering, 82(1):35–45, 1960. The ability to integrate SVD with Bayesian methods K.R. Keane and J.J. Corso. Maintaining prior distri- allows our application to process large data streams in butions across evolving eigenspaces. In International an unsupervised fashion. We demonstrate that a two Conference on Machine Learning and Applications. component multi-process model achieved better reduc- IEEE, 2012. tion in volatility than alternative models. The two component model out-performed alternative models MarketMap Analytic Platform. North American Pric- including a single process model. We find the robust- ing. SunGard Data Systems, Inc., 2014. ness of Bayesian DLMs with respect to noise inputs P. Muller. Empirical tests of biases in equity portfo- of great practical value, allowing us to favor inclusion lio optimization. In S.A. Zenios, editor, Financial of factors, potentially capturing pervasive sources of optimization. Cambridge University Press, 1993. common movement important to aggregate forecast- A. Onatski. Determining the number of factors from ing. The inclusion of an outlier model adds great empirical distribution of eigenvalues. The Review of functionality, delivering robustness to the estimation Economics and Statistics, 92(4):1004–1016, 2010. process. The insensitivity of the mixture models to R. Roll and S.A. Ross. An empirical investigation of multiple evolution variance values leads us to favor the arbitrage pricing theory. The Journal of Fi- mixtures of just two components, a typical evolution nance, 35(5):1073–1103, 1980. variance value for both components, and an inflated observation variance in the outlier component. We C. Trzcinka. On the number of factors in the arbitrage would recommend generating several models of vary- pricing model. the Journal of Finance, 41(2):347– ing adaptiveness, evaluating the variance-bias tradeo↵ 368, 1986. in light of a user’s specific situation. We favor the ju- The Vanguard Group, Inc. Prospectus. Vanguard To- dicious use of intervention for events such as mergers. tal Stock Market ETF. https://www.vanguard. We would like to explore using news feeds to system- com, 2014. atically intervene for events known to impact an assets M.E. Wall, A. Rechtsteiner, and L.M. Rocha. Singular dynamics. value decomposition and principal component anal- ysis. In D.P. Berrar, W. Dubitzky, and M. Granzow, editors, A practical approach to microarray data References analysis. Springer, 2003. H. Wang, C. Reeson, and C. Carvalho. Dynamic finan- J.M. Bernardo. Psi (digamma) function. Applied cial index models: Modeling conditional dependen- Statistics, 25(3):315–317, 1976. cies via graphs. Bayesian Analysis, 6(4):639–664, C.M. Bishop. Pattern recognition and machine learn- 2011. ing. Springer, 2006. M. West and J. Harrison. Bayesian forecasting and G. Connor and R.A. Korajczyk. Risk and return in an dynamic models. Springer Verlag, 1997. 30