=Paper=
{{Paper
|id=Vol-1218/bmaw2014_paper_2
|storemode=property
|title=Multi-Process Models -- An Application for the Construction of Financial Factor Models
|pdfUrl=https://ceur-ws.org/Vol-1218/bmaw2014_paper_2.pdf
|volume=Vol-1218
|dblpUrl=https://dblp.org/rec/conf/uai/KeaneC14
}}
==Multi-Process Models -- An Application for the Construction of Financial Factor Models==
<pdf width="1500px">https://ceur-ws.org/Vol-1218/bmaw2014_paper_2.pdf</pdf>
<pre>
                        Multi-Process Models –
    An Application for the Construction of Financial Factor Models


                                   Kevin R. Keane and Jason J. Corso
                                       Computer Science and Engineering
                            University at Bu↵alo, The State University of New York
                                              Bu↵alo, NY 14260


                     Abstract                              ing investment portfolios, where the performance goal
                                                           is to minimize the standard deviation of returns. In-
    We present an unsupervised, comprehensive              vestment professionals seek to minimize the variation
    methodology for the construction of financial          in investment outcomes because doing so results in
    risk models. We o↵er qualitative comments              higher economic utility for their clients. To illus-
    on incremental functionality and quantitative          trate this point, consider which of two pension scenar-
    measures of superior performance of compo-             ios with equal expected value is preferred: one that
    nent and mixture dynamic linear models rel-            pays 80% salary with probability 1; or, the other that
    ative to alternative models. We apply our              pays 120% salary with probability 1/2 (if things “go
    methodology to a high dimensional stream of            well”) and pays 40% salary with probability 1/2 (if
    daily closing prices for approximately 7,000           things “go poorly”). The economic concept of declin-
    US traded stocks, ADRs, and ETFs for the               ing marginal utility, expressed mathematically with
    most recent 10 years. Our methodology au-              concave utility functions U(E(x)) > E(U(x)), implies
    tomatically extracts an evolving set of ex-            that given investment scenarios with equal expected
    planatory time series from the data stream;            return, individuals prefer the scenario with lowest vari-
    maintains and updates parameter distribu-              ation. Portfolio managers are concerned with gener-
    tions for component dynamic linear models              ating acceptable returns with minimal risk; and, risk
    as the explanatory time series evolve; and,            managers monitor the portfolio managers, verifying fi-
    ultimately specifies time-varying asset spe-           nancial risks remain within authorized limits. Risk
    cific mixture models. Our methodology uti-             models, defining the n ⇥ n asset covariance matrix ⌃
    lizes a hierarchical Bayesian approach for the         and precision matrix ⌃ 1 , are used by portfolio man-
    specification of component model parameter             agers in conjunction with expected return vectors ↵ to
    distributions and for the specification of the         construct optimal portfolios weights ⌃ 1 ↵; and, are
    mixing weights in the final model. Our ap-             used by risk managers given portfolio weights w to
    proach is insensitive to the exact number of           forecast portfolio variance wT ⌃w.
    factors, and “e↵ectively” sparse, as irrelevant        We describe the construction of a set of Bayesian
    factors (time series of pure noise) yield pos-         switching state-space models and qualitatively analyze
    terior parameter distributions with high den-          the on-line behavior of the various component models
    sity around zero. The statistical models ob-           and mixture models. We focus our discussion on the
    tained serve a variety of purposes, including:         qualitative aspects then conclude by providing quan-
    outlier detection; portfolio construction; and         titative measures of our model’s superior performance
    risk forecasting.                                      relative to competing models. We look at the tradeo↵
                                                           of adaption rate and stability of parameter estimates,
                                                           evaluating model responsiveness with both synthetic
1   INTRODUCTION                                           and real data series. We show that dynamic linear
                                                           models are robust 1 with respect to pure noise ex-
We propose a time varying Bayesian statistical model       planatory variables, appropriately generating param-
for individual stock returns depicted in Figure 1.            1
                                                               We use the term robust to describe a desirable trait
Our goal is to accurately model the high dimensional       where an estimation method is stable in the presence of
probability distribution underlying stock returns. We      outlier observations and irrelevant explanatory variables
demonstrate success by constructing better perform-        (noise).

                                                      20
                                                                                                         ⇡↵
eter distributions very concentrated around zero. We
comment on the behavior of component models rela-
tive to the mixture consensus. We illustrate compo-
nent and mixture response to outlier observations. In                                                    ↵n,t       Vn,t
periods with ambiguous posterior model probabilities,
we describe the di↵usive impact to the mixture dis-                                                                    A
tribution; and, we note surprisingly distinct behavior
when an outlier component is selected with near cer-                                     Ft                         vn,t
tainty. We find unexpectedly similar updating behav-
ior across a range of component models, bringing into
question the necessity of more than two components.
We inspect the impact of implementing intervention                                               @ n,t
in the estimation process by greatly inflating the vari-                                                            Yn,t
ance of a stock’s posterior distribution subsequent to
a merger event. The intervention is shown to result in
extremely rapid convergence to new dynamics, while                       n,t 1                     n,t
the same model without intervention is shown to main-
tain bias for an unacceptably long period. Lastly, we
compare our two component mixture model against
several alternatives. Analyzing results in Table 1, we
show the positive impact of regularization provided by                 ✓ n,t 1                                      ✓ n,t
the Bayesian framework relative to PCA, and further
improvement of the mixture model as compared to the
single process model.
                                                                                         Gt                        wn,t
2     BACKGROUND
                                                                                                                            N
2.1    RISK MODELS                                                                                                          T

High dimensional statistical models, central to mod-
ern portfolio management, present significant techni-
cal challenge in their construction. There is extensive                                                            W
literature on the topic of constructing factor models
or risk models as they are interchangeably known by                Figure 1: Our final two process Bayesian switch-
practitioners. Consider a matrix of observations                   ing state-space model for log price returns Yn,t =
                   2                  3                            log (pn,t /pn,t 1 ) for asset n in time period t. We spec-
                      r1,1 · · · r1,t                              ify the prior probabilities ⇡↵ of the switch variable
                   6              .. 7 ,
              X = 4 ...     ..
                               .   . 5              (1)            ↵n,t that controls the observation variance: Vn,t = 1
                        rn,1   ···   rn,t                          if ↵n,t = {regular model} and Vn,t = 100 if ↵n,t =
                                                                   {outlier model}. In our final model, we specify the
representing log price returns                                     evolution variance W. Other variables are obtained
                           ✓        ◆                              or inferred from the data in an unsupervised manner.
                              pi,j
                ri,j = log                  ,               (2)    The return of an individual asset is modeled as a time
                             pi,j 1
                                                                   varying regression, with regression coefficient vector
for assets i 2 1 . . . n, trading days j 2 1 . . . t, and end of   ✓ n,t , common factor explanatory vector Ft , and noise:
day prices pi,j . The general approach to financial risk           Yn,t = ✓ Tn,t Ft + n,t vn,t , vn,t ⇠ N (0, Vn,t ). The regres-
modeling specifies the covariance of asset returns as a            sion coefficients are a hidden state vector, evolving as
structured matrix. A factor model with p explanatory               a Markov chain: ✓ n,t = Gt✓ n,t 1 + n,t wn,t , wn,t ⇠
time series is specified for the n ⇥ t matrix X with               N (0, W). Gt captures rotation and scaling of the ex-
an n ⇥ p matrix of common factor loadings L, a p ⇥ t               planatory vector Ft over time, permitting computation
matrix of common factor returns time series F, and an              of prior distributions for ✓ n,t from posterior distribu-
n ⇥ t matrix of residual error time series ✏ :                     tions for ✓ n,t 1 . n,t is an asset and time specific scale
                                                                   factor applied to both the observation noise vn,t and
                       X = LF + ✏       .                   (3)
                                                                   the state evolution noise wn,t .
An orthogonal factor model (Johnson and Wichern,
1998, Ch. 9) implies diagonal covariance matrices

                                                             21
Cov(F) = I and Cov(✏✏) = . In an orthogonal factor          Wichern, 1998, Ex. 8.5). With respect to the second
model, the covariance of the observation matrix X is        concern, the rotation and scaling of factors in di↵er-
simply:                                                     ent sample periods, our application incorporates the
               Cov(X) = LLT +        .           (4)        method of (Keane and Corso, 2012) to identify these
                                                            rotations and maintain parameter distributions in the
By construction, risk models based on principal com-
                                                            presence of rotation and scaling. Although much dis-
ponent and singular value decomposition (Wall et al.,
                                                            cussion surrounds the third concern, the identification
2003), including ours, possess this simple structure of
                                                            of the “correct” number of factors (Roll and Ross,
orthogonal common factors and residual errors.
                                                            1980; Trzcinka, 1986; Connor and Korajczyk, 1993;
                                                            Onatski, 2010), we find the regularization provided by
2.2   COMMON FACTORS                                        Bayesian dynamic linear models results in regression
In pursing an unsupervised methodology, we utilize an       coefficients densely centered around zero for factors
SVD based approach to identifying characteristic time       that are pure noise. The functioning of our process
series. SVD identifies common variation, both row-          in this regard is analogous to regularized least squares
wise and column-wise, in a matrix of data (Wall et al.,     (RLS) (Bishop, 2006, Ch. 3) and the ability of RLS to
2003). SVD decomposes a rectangular matrix, such as         successfully estimate regression coefficients when con-
the returns X, into two orthogonal matrices U and V;        fronted with a large number of candidate explanatory
and, a diagonal matrix D:                                   variables.

                    X = UDVT       .                 (5)    2.3   DYNAMIC LINEAR MODELS

Given the format of the n ⇥ t returns matrix X in           The Bayesian dynamic linear model (DLM) frame-
Equation 3, the mutually orthogonal, unit-length right      work elegantly addresses our need to process streams
singular vectors comprising matrix V that represent         of data. DLMs are state space models very similar to
common variation across the columns (t trading days)        Kalman filters (Kalman, 1960) and linear dynamical
are the characteristic time series; the mutually orthog-    systems (Bishop, 2006, Ch. 13). We summarize the
onal, unit-length left singular vectors comprising ma-      matrix variate notation and results from (West and
trix U that represent common variation across rows          Harrison, 1997, Ch. 16.4). Define t the time index, p
(n assets) are the characteristic portfolios; and, the      the number of common factors, and n the number of
diagonal singular values matrix D captures scale. The       assets. The observations Yt are generated by matrix
entries of D are ordered by magnitude, therefore a p-       variate dynamic linear models characterized by four
factor model would use the first p-rows of VT for the       time varying parameters, Ft , Gt , Vt , Wt that define
factor return time series.                                  the observation and state evolution equations. We now
                                                            define and comment on the DLM parameters as they
We exploit the fact that SVD methodically extracts
                                                            pertain to our application:
common variation, ordered by magnitude. The intu-
ition is that “pervasive” sources of return important               ⇥                   ⇤T
                                                              • Yt = Yt,1 , . . . , Yt,n , log price returns at time t,
to modeling portfolio level return characteristics will
                                                                common to all component DLMs;
be reliably captured in the first several factors. Points
of concern for some practitioners with regards to the         • Ft a p ⇥ 1 dynamic regression vector, factor re-
use of SVD or PCA include:                                      turns at time t, common to all component DLMs;
                                                              • Gt a p ⇥ p state evolution matrix, accounts for
 1. the factors are not readily identifiable (to the hu-        rotation and scaling of factor return time series
    man analyst);                                               at time t, common to all component DLMs;
 2. the factors can be permuted in order and sign for         • Vt an observational variance scalar, individually
    data samples adjacent in time; and                          specified for each component DLM, greatly in-
 3. it’s not clear how many factors to “keep”.                  flated in the DLMs generating “outliers” at time
                                                                t;
With respect to the first concern of (human) identifia-       • Wt an evolution variance matrix, individually
bility, we reiterate our goal is an unsupervised process        specified for each component DLM, controls rate
yielding a high dimensional statistical model with ad-          of change in factor loadings at time t;
equate explanatory power as opposed to semantically                    2 2             3
meaningful groupings. Examination of assets with sig-                     t,1
                                                                     6          ..             7
nificant weight in the characteristic portfolios typi-        • ⌃t = 4               .         5, unknown diagonal ma-
cally yields meaningful portfolio themes (Johnson and                                    2
                                                                                         t,n

                                                      22
     trix composed of n-asset specific variance scales            about the unknown state parameter matrix ⇥0 and
     at time t;                                                   the unknown variance scale matrix ⌃0 before data ar-
                                                                  rives. Our initial belief is expressed as a matrix nor-
  • ⌫ t , n ⇥ 1 vector of unknown observation errors at           mal/inverse Wishart distribution:
    time t;
            ⇥               ⇤                                                 (⇥0 , ⌃0 ) ⇠ NW 01 [m0 , C0 , S0 ]       ,        (8)
  • ⇥t = ✓t,1 , . . . , ✓t,n , a p⇥n unknown state matrix
    whose columns are factor loadings of individual               with mean matrix m0 , left variance matrix C0 , right
    assets on common factor returns at time t; and,               variance matrix S0 , and degrees of freedom 0 . We al-
            ⇥                ⇤                                    low our estimate of observational variance to vary over
  • ⌦t = !t,1 , . . . , !t,n , a p ⇥ n unknown evolution
    errors matrix applied to the state matrix at time             time by decaying our sample variance degrees of free-
    t.                                                            dom parameter t 1 immediately before computation
                                                                  of the prior distribution Equation 10.
We specify our model variances in scale free form                 The marginal distribution for the state matrix ⇥0 is a
(West and Harrison, 1997, Ch. 4.5), implying multipli-            matrix T distribution:
                                  2
cation by asset specific scales t,i in univariate DLMs:
         2   ⇤         2   ⇤
Ct = t,i Ct , Vt = t,i Vt , and Wt = t,i    2
                                              Wt⇤ . When                         ⇥0 ⇠ T 0 [m0 , C0 , S0 ] .           (9)
using matrix variate notation, ⌃t is a right variance                       ⇥         ⇤
parameter, discussed below, scaling the n columns of              Let Dt = Yt . . . Y0 refer to the information available
the matrix on which it operates. When we specify                  subsequent to observing Yt . The conjugate parameter
models in § 3.5, we will specify scale free parameters            distributions are updated as follows.
Vt⇤ 2 1, 100 and Wt⇤ 2 .00001, .001, .1 . For sim-                Prior distribution at t:
plicity, we shall omit scale free notation.
                                                                        (⇥t , ⌃t |Dt 1 ) ⇠ NW t1 1 [at , Rt , St 1 ]       ,   (10)
The observation equation is:

        Yt = ⇥T                ⌫ t ⇠ N[0, Vt ⌃t ]                 where at = Gt mt 1 and Rt = Gt Ct 1 GT
                                                                                                       t + Wt .
              t Ft + ⌫ t ,                          .       (6)
                                                                  Forecast distribution at t given the dynamic regression
The distribution of the observation errors ⌫ t is mul-
                                                                  vector Ft :
tivariate normal, with mean vector 0 and unknown
variance Vt ⌃t .                                                              (Yt |Dt 1 ) ⇠ T t 1 [ft , Qt St 1 ]      ,       (11)
The state evolution is:
                                                                  where ft = aT                   T
                                                                              t Ft and Qt = Vt + Ft Rt Ft .
   ⇥ t = Gt ⇥ t 1 + ⌦ t ,    ⌦t ⇠ N[0, Wt , ⌃t ]        .   (7)
                                                                  In our application, Ft is not available until Yt
The distribution of the state matrix evolution errors             is observed.          Therefore, we accommodate ran-
⌦t is matrix normal (Dawid, 1981), with mean ma-                  dom regression vectors Ft (Wang et al., 2011,
trix 0, left variance matrix Wt controlling variation             § 7).      Define µFt = E(Ft |Dt 1 ) and ⌃Ft =
across rows (factors) of ⇥t , and right variance matrix           Cov(Ft |Dt 1 ). The forecast distribution with Ft un-
⌃t controlling variation across columns (assets) of ⇥t .          known is (Yt |Dt 1 ) ⇠ T t 1 [fˆt , Q̂t St 1 ], where the
As our implementation uses diagonal matrices for both             moment parameters of the multivariate T forecast
Wt and ⌃t , we implicitly assume independence in the              distribution are now f̂t = aT           t µFt and Q̂t St 1 =
                                                                                                                   T
evolution of factor loadings across the factors i 2 1 . . . p       Vt + µ T
                                                                           Ft R t µ F t + tr (R t ⌃ F t ) S t 1 + at ⌃ F t at .

and across the assets j 2 1 . . . n.                              Posterior distribution at t:
Mapping factor model notation of (Johnson and Wich-
ern, 1998) in Equation 3 to DLM notation                                     (⇥t , ⌃t |Dt ) ⇠ NW t1 [mt , Ct , St ]        ,   (12)
                                            ⇥ of (West    ⇤
and Harrison, 1997) in Equation 6: X ! Y1 . . . Yt ;
                 ⇥          ⇤           ⇥            ⇤            with mt = at + At eT t , ⇥Ct = Rt      At AT t Q⇤t , t =
L ! ⇥T   t ; F ! F1 . . . Ft ; and, ✏ ! ⌫ 1 . . . ⌫ t . The                              1                  T
                                                                   t 1 + 1 and St = t        t 1 St 1 + et et /Qt where
crucial change in perspective involves the regression
                                                                  At = Rt Ft /Qt and et = Yt ft .
coefficients, L ! ⇥Tt . Where as the other matrices in
Equation 3 are simply collections of columns present
in Equation 6, the static regression coefficients L now           2.4    UNIVARIATE DLMS
evolve with time in Equation 7 as ⇥T   t.
                                                                  In § 2.3, we summarized results for matrix variate
As typical with a Bayesian approach, our process                  DLMs. Setting the number of assets n = 1, results
begins with a prior distribution reflecting our belief            for univariate DLMs immediately follow.

                                                            23
2.5     MULTI-PROCESS MODELS
                                                                                                                                pt (↵t , ↵t 1 )
(West and Harrison, 1997, Ch. 12) define multi-process                                     Pr[↵t 1 |↵t , Dt ] =                                      .           (16)
                                                                                                                                    pt (↵t )
models composed of component DLMs. Consider a
set of DLMs A = A1 , . . . , Ak . Let ↵t reference
                                                                                 After each time step, the posterior mixture distribu-
the component DLM realized at time t, A↵t 2 A. If
                                                                                 tion for each component DLM is approximated with
the observations Yt are generated with one unknown
                                                                                 an analytic distribution using the methodology de-
DLM ↵t = ↵ for all time, the observations are said
                                                                                 scribed in (West and Harrison, 1997, Ch. 12.3.4). The
to follow a multi-process, class I model. If at di↵erent
                                                                                 Kullback-Leibler directed divergence between the ap-
times s 6= t, the observations are generated by distinct
                                                                                 proximation and the mixture is minimized in the pa-
DLMs ↵s 6= ↵t , the observations are said to follow a
                                                                                 rameters: mt (↵t ), Ct (↵t ), St (↵t ), and t (↵t ). Let
multi-process, class II model. In an unsupervised mod-
                                                                                 St (↵t , ↵t 1 ) refer to the variance scale estimate ob-
eling process, we need to accommodate the arrival of
                                                                                 tained with the DLM sequence ↵t 1 , ↵t . The param-
both typical and outlier observations. We accomplish
                                                                                 eters of the approximating distributions are as follows.
this with a multi-process class II model, where various
                                                                                 The variance scale estimates St (↵t ) are:
component DLMs are appropriate for various subsets
of the observations. We assume fixed model selection                                                                      k
                                                                                                              1     X pt (↵t , ↵t 1 )
probabilities, ⇡t (A↵ ) = ⇡(A↵ ).                                                    St (↵t ) 1 =                                                        .       (17)
                                                       t                                                  pt (↵t ) ↵ =1 St (↵t , ↵t 1 )
With class II models, there are |A| potential model                                                                   t   1

histories for each asset. We avoid this explosion in
model sequences by considering only two periods, t 1                             The weights for computing the moments of the KL
and t, thereby limiting distinct model sequences in our                          divergence minimizing approximation to the posterior
mixtures to |A|2 . As each asset has its own history,                            distribution are:
no longer sharing common scale free posterior variance
                                                                                                                  St (↵t ) pt (↵t , ↵t 1 )
Ct , our mixture models are asset specific, forcing the                                    p⇤t (↵t 1 ) =                                             .           (18)
use of univariate component DLMs. Parameters 1 ⇥ 1                                                                pt (↵t ) St (↵t , ↵t 1 )
in univariate DLMs are now displayed with scalar no-
                                                                                 The mean vector mt (↵t ) for DLM ↵t is:
tation. To avoid clutter, we omit implicit asset sub-
scripts.                                                                                              k
                                                                                                      X
Inference with multi-process models is based upon                                    mt (↵t ) =                   p⇤t (↵t 1 )mt (↵t , ↵t 1 )             .       (19)
manipulation of various model probabilities: the                                                     ↵t    1 =1

posterior model probabilities for the last model
pt 1 (↵t 1 ); the prior model probabilities for the cur-                         The variance matrix Ct (↵t ) for DLM ↵t is:
rent model ⇡(↵t ); and, the model sequence likelihoods
p(Yt |↵t , ↵t 1 , Dt 1 ). Posterior model sequence proba-                                        k
                                                                                                 X                              n
bilities for current model ↵t and last model ↵t 1 upon                             Ct (↵t ) =               p⇤t (↵t 1 )             Ct (↵t , ↵t 1 ) +
observing Yt are:                                                                               ↵t   1 =1


pt (↵t , ↵t 1 ) = Pr[↵t , ↵t 1 |Dt ]                                                                      [mt (↵t )           mt (↵t , ↵t 1 )]       ⇥
                                                                                                                                                     o
                                                                                                                                                 T
                  / pt 1 (↵t 1 )⇡(↵t )p(Yt |↵t , ↵t 1 , Dt 1 ) .                                          [mt (↵t )           mt (↵t , ↵t 1 )]               .   (20)
                                                            (13)
The unconditional posterior parameter distributions                              The degrees of freedom parameter, t (↵t ) is important
are computed as mixtures of the |A|2 component DLM                               to the KL minimization. Intuitively, if the component
sequences                                                                        DLMs are in discord, the resulting mixture may be
                 k      k
                                                                                 described as “fat-tailed”, and the precision of the un-
                 X      X                                                        known variance scale parameter reduced. We compute
p(✓✓ t |Dt ) =                     pt (✓✓ t |↵t , ↵t 1 , Dt )pt (↵t , ↵t 1 ) .
                 ↵t =1 ↵t                                                         t (↵t ) using the procedure described by (West and
                            1 =1

                                                                        (14)     Harrison, 1997, Ex. 12.7), with a further correction
The posterior model probabilities are                                            term. The approximation West and Harrison utilize,
                                                                                 based upon an algorithm for computing the digamma
                                        k
                                        X                                        function (x) discussed in (Bernardo, 1976), is appro-
      pt (↵t ) = Pr[↵t |Dt ] =                    pt (↵t , ↵t 1 )   .   (15)     priate when x ! 1. However, when estimating the
                                      ↵t   1 =1
                                                                                 reciprocal of the KL minimizing t (↵t ), we find the er-
The posterior probabilities for the last model ↵t 1                              ror in the approximation remains rather constant, and
given the current model ↵t and information Dt are                                we apply a correction to eliminate this constant.

                                                                           24
    25                                                          15


    20                                                          10


    15                                                           5


    10                                                           0


      5                                                         -5
            Q1          Q2           Q3           Q4                    Q1           Q2           Q3          Q4
                             2012                                                         2012

 (a) Synthetic scenario 1: abrupt increase in factor loading   (b) Synthetic scenario 2: abrupt decrease in factor load-
 ✓t , units are percent per annum. Observations Yt are syn-    ing ✓t , units are percent per annum. Observations Yt are
 thesized returns using SPY (S&P 500 ETF) until March          synthesized returns using SPY until March 30, 2012; and,
 30, 2012; and, 2⇥SPY thereafter. Note “true” factor load-     AGG (Barclays Aggregate Bond ETF) thereafter. Note
 ing doubles from approximately 10 to 20.                      “true” factor loading drops from approximately 10 to 0.

Figure 2: Unmonitored mixture models responding to abrupt change. Black line is true value of latent state
variable ✓1,t . Other lines are the posterior mean m1,t (first common factor loading) obtained with three di↵erent
mixture models discussed in § 3.5. None of the three above models responded quickly enough to the dramatic
change in dynamics. A method of intervention to improve responsiveness is discussed in § 4.5.


3         APPLICATION DESIGN                                   the other hand, the quickly adapting model delivers
                                                               a noisier sequence of parameter distributions. Out-
3.1       END USERS                                            side our applied context, the loss function might be
                                                               specified as squared error or absolute error. In the
Our application enables a proprietary trading group            context of a risk model, the loss function should con-
at a financial services firm to better understand the          sider a portfolio manager’s cost of over-trading due
aggregate behavior of stock portfolios. The models             to a model adapting excessively (variance); as well a
provide a statistical framework required for construct-        risk manager’s problems arising in a system adapt-
ing portfolios, assessing portfolio risk, and clustering       ing too slowly (bias). The appropriate loss function
assets. The assets of primary interest are the common          depends critically on the intended end use. A quan-
shares of the largest 1000 - 2000 companies in the US          titative trader constructing portfolios with quadratic
stock market. The group focuses on larger companies            optimization tends to magnify errors in a model, as the
because the liquidity of larger stocks generally makes         optimization process responds dynamically to param-
them cheaper and easier to transact. The group’s               eter estimates (Muller, 1993). In contrast, a firm-wide
strategies include short selling: borrowing stock, sell-       risk manager, who typically evaluates sums of individ-
ing borrowed shares; and, attempting to profit by re-          ual asset exposures, but does not dynamically respond
purchasing the shares at a lower price. Larger stocks          to individual asset risk attributes, may prefer less bias
are generally easier to borrow.                                and more variance, as error in the factor loadings of
                                                               one asset may be o↵set by error in another asset in the
3.2       BIAS-VARIANCE TRADEOFF                               summation process. We construct a variety of mix-
                                                               tures along the bias-variance continuum as discussed
In implementing our application, one of the first issues       in § 3.5 and as illustrated in Figure 2.
encountered is a machine learning classic, the bias-
variance tradeo↵ [Ch. 2.9](Hastie et al., 2009). With          3.3    UNIVERSE OF ASSETS
respect to DLMs, a trade o↵ is incurred in the e↵ec-
tive number of observations as the evolution variance          We identify two universes of assets: a relatively nar-
is varied. A model with greater evolution variance will        row set that will be used to construct explanatory time
generate parameter distributions with greater variance         series; and, an all inclusive set for which we will gen-
but lower bias. A model with lower evolution variance          erate factor loading and residual volatility estimates.
will generate parameter distributions with lower vari-         It is desirable that the assets used to construct the
ance but greater bias. A relatively smooth, lethar-            common factor returns trade frequently and with ad-
gic, slowly adapting model does not track evolving             equate liquidity to minimize pricing errors. We also
dynamics as quickly as a rapidly adapting model; on            avoid survivor bias, the methodological error of omit-

                                                        25
ting companies no longer in existence at the time a        viz. Equation 11, Ft refers to a p ⇥ 1 dynamic regres-
historical analysis is performed. We eliminate this haz-   sion vector, the right most column of the transposed
ard by defining our common factor estimation universe      and scaled right singular vectors, corresponding to the
using daily exchange traded fund (ETF) create / re-        desired vector of common factor returns for day t.
deem portfolios for the last 10 years. Brokers create
ETF shares (in units of 50,000 ETF shares) by deliv-       3.5       MODEL COMPONENTS
ering a defined portfolio of stock and cash in exchange
for ETF shares; or, they redeem ETF shares and re-         The component DLMs in our mixture model share ob-
ceive the defined portfolio of stock and cash. Given the   servations Yt , common factor scores Ft , and state evo-
create / redeem definitions that were used as the basis    lution matrices Gt . The component DLMs are di↵er-
for large transactions during the historical period, the   entiated by the variance parameters: the observational
assets in an ETF portfolio represent an institutionally    variance scale Vt , and the evolution variance matrix
held, survivor bias free, tradeable universe. Its likely   Wt . We construct a set of component DLMs following
the component shares were readily available to borrow,     the approach of (West and Harrison, 1997, Ch. 12.4).
as the component shares were held by custodian banks       For component DLMs that will accommodate “typ-
for the ETF shareholders. Our data base permits us         ical” observation variance, we set Vt = 1; for com-
to obtain data for surviving and extinct stocks. The       ponent DLMs that will accommodate outlier observa-
ETF we select for our factor estimation universe is the    tions, we set Vt = 100. For the evolution variance, we
Vanguard Total Stock Market ETF (VTI) (Vanguard            similarly select a base rate of evolution Wt = .00001;
Group, Inc., 2014). As of April 2014, including both       and, inflate Wt by a factor of 100 and 1002 to permit
mutual fund and ETF share classes, the Vanguard To-        increasingly rapid changes in the factor loadings.
tal Stock Market fund size was approximately USD 330
billion. The VTI constituents closely approximates
                                                           3.6       OUTPUT
our desired universe, with the number of component
stocks typically ranging from 1300 - 1800.                 The format of the output risk model will be a n ⇥ p
                                                           factor loading matrix and a n ⇥ 1 residual volatility
3.4   DATA PREPARATION                                     vector. These p + 1 numeric attributes for n stocks are
                                                           stored in various formats for subsequent use through-
Data preparation involves constructing the artifacts       out the organization.
demanded by § 2.3: Yt , Ft , and Gt . Each trading day,
using price, dividend, and corporate action data for all
7000 - 8000 stocks, ADRs, and ETFs in our US pricing       4     EVALUATION
data base (MarketMap Analytic Platform, 2014), we
construct dividend and split adjusted log price return         600
observation vectors Yt . For stocks in the VTI ETF                                                N(0,1) noise
on that day, we construct a variance equalized histor-         500
                            ⇥              ⇤    1
ical returns matrix rt = Yt T +1 . . . Yt ⌃  ˆ 2 where
                                              t                400     Factor 1
⌃ˆ t is the diagonal matrix of sample variance for the                 Factor 2
                                                               300     Factor 3
period t T + 1 to T . Using (Keane and Corso, 2012,                    Factor 4
§3.c), we compute Ft , from the first p right singular                 Factor 5
                                                               200
vectors from a singular value decomposition of rt . As
the vectors are unit length, and we desire unit variance       100
per day,pCov(Ft ) = I, we scale the right singular vec-
                                                                 0
tors by T . The scaled characteristic time series from                   -0.010    -0.005     0.000     0.005
                                  ⇥               ⇤    1
adjacent data windows, rt 1 = Yt T . . . Yt 1 ⌃     ˆ 2
                                                     t 1
           ⇥              ⇤     1
and rt = Yt T +1 . . . Yt ⌃  ˆ 2 are then used to com-     Figure 3: Distribution of factor loadings for the Van-
                              t
pute Gt as described in (Keane and Corso, 2012, §3.e):     guard Total Market ZIndex portfolio on April 30, 2014.

                             1
                                                           Left axis is density,   dx = 1.
              Gt = F t F T
                         t       Ft FT
                                     t 1   .       (21)
We are a little flexible with the notation in Equa-
tion 21, where Ft and Ft 1 are p⇥(T 1) sub-matrices        4.1       NUMBER OF FACTORS
representing time aligned subsets of two factor return
matrices, the scaled right singular vectors obtained       A Bayesian DLM updated with an “explanatory” se-
from the decomposition of rt 1 and rt . Elsewhere,         ries of pure noise is expected to generate posterior

                                                     26
     12

     10

       8

       6
             Mixture
       4     V=1 W=.00001
             V=100 W=.00001
       2

       0
           Jul     Aug      Sep          Oct   Nov   Dec
                                  2013

                                  (a)                                                        (b)
     12

     10

       8


e      6

       4
             Mixture
             V=1 W=.00001
             V=1 W=.001


   W=.00001
       2     V=100 W=.00001
       0
           Jul     Aug      Sep          Oct   Nov   Dec
                                  2013


0 W=.00001
     12

     10
                                  (c)                                                        (d)


       8


 e     6

       4
             Mixture
             V=1 W=.00001
             V=1 W=.001


   W=.00001
ture
       2     V=1 W=.1
             V=100 W=.00001
       0
           Jul     Aug      Sep          Oct   Nov   Dec
                                  2013


   W=.001
 Aug W=.00001
     200
        Sep     Oct
                                  (e)
                                                                   1.0

                                                                   0.8
                                                                                             (f)


                                                                                                                  Nov   12

                                                                                                                        10


0 W=.00001
                                        IBM


     W=.0012013
     190

                                                                   0.6                                                  8
     180

                                                                   0.4                                                  6
                      SPY


     W=.1
     170

                                                                   0.2                                                  4
     160

                                                                   0.0                                                  2
     150                                                                  Jul      Aug     Sep      Oct     Nov   Dec


 00  W=.00001
           Jul    Aug       Sep          Oct   Nov   Dec
                                                                                            2013
                                  2013
                                                                (h) Posterior component probabilities (left), color key in (c);
      (g) Price histories for IBM and SPY, units are USD.

 Aug    Sep     Oct                                                                                               Nov
                                                                mixture model degrees of freedom t (right), white line.

     Figure 4: Multi-Process Models. See § 4.3 for discussion. Figure 4(a),(c),(e) units are percent per annum.


    Aug                                  Sep
                                          2013             27
                                                                                  Oct                               Nov
regression parameter distributions densely surround-       fied in Figure 4(d). Finally, we specify a four compo-
ing zero given sufficient observations. Further, a         nent mixture (the “very adaptive model”) by adding
linear combination of independent DLMs is a DLM            a component with further inflated evolution variance
(West and Harrison, 1997, Principle of Superposition,      Wt = .1 I. The very adaptive model and compo-
p. 188). We use these two points to justify the in-        nent estimates for the first factor loading m1,t ap-
clusion of a relatively large number of independent        pear in Figure 4(e) and magnified in Figure 4(f). The
explanatory series. Given the regularization inherent      posterior component model probabilities for the very
in Bayesian DLMs, we believe the risk of omitting a        adaptive model appear as a bar chart in Figure 4(h),
common factor far exceeds the risk of including noise.     where the bottom bar corresponds to the probability
In our application, we focus on aggregate (portfolio       of the outlier component, the second bar corresponds
level) forecasts, therefore it is extremely important to   to the very adaptive component DLM, the third bar
identify common sources of variation that may appear       corresponds to the adaptive component DLM, and the
insignificant at the individual asset level. (West and     top bar corresponds to the base component DLM. We
Harrison, 1997, Ch. 16.3) discuss in detail the hazard     specified the fixed DLM selection (prior) probabili-
of omitted common factors in aggregate forecasts. In       ties as .01543, .00887, .0887, .887 for the components
Figure 3, we show the distribution of factor loadings      {outlier, very adaptive, adaptive, base} respectively.
for the VTI constituents on April 30, 2014. The distri-    Note several occurrences where the posterior probabil-
bution of factor loadings for the first five common fac-   ity of an outlier observation significantly exceeds the
tors obtained from our two component mixture model         DLM selection probability. The white line in Figure
are shown in comparison to the distribution of loadings    4(h) corresponds to the degrees of freedom parameter
on Gaussian noise, Ft ⇠ N[0, 1]. Figure 3 is consis-         t for the T-distribution that approximates the mix-
tent with our viewpoint, note the high density of zero     ture model’s posterior parameter distribution.
loadings for the noise series, and the relatively di↵use
factor loadings for the first five common factor series    4.3   QUALITATIVE COMMENTS
extracted with SVD. The factor loadings on the noise
series have a mean loading µm = 0 bp2 , and a stan-        To supplement the more analytically precise discussion
dard deviation of = 2 bp. In contrast, the loadings        in § 2.5, we make the following qualitative comments
on the first factor have a mean loading of µm = 82bp       as to the interaction of the mixture components:
and a standard deviation of m = 21bp.
                                                             • the mixture in Figure 4(e) with larger evolution
4.2     MIXTURE DYNAMICS                                       variance adapts faster; the mixture in Figure 4(a)
                                                               with smaller evolution variance appears smoother;
Figure 4 shows the price movement and model re-
sponse for IBM during the second half of 2013. In Fig-       • time t component posteriors are 1-period depar-
ure 4(g), the price histories for IBM and the S&P 500          tures from the t 1 consensus, see Figure 4(a),
ETF SPY are displayed. Note the sharp drop in IBM’s            (b), (c), (d), and (e);
price on October 17, 2013 corresponding to an earnings       • an outlier component “ignores” current observa-
announcement. This event is helpful in understand-             tions Yt and forecast error |et |, responding to the
ing the interaction of components in our multi-process         t 1 posterior consensus mt 1 , see Figure 4(b)
models. We construct three multi-process models, the           and (d);
simplest of which is a two component mixture (the
“base model”), comprised of a standard component             • in periods of noise, the other components return
DLM to handle the majority of the observations Yt ,            to the outlier component’s estimate with 1-period
and an outlier component DLM. We specify common                lag, see left-hand side of Figure 4(b) and (d);
evolution variance Wt = .00001 I; observation vari-
                                                             • in periods of level change, the outlier follows the
ance Vt = 1 for the standard component; and observa-
                                                               other components’ estimate with 1-period lag, see
tion variance Vt = 100 for the outlier component. The
                                                               right-hand side of Figure 4(b) and (d);
base model and component estimates for the first fac-
tor loading m1,t appear in Figure 4(a) and magnified         • when the posterior probability of the outlier com-
in Figure 4(b). We specify a three component mixture           ponent spikes up, the degrees of freedom parame-
(the “adaptive model”) by adding a component DLM               ter t usually drops, reducing the precision of the
with inflated evolution variance Wt = .001 I. The              variance scale estimate St , see Figure 4(h);
adaptive model and component estimates for the first
factor loading m1,t appear in Figure 4(c) and magni-         • however, when the outlier component is selected
                                                               with very high probability, there is no impact to
  2
      A basis point (bp) is 10, 000 1 .                         t as the observation is ignored, see October 17,


                                                     28
  80                                                           15

  70                                                                                                       2 dlm mix
                                                               10
  60

  50
                                                                5
  40
                                                                     2 dlm mix + intervention
  30                                                            0
         Q2    Q3    Q4    Q1    Q2    Q3     Q4    Q1                Q2    Q3    Q4   Q1       Q2    Q3      Q4      Q1
              2012                 2013            2014                    2012                   2013               2014


              (a) Price history, units are USD.                (b) Posterior mean m1,t , units are percent per annum.

Figure 5: In a deal announced April 15, 2013, Life Technologies Corporation was acquired by Thermo Fisher
Scientific Inc. for USD 14 billion. The change in price behavior is noticable in Figure 5a. The factor loading
estimates for two mixture models are compared in Figure 5b. One mixture model is unmonitored, the other
benefits from intervention subsequent to the news event. See discussion in § 4.5


       2013 in Figure 4(h), noting that the white line         in Figure 5b, we intervene and inflate the prior param-
       does not drop when Pr{ outlier } ⇡ 1;                   eter variance following the April 15th announcement,
                                                               R++
                                                                 t   = Gt (Ct 1 + I) GT + Wt , where the identity
  • except for the very adaptive component, the re-            matrix reflects the increased uncertainty in the param-
    sponse does not vary with Wt , see Figure 4(d)             eter distribution relative to the usual prior variance in
    and (f), where W=.001 and W=.00001 responses               Equation 10. When subsequent updates occur, the
    are nearly identical.                                      DLM with the inflated prior variance adapts to the
                                                               new dynamics rapidly and satisfactorily.
4.4     UNRESPONSIVENESS TO Wt
                                                                           Table 1: Risk Model Performance
The phenomenon we find most surprising is the in-
sensitivity of the components to Wt below a cer-                                         Volatility    s.e.        t-stat
tain threshold. Digging further into this phenomenon,
in a mixture, the various component models view of                   GMV portfolio
the t 1 posterior parameter distribution are very                     PCA (1)               4.62       0.07        40.48
similar. Thinking about univariate DLMs, and as-
                                                                      PCA (10)              3.41       0.05        29.87
suming for discussion F = 1 and G = 1, the
                                                                      DLM (10)              2.17       0.03         6.82
magnitude of the adaptive scalar At = Rt /Qt =
                                                                      Mixture (10)          1.96       0.03
(Ct 1 + Wt ) / (Ct 1 + Wt + Vt ). When Wt ⌧ Ct 1 ,
as describes our situation, At ⇡ Ct 1 / (Ct 1 + Vt ) as
                                                                     MSR portfolio
seen in Figure 4(d) and (f). Only when Wt is signifi-
cant relative to Ct 1 does response vary noticeably.                  PCA (1)               4.07       0.06        49.38
                                                                      PCA (10)              2.06       0.03        28.84
                                                                      DLM (10)              1.30       0.02         4.39
4.5     INTERVENTION
                                                                      Mixture (10)          1.22       0.02
Our discussion of IBM focused on the mixture mod-
els’ processing of unusual observations. We now ex-                  Cap Weight             20.31      0.29
plore an example where the data generating process                   Equal Weight           24.62      0.35
changes abruptly, similar to our synthetic illustrations
in Figure 2. In April 2013, the acquisition of Life
Technologies Corporation by Thermo Fisher Scientific           4.6    RISK MODEL PERFORMANCE
Inc. was announced. The stock’s sensitivity to the
market, as expressed in its first common factor load-          To access the performance of our two component mix-
ing, dropped abruptly, as shown in Figure 5. While             ture model, we construct daily portfolios from the VTI
our goal is an unsupervised estimation processes, the          universe for the most recent ten years, May 2004 to
Bayesian DLM framework facilitates structured inter-           April 2014. The number of trading days during this
vention when necessary. For one of the mixture models          period was 2,516. In Table 1, we report realized out-

                                                          29
of-sample volatility and the standard error (se) of the       equilibrium APT. Journal of Financial Economics,
volatility measure for two strategies: global minimum         21(2):255–289, 1988.
variance (GMV); and, maximum Sharpe ratio (MSR)             G. Connor and R.A. Korajczyk. A test for the num-
(Demey et al., 2010). We implement four risk models:          ber of factors in an approximate factor model. The
1-factor PCA; 10-factor PCA; 10-factor DLM; and, 10-          Journal of Finance, 48(4):1263–1291, 1993.
factor 2-component mixture model. We permit short
positions, and do not constrain position weights. The       A.P. Dawid. Some matrix-variate distribution theory:
PCA models are constructed using (Connor and Kora-            notational considerations and a Bayesian applica-
jczyk, 1988). The mixture model is the same model we          tion. Biometrika, 68(1):265, 1981.
presented earlier, with noise “factor ten” present. For     P. Demey, S. Maillard, and T. Roncalli. Risk based
each strategy, we report the t-statistics for the vari-       indexation. Technical report, Lyxor Asset Manage-
ous models’ realized volatility compared to the mixture       ment, Paris, 2010.
model’s realized volatility. For context on the ambient     T. Hastie, R. Tibshirani, and J. Friedman. The el-
volatility of the ten year period, we provide realized        ements of statistical learning, volume 2. Springer,
volatility for two portfolios that are long only and do       2009.
not use a risk model: the capitalization weighted port-
                                                            R.A. Johnson and D.W. Wichern. Applied multivariate
folio; and the equal weighted portfolio.
                                                              statistical analysis, volume 4. Prentice Hall, 1998.
                                                            R.E. Kalman. A new approach to linear filtering and
5   CONCLUSION                                                prediction problems. Journal of Basic Engineering,
                                                              82(1):35–45, 1960.
The ability to integrate SVD with Bayesian methods          K.R. Keane and J.J. Corso. Maintaining prior distri-
allows our application to process large data streams in       butions across evolving eigenspaces. In International
an unsupervised fashion. We demonstrate that a two            Conference on Machine Learning and Applications.
component multi-process model achieved better reduc-          IEEE, 2012.
tion in volatility than alternative models. The two
component model out-performed alternative models            MarketMap Analytic Platform. North American Pric-
including a single process model. We find the robust-        ing. SunGard Data Systems, Inc., 2014.
ness of Bayesian DLMs with respect to noise inputs          P. Muller. Empirical tests of biases in equity portfo-
of great practical value, allowing us to favor inclusion      lio optimization. In S.A. Zenios, editor, Financial
of factors, potentially capturing pervasive sources of        optimization. Cambridge University Press, 1993.
common movement important to aggregate forecast-            A. Onatski. Determining the number of factors from
ing. The inclusion of an outlier model adds great             empirical distribution of eigenvalues. The Review of
functionality, delivering robustness to the estimation        Economics and Statistics, 92(4):1004–1016, 2010.
process. The insensitivity of the mixture models to
                                                            R. Roll and S.A. Ross. An empirical investigation of
multiple evolution variance values leads us to favor
                                                              the arbitrage pricing theory. The Journal of Fi-
mixtures of just two components, a typical evolution
                                                              nance, 35(5):1073–1103, 1980.
variance value for both components, and an inflated
observation variance in the outlier component. We           C. Trzcinka. On the number of factors in the arbitrage
would recommend generating several models of vary-            pricing model. the Journal of Finance, 41(2):347–
ing adaptiveness, evaluating the variance-bias tradeo↵        368, 1986.
in light of a user’s specific situation. We favor the ju-   The Vanguard Group, Inc. Prospectus. Vanguard To-
dicious use of intervention for events such as mergers.       tal Stock Market ETF. https://www.vanguard.
We would like to explore using news feeds to system-          com, 2014.
atically intervene for events known to impact an assets     M.E. Wall, A. Rechtsteiner, and L.M. Rocha. Singular
dynamics.                                                    value decomposition and principal component anal-
                                                             ysis. In D.P. Berrar, W. Dubitzky, and M. Granzow,
                                                             editors, A practical approach to microarray data
References                                                   analysis. Springer, 2003.
                                                            H. Wang, C. Reeson, and C. Carvalho. Dynamic finan-
J.M. Bernardo. Psi (digamma) function.           Applied
                                                              cial index models: Modeling conditional dependen-
  Statistics, 25(3):315–317, 1976.
                                                              cies via graphs. Bayesian Analysis, 6(4):639–664,
C.M. Bishop. Pattern recognition and machine learn-           2011.
  ing. Springer, 2006.                                      M. West and J. Harrison. Bayesian forecasting and
G. Connor and R.A. Korajczyk. Risk and return in an          dynamic models. Springer Verlag, 1997.

                                                      30

</pre>