=Paper= {{Paper |id=Vol-1425/paper9 |storemode=property |title=Estimating Dynamic Graphical Models from Multivariate Time-series Data |pdfUrl=https://ceur-ws.org/Vol-1425/paper9.pdf |volume=Vol-1425 |dblpUrl=https://dblp.org/rec/conf/pkdd/GibberdN15 }} ==Estimating Dynamic Graphical Models from Multivariate Time-series Data== https://ceur-ws.org/Vol-1425/paper9.pdf
 Proceedings 1st International Workshop on Advanced Analytics and Learning on Temporal Data
                                         AALTD 2015




     Estimating Dynamic Graphical Models from
                  Multivariate Time-Series Data

                    Alexander J. Gibberd and James D.B. Nelson
Department of Statistical Science, University College London, Gower Street, London,
                                    WC1E 6BT

       Abstract.   We consider the problem of estimating dynamic graphical
       models that describe the time-evolving conditional dependency struc-
       ture between a set of data-streams. The bulk of work in such graphical
       structure learning problems has focused in the stationary i.i.d setting.
       However, when one introduces dynamics to such models we are forced to
       make additional assumptions about how the estimated distributions may
       vary over time. In order to examine the eect of such assumptions we
       introduce two regularisation schemes that encourage piecewise constant
       structure within Gaussian graphical models. This article reviews previ-
       ous work in the eld and gives an introduction to our current research.

1    Introduction
As the current data explosion continues, governments, business, and academia
are now not only harvesting more data points but also measuring an ever-
increasing number of variables. The complex systems represented by such data-
sets arise in many socio-scientic domains, such as: cyber-security, neurology,
genetics and economics. In order to understand such systems, we must focus our
analytic and experimental resources on investigating the most important rela-
tionships. However, searching for signicant relationships between variables is a
complex task. The number of possible graphs that encode such dependencies be-
tween variables becomes exponentially large as the number of variables increase.
Such computational issues are only compounded when such graphs vary over
time.
    From a statistical estimation viewpoint, the signicance of a model com-
ponent can often be viewed in terms of a model selection problem. Generally,
one may construct an estimate of model t (a lower score implies better t)
L(M, θ, Y ), relating a given model M ∈ M and parameters θ ∈ Θ(M ) to some
observed data Y ∈ Ω . Additionally, to account for dierences in perceived model
complexity one should penalise this by a measure of complexity R(M, θ) (larger
is more complex ). An optimal model and identication of parameters can be
found through balancing the two terms, i.e:
                                              [                     ]
                   (M̂ , θ̂) =     arg min     L(M, θ, Y ) + R(M, θ) .                      (1)
                                 M ∈M,θ∈Θ(M )

In statistics such a formulation is referred to as an M-estimator [15], however such
frameworks are popular across all walks of science [2], for example, maximum-
likelihood (ML), least-squares, robust (Huber loss), penalised ML estimators can

Copyright⃝
         c 2015 for this paper by its authors. Copying permitted for private and academic
purposes.
2     A.J. Gibberd, J.D.B. Nelson

all be discussed in this context. The principle idea is to suggest a mathemati-
cal (and therefore can be communicated objectively) statement to the eect of
Occam's Razor, whereby given similar model-t, one should prefer the simpler
model. Depending on the specication of the functions L(·) and R(·) and asso-
ciated model/parameter spaces, the problem in (1) can be either very easy or
dicult (for example, are the functions smooth, convex, etc).
    In the next section we introduce the canonical Gaussian graphical model
(GGM), and study the estimation of such models within the M-estimation frame-
work. This lays the foundations for our proposed dynamical extensions. We con-
clude with an example of an estimated dynamic GGM, some recovery properties
of our estimators and discuss future research directions.


2    Gaussian graphical models

A Gaussian graphical model is a generative model which encodes the conditional
dependency structure between a set of P variables (Y1,... YP ) ∼ N (0, Σ) as a
graph G(V, E). For now we will discuss the traditional i.i.d setting, in Section
(4) we will demonstrate ways in which we may relax the assumption of the
distribution being identical over time.
     In the standard case, the vertex set V = {1, . . . , P } identies variables and
the edge set E = {(i, j), . . . (l, m)} contains an edge if variables are condition-
ally dependent, specically if (i, j) ̸∈ E we can decompose a joint distribution
as P (Yi , Yj |YV \{i,j} ) = P (Yi |YV \{i,j} )P (Yj |YV \{i,j} ). The aim of our work is to
estimate an edge-set that appropriately represents a given data-set. Within the
GGM setting, learning such representations does not only provide insight by sug-
gesting key dependencies, but also species a robust probabilistic model which
we can use for tasks such as anomaly detection.
     It is well known that the edges in a GGM are encoded by non-zero o-
diagonal entries within the precision matrix Θ := Σ −1 , specically (i, j) ∈
E ⇐⇒ Θi,j ̸= 0 (see [12] for details). Learning the structure within the GGM
can then be linked with the general framework of (1) through a ML or Maximum
a-posteriori (MAP) paradigm. Assuming T observations Y ∈ RP ×T drawn as
i.i.d samples the model t function L(·) can be related to the likelihood specied
by the multivariate normal. Typically, one prefers to work with the log-likelihood,
which if we assume µ = 0 (we assume this throughout) is given by:

                                  1             1            P
            log(P (Y |Θ))/T =       log det(Θ) − trace(ŜΘ) − ln(π) ,
                                  2             2            2

where Ŝ = Y Y ⊤ /T . Taking L(·) = − log det(Θ) + trace(ŜΘ) gives (in the
setting where T > P ) a well-behaved smooth, convex function describing how
well a given parameterisation Σ represents the data Y .
    Estimating Dynamic Graphical Models from Multivariate Time-Series                  3

3    Penalising complexity

If one considers Eq. (1) with the function R(·) = 0, i.e. no complexity
                                                               [        penalty,
then the precision matrix estimator Θ̂ := arg min{Θ≽0}∈RP ×P − log det(Θ) +
           ]
trace(ŜΘ) demonstrates some undesirable properties indicative of over-tting:

  The estimator exhibits large variance when T ≈ P and is very sensitive to
   changes in observations leading to poor generalisation performance.
  In the high-dimensional setting (P > T ), the sample estimator is rank de-
   cient (rank(Ŝ) < P ) and there is no unique estimator Θ̂.

In order to avoid estimating a complete GGM graph (where all vertices's are
connected to each other), one must actively select edges according to some cri-
teria. In the asymptotic setting where T ≫ P we can test for the signicance of
edges by considering the asymptotic distribution of the empirical partial corre-
lation coecients (ρij = −Θij /Θii1/2 Θjj
                                        1/2
                                            ) [4]. However, such a procedure cannot
be performed in the high-dimensional setting (this is important for the dynam-
ical extensions, see Sec. 4) as we require that the empirical estimate be positive
semi-denite.
    An alternative approach to testing is to consider prior knowledge about the
number of edges in the graph. If we assume a at prior on the model M and
parameters Θ(M), maximising the approximate posterior probability over mod-
els P (M|Y ), then leads to the Bayesian information criterion for GGM [5]:
BIC(Θ̂ M L ) = N (− log det(Θ̂ M L ) + trace(Ŝ Θ̂ M L )) + p̂ log(N ), where p̂ is given
by the number of unique non-zeros within the ML estimated precision matrix
Θ̂ M L . Unfortunately, interpreting BIC under the framework in Eq. (1), we nd
the complexity penalty R() = p̂ log(N ) is non-convex (p̂ ∝ ∥Θ∥0 it basically
counts the number of estimated edges). In order to arrive at a global minima an
exhaustive search over the model space (all possible graphs O(2P )) is required.
                                                                         2



    Alternatively, one can place an informative prior on the parameterisation and
model (i.e. the GGM sparsity pattern) to encourage a parsimonious representa-
tion. One popular approach [6,11,17,20] is to place a Laplace type prior on the
precision matrix in an eort to directly shrink o-diagonal values. Whilst one
could choose to perform full Bayesian inference for the posterior P (Θ|Y , γ) (as
demonstrated in [17]), a computationally less demanding approach is to perform
MAP estimation resulting in the graphical lasso problem [6]:
                          [                                      ]
          Θ̂ GL := arg min − log det(Θ) + trace(ŜΘ) + (γ/N )∥Θ∥1 ,                  (2)
                      Θ≻0

                ∑
where ∥Θ∥1 = 1≤i,j≤P |Θi,j | is the ℓ1 norm of Θ. The graphical lasso problem
can yet again be interpreted within the general framework, except this time with
R(·) = (γ/N )∥Θ∥1 . Unlike BIC this complexity penalty is convex thus we can
quickly nd a global minima.
4     A.J. Gibberd, J.D.B. Nelson

4    Introducing dynamics
In this section we extend the basic GGM model to a dynamic setting whereby
the estimated graph is permitted to change as a function of time. Consider the
P -variate time-series data Y ∈ RP ×T as before, however, we now permit the
generative distribution to be a function of time, i.e:
                                (Y1t , . . . YPt ) ∼ N (0, Σ t ) ,                          (3)
the challenge is now to learn a GGM via (Σ t )−1 for each time point t = 1, . . . , T .
Clearly such a model is far more exible than the identically distributed version,
instead of O(P 2 ) parameters we now have O(P 2 T ). In such a semi-parametric
model the potential complexity can scale with the amount of data we have
available. Our aim is to harness this additional exibility to identify potential
changes within the graphical models which may shed insight onto dynamics of
the data-generating system.

Local kernel/window estimation
Zhou et. al. [20] consider the dynamic GGM model in a continuous setting
such that the underlying graphs are assumed to vary smoothly as a function
of time. To∑provide a local ∑ estimate of the covariance they suggest the estima-
tor Ŝ(t) = s wst y s y ⊤
                        s /  s wst , where wst = K(|s − t|/hT ) are weights derived
from a symmetric non-negative kernel (typically one may use a box-car/Gaussian
function) with bandwidth hT . The idea is that by replacing Ŝ with Ŝ(t) in the
graphical lasso problem (Eq. 2) it is possible to obtain a temporally localized
estimate of the graph Θ̂(t)GL . Given some smoothness conditions on the true
covariance matrices one can demonstrate [20] that the estimator is consistent
(estimator risk converges in probability R(Σ̂(t)) − R(Σ ∗ (t)) →  P
                                                                     0) even in the
dynamic (non-identically distributed) case.

Piecewise constant GGM
The seminal work by Zhou et al. [20] focused in the setting where graphs contin-
uously and smoothly evolve over time. However, there are many situations where
we might expect the smoothness assumptions to be broken. Our research [7,8,9]
focuses on how we can incorporate dierent smoothness assumptions when esti-
mating dynamic GGM. In particular we wish to study piecewise constant GGM
where the generative distribution is strictly stationary within regions separated
by a set of changepoints T = {τ1 , . . . , τK }, τi ∈ {1, . . . , T }, such that:
       P (Y t ) = P (Y t+i ) ∀ t, (t + i) ∈ {τk , . . . τk+1 } for k = 0, . . . , K − 1 .
    If we keep the Gaussian assumption of Eq. (3), then estimation relates to
nding a set of K − 1 GGM describing the distribution between changepoints.
Such a denition extends the usual denition of a changepoint [13] to multivari-
ate distributions, it is expected that the number of changepoints should be small
relative to the total period of measurement, i.e. K ≪ T and that such points
may lead to insight about changes within observed systems.
    Estimating Dynamic Graphical Models from Multivariate Time-Series             5

5    Structure learning with dynamic GGM
Our approach to searching for changepoints falls naturally into the M-estimation
framework of Eq. (1). As has already been discussed, appropriate complexity
penalties R(·) may act to induce sparsity in a given set of parameters. We propose
two sparsity aware estimators that use such properties not only to estimate the
graphical model, but also jointly extract a sparse set of changepoints.

Independent Fusing
Our rst approach (see [7,9], also related to [14,3,18]) constructs a model t
                     ∑    (                     t    )           t
function L(Θ, Y ) = Tt=1 − logdet(Θt ) + tr(Ŝ Θt ) , where Ŝ = y t (y t )⊤ /2 is
an estimate of the covariance for a specic time t. Clearly, there is not enough
                     t
information within Ŝ to recover a graph, as we are eectively trying to estimate
with only one data point. To solve this problem we introduce an explicit prior
on the smoothness of the graph via a complexity function
                                 ∑
                                 T                    ∑
                                                      T
               RIF GL (Θ) = λ1         ∥Θ t ∥1 + λ2     ∥Θ t − Θ t−1 ∥1 ,       (4)
                                 t=1                  t=2

where λ1 , λ2 control the level of sparsity and number of changepoints in the
model. Unlike in the work of Zhou et al. our prior encodes an assumption
that the model has a piecewise constant parameterisation (this is similar[ to the
fused lasso,
          ] see [16,10]). We refer to the problem {Θ̂}t=1 = arg minΘ≽0 L(·) +
                                                       T

RIF GL (·) as dened above, as the independently fused graphical lasso (IFGL),
it estimates changepoints at an individual edge level such that changepoints do
not necessarily coincide between edges.

Group Fusing
Sometimes we have a-priori knowledge that particular variables may change in a
grouped manner, that is changepoints across the edges which connect variables
may coincide. Examples, might include genes associated with a specic biological
function (see the example in Fig. 1), or stocks within a given asset class. In order
to encode such prior structure for changepoints one can adapt the smoothing
prior to act over a group of edges, for such cases we suggest the group-fused
graphical lasso (GFGL) penalty[9]:

                                 ∑
                                 T                    ∑
                                                      T
               RGF GL (Θ) = λ1         ∥Θ t ∥1 + λ2         ∥Θ t − Θ t−1 ∥2 .   (5)
                                 t=1                  t=2


Optimisation
Both IFGL and GFGL form non-smooth convex optimisation problems which
can be tackled within a variety of optimisation schemes. We have developed
6      A.J. Gibberd, J.D.B. Nelson

an alternating direction method of multipliers (ADMM) algorithm that e-
ciently solves both the above problems by taking advantage of subtle separabil-
ity properties of the estimators. Typically, one can solve for several changepoints
K = 1, . . . , ∼ 10 on problems of size T ≈ 100 − 1000, P ≈ 10 − 100 in a few
minutes. Due to the convex formulation scaling is linear in time O(T P 3 K 2 ) for
GFGL, which is a considerable advantage when compared to the quadratic time
complexity of dynamic programming approaches [1].


                                                         '                                                              $
                                                                             Graph Recovery vs T (P=10)

'                                                        $            0.8


                                                                      0.6




                                                           F1−score
                                                                      0.4


                                                                      0.2                                   GFGL
                                                                                                            IFGL

                                                                            20     40            60    80      100
                                                                                             T


    Embryo      Larva            Pupa            Adult                       Graph Recovery vs P (T=20)


                                                                      0.8


                                                                      0.6




                                                           F1−score
                                                                      0.4



&                                                        %
                                                                      0.2




                           (a)
                                                                            10          20            30           40

                                                         &                                   P
                                                                                                                        %
                                                                                        (b)

Fig. 1: a) Example of estimated graphical models (using GFGL) describing gene
dependency through the life-cycle of Drosophila melanogaster (the common fruit
y). The generative distribution is assumed to be stationary between the blue
bars which indicate changepoints. The dashed line indicates the number of active
edges in the recovered graph. b) Results from an empirical study [9] considering
how recovery of the graph changes with problem size.



6     Conclusion
To date, we have examined some properties of the IFGL and GFGL estimators
in an empirical setting (see Fig. 1). Through the use of a wavelet framework
our work [8] has also considered how one could allow for trends and changes in
the mean parameter for dynamic GGM. Empirical results suggest some desir-
able properties for the proposed estimators (graph recovery improves when one
increases the size and amount of data available within the stationary segments,
see Fig. (1b), however, we have yet to examine the theoretical consistency prop-
erties. Theoretical analysis is complicated by the fact we regularise in multiple
directions (the graph and over time), it is possible some insight in this direction
can be gained from results in the regression setting [19].
   Estimating Dynamic Graphical Models from Multivariate Time-Series                    7

References
 1. D. Angelosante and G. B. Giannakis. Sparse graphical modeling of piecewise-
    stationary time series. IEEE International Conference on Acoustics, Speech and
    Signal Processing (ICASSP), 2011.
 2. S. Boyd and L. Vandenberghe. Convex Optimization. 2004.
 3. P. Danaher, P. Wang, and D. M. Witten. The joint graphical lasso for inverse co-
    variance estimation across multiple classes. Journal of the Royal Statistical Society:
    Series B (Statistical Methodology), 2013.
 4. M. Drton and M. D. Perlman. Model selection for Gaussian concentration graphs.
    Biometrika, 2004.
 5. R. Foygel and M. Drton. Extended Bayesian information criteria for gaussian
    graphical models. In J.D. Laerty, C.K.I. Williams, J. Shawe-Taylor, R.S. Zemel,
    and A. Culotta, editors, Advances in Neural Information Processing Systems 23.
    2010.
 6. J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation
    with the graphical lasso. Biostatistics (Oxford, England), 2008.
 7. A. J. Gibberd and J. D. B. Nelson. High dimensional changepoint detection with
    a dynamic graphical lasso. IEEE International Conference on Acoustics, Speech
    and Signal Processing (ICASSP), 2014.
 8. A. J. Gibberd and J. D. B. Nelson. Estimating multi-resolution dependency graphs
    within a locally stationary wavelet framework. In review, 2015.
 9. A. J. Gibberd and J. D. B. Nelson. Regularized Estimation of Piecewise Constant
    Gaussian Graphical Models: The Group-Fused Graphical Lasso. In review, 2015.
10. Z. Harchaoui and C. Lévy-Leduc. Multiple change-point estimation with a total
    variation penalty. Journal of the American Statistical Association, 2010.
11. J. Laerty, H. Liu, and L. Wasserman. Sparse nonparametric graphical models.
    Statistical Science, 2012.
12. S. L. Lauritzen. Graphical models. Oxford, 1996.
13. M. A. Little and N. S. Jones. Generalized methods and solvers for noise removal
    from piecewise constant signals. II. New methods. Proceedings. Mathematical,
    physical, and engineering sciences / the Royal Society, 2011.
14. R. P. Monti, P. Hellyer, D. Sharp, R. Leech, C. Anagnostopoulos, and G. Montana.
    Estimating time-varying brain connectivity networks from functional MRI time
    series. NeuroImage, 2014.
15. S. N. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A unied framework
    for high-dimensional analysis of M-estimators with decomposable regularizers. Sta-
    tistical Science, 2012.
16. R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight. Sparsity and smooth-
    ness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statis-
    tical Methodology), 2005.
17. H. Wang. Bayesian graphical lasso models and ecient posterior computation.
    Bayesian Analysis, 2012.
18. S. Yang, Z. Pan, X. Shen, P. Wonka, and J. Ye. Fused multiple graphical lasso.
    Arxiv, 2012.
19. B. Zhang, J. Geng, and L. Lai. Multiple change-points estimation in linear regres-
    sion models via sparse group lasso. IEEE Trans. Signal Processing, 2015.
20. S. Zhou, J. Laerty, and L. Wasserman. Time varying undirected graphs. Machine
    Learning, 2010.