Introduction

Cities

An Exploratory Analysis of Multiple Multivariate Time Series

Lynne Billard

Ahlame Douzal-Chouakria

Seyed Yaser Samadi

0 0 Department of Mathematics, Southern Illinois University , USA 1 Department of Statistics, University of Georgia , USA 2 Universite Grenoble Alpes , CNRS - LIG/AMA , France

2015

1 2 14

Our aim is to extend standard principal component analysis for non-time series data to explore and highlight the main structure of multiple sets of multivariate time series. To this end, standard variancecovariance matrices are generalized to lagged cross-autocorrelation matrices. The methodology produces principal component time series, which can be analysed in the usual way on a principal component plot, except that the plot also includes time as an additional dimension.

Introduction

dependencies can be used in a principal component analysis. This produces principal component time series, which in turn allows the projection of the original time series observations onto three dimensional principal component by time space. The basic methodology is outlined in Section2, and illustrated in Section 3. 2 2.1

Methodology

Cross-Autocorrelation functions for S > 1 series and p > 1 dimensions Let Xst = {(Xstj ); j = 1; : : : ; p}, t = 1; : : : ; Ns, s = 1; : : : ; S, be a p-dimensional time series of length Ns, for each series s. For notational simplicity, assume Ns = N for all s. Let us also assume the observations have been suitably di erenced/transformed so that the data are stationary.

For a standard single univariate series time series where S = 1 and p = 1, it is well-known that the sample autocovariance function at lag k is (dropping the s = S = 1 and j = p = 1 subscripts) ^(k) = 1 N−k

∑ (Xt − X)(Xt+k − X); k = 0; 1; : : : ; N t=1 X = 1 ∑N Xt;

N t=1 (2.1) and the sample autocorrelation function at lag k is ^(k) = ^(k)=^(0), k = 0; 1; : : :.

These autocorrelation functions provide a measure of the time dependence between observations changes as their distance apart, lag k. They are used to identify the type of model and also to estimate model parameters. See, many of the basic texts on time series, e.g., Box et al. (2011); Brockwell and Davis (1991); Cryer and Chan (2008). Note that the divisor in Eq.(2:1) is N , rather than N − k. This ensures that the sample autocovariance matrix is non-negative de nite.

For a single multivariate time series where S = 1 and p ≥ 1, the crossautocovariance function between variables (j; j′) at lag k is the p × p matrix (k) with elements estimated by and the cross-autocorrelation function between variables (j; j′) at lag k is the p × p matrix, (k), with elements { jj′ (k); j; j′ = 1; : : : ; p} estimated by ^jj′ (k) = ^jj′ (k)={^jj (0)^j′j′ (0)}1=2; k = 0; 1; : : : : (2.3)

Unlike the autocorrelation function obtained from Eq.(2:1) with its single value at each lag k, Eq.(2:3) produces a p × p matrix at each lag k. The function Eq.(2:2) was rst given by Whittle (1963) and shown to be nonsymmetric by Jones (1964). In general, jj′ (k) ̸= j′j (k) for variables j ̸= j′, except for k = 0, but (k) = ′(−k); see, e.g., Brockwell and Davis (1991).

When there are S ≥ 1 series and p ≥ 1 variables, the de nition of Eqs.(2:2)(2:3) can be extended to give a p×p sample cross-autocovariance function matrix between variables (j; j′) at lag k, ^ (k), with elements given by, for j; j′ = 1; : : : ; p;

N1S ∑s=S1 N∑t=−1k(Xstj − Xj )(Xs;t+k;j′ − Xj′ ); k = 0; 1; (2.4) with Xj = N1S ∑s=S1 ∑t=N1 Xstj ; and the p × p sample cross-autocorrelation matrix at lag k, ^( 1 )(k), has elements ^jj′ (k), j; j′ = 1; : : : ; p, obtained by substituting Eq.(2:4) into Eq.(2:3). This cross-autocovariance function in Eq.(2:4) is a measure of time dependence between observations k units apart for a given variable pair (j; j′), calculated across all S series. Notice, the sample means Xj in Eq.(2:4) are calculated across all N S observations.

An alternative approach is to calculate these sample means by series. In this case, the cross-autocovariance matrix ^ (k) has elements estimated by, for j; j′ = 1; : : : ; p, s = 1; : : : ; S,

N1S ∑s=S1 N∑t=−1k(Xstj − Xsj )(Xs;t+k;j − Xsj′ ); k = 0; 1; (2.5) with Xsj = 1 ∑N Xstj ; N t=1 and the corresponding p × p cross-autocorrelation function matrix ^(2)(k) has elements ^jj′ (k) found by substituting Eq.(2:5) into Eq.(2:3).

Other model structures can be considered, which would provide other options for obtaining the relevant sample means. These include class structures, lag k structures, weighted series and/or weighted variable structures, and the like; see Billard et al. (2015). 2.2

Principal Components for Time Series In a standard classical principal component analysis on a set of p-dimensional multivariate observations X = {Xij ; i = 1; : : : n; j = 1; : : : ; p}, each observation is projected into a corresponding th order principal component, P C (i), through the linear combination of the observation's variables,

P C (i) = w 1Xi1 + · · · + w pXip; = 1; : : : ; p; (2.6) where w = (w 1; : : : ; w p) is the th eigenvector of the correlation matrix (or, equivalently for non-standardized observations, the variance-covariance matrix ). The eigenvalues satisfy 1 ≥ 2 ≥ : : : ≥ p ≥ 0, and ∑p=1 = p (or, 2 for non-standardized data). A detailed description of this methodology for standard data can be found in any of the numerous texts on multivariate analysis, e.g., Joli e (1986) and Johnson and Wichern (2007) for an applied approach, and Anderson (1984) for theoretical details.

For time series data, the correlation matrix is replaced by the crossautocorrelation matrix (k), for a speci c lag k = 1; 2; : : : , and the th order principal component of Eq.(2:6) becomes P C (s; t) = w 1Xs1t + · · · + w pXspt; = 1; : : : ; p; t = 1; : : : ; N; s = 1; : : : ; S: (2.7) The elements of (k) can be estimated by ^jj′ (k) from Eq.(2:4) or from Eq.(2:5) (or from other choices of model structure). The problem of non-positive de niteness, for lag k > 0, for the cross-autocorrelation matrix has been studied by Rousseeuw and Molenberghs (1993) and Jackel (2002), with the recommendation that negative eigenvalues be re-set at zero. 3

Illustration

To illustrate, take a data set (<http://dss.ucar.edu/datasets/ds578.5>) where the observations are time series of monthly temperatures at S = 14 cities (weather stations) in China over the years 1923-88. In the present analysis, each month is taken to be a single variable corresponding to the twelve months (January, : : : , December, respectively); hence, p = 12. Clearly, these variables are dependent as re ected in the cross-autocovariances when j ̸= j′.

Let us limit the discussion to using the cross-autocorrelation functions at lag k = 1, evaluated from Eq.(2:4) and Eq.(2:3), and shown in Table 1. We obtain the corresponding eigenvalues and eigenvectors, and hence we calculate the principal components P C , = 1; : : : ; p, from Eq.(2:6). A plot of P C1 × P C2× time is displayed in Figure 1, and that for P C1 × P C3× time is given in Figure 2. An interesting feature of these data highlighted by the methodology is that it is the P C1 × P C3 pair that distinguishes more readily the city groupings. Figure 3 displays the P C1 × P C3 values for all series and all times without tracking time (i.e., the 3-dimensional P C1 × P C3 × time values are projected onto the P C1 × P C3 plane). Hence, we are able to discriminate between cities. Thus, we observe that cities 1-4 (Hailaer, HaErBin, MuDanJiang and ChangChun, respectively), color coded in black (and indicated by the symbol black ◦ and full lines (`lty=1')) have similar temperatures and are located in the north-eastern region of China. Cities 5-7 (TaiYuan, BeiJing, TianJin), identi ed by red (△ and lines − · − (`lty=4')), are in the north, and have similar but di erent temperature trends than do those in the north-eastern region. Two (BeiJing and TianJin) are located close to sea-level, while the third (TaiYuan) is further south (and so might be expected to have higher temperatures) but its elevation is very high so decreasing its temperature patterns to be more in line with BeiJing and TianJin. Cities 8-11 (ChengDu, WuHan, ChangSha, HangZhou), green (∗) with lines · · · (`lty=3'), are located in central regions with ChengDu further west but elevated. Finally, cities 12-14 (FuZhou, XiaMen, GuangZhou), blue ( ) with lines − − − (`lty=8'), are in the southeast part of the country.

Pearson correlations between the variables Xj , j = 1; : : : ; 12, and the principal components P C , = 1; : : : ; 12, sand correlation circles (not shown) show that all months have an impact on P C1 with the months of June, July and August having a slightly negative in uence on P C2. Plots for other k ̸= 1 values give comparable results. Likewise, analyses using the cross-autocorrelations of Eq.(2:5) also produce similar conclusions. 4

Conclusion

The methodology has successfully identi ed cities with similar temperature trends, which trends a priori could not have been foreshadowed, but which do conform with other geophysical information thus con rming the usefulness of the methodology. The cross-autocorrelation functions for a p-dimensional multivariate time series have been extended to the case where there are S ≥ 1 multivariate time series. These replaced the standard variance-covariance matrices for use in a principal component analysis, thus retaining measures of the time dependencies inherent to time series data. The methodology produces principal component time series, which can be compared in the usual way on a principal component plot, except that the plot also includes time as an additional plot dimension.

Figure 1 - Temperature Data: P C1 × P C2 over Time { All Cities, k = 1

40 50 60 70 0 200 400 600 800 0 10 20 30 Time

PC1 Figure 2 - Temperature Data: P C1 × P C3 over Time { All Cities, k = 1

Cities 8−11 Cities 12−14

Cities 1−4 PC1 −400 −200 0 200 400 600

Figure 3 - Temperature

Data: P C1 × P C3 { All

Cities

, All Times, k = 1