1. Introduction

Automatic Time-Series Clustering via Network Inference

Kohei Obata

Yasuko Matsubara

Koki Kawabata

Yasushi Sakurai

0 0 SANKEN, Osaka University

Given a collection of multidimensional time-series that contains an unknown type and number of network structures between variables, how eficiently can we find typical patterns and their points of variation? How can we interpret important relationships with obtained patterns? In this paper, we propose a new method of model-based clustering, which we call network clustering via graphical lasso (NGL). Our method has the following properties: (a) Interpretable: it provides interpretable network structures and cluster assignments for the data; (b) Automatic: it determines the optimal cut points and the number of clusters automatically; (c) Accurate: it provides reliable clustering performance thanks to the automated algorithm. We evaluate our NGL algorithm on both real and synthetic datasets, obtaining interpretable network structure results and outperforming state-of-the-art baselines in terms of accuracy.

eol>time-series network structure graphical lasso

1. Introduction

in most cases, we do not know the optimal number of clusters in advance.

Many applications generate time-series data including In this paper, we propose an automatic algorithm, those used in automobiles [ 1 ], biology, social networks called network clustering via graphical lasso (NGL), and in relation to financial data. In most cases, these which enables us to summarize multidimensional timedata are multidimensional, and it is important to find series into meaningful patterns eficiently based on the typical patterns, which have a specific network structure. graphical lasso problem. Intuitively, the problem we wish In practice, real-life data have multiple distinct patterns, to solve is as follows. which diferentiate their network structures. For examInformalProblem 1. Given a large collection of mulple, automobile sensor data from a driving session can be composed of some basic actions and some abrupt actions tidimensional time-series data with underlying network structures , Find a compact description of , which (i.e., going straight, turning right, turning left, slowing down, sudden braking, sudden turning). The network consists of: structure is equal to the graph structure. In this case, sen- 1. a set of segments and their cut points sors can be represented as nodes, and sensor interactions 2. a set of segment groups (i.e., clusters) of similar can be represented as edges. For a turning action, lateral network structures acceleration and steering angle may have an edge and for a braking action, brake pedal stroke and longitudinal 3. the optimal number of clusters acceleration may have an edge.

In this paper, we focus on finding a network structure Contrast with Competitors. We will compare NGL automatically from multidimensional time-series data. with existing methods from the viewpoint of network Understanding the structure of these networks is useful inference. Network estimation with time-series informabecause it allows us to devise models of sensor interac- tion has been studied as a method for analyzing ecotion, which can be used to analyze such behaviours as nomic data and biological signal data because of the fossil-eficient driving. However, there are many network high interpretability of its graphical model [ 2 ]. Graphical structures in the data, which change over time, and it is lasso [ 3, 4 ] is a network estimation method that provides dificult to find a meaningful segmentation point since no an interpretable sparse inverse covariance matrix due to one knows how data change. Moreover, the number of the ℓ1-norm. Time varying graphical lasso (TVGL) [ 5 ] clusters should be selected automatically to find abrupt is a network estimation method that takes time informachanges or for an extension to online learning, because tion into account. Although this method can find change points by comparing the network structure before and V$LDobBa’2ta28,8S@eps0a5n–k9e,n2.0o2s2k,aS-yud.ance.jyp, A(Ku.stOrablaiata); after a change, it can’t find clusters. Toeplitz inverse yasuko@sanken.osaka-u.ac.jp (Y. Matsubara); covariance-based clustering (TICC) [ 6 ] and time adaptive koki@sanken.osaka-u.ac.jp (K. Kawabata); Gaussian model (TAGM) [ 7 ] are clustering methods based yasushi@©sa20n22kPeronce.eodsinagskoaft-hueV.aLDcB.j2p022(YPh.DSWaokrkushropa,iS)eptember 5, 2022. Sydney, on network structure. TICC uses Markov random fields CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACCureEsatrtUaivlieRa.CCoWompmyoroirngkshtLs(iCche)no2s0pe2A2Ptftorrribotuhctisioepnea4pd.e0riInbnytgeirtsnsaa(tuiCothnoEarlsU(.CUCRseB-YWpe4r.m0S)i..ttoedrgun)der (tMionRsFh)iapnsdamTooenpglivtzarmiaabtlreics.esTAtoGcMapitsuarefuthsieoinnhoefraenhitdrdeelanMarkov model (HMM) and a Gaussian mixture model for in the -th cluster, the full parameter set that we (GMM). These methods find clusters depending on the want to estimate consists of = {1, 2, ..., } network structure of each subsequence. This provides and the number of clusters . clusters with interpretability and allows us to discover patterns that other traditional clustering methods are 2.2. Graphical lasso problem unable to find. Both incorporate a graphical lasso to capture the interaction between variables but require the We first consider the static inference that estimates . number of clusters to be specified as prior information. The optimization problem is written as follows: Consequently, only our approach satisfies the need for interpretability and find the optimal number of clusters minimize ∈+ + − (, ) + || ||,1, automatically. (, ) = ||(log det − ( )), Contributions. The main contribution of this work is the concept and design of NGL, which has the following desirable properties: 1. Interpretable: NGL provides the underlying graphical structures and cluster assignments in data, which help us to interpret important relationships between variables. 2. Automatic: We formulate data encoding schemes to reveal distinct patterns/clusters, each of which captures the network structure. The proposed method requires no parameter tuning to specify the number of clusters. 3. Accurate: NGL is a simple yet powerful algorithm for time-series segmentation using a graphical lasso, which outperforms state-of-the-art competitors in terms of accuracy.

where is the empirical covariance (1/||) ∑︀|| , are the diferent samples,

=1 + + is the space of the positive definite matrices, determines the sparsity level of the network, and ‖ · ‖ ,1 is the of-diagonal ℓ1-norm. This is a convex optimization problem, which imposes the ℓ1-norm restriction.

2.3. TVGL problem

To infer a time-varying sequence of networks, TVGL [ 5 ] extends the above approach, which is designed to infer a set of inverse covariance matrices Θ . TVGL solves the problem below: minimize ∈+ + ∑︁ − (, ) + || ||,1 + ∑︁ ( − − 1), =1 =2 where determines how strongly correlated neighboring 2. Preliminary covariance estimations should be. The penalty function encourages similarity between and − 1. Diferent In this paper we investigate an automatic network struc- types of allow us to enforce diferent restrictions in the ture clustering for large multidimensional time-series time-varying similarity. This problem is solved by emdata. We will describe a few key concepts and back- ploying the alternating direction method of multipliers ground materials in this section. (ADMM) [ 8 ], which is a well-established method for solving the convex optimization problem. For more details, 2.1. Problem definition please see, e.g., [ 5 ]. Although TVGL can find a changing point by comparing and − 1, it cannot find a cluster Consider a set of -dimensional time-series data con- simultaneously. Throughout this paper our method uses sisting of sequential bundle observations, = TVGL to optimize the graphical lasso problem. {1, 2, ..., } and there are || ≥ 1 diferent observations at each time . ∈ R is the -th multidimensional bundle observation, and each bundle observation vector 3. Algorithms is sampled from any multivariate normal distributions vgarar∼ipahblset(rru0e,cptΣruerse)e.,niTntshweahnniecothdweao,grakinvdsetnrtuhcet uc∼orevaisr(i0eaqn,Σuceal)mt,oaetatrhcihxe (Tianh)veherposrweevctiooovudasersisacenrcitcbieoenmthdaeetrsmiccroeidbs eeΘld,.h(bNo)wohwotowtheetsotqimfinudeastoetipoatnismsetaalroef Σ forms an edge. Our goal is to find the cluster assign- cut points, and (c) how to assign segments to optimal ments of these bundle observations into clusters clusters. There are three main ideas behind our model: ℱ = {ℱ1, ℱ2, ..., ℱ }, where ℱ is a cluster assign- 1. Model description cost: We use the minimum ment set of ⊂ (.. = 1, 2, ..., ) represented description length (MDL) principle as a model by a set of matrices, i.e., Θ = { 1, 2, ..., }. There- selection criterion for choosing between alternafore, letting = { , ℱ} be a compact description tive segmentation and cluster descriptions. We propose a novel cost function to estimate the de- 3.2.1. MergeSegment (inner loop) scription cost of the graphical lasso model.

Assuming that neighboring segments tend to belong to 2. CutPointSearch: We modify the generic bottom- the same cluster, we update cut points through Mergeup algorithm [ 9 ] to enhance its ability to han- Segment. We consider given cut points = {0, 1, dle time-series data. Initially we adopt short seg- ..., }, and a set of inverse covariance matrices at each ments and iteratively merge with an adjacent pair segment Θ = { ,0 , 0,1 , ..., ,}, where the numthat satisfies the cost restriction. ber of segments is + 1. And the set of inverse covariance matrices at each segment, which consists of only even/odd-numbered cut points Θ = { ,0 , 0,2 , ...} 3. NGL: We use the EM algorithm to cluster the segments obtained by CutPointSearch while determining the optimal number of clusters automatically.

and Θ = { ,1 , 1,3 , ...}. , , , , and , are the data, model, and inverse covariance matrix from cut points to . Our goal is to determine if a segment should be merged with its neighboring segment. As 3.1. Model description cost shown in Figure 1, we have three candidates as updated cut points: (a) Solo has three segments all separated, (b) The MDL explains the model in a parsimonious way that Left and (c) Right have two segments in which one side calculates the required number of bits. Thus, it follows is merged. We compare the MDL costs using Equation the assumption that the more we can compress the data, (1), in these three cases, (a) vs. (b) vs. (c), and select the the more we can generalize its underlying structures. best cut points so that they minimize the local MDL cost. In a nutshell, we want to find the minimum number of For example, if (b) has the lowest cost, +2 is added to graphical lasso models needed to express the data. The the updated cut points. If (a) has the lowest cost, there is goodness of the model can be described as follows: no change from the previous cut points. We iterate this process throughout the whole sequence. < ; > = < > + < | >,

(1) where < > shows the cost of describing the model , and < | > represents the cost of describing the data given the model .

Model Coding Cost. The description complexity of model is the sum of the following elements: The number of clusters requires log* (). 1 The total number of observations of each cluster requires ∑︀

=1 log* (|ℱ|).

The mean value of each cluster which has a size × 1, requires ∑︀=1( × ). The inverse covariance matrix of each cluster which has a size × , requires ∑︀

=1 | |̸=0(2 log()+ )+log* (| |̸=0), where |·| ̸=0 describes the number of non-zero elements in a matrix and is the floating point cost. 2 Data Coding Cost. Given a model , encoding cost of the data using Hufman coding [ 10 ] is computed by: < | >= ∑︀=1 (, ). Our next goal is to find the best model that minimizes Equation (1).

3.2. Automatic cut point detection

3.2.2. CutPointSearch (outer loop) This algorithm finds the optimal cut points. We are now given bundle and initial cut points . The user decides the interval of initial cut points. Since TVGL forces a time-varying similarity with neighboring network, we calculate Θ , Θ , and Θ using the TVGL graphical lasso optimization method. After obtaining each Θ , we run the MergeSegment algorithm to update the cut points. We iterate this process until the cut points are stable.

3.3. Automatic clustering: NGL

Now we have optimal cut points, which means that there are a limited number of segments that have enough samples with which to estimate the network structure. Next, we assign segments to a cluster and find the optimal number of clusters. As Algorithm 1 shows, we use the EM algorithm to classify each segment. For each iteration we vary = 1, 2, 3, ..., and minimize the function below:

1Here, log* is the universal code length for integers

2We used 4 × 8 bits in our setting So far, we have described how we calculate the MDL cost for our model. The next question is how to find arg min ∑︁ − (, ) + || ||,1, (2) optimal cut points that minimize MDL cost eficiently; =1 we still have numerous candidates with which to merge to summarize similar subsequences into a compact model, In the E-step, we assign each segment to the optimal and thus we modify the bottom-up algorithm to prevent a cluster, so that the log likelihood is maximized. In the pattern explosion. We answer this question in two steps, M-step, we calculate the value of each cluster using a MergeSegment and CutPointSearch. normal graphical lasso optimization algorithm. Until the cost function increases, we vary so as to minimize the cost function.

4. Experiments We evaluate our method on both synthetic and real datasets. 4.1. Accuracy on synthetic data In this section, we demonstrate the accuracy of NGL on

synthetic data. We do so because there are clear ground truth networks with which to test the accuracy. Experimental Setup. We randomly generate synthetic multidimensional data in R5, which follows a multivariate normal distribution ∼ (0, − 1). Each of the clusters has a mean of ⃗0, so that the clustering result is based entirely on the structure of the data. For each cluster, we generate a random ground truth inverse covariance as follows [ 11 ]:

Baseline Methods. We compare our method to

three state-of-the-art methods and one ablation method. TICC [ 6 ] and TAGM [ 7 ] take network structure into account. Since both methods need to specify the number of clusters, we gave the true number of clusters only to these methods. AutoPlait [ 12 ] is multi-level HMM based automatic method for time-series clustering. NGL no-cps is our NGL method without CutPointSearch. Our method, including NGL no-cps, requires us to specify initial cut points. We set its interval at every 5 points throughout the synthetic experiments.

Clustering Accuracy. We set each segment in each of the examples to have 100 observations in R5 (for example, "1,2,1" has a total of 300 observations). Table 1 shows the clustering accuracy for the macro-F1 scores for each dataset. As shown, NGL significantly outperforms the baselines. Our method consistently achieves the highest accuracy and lowest standard deviation. AutoPlait does not consider network structure, so it does not find any clusters. Although we gave the true number of clusters to TICC and TAGM, the average accuracy of our method is more than 10% higher. NGL no-cps shows that finding a large segment by CutPointSearch has meaning of grouping adjacent observations into the same cluster. Efect of Total Number of Samples. We next focus on the number of samples required for each method to accurately find clusters. We take the "1,2,3,4,1,2,3,4" example and vary the number of samples. As shown in Figure 2, our method outperforms the baselines for almost all segment lengths. Our method has a constantly high average, even for relatively small segment lengths. This is because our CutPointSearch algorithm correctly find cut points even if the sample size is small.

4.2. Case study

Here, we show that our NGL provides an interpretable result with real-world financial data. In general, stocks, bonds, and currency prices are correlated. By examining historical financial data, we can infer a financial network 1. Set ∈ R5× 5 equal to the adjacency matrix of structure to reveal the relationships between them. We an Erdős-Rényi directed random graph, where use hourly currency exchange rate data 3 of AUD/USD, every edge has a 20% chance of being selected. EUR/USD, GBP/USD, and USD/CAD from 2005 to 2018. 2. For every selected edge in set ∼ Assuming that the underlying network structure is conUniform([− 0.6, − 0.3] ∪ [0.3, 0.6]). We enforce a sistent for a week, we normalized the data for each week. symmetry constraint whereby every = . We also set the initial cut points at a week to capture the weekly correlation trend. The top of Figure 3 shows the 3. Let = min() be the smallest eigenvalue of clustering result obtained with NGL. During the global , and set = + (0.1 + ||), where is an financial crisis (from mid- 2007 to early 2009), we found identity matrix. that the network structure changed. There are abrupt changes on 2016/5/16 ∼ 2016/6/5, the bottom of FigWe run our experiments on four diferent temporal se- ure 3 shows how the correlation changed during this quences: "1,2,1","1,2,3,2,1","1,2,3,4,1,2,3,4","1,2,2,1,3,3,3,1". period. As we can see, a correlation related to the United We generate each dataset 10 times and report the mean and standard deviation of the macro-F1 score.

Model 1,2,1 1,2,3,2,1 1,2,3,4,1,2,3,4 1,2,2,1,3,3,3,1 Macro-F1 score of clustering accuracy for four diferent temporal sequences, comparing

NGL with state-of-the-art methods

TAGM (KDD’21)

TICC (KDD’18)

AutoPlait (SIGMOD’14)

NGL no-cps ber of samples for NGL and two other state-of-the-art methods.

Kingdom changed significantly. This was in response to the United Kingdom European Union membership referendum on 2016/6/23, which may have caused public concern. 5. Conclusion and Future work In this paper, we presented NGL, which is an interpretable

clustering algorithm. We focused on the problem of the interpretable clustering of multidimensional time-series data with underlying network structures. Our proposed

NGL indeed exhibits all the desirable properties; it is

Interpretable and Automatic and Accurate.

In future work, we will focus on the following direction: Online learning. In several situations, network inference needs to operate in an online fashion. And to the best of our knowledge, no study has dealt with online clustering based on network structure. In this context, we will develop an extension of our methods by utilizing the novel sliding window and bottom-up (SWAB) algorithm

Acknowledgments The authors would like to thank the anonymous referees for their valuable comments and helpful suggestions. This work was supported by JSPS KAK ENHI Grant-in-Aid for Scientific Research Number MIC/SCOPE 192107004, JST-AIP JPMJCR21U4, ERCA Environment Research and Technology Development Fund JPMEERF20201R02,

[1]

Miyajima ,

Nishiwaki ,

Ozawa ,

Wakita ,

Itou ,

Takeda ,

Itakura , Driver modeling based on driving behavior and its evaluation in driver identiifcation , IEEE 95 ( 2007 ) 427 - 437 .

[2]

Tomasi ,

Tozzo ,

Barla , Temporal pattern detection in time-varying graphical models , in: ICPR , 2021 , pp. 4481 - 4488 . doi: 10 .1109/ICPR48806. 2021 . 9413203 .

[3]

Friedman ,

Hastie ,

Tibshirani , Sparse inverse covariance estimation with the graphical lasso , Biostatistics 9 ( 2008 ) 432 - 441 .

[4]

Tomasi ,

Tozzo ,

Salzo ,

Verri , Latent variable time-varying network inference , in: KDD , 2018 , pp. 2338 - 2346 . URL: https://doi.org/10.1145/3219819. 3220121. doi: 10 .1145/3219819.3220121.

[5]

Hallac ,

Park ,

S. P.

Boyd ,

Leskovec , Network inference via the timevarying graphical lasso , in: KDD , 2017 , pp. 205 - 213 . URL: https://doi.org/10. 1145/3097983.3098037. doi: 10 .1145/3097983.3098037.

[6]

Hallac ,

Vare ,

S. P.

Boyd ,

Leskovec , Toeplitz inverse covariance-based clustering of multivariate time series data , in: KDD , 2017 , pp. 215 - 223 . URL: https://doi.org/10.1145/3097983.3098060. doi: 10 .1145/3097983.3098060.

[7]

Tozzo ,

Ciech ,

Garbarino ,

Verri , Statistical models coupling allows for complex local multivariate time series analysis , in: KDD , 2021 , pp. 1593 - 1603 . URL: https://doi.org/10.1145/3447548.3467362. doi: 10 .1145/3447548. 3467362.

[8]

S. P.

Boyd ,

Parikh ,

Chu ,

Peleato ,

Eckstein , Distributed optimization and statistical learning via the alternating direction method of multipliers, Found . Trends Mach. Learn . 3 ( 2011 ) 1 - 122 . URL: https://doi.org/10.1561/ 2200000016. doi: 10 .1561/2200000016.

[9]

E. J.

Keogh ,

Chu ,

D. M.

Hart ,

M. J.

Pazzani , An online algorithm for segmenting time series , in: Proceedings of the 2001 IEEE International Conference on Data Mining , 29 November - 2 December 2001 , San Jose, California, USA, IEEE Computer Society, 2001 , pp. 289 - 296 . URL: https://doi.org/10.1109/ICDM. 2001 . 989531 . doi: 10 .1109/ICDM. 2001 . 989531 .

[10]

Böhm ,

Faloutsos ,

J.-Y.

Pan ,

Plant , Ric: Parameter-free noise-robust clustering , TKDD 1 ( 2007 ) 10 - es .

[11]

Mohan , P. London,

Fazel ,

Witten , S.-I. Lee , Node-based learning of multiple gaussian graphical models , J. Mach. Learn. Res . 15 ( 2014 ) 445 - 488 .

[12]

Matsubara ,

Sakurai ,

Faloutsos , Autoplait: Automatic mining of co-evolving time sequences , in: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD '14 , Association for Computing Machinery, New York, NY, USA, 2014 , p. 193 - 204 . URL: https://doi.org/10.1145/2588555.2588556. doi: 10 .1145/2588555.2588556.