=Paper=
{{Paper
|id=Vol-3186/paper6
|storemode=property
|title=Automatic Time-Series Clustering via Network Inference
|pdfUrl=https://ceur-ws.org/Vol-3186/paper_6.pdf
|volume=Vol-3186
|authors=Kohei Obata
|dblpUrl=https://dblp.org/rec/conf/vldb/Obata22
}}
==Automatic Time-Series Clustering via Network Inference==
Automatic Time-Series Clustering via Network Inference Kohei Obata1 , Yasuko Matsubara1 , Koki Kawabata1 and Yasushi Sakurai1 1 SANKEN, Osaka University Abstract Given a collection of multidimensional time-series that contains an unknown type and number of network structures between variables, how efficiently can we find typical patterns and their points of variation? How can we interpret important relationships with obtained patterns? In this paper, we propose a new method of model-based clustering, which we call network clustering via graphical lasso (NGL). Our method has the following properties: (a) Interpretable: it provides interpretable network structures and cluster assignments for the data; (b) Automatic: it determines the optimal cut points and the number of clusters automatically; (c) Accurate: it provides reliable clustering performance thanks to the automated algorithm. We evaluate our NGL algorithm on both real and synthetic datasets, obtaining interpretable network structure results and outperforming state-of-the-art baselines in terms of accuracy. Keywords time-series, network structure, graphical lasso, 1. Introduction in most cases, we do not know the optimal number of clusters in advance. Many applications generate time-series data including In this paper, we propose an automatic algorithm, those used in automobiles [1], biology, social networks called network clustering via graphical lasso (NGL), and in relation to financial data. In most cases, these which enables us to summarize multidimensional time- data are multidimensional, and it is important to find series into meaningful patterns efficiently based on the typical patterns, which have a specific network structure. graphical lasso problem. Intuitively, the problem we wish In practice, real-life data have multiple distinct patterns, to solve is as follows. which differentiate their network structures. For exam- InformalProblem 1. Given a large collection of mul- ple, automobile sensor data from a driving session can be tidimensional time-series data with underlying network composed of some basic actions and some abrupt actions structures π, Find a compact description of π, which (i.e., going straight, turning right, turning left, slowing consists of: down, sudden braking, sudden turning). The network structure is equal to the graph structure. In this case, sen- 1. a set of segments and their cut points sors can be represented as nodes, and sensor interactions 2. a set of segment groups (i.e., clusters) of similar can be represented as edges. For a turning action, lateral network structures acceleration and steering angle may have an edge and for a braking action, brake pedal stroke and longitudinal 3. the optimal number of clusters acceleration may have an edge. In this paper, we focus on finding a network structure Contrast with Competitors. We will compare NGL automatically from multidimensional time-series data. with existing methods from the viewpoint of network Understanding the structure of these networks is useful inference. Network estimation with time-series informa- because it allows us to devise models of sensor interac- tion has been studied as a method for analyzing eco- tion, which can be used to analyze such behaviours as nomic data and biological signal data because of the fossil-efficient driving. However, there are many network high interpretability of its graphical model [2]. Graphical structures in the data, which change over time, and it is lasso [3, 4] is a network estimation method that provides difficult to find a meaningful segmentation point since no an interpretable sparse inverse covariance matrix due to one knows how data change. Moreover, the number of the β1 -norm. Time varying graphical lasso (TVGL) [5] clusters should be selected automatically to find abrupt is a network estimation method that takes time informa- changes or for an extension to online learning, because tion into account. Although this method can find change points by comparing the network structure before and VLDBβ22, Sep 05β9, 2022, Sydney, Australia $ obata88@sanken.oska-u.ac.jp (K. Obata); after a change, it canβt find clusters. Toeplitz inverse yasuko@sanken.osaka-u.ac.jp (Y. Matsubara); covariance-based clustering (TICC) [6] and time adaptive koki@sanken.osaka-u.ac.jp (K. Kawabata); Gaussian model (TAGM) [7] are clustering methods based yasushi@sanken.osaka-u.ac.jp (Y. Sakurai) on network structure. TICC uses Markov random fields Β© 2022 Proceedings of the VLDB 2022 PhD Workshop, September 5, 2022. Sydney, Australia. Copyright (C) 2022 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). (MRF) and Toeplitz matrices to capture the inherent rela- CEUR CEUR Workshop Proceedings (CEUR-WS.org) Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 tionships among variables. TAGM is a fusion of a hidden Markov model (HMM) and a Gaussian mixture model for ππ in the π-th cluster, the full parameter set that we (GMM). These methods find clusters depending on the want to estimate consists of π = {π1 , π2 , ..., ππΎ } network structure of each subsequence. This provides and the number of clusters πΎ. clusters with interpretability and allows us to discover patterns that other traditional clustering methods are 2.2. Graphical lasso problem unable to find. Both incorporate a graphical lasso to cap- ture the interaction between variables but require the We first consider the static inference that estimates ππ . number of clusters to be specified as prior information. The optimization problem is written as follows: Consequently, only our approach satisfies the need for interpretability and find the optimal number of clusters minimizeππ βπ π β ππ(π₯π , ππ ) + π||ππ ||ππ,1 , ++ automatically. ππ(π₯π , ππ ) = |π₯π |(log det ππ β π π(ππ ππ )), Contributions. The main contribution of this work is where π is the empirical covariance the concept and design of NGL, which has the following βοΈ ππ | (1/|π₯π |) |π₯ π₯π π₯ππ , π₯π are the different samples, desirable properties: π=1 π π++ is the space of the positive definite matrices, 1. Interpretable: NGL provides the underlying π determines the sparsity level of the network, and graphical structures and cluster assignments in β Β· βππ,1 is the off-diagonal β1 -norm. This is a convex data, which help us to interpret important rela- optimization problem, which imposes the β1 -norm tionships between variables. restriction. 2. Automatic: We formulate data encoding schemes to reveal distinct patterns/clusters, each 2.3. TVGL problem of which captures the network structure. The To infer a time-varying sequence of networks, TVGL [5] proposed method requires no parameter tuning extends the above approach, which is designed to infer a to specify the number of clusters. set of inverse covariance matrices Ξ. TVGL solves the problem below: 3. Accurate: NGL is a simple yet powerful algo- rithm for time-series segmentation using a graphi- π βοΈ π βοΈ cal lasso, which outperforms state-of-the-art com- minimizeππ βπ++ π βππ(π₯π , ππ ) + π||ππ ||ππ,1 + π½ π(ππ β ππβ1 ), petitors in terms of accuracy. π=1 π=2 where π½ determines how strongly correlated neighboring 2. Preliminary covariance estimations should be. The penalty function π encourages similarity between ππ and ππβ1 . Different In this paper we investigate an automatic network struc- types of π allow us to enforce different restrictions in the ture clustering for large multidimensional time-series time-varying similarity. This problem is solved by em- data. We will describe a few key concepts and back- ploying the alternating direction method of multipliers ground materials in this section. (ADMM) [8], which is a well-established method for solv- ing the convex optimization problem. For more details, please see, e.g., [5]. Although TVGL can find a changing 2.1. Problem definition point by comparing ππ and ππβ1 , it cannot find a cluster Consider a set of π-dimensional time-series data con- simultaneously. Throughout this paper our method uses sisting of π sequential bundle observations, π = TVGL to optimize the graphical lasso problem. {π₯1 , π₯2 , ..., π₯π } and there are |π₯π | β₯ 1 different observa- tions at each time π. π₯π β Rπ is the π-th multidimensional bundle observation, and each bundle observation vector 3. Algorithms is sampled from any πΎ multivariate normal distributions The previous section described how to estimate a set of π₯π βΌ π (0, Ξ£π ). The network structure is equal to the inverse covariance matrices Ξ. Now the questions are graph structure, in which a given π₯π βΌ π (0, Ξ£π ), each (a) how to describe the model, (b) how to find optimal variable represents a node, and the covariance matrix cut points, and (c) how to assign segments to optimal Ξ£π forms an edge. Our goal is to find the cluster assign- clusters. There are three main ideas behind our model: ments of these π bundle observations into πΎ clusters β± = {β±1 , β±2 , ..., β±πΎ }, where β±π is a cluster assign- 1. Model description cost: We use the minimum ment set of ππ β π (π .π‘. π = 1, 2, ..., πΎ) represented description length (MDL) principle as a model by a set of πΎ matrices, i.e., Ξ = {π1 , π2 , ..., ππΎ }. There- selection criterion for choosing between alterna- fore, letting ππ = {ππ , β±π } be a compact description tive segmentation and cluster descriptions. We propose a novel cost function to estimate the de- 3.2.1. MergeSegment (inner loop) scription cost of the graphical lasso model. Assuming that neighboring segments tend to belong to 2. CutPointSearch: We modify the generic bottom- the same cluster, we update cut points through Merge- up algorithm [9] to enhance its ability to han- Segment. We consider given cut points ππ = {π0 , π1 , dle time-series data. Initially we adopt short seg- ..., ππ }, and a set of inverse covariance matrices at each ments and iteratively merge with an adjacent pair segment Ξπ = {π,π0 , ππ0 ,π1 , ..., πππ , }, where the num- that satisfies the cost restriction. ber of segments is π + 1. And the set of inverse covari- ance matrices at each segment, which consists of only 3. NGL: We use the EM algorithm to cluster the even/odd-numbered cut points ΞπΈ = {π,π0 , ππ0 ,π2 , ...} segments obtained by CutPointSearch while de- and Ξπ = {π,π1 , ππ1 ,π3 , ...}. πππ ,ππ , πππ ,ππ , and πππ ,ππ termining the optimal number of clusters auto- are the data, model, and inverse covariance matrix from matically. cut points ππ to ππ . Our goal is to determine if a seg- ment should be merged with its neighboring segment. As 3.1. Model description cost shown in Figure 1, we have three candidates as updated cut points: (a) Solo has three segments all separated, (b) The MDL explains the model in a parsimonious way that Left and (c) Right have two segments in which one side calculates the required number of bits. Thus, it follows is merged. We compare the MDL costs using Equation the assumption that the more we can compress the data, (1), in these three cases, (a) vs. (b) vs. (c), and select the the more we can generalize its underlying structures. best cut points so that they minimize the local MDL cost. In a nutshell, we want to find the minimum number of For example, if (b) has the lowest cost, ππ+2 is added to graphical lasso models needed to express the data. The the updated cut points. If (a) has the lowest cost, there is goodness of the model π can be described as follows: no change from the previous cut points. We iterate this process throughout the whole sequence. < π; π > = < π > + < π|π >, (1) where < π > shows the cost of describing the model 3.2.2. CutPointSearch (outer loop) π , and < π|π > represents the cost of describing the This algorithm finds the optimal cut points. We are now data π given the model π . given bundle π and initial cut points ππ. The user decides Model Coding Cost. The description complexity of the interval of initial cut points. Since TVGL forces a model π is the sum of the following elements: The num- time-varying similarity with neighboring network, we ber of clusters πΎ requires log* (πΎ). 1 TheβοΈ total number of calculate Ξπ , Ξπ , and ΞπΈ using the TVGL graphical observations of each cluster requires πΎ * π=1 log (|β±π |). lasso optimization method. After obtaining each Ξ, we The meanβοΈvalue of each cluster which has a size π Γ 1, run the MergeSegment algorithm to update the cut points. requires πΎ π=1 (π Γ ππΉ ). The inverse covariance ma- We iterate this process until the cut points are stable. trix βοΈπΎ of each cluster which has a*size π Γ π, requires π=1 |ππ |ΜΈ=0 (2 log(π)+ππΉ )+log (|ππ |ΜΈ=0 ), where |Β·|ΜΈ=0 describes the number of non-zero elements in a matrix 3.3. Automatic clustering: NGL and ππΉ is the floating point cost. 2 Now we have optimal cut points, which means that there Data Coding Cost. Given a model π , encoding cost of are a limited number of segments that have enough sam- the data π usingβοΈ Huffman coding [10] is computed by: ples with which to estimate the network structure. Next, < π|π >= πΎ π=1 ππ(ππ , ππ ). Our next goal is to find we assign segments to a cluster and find the optimal num- the best model π that minimizes Equation (1). ber of clusters. As Algorithm 1 shows, we use the EM algorithm to classify each segment. For each iteration we 3.2. Automatic cut point detection vary πΎ = 1, 2, 3, ..., and minimize the function below: So far, we have described how we calculate the MDL πΎ βοΈ cost for our model. The next question is how to find arg min βππ(ππ , ππ ) + π||ππ ||ππ,1 , (2) optimal cut points that minimize MDL cost efficiently; π=1 we still have numerous candidates with which to merge In the E-step, we assign each segment to the optimal to summarize similar subsequences into a compact model, cluster, so that the log likelihood is maximized. In the and thus we modify the bottom-up algorithm to prevent a M-step, we calculate the ππ value of each cluster using a pattern explosion. We answer this question in two steps, normal graphical lasso optimization algorithm. Until the MergeSegment and CutPointSearch. cost function increases, we vary πΎ so as to minimize the 1 Here, log* is the universal code length for integers cost function. 2 We used 4 Γ 8 bits in our setting Baseline Methods. We compare our method to three state-of-the-art methods and one ablation method. TICC [6] and TAGM [7] take network structure into ac- Figure 1: Illustration of the three candidates. We compare count. Since both methods need to specify the number the each MDL costs of these candidates. of clusters, we gave the true number of clusters only to Algorithm 1 NGL(π, ππ) these methods. AutoPlait [12] is multi-level HMM based 1: Input: Bundle π , initial cut point set ππ automatic method for time-series clustering. NGL no-cps 2: Output: Cluster parameters Ξ and cluster assignments is our NGL method without CutPointSearch. Our method, β± including NGL no-cps, requires us to specify initial cut 3: πππππ‘ = CutPointSearch(π, ππ); πΎ = 1; points. We set its interval at every 5 points throughout 4: while improving the total cost < π; π > do the synthetic experiments. 5: Ξ = ModelInitialization(πππππ‘ , πΎ); Clustering Accuracy. We set each segment in each of 6: repeat the examples to have 100 observations in R5 (for exam- 7: β± = AssignToCluster(π, Ξ, πππππ‘ ); /* E-step, ple, "1,2,1" has a total of 300 observations). Table 1 shows Equation (2) */ the clustering accuracy for the macro-F1 scores for each 8: Ξ = GraphicalLasso(π, β± ); /* M-step */ 9: until convergence; dataset. As shown, NGL significantly outperforms the 10: Compute < π; π >; // π = {Ξ, β± } baselines. Our method consistently achieves the highest 11: πΎ = πΎ + 1; accuracy and lowest standard deviation. AutoPlait does 12: end while not consider network structure, so it does not find any 13: return π = {Ξ, β± }; clusters. Although we gave the true number of clusters πΎ to TICC and TAGM, the average accuracy of our method is more than 10% higher. NGL no-cps shows that find- 4. Experiments ing a large segment by CutPointSearch has meaning of grouping adjacent observations into the same cluster. We evaluate our method on both synthetic and real Effect of Total Number of Samples. We next focus on datasets. the number of samples required for each method to accu- rately find clusters. We take the "1,2,3,4,1,2,3,4" example and vary the number of samples. As shown in Figure 2, 4.1. Accuracy on synthetic data our method outperforms the baselines for almost all seg- In this section, we demonstrate the accuracy of NGL on ment lengths. Our method has a constantly high average, synthetic data. We do so because there are clear ground even for relatively small segment lengths. This is because truth networks with which to test the accuracy. our CutPointSearch algorithm correctly find cut points Experimental Setup. We randomly generate synthetic even if the sample size is small. multidimensional data in R5 , which follows a multivari- ate normal distribution π βΌ π (0, πβ1 ). Each of the 4.2. Case study πΎ clusters has a mean of β0, so that the clustering re- sult is based entirely on the structure of the data. For Here, we show that our NGL provides an interpretable each cluster, we generate a random ground truth inverse result with real-world financial data. In general, stocks, covariance as follows [11]: bonds, and currency prices are correlated. By examining historical financial data, we can infer a financial network 1. Set π΄ β R5Γ5 equal to the adjacency matrix of structure to reveal the relationships between them. We an ErdΕs-RΓ©nyi directed random graph, where use hourly currency exchange rate data 3 of AUD/USD, every edge has a 20% chance of being selected. EUR/USD, GBP/USD, and USD/CAD from 2005 to 2018. Assuming that the underlying network structure is con- 2. For every selected edge in π΄ set π΄ππ βΌ sistent for a week, we normalized the data for each week. Uniform([β0.6, β0.3] βͺ [0.3, 0.6]). We enforce a We also set the initial cut points at a week to capture the symmetry constraint whereby every π΄ππ = π΄ππ . weekly correlation trend. The top of Figure 3 shows the 3. Let π = πmin (π΄) be the smallest eigenvalue of clustering result obtained with NGL. During the global π΄, and set π = π΄ + (0.1 + |π|)πΌ, where πΌ is an financial crisis (from mid-2007 to early 2009), we found identity matrix. that the network structure changed. There are abrupt changes on 2016/5/16 βΌ 2016/6/5, the bottom of Fig- We run our experiments on four different temporal se- ure 3 shows how the correlation changed during this quences: "1,2,1","1,2,3,2,1","1,2,3,4,1,2,3,4","1,2,2,1,3,3,3,1". period. As we can see, a correlation related to the United We generate each dataset 10 times and report the mean and standard deviation of the macro-F1 score. 3 https://github.com/FutureSharks/financial-data Table 1 Macro-F1 score of clustering accuracy for four different temporal sequences, comparing NGL with state-of-the-art methods (higher is better). Model NGL TAGM (KDDβ21) TICC (KDDβ18) AutoPlait (SIGMODβ14) NGL no-cps 1,2,1 0.93 Β± 0.05 0.83 Β± 0.25 0.85 Β± 0.26 0.67 0.62 Β± 0.13 1,2,3,2,1 0.96 Β± 0.03 0.74 Β± 0.21 0.89 Β± 0.18 0.40 0.66 Β± 0.15 1,2,3,4,1,2,3,4 0.94 Β± 0.03 0.78 Β± 0.26 0.82 Β± 0.21 0.25 0.66 Β± 0.11 1,2,2,1,3,3,3,1 0.93 Β± 0.05 0.89 Β± 0.17 0.83 Β± 0.26 0.38 0.62 Β± 0.07 Acknowledgments The authors would like to thank the anonymous ref- erees for their valuable comments and helpful sug- gestions. This work was supported by JSPS KAK- ENHI Grant-in-Aid for Scientific Research Number JP20H00585, JP21H03446, JP22K17896, NICT 21481014, MIC/SCOPE 192107004, JST-AIP JPMJCR21U4, ERCA- Environment Research and Technology Development Figure 2: Plot of clustering accuracy macro-F1 score vs. num- Fund JPMEERF20201R02, ber of samples for NGL and two other state-of-the-art meth- ods. References [1] C. Miyajima, Y. Nishiwaki, K. Ozawa, T. Wakita, K. Itou, K. Takeda, F. Itakura, Driver modeling based on driving behavior and its evaluation in driver identi- fication, IEEE 95 (2007) 427β437. [2] F. Tomasi, V. Tozzo, A. Barla, Temporal pattern detection in time-varying graphical models, in: ICPR, 2021, pp. 4481β4488. doi:10.1109/ICPR48806. 2021.9413203. [3] J. Friedman, T. Hastie, R. Tibshirani, Sparse inverse covariance estimation with the graphical lasso, Biostatistics 9 (2008) 432β441. [4] F. Tomasi, V. Tozzo, S. Salzo, A. Verri, Latent variable time-varying network inference, in: KDD, 2018, pp. 2338β2346. URL: https://doi.org/10.1145/3219819. 3220121. doi:10.1145/3219819.3220121. [5] D. Hallac, Y. Park, S. P. Boyd, J. Leskovec, Network inference via the time- varying graphical lasso, in: KDD, 2017, pp. 205β213. URL: https://doi.org/10. 1145/3097983.3098037. doi:10.1145/3097983.3098037. Figure 3: Clustering result of NGL using currency datasets. [6] D. Hallac, S. Vare, S. P. Boyd, J. Leskovec, Toeplitz inverse covariance-based clustering of multivariate time series data, in: KDD, 2017, pp. 215β223. URL: https://doi.org/10.1145/3097983.3098060. doi:10.1145/3097983.3098060. Kingdom changed significantly. This was in response to [7] V. Tozzo, F. Ciech, D. Garbarino, A. Verri, Statistical models coupling allows for complex local multivariate time series analysis, in: KDD, 2021, pp. 1593β the United Kingdom European Union membership ref- 1603. URL: https://doi.org/10.1145/3447548.3467362. doi:10.1145/3447548. erendum on 2016/6/23, which may have caused public 3467362. [8] S. P. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, Distributed optimiza- concern. tion and statistical learning via the alternating direction method of multipli- ers, Found. Trends Mach. Learn. 3 (2011) 1β122. URL: https://doi.org/10.1561/ 2200000016. doi:10.1561/2200000016. [9] E. J. Keogh, S. Chu, D. M. Hart, M. J. Pazzani, An online algorithm for segment- 5. Conclusion and Future work ing time series, in: Proceedings of the 2001 IEEE International Conference on Data Mining, 29 November - 2 December 2001, San Jose, California, USA, IEEE In this paper, we presented NGL, which is an interpretable Computer Society, 2001, pp. 289β296. URL: https://doi.org/10.1109/ICDM.2001. 989531. doi:10.1109/ICDM.2001.989531. clustering algorithm. We focused on the problem of the [10] C. BΓΆhm, C. Faloutsos, J.-Y. Pan, C. Plant, Ric: Parameter-free noise-robust interpretable clustering of multidimensional time-series clustering, TKDD 1 (2007) 10βes. [11] K. Mohan, P. London, M. Fazel, D. Witten, S.-I. Lee, Node-based learning of data with underlying network structures. Our proposed multiple gaussian graphical models, J. Mach. Learn. Res. 15 (2014) 445β488. NGL indeed exhibits all the desirable properties; it is [12] Y. Matsubara, Y. Sakurai, C. Faloutsos, Autoplait: Automatic mining of co-evolving time sequences, in: Proceedings of the 2014 ACM SIGMOD Interpretable and Automatic and Accurate. International Conference on Management of Data, SIGMOD β14, Associa- In future work, we will focus on the following direc- tion for Computing Machinery, New York, NY, USA, 2014, p. 193β204. URL: https://doi.org/10.1145/2588555.2588556. doi:10.1145/2588555.2588556. tion: Online learning. In several situations, network inference needs to operate in an online fashion. And to the best of our knowledge, no study has dealt with online clustering based on network structure. In this context, we will develop an extension of our methods by utilizing the novel sliding window and bottom-up (SWAB) algorithm [9].