Proceedings 1st International Workshop on Advanced Analytics and Learning on Temporal Data AALTD 2015 Fuzzy Clustering of Series Using Quantile Autocovariances Borja Lafuente-Rego and Jose A. Vilar Research Group on Modeling, Optimization and Statistical Inference (MODES), Department of Mathematics, Computer Science Faculty, University of A Coruña Abstract. Unlike conventional clustering, fuzzy cluster analysis allows data elements to belong to more than one cluster by assigning member- ship degrees of each data to clusters. This work proposes a fuzzy C– medoids algorithm to cluster time series based on comparing their esti- mated quantile autocovariance functions. The behaviour of the proposed algorithm is studied on different simulated scenarios and its effectiveness is concluded by comparison with alternative approaches. 1 Introduction In classical cluster analysis each datum is assigned to exactly one cluster, thus producing a “hard” partition of the data set into several disjoint subsets. This approach can be inadequate in the presence of data objects that are equally dis- tant to two ore more clusters. Fuzzy cluster analysis allows gradual memberships of data objects to clusters, thus providing versatility to reflect the certainty with which each data is assigned to the different clusters. An interesting overview of present fuzzy clustering methods is provided by [3]. Interest in this approach has increased in recent years. Proof of this is the large amount of publications in this field (e.g. [6] and [5]). In this paper, a fuzzy C–medoids algorithm to cluster time series using the quantile autocovariance functions is proposed. Motivation behind this approach is twofold. First, quantile autocovariances have shown a high capability to clus- ter time series generated from a broad range of dependence models [10]. On the other hand, the use of a fuzzy approach for clustering time series is justified in order to gain adaptivity for constructing the centroids and to obtain a better characterization of the temporal pattern of the series (see discussion in [7]). To illustrate the merits of the proposed algorithm, an extensive simulation study comparing our fuzzy approach with other fuzzy procedures has been carried out. Specifically, we have focused on the classification of heteroskedastic models, which are of great importance in many applications (e.g. to model many finan- cial time series) and have received relatively little attention in the clustering literature. 2 A dissimilarity based on quantile autocovariances  (j) (j) Consider a set of p series S = X (1) , . . . , X (p) , with X (j) = (X1 , . . . , XT ) (j) being a T -length partial realization from a real valued process {Xt , t ∈ Z}. Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes. 2 B. Lafuente-Rego, J.A. Vilar We wish to perform cluster analysis on S in such a way that series with similar generating processes are grouped together. To achieve this goal, we propose to measure dissimilarity between two series by comparing the estimators of their quantile autocovariance functions (QAF), which are formally defined below. Let X1 , . . . , XT an observed stretch of a strictly stationary process {Xt ; t ∈ Z}. Denote by F the marginal distribution of Xt and by qτ = F −1 (τ ), τ ∈ [0, 1], the corresponding quantile function. Fixed l ∈ Z and an arbitrary couple of quantile levels (τ, τ 0 ) ∈ [0, 1]2 , consider the cross-covariance of the indicator functions I (Xt ≤ qτ ) and I (Xt+l ≤ qτ 0 ) given by γl (τ, τ 0 ) = cov {I (Xt ≤ qτ ) , I (Xt+l ≤ qτ 0 )} = P (Xt ≤ qτ , Xt+l ≤ qτ 0 ) − τ τ 0 . (1) Function γl (τ, τ 0 ), with (τ, τ 0 ) ∈ [0, 1]2 , is called quantile autocovariance func- tion of lag l. Replacing in (1) the theoretical quantiles of the marginal distribu- tion F , qτ and qτ 0 , by the corresponding empirical quantiles based on X1 , . . . , XT , q̂τ and q̂τ 0 , we obtain the estimated quantile autocovariance function given by T −l 1 X γ̂l (τ, τ 0 ) = I (Xt ≤ q̂τ ) I (Xt+l ≤ q̂τ 0 ) − τ τ 0 . (2) T − l t=1 As the quantile autocovariances are able to account for high level dynamic (1) (2) features, a simple dissimilarity criterion between two series Xt and Xt con- sists in comparing their estimated quantile autocovariances on a common range of selected quantiles. Thus, for L prefixed lags, l1 , . . . , lL , and r quantile levels, 0 < τ1 < . . . < τr < 1, we construct the vectors Γ (u) , u = 1, 2, given by     (u) (u) (u) (u) Γ (u) = Γ l1 , . . . , Γ lL , with Γ li = γ̂li (τj , τk ); j, k = 1 . . . , r , (3) (1) (2) for i = 1, . . . , L, and γ̂ given in (2). Then, the distance between Xt and Xt is defined as the squared Euclidean distance between their representations Γ (1) and Γ (2) , i.e.   (1) (2) dQAF Xt , Xt Γ (1) − Γ (2) ||22 = ||Γ (4) Computing dQAF for all pairs of series in S allows us to set a pairwise dissim- ilarity matrix, which can be taken as starting point of a conventional hierarchical clustering algorithm. Alternatively, a partitioning clustering, such as the k-means algorithm, can be performed averaging the Γ representations to determine the centroids. Then, dQAF is also used to calculate the distances between series and centroids involved in the iterative refinement of the cluster solution. 3 Fuzzy clustering based on quantile autocovariances Time series are dynamic objects and therefore different temporal patterns may be necessary to characterize the serial behaviour in different periods of time. In Fuzzy Clustering of Series Using Quantile Autocovariances 3 other words, the series are not distributed accurately within a given number of clusters, but they can belong to two or even more clusters. This problem can be adequately treated using a fuzzy clustering procedure, which associates a fuzzy label vector to each element stating its memberships to the set of clusters. In this section we propose a fuzzy C-medoids clustering algorithm for time series by plugging the QAF-dissimilarity introduced in Section 2.  Let S = X (1) , . . . , X (p) be a set of p time series and Γ = Γ (1) , . . . , Γ (p) a set of quantile autocovariances selected to perform n clustering.o The fuzzy C- (1) medoids clustering finds the subset of , Γ Γ̃ = Γ̃ , . . . , Γ̃ (C) , and the p × C matrix of fuzzy coefficients Ω = (ui,c ) that lead to solve the minimization problem: p X C C X 2 X min um ic Γ (i) − Γ (c) , subject to uic = 1 and uic ≥ 0. (5) Γ̃ ,Ω 2 i=1 c=1 c=1 Each uic ∈ [0, 1] represents the membership degree of the i-th series to the c-th cluster and the parameter m > 1 controls the fuzziness of the partition. As the value of m increases, the boundaries between clusters become softer and therefore the classification is fuzzier. If m = 1, the hard version of the clustering procedure is obtained, i.e. uic ∈ {0, 1}, that leads to a classical K-means partition of S. PC The constraints c=1 uic = 1 and uic ≥ 0 ensure that no cluster is empty and that all series are included in the cluster partition. The objective function (5) cannot be minimized directly, and an iterative algorithm that alternately optimizes the membership degrees and the medoids must be used. The update formula for the membership degrees is given by  1 −1 C 2 ! m−1 X Γ (i) − Γ (c) 2 uic =   , for i = 1, . . . , p. (6) 0 Γ (i) − Γ (c0 ) 2 c =1 2 Then, the QAF-based fuzzy C–medoids clustering algorithm is implemented as follows. n o i. Pick an initial set of medoids Γe = Γe (1) , . . . , Γe (C) and the fuzzifier m. ii. Set Γe OLD = Γe . iii. Compute uic using (6). n o iv. Update the medoids, let’s say Γb = Γb (1) , . . . , Γb (C) , by minimizing the objective function with the new uic . Denote by p X 00 0 2 q = argmin1≤i0

0,P q ( i=1 αi + j=1 βj ) < 1. Then, the dissimilarities are defined as follows. 1. Dissimilarity based on the autoregressive representation of the GARCH mod- 0 els [12, 11]. Given X (k) and X (k ) in S, we define R 0 X 2 d2AR (X X (k) , X (k ) ) = πrk − π (b brk0 ) , r=1 Pmin(q,r) with πbrz an estimator of the r-th coefficient πr = (αr +βr )+ j=1 βj πr−j , for the series z, z = k, k 0 . Parameter R determines the maximum number of Fuzzy Clustering of Series Using Quantile Autocovariances 5 autoregressive coefficients πr . A GARCH–based fuzzy C-medoids clustering is proposed in [4] by considering the optimization problem: p X X C R X C X 2 min um ic πri − π (b brc ) , subject to uic = 1 and uic ≥ 0. (8) Π̂,Ω i=1 c=1 r=1 c=1 2. GARCH-based distance measure [1] given by 0 0 −1 dGARCH (XX (k) , X (k ) ) = (Lk − Lk0 ) (Vk + Vk0 ) (Lk − Lk0 ) (9)   with Lj = αb j , βbj the vector of estimated parameters and Vj the esti- mated covariance matrix for Lj , for j = k, k 0 . An alternative GARCH-based fuzzy C-medoids clustering is proposed in [4] by minimizing: p X C h i 0 −1 X um ic (Li − Lc ) (Vi + Vc ) (Li − Lc ) , (10) i=1 c=1 PC subject to c=1 uic = 1 and uic ≥ 0. The three fuzzy clustering algorithms were performed using a fuzziness pa- rameter m = 1.5 on N = 100 trials for each scenario. At each trial, the quality of the clustering procedure was evaluated comparing the experimental cluster solution with the true cluster partition. Two different agreement measures were used, namely the Gavrilov index [8] and the adjusted Rand index [9]. The mean values and standard deviations of these indexes based on the 100 trials using both hard and fuzzy cluster analysis are provided in Table 1. Table 1. Averages and standard deviations (in brackets) of two cluster similarity indexes obtained from 100 trials. Scenario 1 Scenario 2 Gavrilov Adj. Rand Gavrilov Adj. Rand Hard cluster dAR 0.859 (.109) 0.685 (.198) 0.712 (.146) 0.469 (.215) dGARCH 0.574 (.059) 0.286 (.072) 0.504 (.078) 0.137 (.116) dQAF 0.843 (.109) 0.726 (.152) 0.918 (.081) 0.825 (.135) Fuzzy cluster dAR 0.541 (.056) 0.271 (.080) 0.486 (.076) 0.128 (.100) dGARCH 0.553 (.088) 0.241 (.132) 0.535 (.076) 0.188 (.107) dQAF 0.842 (.116) 0.704 (.181) 0.925 (.072) 0.833 (.125) Results from Table 1 show that the metrics based on quantile autocovariances and on the AR representation led to the best scores in Scenario 1 when the hard cluster is carried out. When the fuzzy approaches were considered, the behaviour of the dAR substantially worsened, while the very similar (even somewhat higher) 6 B. Lafuente-Rego, J.A. Vilar results were obtained with dQAF . The worst results were obtained with the GARCH-based dissimilarity both for the hard and the fuzzy versions. The metric based on quantile autocovariances also obtained the best results in Scenario 2, with indexes of agreement above 0.8 and a slight improvement by using the fuzzy clustering. The GARCH-based metrics, dAR and dGARCH where strongly affected by the model misspecification and produced the worst results for both the hard and the fuzzy versions of the cluster analysis. To assess the effect of the fuzziness parameter in the partitions the algorithm was implemented for several values of m. However this results were here omitted due to the limitation of space. 5 A case study In this section, the proposed fuzzy C–medoids clustering algorithm is used to per- form clustering on a set of series of electricity demand. Specifically, our database consists of hourly electricity demand in the Spanish market from 1st January 2011 to 31th December 2012. All data are sourced from the official website of Operador del Mercado Iberico de Energia1 . Records corresponding to Saturdays and Sundays have been removed from the database because electricity demand is lower in the weekends. Thus we have 24 time series (one for each hour of the day) of length T = 731. Since all series are non–stationary in mean, the original series are transformed taking one regular difference. Table 2 presents the membership degrees for the case with two and three clusters. The results obtained for the two–cluster partition formed by C1 = {H24, H1, H2, H3, H4, H5, H6, H7} and C2 grouping the remaining series. The cluster C1 corresponds with the hours of the day where the electricity demand is low, while the C2 identifies the time of the day when the power consumption is greater. In the case of the three–cluster partition, the cluster C1 is divided in two subclusters. One formed with the hours of the day with the lowest demand of electricity, and a second cluster with an intermediate electricity consumption. 6 Concluding remarks In this paper, we focus on the classification of time series featuring a fuzzy clustering algorithm in the framework of a partitioning around medoids. A dissimilarity–based approach is considered. In particular, we propose a C–medoids fuzzy clustering algorithm using an innovative dissimilarity measure based on the quantile autocovariances (dQAF ). The simulation study shows that the proposed dissimilarity produces satisfac- tory results by performing fuzzy cluster analysis. The proposed clustering algo- rithm was tested against two GARCH–based fuzzy clustering algorithm present in the literature in two different heteroskedastic scenarios. The fuzzy clustering algorithm based on dQAF led to the best results. In fact, apart from dQAF , none 1 http://http://www.omel.es/files/flash/ResultadosMercado.swf Fuzzy Clustering of Series Using Quantile Autocovariances 7 Table 2. Membership degrees obtained with QAF–based FCM with m = 1.5 consid- ering 2 and 3 clusters. 2 Clusters 3 Clusters Membership degrees Crisp Membership degrees Crisp C1 C2 C1 C2 C3 H1 0.63044 0.36956 1 1.00000 0.00000 0.00000 1 H2 1.00000 0.00000 1 0.99878 0.00098 0.00024 1 H3 0.98282 0.01718 1 0.99484 0.00395 0.00121 1 H4 0.94118 0.05882 1 0.99229 0.00574 0.00197 1 H5 1.00000 0.00000 1 0.92388 0.06811 0.00801 1 H6 0.99923 0.00077 1 0.30793 0.66185 0.03021 2 H7 0.98282 0.01718 1 0.00000 1.00000 0.00000 2 H8 0.00003 0.99997 2 0.05793 0.09475 0.84733 3 H9 0.00003 0.99997 2 0.00097 0.00294 0.99610 3 H10 0.00077 0.99923 2 0.00045 0.00149 0.99806 3 H11 0.00077 0.99923 2 0.00171 0.00507 0.99323 3 H12 0.00002 0.99998 2 0.00011 0.00049 0.99940 3 H13 0.00002 0.99998 2 0.00027 0.00178 0.99794 3 H14 0.00002 0.99998 2 0.00032 0.00107 0.99861 3 H15 0.00002 0.99998 2 0.00134 0.00940 0.98927 3 H16 0.00002 0.99998 2 0.00088 0.00636 0.99277 3 H17 0.00002 0.99998 2 0.00000 0.00000 1.00000 3 H18 0.00056 0.99944 2 0.00051 0.00498 0.99451 3 H19 0.00000 1.00000 2 0.00002 0.00014 0.99984 3 H20 0.00003 0.99997 2 0.00020 0.00135 0.99846 3 H21 0.00056 0.99944 2 0.00047 0.00384 0.99569 3 H22 0.01718 0.98282 2 0.00136 0.01488 0.98377 3 H23 0.00003 0.99997 2 0.01539 0.13091 0.85370 3 H24 0.99998 0.00002 1 0.00206 0.98054 0.01740 2 of the remaining examined dissimilarities shown acceptable results by cluster- ing heteroskedastic processes, thus emphasizing the usefulness of dQAF in this framework. Note that a limitation of our procedure is that series are assumed to be strictly stationary and hence further research must be carried out. Although we have followed a dissimilarity-based approach, it is worthy to emphasize that model-based techniques can be also an interesting alternative. Likewise the fuzzy approach, the use of probabilistic models such as mixture models (see e.g. [2]) allows us to assign each datum to one single cluster although this assignment relies on a probabilistic approach since the mixing proportions are estimated from the data. Unlike the fuzzy approach, no fuzziness parameter is required by using mixture models, although the model selection problem must be solved in the latter case. References 1. Caiado, J., Crato, N.: A garch-based method for clustering of financial time series: International stock markets evidence. Mpra paper, University Library of Munich, Germany (2007), http://EconPapers.repec.org/RePEc:pra:mprapa:2074 2. Chen, W.C., Maitra, R.: Model-based clustering of regression time series data via apecman aecm algorithm sung to an even faster beat. Statistical Analysis and Data Mining 4(6), 567–578 (2011) 8 B. Lafuente-Rego, J.A. Vilar 3. Döring, C., Lesot, M.J., Kruse, R.: Data analysis with fuzzy clustering methods. Computational Statistics & Data Analysis 51(1), 192 – 214 (2006) 4. D’Urso, P., Cappelli, C., Lallo, D.D., Massari, R.: Clustering of financial time series. Physica A 392(9), 2114–2129 (2013) 5. D’Urso, P., Giovanni, L.D., Massari, R.: Time series clustering by a robust autore- gressive metric with application to air pollution 141(15), 107–124 (2015) 6. D’Urso, P., Giovanni, L.D., Massari, R., Lallo, D.D.: Noise fuzzy clustering of time series by the autoregressive metric 71(3), 217–243 (2013) 7. D’Urso, P., Maharaj, E.A.: Autocorrelation-based fuzzy clustering of time series. Fuzzy Sets Syst. 160(24), 3565–3589 (2009) 8. Gavrilov, M., Anguelov, D., Indyk, P., Motwani, R.: Mining the stock market (extended abstract): Which measure is best? In: Proceedings of the sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 487–496. KDD’00, ACM, New York, USA (2000) 9. Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985) 10. Lafuente-Rego, B., Vilar, J.A.: Clustering of time series using quantile autocovari- ances. Advances in Data Analysis and Classification pp. 1–25 (2015) 11. Maharaj, E.A.: Clusters of time series. J. Classification 17(2), 297–314 (2000) 12. Piccolo, D.: A distance measure for classifying arima models. J. Time Series Anal. 11(2), 153–164 (1990)