SE4TeC: A Scalable Engine For efficient and expressive Time Series Classification Jingwei Zuo Karine Zeitouni Yehia Taher DAVID Lab, University of Versailles DAVID Lab, University of Versailles DAVID Lab, University of Versailles University of Paris-Saclay University of Paris-Saclay University of Paris-Saclay Versailles, France Versailles, France Versailles, France jingwei.zuo@uvsq.fr karine.zeitouni@uvsq.fr yehia.taher@uvsq.fr Abstract—Time Series (TS) data are ubiquitous in enormous proXimation (SAX) [7], transforms subsequences of raw time application fields, such as medicine, multimedia, and finance. In series into value-characterized symbols, which is eligible for this paper, we present SE4TeC: A Scalable Engine for efficient a hierarchic indexing iSAX [8] to accelerate similarity search. and expressive Time Series Classification, which brings novel optimizations to the state-of-the-art shapelet-based algorithm: (i) Fast Shapelet [9] proves its efficiency even scalability based More efficient feature extraction; (ii) Better interpretability of on SAX and random projection. However, a loss of feature both feature extraction and classification process; (iii) Scalability information is unavoidable after the reduction of dimensions, in the context of Big Data. SE2TeC’s effectiveness and efficiency which requires a trade-off between efficiency and accuracy. are experimentally demonstrated over real-life datasets. Our work is biased towards raw time series processing which has a higher accuracy performance, but a relatively I. I NTRODUCTION high time complexity [3], [10]. Unlike some hardware-based Tony is informed unhealthy heart status on the basis of his implementations, such as using GPUs to accelerate the simi- electrocardiogram (ECG) during a medical diagnosis. An ex- larity calculation [11], we focus here on the scalability of TSC perienced doctor can easily correlate the abnormal ECG with based on shapelet extraction. Traditional TSC algorithms on the diseases, then explain to Tony the symbolic abnormality raw TS data are not applicable for big data context, because of in the ECG and the relevant treatment. Nowadays, Machine their low scalability. Here are our contributions in this paper: Learning technique can partly replace the role of an experi- 1) We propose a novel method to evaluate the shapelet in enced doctor and do the diagnosis very accurately. In Tony’s batches case, the electrical activity of the heart from physiological 2) We introduce a scalable engine to extract the shapelet sensor is collected as Time Series (TS). From the perspective expressively of the machine, the diagnosis can be considered as a Time 3) Based on the scalable engine, we propose an acceleration Series classification (TSC) problem. strategy to extract the shapelet more efficiently The classical approaches [1], [2] on TSC problems are The rest of this paper is organized as follows. In Section 2, usually based on the statistical features extracted from time we review the background and state the research problems. series, such as mean, standard deviation of subsequences, We present our scalable engine for Time Series Classification which are assumed to represent the global characteristics of in Section 3. Section 4 shows an empirical evaluation and time series. Intuitively, they get very superficial information performance comparison with rival solutions. Finally, we give with low noise tolerance. On the basis of solving these our conclusions and perspectives for future work in Section limitations, Shapelet [3] has attracted great interest over the 5. past years, owing to its high discriminative feature and good interpretability. However, extracting shapelets from data series II. BACKGROUND is accomplished with large computation cost. Even for small A. Definition and notation sized datasets, the algorithm can take days. This is mainly We start with defining shapelets and the other notions used due to the repeated similarity search between a sub-sequence in the paper. (i.e. a candidate shapelet) and TS instances in the database. Definition 1: A Time Series T is a sequence of real-valued Some typical speed-up techniques (i.e., indexing [4], lower- numbers T=(t1 , t2 , ..., ti , ..., tn ), where n is the length of T . bounding [5] and early abandoning [3]), introduce always extra Definition 2: A subsequence Ti,m of T is a continuous parameters, which is difficult to operate without prior knowl- subset of values from T with length m starting from position edge. Some low-dimensional representation methods have also i. Ti,m = (ti , ti+1 , ..., ti+m−1 ), where i ∈ [0, n − m + 1]. been proposed, such as Piece-wise Aggregate Approximation Definition 3: Shapelet ŝ is a time series subsequence which (PAA) [6], which computes the mean of each subsequence shape is particularly representative of a class. As such, it shows of time series in a given length, and transforms the raw data an interpretable feature which can distinguish one class from in coarse-grained sub-components. Symbolic Aggregate ap- the others. 8 T T a,m b,m Definition 4: A Dataset D is a collection of Source T time series Ti , and its class label ci , Formally, D = Target T ’ ,,...,, where N is the number MPi m of instances in D. C = c1 , c2 , ..., c|C| is a collection of class 0 OFFSET in T n labels, where |C| denotes the number of labels. Fig. 2: Matrix Profile between Source time series T and Target time Definition 5: Z-Normalization Time Series is a formal repre- series T 0 , where n is the length of T . Intuitively, M Pi shares the sentation of Time Series, which is defined as ZN ormal(T ) = same offset as source T T −µ σ , where µ is the sample mean, σ is the standard deviation: n n 1 X 1 X 2 B. Evaluation of candidate Shapelets µ= ti , σ2 = t −µ (1) m i=1 m i=1 i The quality of a candidate shapelet ŝ, can be assessed by its ability to separate the instances of different class in Z-Normalization allows us to focus on the structural feature the dataset D. A prerequisite of the quality measure for of T , rather than its amplitude value. It addresses the ŝ, is that a set of distance Dŝ must be calculated, where problem of data stability. For instance, assume that the Dŝ = Dŝ,1 , Dŝ,2 , ...Dŝ,n , n is the number of T in dataset Euclidean Distance(ED) between two time series Tx,m , Ty,m D. is expressed as follows: Information Gain, an evaluation method based on Decision Tree, is widely adopted in previous works [3], [10], [12]. v um uX EDx,y = t (tx,i − ty,i )2 (2) An iteration test of Dŝ is conducted to extract the best i=1 instance which brings the highest Information Gain. The distance instance will be applied as a property of candidate Some little changes (e.g., the noise with a peak value) will Shapelet, to check the inclusion between the candidate and cause an evident bias for the result, Z-Normalization is a way time series. Another simple approach, is to use the F-Statistic of smoothing the bias value. [13], based on the difference of means in an Analysis of Definition 6: Normalizedq Euclidean Distance(N-ED), is 1 Pm Variance A(NOVA). The main idea of this statistic method, 2 expressed by the formula m i=1 x,i − ty,i ) (t is to assess the difference in distributions of ŝ between the Definition 7: Distance Profile DPi is a vector which stores class distances. the Normalized Euclidean Distance between a given subse- 0 quence/query Ti,m and every subsequences Tj,m of a target C. Problem Statement 0 m 0 Time Series T . Formally, DPi,j = dist(Ti,m , Tj,m ), ∀j ∈ 0 [0, n − m + 1] The high degree of coupling inside classical TSC algorithm [10], [12] leads to the problem of not being able to parallelize. The speed-up method such as Early Abandoning [3] is based Query T i,m Source T on classical Euclidean Distance measure, which has a time Nearest neighbor T’ j,m complexity of O(N 2 n4 ) with several orders of magnitude Target T ’ m higher than MASS [14]: O(N 2 n3 logn). Another common trick DPi,j OFFSET in T’ played by previous work [3], [10], [12]: If we know ŝ is a low- n’ quality candidate, then any similar subsequence ŝ0 to ŝ must 0 Fig. 1: Distance Profile between Query Ti,m and target time series also result in a low quality and therefore, a costly computation T 0 , where n0 is the length of T 0 . Obviously, DPi,j shares the same of the distance set Dŝ0 (evaluation of ŝ0 ) can be skipped. offset as target T 0 However, a candidate shapelet is evaluated by its quality ranking among all candidates of the same length. Assume that Definition 8: MASS, namely Mueen’s ultra-fast Algorithm the distributed nodes have generated from dataset a collection for Similarity Search, computes Distance Profile based on Fast of candidates ŝl , an aggregation operation between nodes is Fourier Transform(FFT), which requires just O(nlogn) time, required to extract the candidate with the best quality. Extra other than O(nm2 ) time in classical N-ED similarity search. aggregations will be made along with the iteration of candidate Definition 9: Matrix Profile M P is a vector of distance length. Apparently, the acceleration from the classical pruning between subsequence Ti,m in source T and its nearest neigh- 0 techniques can be easily offset by the communication cost bor Tj,m in target T 0 . Formally, M Pim = min(DPim ), where caused by the aggregation. i ∈ [0, n − m + 1]. Like the distance profile, the matrix profile can be consid- III. S YSTEM OVERVIEW ered as a meta time series annotating the source time series T. The profile has a lot of interesting and exploitable properties. Strategies should be taken in order to make the system For example, the highest point on the profile corresponds to scalable. The main idea of our system is that the calculation the time series discord, the (tied) lowest points correspond to should be shared and executed independently, less communi- the position of a query which has a similar matching in target cation between the nodes, more powerful the algorithm would time series. be. 9 Communication between nodes Algorithm 1: SMAP(Shapelet on MAtrix Profile) Input: Dataset D, classSet Ĉ, k, of f set node node node ... Labeled TS Database Cluster Output: Ŝ 1 minLength ← 2 Shapelet 2 maxLength ← getM inLen(D) Extraction 3 double[] DiscmP ← [] double[] DistT hresh ← [] 4 Ŝ ← ∅ Early Prediction Unlabelled TS Classifier result 5 D.cache(); //cache all the dataset in the cluster, where each Database time series has an unique ID 6 MapPartition (Set of < ID, T >: Tset ) Fig. 3: System overview 7 for < ID, T >∈ Tset do 8 for m ← minLength to maxLength do A. Main structure 9 DiscmP [m], DistT hresh[m] ← computeDiscmP (T, D, m, ofp f set) The conventional Time Series classification problems are 10 DiscmP [m] ← DiscmP [m] ∗ 1/l tackled with nearest neighbor(1NN) algorithm due to its easy- 11 DiscmP ← pruning(DiscmP ) design feature. As shown in Figure 3, on the basis of 1NN, an 12 emit(DiscmP, DistThresh) early classifier [15] adopted in the system allows to give the 13 MapAggregation (class, (DiscmP, DistT hresh)) prediction result as earlier as possible without waiting for the 14 for c ∈ Ĉ do entire sequence. The processing of labelled Time Series data 15 Ŝ 0 ← getT opk(DiscmP [c], DistT hresh[c], k) requires to be flexibly arranged for nodes in the cluster. To this 16 for ŝ ∈ Ŝ do end, a suitable algorithm is applied here allowing assignment 17 ŝ.matchingIndices ← of computing tasks which are relatively independent of each getM atchingIndices(ŝ, D) other. 18 Ŝ ← Ŝ Ŝ 0 S B. SMAP: Shapelet extraction on MAtrix Profile 19 return Ŝ Matrix Profile provides a meta-data which facilitates the representation of a complex correlation between two time resentative Power from class C to others(OVA, one-vs-all). series. SMAP takes the time series as the smallest processing Discrimination Profile is then defined as follows: unit between nodes, and utilizes the normalized quality to extract the most important parts in each processing unit and DiscmP rof ile (TiC , D) = −(RP (TiC , DC ) − RP (TiC , D!C )) then merges them by an aggregation process. For this reason, (4) the number of candidate shapelet could be greatly reduced. A quality Normalization in line 10 is made which allows Moreover, a single aggregation operation is required to get to assess the Discrimination Power for shapelet of different the global shapelet result of different class. In line 5, dataset length in an uniform way. Similar as the concept Information is broadcast to distributed nodes in order to reducing the Gain, but Discrimination Profile is a technique more inter- communication cost caused by accessing the common data. pretable serving to assess the candidate shapelets. Moreover, Then, each cluster partition shares the computing takes for in this manner, a split distance can be given directly, other a set of time series. The function computeDiscrimP aims at than iterating every possible distance and deciding the best computing in batches the quality of candidate shapelets. The one with the highest Information Gain, in time O(N 2 n2 ). A visualized process is shown in Figure 4. strategy to check if T contains a shapelet can be defined as The batch quality of instances in a time series is defined by the following: Discrimination Profile, which refers to the concept Represen- true, if dist(T, ŝC ) ≤ RP (ŝC , DC )  tative Profile: sInT (T, ŝC ) = f alse, otherwise RP (TiC , D) = avg(M PTiC ,Tj ) (3) (5) where Tj ∈ DC . Representative Profile targets thus on the C. Acceleration strategy minimal processing unit(i.e. time series) in SMAP, which The pruning function in line11 is capable of eliminating shows a vector of Representative Power of each instance in the the number of candidate shapelet, and then reducing the processing unit. To put it simply, the Representative Power of communication cost during the aggregation process. We can a subsequence in class C, is its normalized distance to the simply take the ”TopK” strategy, which extracts the biggest global instance cluster of class C. Intuitively, it represents K values of DiscrimP. However, a such technique is far away the relevance between the subsequence(i.e., the candidate from lightening the computation during MapPartition process. shapelet) and the class. Since each processing unit should be independent of each The fact that a subsequence is discriminative for its class other, a tenable technique for updating the profile of a long towards others, can be expressed by the difference of Rep- range query could be adopted. The Lower Bounding distance 10 MPi,1C Computing TS: class C Matrix Profile MPi,2C Representative Power ... ... MPi,nC TiC Discriminative Power MPi,1^C MPi,2^C Computing Representative TS: class ^C Matrix Profile ... ... Power TiC sC(l=m) MPi,n^C Fig. 4: Discrimination Profile Extraction [5] is defined to estimate a minimal possible Z-Normalized A. ECG medical diagnosis Euclidean Distance between two subsequences Ti,l+k and The dataset ECG200, from MIT-BIH Long-Term ECG Tj,l+k , based on the distance already computed between Ti,l Database (ltdb), is collected by two electrodes which record and Tj,l . Compared to a linear time complexity of computing the brain activities in distinct body positions. Each heartbeat the exact distance, LB Distance Profile can be calculated in a has an assigned label of normal or abnormal. All abnormal constant time, which can accelerate greatly the computation heartbeats are representative of a cardiac pathology known of Matrix Profile in Figure 4. For example, from shapelet as supraventricular premature beat. ECG200 contains 100 length l = m to m + 1, the time complexity of computing labelled records with a fixed length of 96. m(m−1) the distance dl+1 i,j is O(n 2 m), where j ∈ [0, n m(m−1) 2 ] which represents the number of subsequences in D, n is the Class 1 Class 2 number of instance in D, m is the length of the longest Top-5 Shapelet Top-5 Shapelet instance in D. Accordingly, LB distance takes O(n m(m−1) 2 ) (class1) (class2) which shows an apparent advantage when the query length is relatively long. Lower Bounding distance [5] is defined as: ( √ σj,l l , if qi,j ≤ 0 l+k LB(di,j ) = q σj,l+k (6) 2 σj,l l(1 − qi,j ) σj,l+k , otherwise Fig. 5: Shapelets in ECG raw sequence (tj+p−1 ti+p−1 ) Pl −µi,l µj,l The shapelets extracted by SMAP and the raw time series where qi,j = p=1 l σi,l σj,l are shown in Figure 5. Intuitively, shapelets of different Empirically, the matching subsequence Tj,l which is the length extracted from ECG instances, can not even be found nearest neighbor of Ti,l , can deduce a longer subsequence directly by naked eyes. The performance results are shown Tj,l+1 , which is probably the nearest neighbor of Ti,l+1 . in Table I. SM APLB shows a gain in performance: 23.8X Assume that the matching subsequence keeps in the same faster, realtively higher prediction accuracy (84% to 76%). position in Ttarget when query Ti,l length increases, then the We should know that the speed-up performance relies on the time complexity for computing the minimal distance between computing power of the cluster. We are more inclined to pay Ti,l and Ttarget is O(l), other than O(l(n − l + 1)). As men- attention to its parallelization capability which can be assessed tioned in Definition 9, M Pim = min(DPim ), the main idea by the shuffle cost of distributed nodes. From local(1 node) to here is to utilize LB Distance to accelerate the computation of cluster mode(6 nodes), the shuffle cost increases by 106%, the min(DPim ), other than computing the entire DPi in a higher Distance Measure cost drops to 8.6%. Obviously, considering time complexity. the gain, the shuffle cost can be ignored when we expand the cluster to a larger scale. IV. E VALUATION TABLE I: Performance comparison The baseline of the evaluation is USE in [16], which Algorithm Time Accuracy Shuffle(L./C.) Dist. M.(L./C.) TopK(L./C.) utilizes the traditional method for shapelet extraction based USE 2547s 76% - - - on Information Gain. U SEM ASS utilizes M ASS as Simi- U SEM ASS 1782s 76% - - - larity Measure strategy, SM APLB applies the acceleration SMAP 139s 84% 1.52s/3.08s 1558s/135.57s 1.15s/0.27s technique based on Lower Bounding Distance. KNN(K=10) SM APLB 107s 84% 1.48s/3.19s 1198s/103.23s 1.21s/0.31s classifier is applied for all accuracy test. All implementations are based on Python3.6, with the extension PySpark of the Apache Spark. The program is tested on AWS EMR cluster B. Phalanges Outlines Recognition (each node is with 16GB RAM, 8 vCPUs, maximal parallelism To put the evaluation into larger scale context, we choose of 6, where 2 vCPUs are reserved for cluster management the time series dataset PhalangesOutlinesCorrect, which con- system). The number of nodes is adjustable. tains 1800 training records with a fixed length of 80. The target 11 is to identify the bones of the middle finger by the outlines [7] J. Lin et al., “A symbolic representation of time series, with implications collected from the images. Three human evaluators labelled for streaming algorithms,” in Proc. 8th ACM SIGMOD, ser. DMKD ’03, 2003. the output of the image outlining as correct or incorrect. [8] A. Camerra et al., “isax 2.0: Indexing and mining one billion time USE requires more than 12960 mins to finish the compu- series,” in Proc. ICDM 2010, 2010, pp. 58–67. tation task. SMAP in single node environment is more than [9] T. Rakthanmanon and E. Keogh, “Fast Shapelets: A Scalable Algorithm for Discovering Time Series Shapelets,” in Proc of the 2013 SIAM 7 times faster (precisely, it takes 1860 mins in shapelets International Conference on Data Mining, 2013. extraction). The accuracy was 86%. Figure 6 shows the [10] A. Mueen et al., “Logical-shapelets: an expressive primitive for time relation between the time cost and the cluster’s capacity, for series classification,” Proc. 17th SIGKDD, 2011. [11] K. Chang et al., “Efficient pattern-based time series classification on both Similarity Measure and Aggregation process. The result gpu,” in Proc. ICDM 2012, 2012, pp. 131–140. is capable of supporting the conclusion obtained in previous [12] L. Ye and E. Keogh, “Time series shapelets: a novel technique that experimentation. allows accurate, interpretable and fast classification,” DMKD, 2011. [13] J. Lines, L. M. Davis, J. Hills, and A. Bagnall, “A Shapelet Transform for Time Series Classification,” Tech. Rep., 2012. Similarity Measure in Remote Cluster Aggregation/Shuffle Time in Remote Cluster [14] C.-C. Michael Yeh et al., “Matrix Profile I: All Pairs Similarity Joins for Time Series: A Unifying View That Includes Motifs, Discords and Shapelets,” in Proc. ICDM ’16, 2016, pp. 1317–1322. [15] Z. Xing, J. Pei, and P. S. Yu, “Early Prediction on Time Series: a Nearest Neighbor Approach,” 21st IJCAI, pp. 1297–1302, 2009. [16] R. Mousheimish, Y. Taher, and K. Zeitouni, “Automatic Learning of Predictive CEP Rules: Bridging the Gap between Data Mining and Complex Event Processing,” in Proc. DEBS ’17, 2017, pp. 158–169. Number of distributed nodes Number of distributed nodes Fig. 6: Time consumed along with the node number V. C ONCLUSION In this paper, we propose a novel methodology, namely SMAP, for Time Series Classification. SMAP adopts the concept Matrix Profile, and extracts the shapelet features in a scalable and interpretable manner. Within SMAP, the Dis- crimination Profile is defined to assess in batches the quality of candidate shapelets. On the basis of Lower Bounding, we propose also an acceleration strategy, which works appro- priately for distributed environment. The satisfactory results proved the efficiency of the scalable approach, and testified its competitiveness. Different optimizations are in our planning list. Specifically, to expand SMAP for longer time series by reducing the data dimension, but meanwhile to conserve its high accuracy and scalability. ACKNOWLEDGEMENT This research was supported by DATAIA convergence in- stitute as part of the Programme d’Investissement d’Avenir , (ANR-17-CONV-0003) operated by DAVID Lab, University of Versailles Saint-Quentin, and MASTER project that has received funding from the European Union’s Horizon 2020 re- search and innovation programme under the Marie-Slodowska Curie grant agreement N. 777695. R EFERENCES [1] N. Ravi et al., “Activity recognition from accelerometer data,” ser. IAAI’05. AAAI Press, 2005, pp. 1541–1546. [2] L. Bao and S. S. Intille, “Activity recognition from user-annotated acceleration data.” Springer, 2004, pp. 1–17. [3] L. Ye and E. Keogh, “Time series shapelets: A New Primitive for Data Mining,” Proc. 15th ACM SIGKDD, 2009. [4] D.-E. Yagoubi et al., “DPiSAX: Massively Distributed Partitioned iSAX,” in Proc. ICDM 2017, Nov. 2017, pp. 1–6. [5] M. Linardi and Y. e. a. Zhu, “Matrix Profile X: VALMOD -Scalable Discovery of Variable-Length Motifs in Data Series,” 2018. [6] E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrotra, “Dimensionality reduction for fast similarity search in large time series databases,” Knowledge and Information Systems, pp. 263–286, Aug 2001. 12