-

Clustering Trend Data Time-Series through Segmentation of FFT-decomposed Signal Constituents

Daniyal Kazempour

Anna Beer

Oliver Schrufer

oliver.schruefer@campus.lmu.de 0

Thomas Seidl

seidlg@dbs.ifi.lmu.de 0 0 Ludwig-Maximilians-Universitat Munchen , Munich , Germany

When we are given trend data for di erent keywords, scientists may want to cluster them in order to detect speci c terms which exhibit a similar trending. For this purpose the periodic regression on each of the time-series can be performed. We ask in this work: What if we not simply cluster the regression models of each time-series, but the periodic signal constituents? The impact of such an approach is twofold: rst we would see at a regression level how similar or dissimilar two timeseries are regarding their periodic models, and secondly we would be able to see similarities based on single signal constituents between di erent time-series, containing the semantic that although time-series may be di erent on a regression level, they may be similar on an constituent level, re ecting other periodic in uences. The results of this approach reveal commonalities between time series on a constituent level that are not visible in rst place, by looking at their plain regression models.

Clustering Time-series FFT Signal decomposition constituents

Whenever data is collected over a period of time we have time-series. Let it be time-series in the life science, economy, health, social science and many other research elds, we are faced with time-series containing information on attributes changing over time. In context of time-series the detection of periodicity is of special interest, such as the periodicity in the predator-prey behavior among animals, power-consumption cycles, or trends of certain terms in search engines etc. As a motivation we introduce in this work the following use-case: Suppose there is a journal which publishes four times a year. Its content encompasses topics ranging from health, over society, education and many more. The editor in chief decides now to include topics in each of the issues which are most trending among the readers. The question at this point which concerns the editor in chief Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). is: what is most trending? And how can such information be obtained? Or in other terms: How can the editorial stu decide which of the most trending topics belong together in an issue? For this purpose one of the editors comes up with the idea to mine the Google trends platform1. From an editors-compiled list of topics, the trend data over the past years are collected from the trends platform. The data scientists in the journal obtain rst for the keyword "turkey" (the poultry) the time-series and perform a periodic regression on it. We elaborate on periodic regression in Section 3 in more detail. The data scientists, however, recognize that the regression approach neglects many information of the original time-series as it can be seen in Figure 1 (left). Then one of the data scientists comes up with the idea to perform a Fast Fourier Transformation (FFT) by which the time-series for \turkey" is decomposed through the frequency domain into its so called constituents as seen in Figure 1 (right). In fact, the sum over all the constituents resembles the original signal. What the data scientists can do now is to cluster the keywords by using the constituents information instead of clustering the regressions of the time-series. We shall see in the experiments section of this work that the clustering on basis of signal constituents yields meaningful clusters compared to the case where we just use the regression of time-series. By the constituent-based approach the editors can identify the topics for each calendar quarter of the year, providing articles in their issues which satisfy the (seasonal) interests of readers.

Apart from our journal example, in [1] the authors list 25 applications of clustering on time-series data from 11 di erent domains (aviation, energy, nance etc.). Each of the clustering applications comes with references to at least one paper, emphasizing that the topic of time-series clustering is undoubtedly of signi cance and broad range regarding its application. Concluding this introductory part our contributions in this work are as follows: ( 1 ) Providing a representation of decomposed signal constituents from a periodic signal as parameter-space vec

1 https://trends.google.de/trends/

tors. ( 2 ) A clustering method on regression and on constituent level of time-series applied on periodic parameter space. 2

Related Work

Most related work can be found in context of expression patterns of genes, since many of them are associated with \circannual (yearly), circadian (daily), cellcycle and other periodic biological processes [that] are known to be rhythmic" [7]. Early methods like, e.g., [3] use the standard correlation coe cient between the normalized vectors representing the correlations, since it captures similarity in the shape of periodically correlated datasets well. However, it does not ful ll the triangle inequality (and is thus no metric), plus magnitude and shift of time series are neglected here, which can lead to undesired results. [3] apply a pairwise average-linkage cluster analysis and focus on organizing and graphically displaying data in a way to allow users to explore data in an intuitive manner. RAGE [7] is phase independent and uses a true distance metric based on the undirected Hausdor distance. Nevertheless, it is based on nding the periodic representation of each time series by tting it to an ideal synthetic sinusoid or another already known (gene expression) pro le. CorrCos [4] rst generates over 100000 synthetic models and then matches each time series to one of those models using cross correlation. In a more recent work, \a periodic covariance function based on a projection of the Matern covariance" [5] is used. They regard noise and allow time series to share only a periodic component, structuring them in a hierarchical manner. However, they need some information like the phase-length beforehand, which may be popular in this eld, but not generally accessible for new datasets. The representation of time series in form of FFT and other approaches is quite common in signal processing as discussed in [9], which also gives a brief overview on some of the most common techniques. With regards to the aspect of time-series distance measures there is a wealth of literature, such as in a more recent work of [8]. To the best of our knowledge, we are not aware of any algorithm regarding the singular components of each periodically correlated time series, but only such working on simple periodical correlations. 3

Clustering via Signal Constituents

In this section we elaborate on our approach of clustering periodic time series by clustering signal constituents. For this purpose we rst describe the periodic feature vectors and parameter space and proceed on periodic regression. We contrast the regression approach by describing on how we perform signal decomposition using FFT. Having obtained the signal constituents, we cluster them, which requires speci c handling, being elaborated on in the last subsection. 3.1

Periodic Feature Vectors and Parameter Space

A periodic function can be represented as follows: De nition 1 (Periodic Function) Given an object (x; y) 2 R2, a periodic sinusoidal function is de ned as: Where a is the amplitude, f the frequency, p the phase-shift and v the vertical shift. As such a periodic feature vector ' is de ned as:

The periodic feature vector ' can be regarded as a model which describes a periodicity within a given time series. At this point we recapitulate that we do not aim at comparing residue-by-residue multiple time-series with each other, but their respective models. For this purpose we de ne the periodic parameter space P as a feature space being spanned by the set of parameters of a periodic function fa; f; p; dg with ' 2 P. 3.2

The FFT and Periodic Regression

The most common usage of the Fourier transform is to convert a given signal from the so-called time spectrum to a frequency spectrum. One core aspect on which the Fourier transform relies, is that every non-linear function can be represented as a sum of sine functions. A Fourier transform decomposes the time signal and reveals information about the frequency of all involved sine waves that generate the original signal. For time-series of evenly spaced values, such as time series of having data from every second, hour, week etc. the so called Discrete Fourier Transform (DFT) is de ned as follows:

De nition 2 (Discrete Fourier Transformation)

N 1 Xk = X xne 2 ikn=N

n=0 where N denotes the number of samples, n the current sample, xn the value of a signal at the sample point n, k the current frequency and Xk the result of the DFT encapsulating amplitude and phase.

Since the computation of the DFT as shown in De nition 2 has a quadratic runtime complexity of O(n2), we use in this work the so called Fast Fourier Transform (FFT)[2] which reduces the runtime to O(N log(N )) by recursively dividing a given DFT into smaller DFTs .

A periodic regression is de ned as:

De nition 3 (Periodic Regression) Given a set DB of objects from e.g. a

time series (xi; yi) of variables (dependent and independent), the Damped Least Squares (DLS, also known as Levenberg-Marquardt algorithm) aims to nd the parameters of a model curve (x; ) s.t. the sum of squares of the deviation S( ) is minimized:

S( ) = m X(yi i=1

(xi; ))2 with an initial guess p0 = (a0; f0; p0; v0) which is obtained from a FFT taking the frequency, phase-shift and vertical shift with the highest amplitude.

For a periodic regression we perform rst an FFT on a given time-series. We take the peak frequency in the frequency domain from the FFT and discard all other frequencies. The peak refers to that particular frequency with the highest energy (=amplitude). Intuitively it can be compared to a principal component analysis (PCA), where for dimension reduction purposes we keep for example only that principal component with the strongest Eigenvalue. The amplitude is determined by computing the absolute value of the complex conjugate from an FFT. Having the frequency with the highest amplitude (=peak) we now need to determine the phase-shift which is easily obtained by computing the angle of the complex conjugate result. With the amplitude, frequency and phase we can now compute the missing vertical shift. The FFT-obtained frequency, phase-shift, vertical-shift and amplitude are taken as an initial guess for a DLS as stated in De nition 3. We got now the periodic feature vector ' which represents the regression of a given periodic time series.

However, the regression approach comes with the drawback that we may lose information on the constituents which actually contribute to the detailed \shape" of a time-series. This is an issue which we shall approach in the upcoming subsection. 3.3

Signal Constituent Tensors through Signal Decomposition

What are the most determinant constituents of an original time-series signal? Or, to express it rather in terms of a principal component analysis: what are the most determinant principal components? For this purpose we perform an FFT as described in the previous section and keep the k strongest signals, where the signal strength is determined by its amplitude. The other constituents are discarded. A possible approach by which k can be chosen is, e.g., by using the elbow-method[6]. As a result we obtain a tensor of the following structure:

De nition 4 (Signal Constituent Tensor) Given the top k signal con

stituents i = (ai; fi; pi; vi)T from an FFT on a time-series , the resulting signal constituent tensor is de ned as:

( ) = ( 0T ; :::; kT 1)T Where the single constituents are ordered in a way such that it holds: ai > ai+1 > ::: > ak 1

The signal constituent tensor from De nition 4 provides us now a representation of the top k strongest signal constituents. They pose the very basis for computing similarities between di erent time-series among their respective constituent signals. 3.4

Similarity and Clustering of Signal Constituents

Having now obtained the signal constituent tensors, we use them as a base for performing a clustering. As a design decision we use here hierarchical clustering with average linkage. In order to perform hierarchical clustering we need to have a distance matrix. For such a distance matrix the question of how to actually compute the distance of the signal constituent tensors between two time-series arises. For this purpose we like to introduce the MinCT-dist distance in this work.

De nition 5 (Minimum Constituent Tensor Distance (MinCT-dist))

Given two signal constituent tensors, each of them from one time-series ( 1 ) and ( 2 ). The minimum constituent tensor distance is de ned as: dMinCT ( ( 1 ); ( 2 ))

= minf( 0 ( 1 ) ; 0 ( 2 ) ); :::; ( k 1 ( 1 ) ; k 1 ( 2 ) )g where ( i ( 1 ) ; j ( 2 ) ) 2 ( 1 )

( 2 )

The intuition behind De nition 5 is that we take between two constituent tensors the minimum of two constituents as the distance. Through this approach we link those time-series which have a small distance on a speci c constituent. By that de nition we also have some sort of a lower bound, since the distance between two time-series can not be smaller than any of the constituents.

Such an approach as provided by MinCT-dist would work well on coarse resolutions (such as one measure per week), but will fail on ner temporal resolutions, since we neglect with MinCT-dist repetitive patterns on daily, or weekly level. To mitigate this e ect, we provide a di erent distance measure taking into account the di erent phase-shifts for each time-series. This leads to a rst de nition of our next distance measure: De nition 6 (Phase-shift based distance) Given two time-series i, j where each of the time-series has a common series of di erent frequency intervals like f = [f1; f2; :::; fi] where e.g. f1 = [23h; 35h]. For each time-series a FFT is computed, obtaining the respective amplitudes a and phase-shifts p among the frequency f . The distance between two time-series i and j is de ned as: n 1 dphase( i; j ) = X

De nition 6 also relies on taking the top-k impactful frequencies for each of the frequency intervals. However, this phase-shift-centric approach comes with the major drawback that all frequency intervals have the same impact being considered as 'equally important'. Looking for a keyword like e.g. 'Beer garden' (Biergarten in German) the daily cycles may be more important since people query their search engine of choice for the nearest beer garden in the afternoon. Looking at an annual resolution, the summer may be of more relevance since people would look for a beer garden to spend their time in summer rather then in winter. To counteract this e ect, we multiply each of the absolute phase-shift distances with their associated amplitude, yielding the next stage of our distance de nition:

De nition 7 (Amplitude-weighted phase-shift based distance) Given

two time-series i, j . The amplitude-enhanced variant of the phase-shift based distance is de ned as: daw phase( i; j ) = n 1 X a ifk k=0 a jfk jp ifk

By including the amplitudes of both time series as weighting factors like in De nition 7, frequency intervals with small impact (small amplitude) have a smaller impact on the whole distance. In the scenario that in both time series we have high amplitudes in the same frequency intervals, the phase-shift distance is also meaningful for the overall distance. However this approach will fail given the case that both time-series do not share impactful frequencies. This would result in small distances rendering it impossible to tell if they have similar phase shifts or if the distance itself is just insigni cant. This e ect is however mitigated by adding up the multiplied amplitudes without the phase shifts and normalizing De nition 7 by it.

De nition 8 (Normalized amplitude-weighted phase-shift based dis

tance) Given two time-series i, j . The normalized amplitude-enhanced variant of the phase-shift based distance is de ned as: dnaw phase( i; j ) =

1 Pn 1 k=0 a ifk n 1

X a ifk a jfk k=0 a jfk jp ifk

With De nition 8 we have now a distance function which computes the distance with respect to (a) the phase-shift, (b) the amplitudes and thus the impact and (c) di erent frequency intervals. We shall see in the experiment section the e ects of our de ned distance function. 4

Experiments 2 http://trends.google.com

In order to re-connect to the initial journal use-case from the introduction section, we take for the conducted experiments 10 time-series from Google trends2, namely: college dorm, reworks, tness, u, healthy eating, holidays, i have a dream, sunburn, superbowl and turkey from a time frame between 2004 and today in the region of the United States. We have the relevance for each of these key terms which denote the vertical axis and each sample represents one month. This yields a total of 183 samples per key term. We used the FFT from the numpy library on which we wrote routines to extract the amplitude, phase-shift and vertical shift of each constituent, ranking them by amplitude and encapsulating them into a constituent tensor. We then performed a hierarchical clustering using rst a periodic regression and then all three of our distance measures. Computing rst the clustering for the regression models for each of the keywords we obtain the following result as seen in Figure 2.

In the periodic regression model clustering we can observe one cluster block on the diagonal with low intra-cluster distance (center) and two with high intracluster distance. The clustering however does not seem to make that much of sense, since the Martin Luther King day (I have a dream) and turkey as well as u or super bowl do not share any temporal closeness. They all occur annual (except u maybe) but do not have the same time of the year (phase).

Comparing the regression result against the minimum constituent tensor distance, we obtain the cluster map as shown in Figure 3. It shows, that our approach yields a more meaningful result. We have two clusters with very low intracluster distances. The bottom-right cluster encompasses the keywords healthy eating, I have a dream, tness and holidays. Taking a closer look at the keywords reveals that healthy eating and tness are terms of high relevance during the beginning of the year (mostly new years resolutions). Also the Martin Luther King day is in January (20.01). In the top-left cluster we can see reworks, u, turkey, college dorm, sunburn and super bowl. Fireworks has its peak around the independence day (04.07) and sunburn also matches well, since we are in the summer time. Flu take some kind of special role, since the people seem to look more for this keyword in the search engine during October which is also close to thanksgiving (turkey). College dorm is close to the summer vacations and as such also suitable in the cluster. Just the term super bowl does not seem to t in the pattern, since it is in February. Compared to the simple regression model, a clustering based on decomposed signal constituents seems to provide a much more intuitive result.

However, in this rst experiment the time-series data was obtained on a weekly-basis. What happens if we take di erent keywords on di erent temporal resolutions? The frequency intervals encompass 8, 12, 24, 84, 168, 420, 735, 802, 981, 1103, 1471, 2207, 2943, 3924, 4415, 7064, 8830 and 11773 hours representing, working day, weeks, months, seasons etc. For each of the keywords for all of the frequency intervals their respective amplitudes and phase-shifts are computed through FFT. As keywords for this second data set we chose weather (Wetter), outdoor pool (Freibad), winter tires (Winterreifen), bakery (Backer), MVV (Munich local public transport), beer garden (Biergarten), sunburn (Sonnenbrand), tv program (Fernsehprogramm), Sunrise (Sonnenaufgang), Twitter, Club, Ikea, Christmas (Weihnachten), Ski and Firework (Feuerwerk). Computing clusters on this trend data with MinCT-dist yields the clustermap as seen in Figure 4. While club and twitter are together in one cluster, most of the rest of the data set is put into one massive cluster where Christmas and outdoor pool are put together. The result seems arbitrary.

If we look at the clustermap yielded by the normalized amplitude-weighted phase-shift based distance, one can see that more ne-grained clusters are revealed as seen in Figure 5. Weather and outdoor pool are put together into one cluster which makes sense, since people would go to an outdoor pool only if the weather is suitable. Further sunburn, beer garden, outdoor pool and MVV fall into the same cluster.

Taking a look at the time-series on a 30-days interval reveals no common periodic pattern as seen in Figure 6.

However if we change from the 30-days interval to a 7-days interval one can observe that all keywords have a high(er) number of queries between 8am and 10pm as seen in Figure 7. Individually the MVV peeks at around 8am and 9pm which corresponds roughly to the morning time people getting to work and the evening time returning from evening events. Outdoor pool peeks at around lunch time, which re ects the time people seeking to go swimming in the afternoon. In the late afternoon beer garden peeks which is due to the e ect that people look for beer gardens to spend their evening at. Lastly we have as an interesting insight an increased query for sunburn around 9pm-10pm which may be speculated that people have got a sunburn and look for ways to heal or alleviate the pain from it. In this work-in-progress we have presented two distance functions for clustering time-series based on their signal constituents which emerged from an FFT. The rst experiments on real-world data showed that our method reveals similarities between time-series on the constituent models which would not have been visible by simply clustering the regression models. While our rst approach goes beyond a simple periodic regression, it fails when considering di erent frequency intervals with their di erent phase shifts. For this purpose we developed the normalized amplitude-weighted phase-shift based distance, which provides a better re ection of distances by putting phase-shift and amplitudes on di erent frequency intervals into account. As this work may seem like 'yet another distance measure for temporal data' it aims to explore the computation of distances based on single constituents, respecting the in uences of phase-shift, amplitude and frequency intervals. We hope that this work leads to new discoveries on time-series data and the team in our ctional journal to more time-speci c topics for their readers.

Acknowledgement

This work has been funded by the German Federal Ministry of Education and Research (BMBF) under Grant No. 01IS18036A. The authors of this work take full responsibilities for its content.

1. Aghabozorgi , S. , Shirkhorshidi , A.S. , Wah , T.Y.: Time-series clustering{a decade review . Information Systems 53 , 16 { 38 ( 2015 )

2. Cooley , J.W. , Tukey , J.W.: An algorithm for the machine calculation of complex fourier series . Mathematics of computation 19(90) , 297 { 301 ( 1965 )

3. Eisen , M.B. , Spellman , P.T. , Brown , P.O. , Botstein , D. : Cluster analysis and display of genome-wide expression patterns . Proceedings of the National Academy of Sciences 95 ( 25 ), 14863 { 14868 ( 1998 )

4. Harmer , S.L. , Hogenesch , J.B. , Straume , M. , Chang , H.S., Han, B. , Zhu , T. , Wang , X. , Kreps , J.A. , Kay , S.A. : Orchestrated transcription of key pathways in arabidopsis by the circadian clock . Science 290 ( 5499 ), 2110 { 2113 ( 2000 )

5. Hensman , J. , Rattray , M. , Lawrence , N.D.: Fast nonparametric clustering of structured time-series . IEEE transactions on pattern analysis and machine intelligence 37 ( 2 ), 383 { 393 ( 2015 )

6. Ketchen , D.J. , Shook , C.L.: The application of cluster analysis in strategic management research: an analysis and critique . Strategic management journal 17(6) , 441 { 458 ( 1996 )

7. Langmead , C.J. , Yan , A.K. , McClung , C.R. , Donald , B.R. : Phase-independent rhythmic analysis of genome-wide expression patterns . Journal of computational biology 10(3-4) , 521 { 536 ( 2003 )

8. Lucas , B. , Shifaz , A. , Pelletier , C. , ONeill , L. , Zaidi , N. , Goethals , B. , Petitjean , F. , Webb , G.I. : Proximity forest: an e ective and scalable distance-based classi er for time series . Data Mining and Knowledge Discovery 33 ( 3 ), 607 { 635 ( 2019 )

9. Popivanov , I. , Miller , R.J. : Similarity search over time-series data using wavelets . In: Proceedings 18th international conference on data engineering . pp. 212 { 221 . IEEE ( 2002 )