=Paper=
{{Paper
|id=Vol-2259/aics_7
|storemode=property
|title=Kernel Methods for Time Series Classification and Regression
|pdfUrl=https://ceur-ws.org/Vol-2259/aics_7.pdf
|volume=Vol-2259
|authors=Mourtadha Badiane,Martin O'Reilly,Padraig Cunningham
|dblpUrl=https://dblp.org/rec/conf/aics/BadianeOC18
}}
==Kernel Methods for Time Series Classification and Regression==
Kernel Methods for Time Series Classification and Regression Mourtadha Badiane, Martin O’Reilly and Pádraig Cunningham University College Dublin Abstract. The purpose of this paper is to collect together various SVM kernels presented in literature for time series classification and regres- sion and to test them against each other to see if there is a clear win- ner. Support Vector Machines (SVMs) are a state-of-the-art set of al- gorithms which utilize a kernel function in their processing. This paper re-evaluates certain kernels that can be used for time series classification and regression paying particular attention to whether the kernels are positive semi-definite (PSD) and hence admissible. This paper explains the popular dynamic time warping kernel as well as variations. We test the LTW (linear time warping), DTW, and SAX kernels in various forms against each other to see if there is a clear winner on a time series classi- fication and a time series regression problem. We find that a variation of the DTW kernel: the global alignment (GA) kernel wins when the time series are short in length (<100 temporal dimensions) but when the time series get too long the GA kernel values explode to infinity and as a re- sult simpler kernels like LTW with a linear local kernel perform better. Furthermore we find that the original DTW kernel performs poorly as it does not produce PSD kernel matrices. Keywords— Support Vector Machine, Time Series Regression, Time Series Clas- sification, Dynamic Time Warping, Global Alignment Kernel, SAX 1 Introduction Support vector machines were developed in Russia in the 60s by Vladimir Vapnik and Alexey Chervonenkis. The Support Vector Machine (SVM) was introduced in [1] and the training algorithm was presented in [3]. They were originally invented for classification but quickly became adapted for regression. Support Vector Machines have been successfully applied to pattern recognition and it is the hope of the authors to transfer that success to time series analysis. This paper serves as an introduction to time series analysis and aims to determine the most promising kernels for use in support vector machine classification and regression. This paper defines the concept of a positive semi-definite kernel in section 2, as well as the motivations for this paper. In section 3 the reader is introduced to perhaps the simplest solution to performing time series analysis on time series of differing lengths, the linear time warping (LTW) kernels (introduced in [7]). In section 4 we introduce the more sophisticated dynamic time warping (DTW) kernels as well as variations. In section 5 we briefly introduce a bag of features approach: Symbolic Aggregate Approximation (SAX) which is a 2 Mourtadha Badiane, Martin O’Reilly and Pádraig Cunningham symbolic approach to time series analysis. Section 6 details two experiments on time series classification and regression while section 7 concludes the paper. 2 Positive Semi-Definite (PSD) Kernels A support vector machine is only as good as the kernel function it employs, in fact, it is helpful to think of a support vector machine as an algorithm which allows us to use a kernel function to make predictions. A kernel function is a measure of similarity between two vectors. A kernel function is meant to correspond to an inner (dot) product in a higher dimensional feature space. A sufficient condition to ensure that the kernel defines a dot-product in a feature space is that all the eigenvalues in its eigenfunction expansion are positive. To guarantee that these coefficients are positive, it is necessary and sufficient (Mercer’s Theorem) that the condition (1) be satisfied ([3]). Therefore we have that for a kernel function to be admissible (i.e. possible to be utilized in an SVM) it must be positive semi-definite, if the kernel is not positive semi-definite then the support vector machine algorithm may not terminate at a global optimum, instead it may produce very suboptimal results. The Gram matrix is the matrix Kij = K(xi , xj ), where xi is the ith training example and K(., .) is the kernel function. The kernel function K is positive semi-definite if and only if it satisfies Mercer’s condition: Z∞ Z∞ K(u, v)g(u)g(v)dudv ≥ 0 (1) −∞ −∞ for all g such that: Z∞ g 2 (u)du < ∞. (2) −∞ In the discrete case we have that for any set of scalars {c1 , ..., cn }, ci ∈ R for i = 1, ..., n the following criterion must be satisfied by the Gram (kernel) matrix Kij = K(xi , xj ): n X ci Kij cj ≥ 0 (3) i,j=1 We may note that by definition a matrix A is positive semi-definite if and only if x· Ax ≥ 0 ∀x ∈ Rd . Hence for an eigenvector x with corresponding eigenvalue λ x· Ax = x· λx = λx· x = λkxk2 We require the above expression to be non-negative which is true when the eigenvector λ is non-negative. So a positive semi-definite matrix must have no negative eigenvalues. We may go further and show that a symmetric matrix with only non-negative eigenvalues is positive semi-definite. If an n × n real matrix is symmetric it has n orthogonal eigenvectors x1 , ..., xn with corresponding eigenvalues λ1 , ..., λn . Then the eigenvectors form a basis for Rn . That is any vector y can be written as a linear combination c1 x1 + ... + cn xn for suitable constants ci . We then have: n X n X n X n X y· Ay = ci xi · A ci xi = c i xi · ci Axi i=1 i=1 i=1 i=1 Kernel Methods for Time Series Classification and Regression 3 now since each xi is an eigenvalue Axi = λi xi , so we have n X n X n X y· Ay = ci xi · ci λi xi = λi ci cj xi · xj i=1 i=1 i,j=1 now since x1 , ..., xn are orthogonal vectors: xi · xj = 0 if i 6= j. This yields: n X n X y· Ay = c2i λi xi · xi = c2i λi kxi k2 i=1 i=1 which is strictly non-negative if all the eigenvalues λ1 , ..., λn are non-negative. We re-iterate a kernel is a similarity measure so we would like it to be big when two vectors are close and small when applied to vectors which are far. It is also worth noting that we do not directly require the training examples when training our SVM, all that is required is the Gram matrix Kij . The following functions may be used as kernels (for static vectors) ([6]): – Polynomial (homogeneous): K(u, v) = (u· v)d – Polynomial inhomogeneous: K(u, v) = (u· v + 1)d 2 – Gaussian radial basis function: K(u, v) = e−γ||u−v|| – Hyperbolic tangent: K(u, v) = tanh(κu· v + c) One of the best performing kernels is the radial basis function (RBF) [4] which relies on computing a distance between two vectors ||u − v||. Now these are kernels for static vectors, but our intuition is to extend the RBF kernel to time series data simply by formulating a distance measure which can be applied to time series, such as the dynamic time warping distance, which can then be plugged into the Gaussian RBF kernel exactly as in the static case. Unfortunately, before the reader gets their hopes up, it turns out that the popular dynamic time warping distance is not actually a metric, which means that the kernel induced is not positive semi-definite (PSD) and so is not an admissible kernel. Fortunately for some authors it has been used before successfully for SVM time series analysis [2]. 3 Linear Time Warping One of the first problems that confronts one attempting to perform regression on time series vectors is that the time series may not be of the same length. There is a simple and straightforward way around this problem: linear time warping (LTW), first introduced in [7]. We will simply warp each vector to the same length. In theory we could chose an arbitrary length to which we would map each time series, in reality to guarantee the derived LTW kernel is PSD, so long as the underlying kernel on static vectors is PSD, we should warp each time series in the data set to a vector of fixed length l. The maximum length of all the time series is a good candidate for l, the length to which all the time series will be warped. Let π i be the warping function for example i, then for the ith training example x∗i = {xπi }lk=1 , where in this case l is the length of the k longest time series and km π i (k) = d e (4) l 4 Mourtadha Badiane, Martin O’Reilly and Pádraig Cunningham where m is the length of xi and d.e denotes the ceiling function. We then can simply calculate the kernel of x and y as: l X K(x, y) = kL (xπ1 (i) , yπ2 (i) ) i=1 where kL is the local kernel (active on static vectors) for example: linear or quadratic; and π 1 is the ltw function for x ,while π 2 is the ltw function for y. We could also consider the linear time warping RBF kernel defined as: l P kx 1 −y 2 k i=1 π (i) π (i) K(x, y) = e σ2 where σ 2 is the variance. The time series is warped since every x∗i is of length l, simply by repeating certain time points again in the series so that they are all the same length. Linear time warping is attractive since it nearly always produces kernels which are PSD and is a trivial way of comparing time series of different length. The equations above are a summation of those presented in [7]. 4 Dynamic Time Warping So that the paper may retain its self-contained nature, i.e. its completeness we will re-introduce the DTW algorithm presented in [7] as well as [8] and [9] . We begin this section by the definition of an alignment between two time series. We then define the DTW kernel in terms of those alignments. Both these definitions were made in [7]. The rest of this section deals with variations of the original DTW algorithm described in literature. 4.1 The original algorithm Fig. 1. Misalignment correction by DTW To compare two time series that do not line up on the x-axis we may use the dynamic time warping distance. As you can see in Figure 1 when comparing two time series X and Y that are similar in shape except that the crest of one is shifted along the x-axis, the DTW kernel will reflect this similarity by warping the time axis so that Kernel Methods for Time Series Classification and Regression 5 the two time series align. In contrast the Euclidean distance completely ignores the inherent similarity between the two series as a result of the misalignment. In summary DTW is elastic, and Euclidean distance is not. The dynamic time warping distance is not a metric since it does not satisfy the triangle inequality 1 . However on some problems the kernel still performs well. It is usually a good idea to plot the eigenvalues of the Gram (kernel) matrix, many negative eigenvalues usually implies poor performance. To calculate the dynamic time warping distance we must first define what is meant by a good alignment. An alignment π is a pair of functions π 1 and π 2 which satisfy the following properties: ([v] = {1, ..., v}) π 1 : [m] → [l] (5) π 2 : [n] → [l] (6) where l is known as the length of the alignment. π1k = 1, f or k ∈ [2] (7) and πl1 = m (8) πl2 = n (9) πik − πi−1 k ∈ {0, 1} ∀k ∈ [2] ∀i ∈ {2, ..., l} (10) πia = πi−1 a =⇒ πib − πi−1 b = 1, ∀a, b ∈ [2], a 6= b, ∀i ∈ {2, ..., l} (11) We may summarize the criteria as both π 1 and π 2 must be monotonic functions from [m] and [n] onto [l] such that they contain no simultaneous repetitions (11). Once we have the alignment π we may define the dynamic time warping distance between two time series x of length m and y of length n. l X d(x, y) = min ( kx(tπ1 ) − y(tπ2 )k) (12) π∈A(x,y) k k k=1 where A(x, y) is the set of all possible alignments and k.k is the regular Euclidean distance. We may calculate the dynamic time warping distance in O(mn) via the recurrence relation: Mi,j = min(Mi−1,j , Mi−1,j−1 , Mi,j−1 ) + kxi − yj k (13) The resultant Mm,n is the dynamic time warping distance between x and y. Note it is often customary to use a warping window, this limits the maximum warping that may occur between the two time series. This is trivial to implement since if TOL is our tolerance (maximum warping) then we must simply ensure that when |i − j| > T OL: Mi,j = ∞. By doing this we are ensuring there is an upper bound on the warping. We may form the kernel in the usual way: 2 2 KGDT W (x, y) = e−d(x,y) /σ (14) 1 the triangle inequality which is a necessary condition for the distance measure d on a vector space χ to be a metric states: d(u, v) ≤ d(u, w) + d(w, u) ∀u, v, w ∈ χ. If the distance underlying a kernel fails this test, the kernel will fail to be PSD. 6 Mourtadha Badiane, Martin O’Reilly and Pádraig Cunningham where σ is the standard deviation of the distances. This is the Gaussian dynamic time warping kernel: KGDT W . The negative dynamic time warping kernel is defined as: KN DT W (x, y) = −d(x, y) (15) it is simply the negated DTW distance. 4.2 Variations We may also utilize DTW with a linear local kernel. This kernel would be defined similar to KGDT W . |π| X KLDT W = max ( xπ1 (k) · yπ2 (k) ) (16) π∈A(x,y) k=1 We may calculate KLDT W in a similar way to KGDT W in O(mn) via the recurrence relation : Mi,j = max(Mi−1,j + Mi−1,j−1 + Mi,j−1 ) + xi · yj (17) the resulting Mm,n is the dynamic time warping method with a linear local kernel. We may extend the dynamic time algorithm to any non-linear local kernel kL by modifying equation (16) to: |π| X K= max ( kL (xπ1 (i) , yπ2 (i) ) (18) π∈A(x,y) i=1 and hence equation (17) would be modified to: Mi,j = max(Mi−1,j + Mi−1,j−1 + Mi,j−1 ) + kL (xi , yj ) (19) which defines the DTW kernel induced from a local kernel kL . Some choices for kL include x· y, (x· y)n , and tanh(x· y). In regular DTW we only consider the path through the cost matrix M which min- imizes the total cost and therefore the path which minimizes the cumulative distance between the two time series. The Global Alignment kernel instead considers every single admissible alignment π ∈ A(x, y). Fist we define the cumulative cost S of an alignment: |π| X S(π) = kxπ1 (i) − yπ2 (i) k (20) i=1 and then the Global Alignment (GA) kernel is defined as X K(x, y) = eS(π) (21) π∈A(x,y) The Global Alignment (GA) kernel is discussed in detail in [8] and [9] where it is shown to produce PSD kernels. We can simply calculate the GA kernel in O(mn) time via the recurrence relation: Mij = (Mi−1,j + Mi,j−1 + Mi−1,j−1 )ekxi −yj k (22) as you can see the min() in equation (13) is replaced by a summation here, so this kernel is sometimes called the soft-minimum kernel. Kernel Methods for Time Series Classification and Regression 7 5 SAX The approaches to time series analysis above are numerical. Here we introduce a sym- bolic approach to time series analysis: Symbolic Aggregate Approximation (SAX). One possible motivation for moving towards a symbolic approach is that we could then uti- lize the wealth of datamining techniques pertaining to string representations, one ex- ample would be edit distance. Another source of motivation is that a symbolic approach may yield significant dimensionality reductions. For further explanation see [10]. The fist step to SAX is to discretize the input space. First we set the alphabet size (a > 2) for the problem. Next for every time series C = c1 , ..., cn we must assign each ci to a corresponding variable in the alphabet. So if a = 3 and our alphabet is {a, b, c} then for the time series c1 , ..., cn we must map each ci to a letter in the alphabet. Our approach to discretization is to first transform the data into the Piecewise Aggregate Approximation (PAA) representation and then symbolize the PAA representation into a discrete string. The two main benefits of this process are the well documented di- mensionality reduction of PAA ([11], [12]) and lower bounding: our distance measure between two symbolic string lower bounds the true distance between the original time series ([13], [12]). Fig. 2. Discretization via PAA We can represent a time series C of length n in a w-dimensional space by a vector C̄ = c¯1 , ..., c¯n . As in [10], we can calculate c¯i by ni w w X c¯i = cj (23) n n (i−1)+1 j= w We have reduced the time series from n (temporal) dimensions to w dimensions by dividing the data into w equally sized frames and then c¯i is simply the mean value of the time series for that frame. We may think of this process as attempting to approximate our time series with a linear combination of box functions. It is worth noting that it is important to z-normalise each time series. We then appropriately define breakpoints which determine the letter in our alphabet to which each c¯i will be mapped. Usually we do this by analyzing the statistics of our time series and choosing breakpoints so that each letter in the alphabet is as likely to appear as each other letter. In other words we choose breakpoints to spread out the data evenly. For a more thorough explanation see [10]. Once we have the breakpoints determined it is straightforward to map our PAA representation to a string consisting of letters from our alphabet. The PAA coefficient controls the proportion of examples that will be placed in each bin. 8 Mourtadha Badiane, Martin O’Reilly and Pádraig Cunningham Whereas the alphabet size regulates the discretization of the x-axis, the PAA coefficient regulates the discretization of the y-axis. When we have a large time series our approach is to first discretize the time series into a long string and then extract a bag of words. We determine a sliding window, usually found by parameter optimization, and this sliding window length becomes the length of each word in our bag of words. So we turn one long string of letters representing our original time series into a series of words, each word is the length of the sliding window. The first word starts from the first index in the original long string, the second word starts from the second index in the original long string. We proceed until all the words have been extracted. As a distance measure between the time series we could use Euclidean distance, however to make our time series more elastic we use edit distance. Now time series that are similar but not aligned will be marked as similar by our distance metric: edit distance, since it is elastic. It is also worth noting that edit distance is not a metric as it does not satisfy the triangle inequality, however in the experiments reported here, time series discretized with SAX combined with edit distance and a Gaussian RBF-like kernel produces PSD Gram (kernel) matrices. 6 Experiments For both experiments we can summarise our methods into four categories: 1. Linear time warping (LTW) described previously and paired with various local kernels: ltw - linear, quadratic, quadratic inhomogeneous, cubic, cubic inhomoge- neous, tanh, and rbf. 2. Dynamic time warping (DTW) paired with various local kernels and also the varia- tion: the global alignment kernel. The kernels are: GDTW; NDTW; DTW - linear, quadratic, quadratic inhomogeneous, cubic, cubic inhomogeneous, tanh; and the GA kernel. 3. SAX with edit distance. The two kernels are SAX - rbf and SAX - negative. 4. Gaussian Euclidean rbf kernel. SAX RBF is the edit distance calculated as above (section 5) paired with a Gaussian rbf kernel. SAX - negative is simply the negated edit calculated as above (section 5) used as the kernel. NDTW stands for the negative DTW kernel and is simply the negative of the DTW distance used as the kernel. GDTW stands for Gaussian DTW and is simply the DTW distance plugged into the Gaussian RBF kernel (thus replacing the Euclidean distance). GA stands for the global alignment kernel discussed with the other variations of DTW above. 6.1 Experiment I - Time Series Regression This experiment consisted of training and testing numerous time series regression mod- els on the Proximal Phalanx dataset available from the UCR archive [14]. The dataset consists of examples which are univariate time series representing bone measurements in an attempt to automate the job of a radiographer in estimating the age of a patient given a radiograph of the non-dominant hand. The dataset consists of 399 training samples and 204 test samples. The target variable is the bone age and there are 6 classes (3-8) however for the purposes of this paper we are treating this as a regression exercise. Every time series is of length 80. The parameters used for the different models are summarized in the Appendix. Kernel Methods for Time Series Classification and Regression 9 Fig. 3. Experiment I - Proximal Phalanx (Bones) The results of this experiment are presented in Figure 2. Red bars indicate that the method is not PSD. Here the global alignment kernel (GA) kernel produces a PSD kernel matrix. When that happens the GA usually performs outstandingly well as is the case here. It is an easy winner. 6.2 Experiment II - Time Series Classification This experiment consisted of training and testing numerous time series classification models on the Jumping dataset [15]. This dataset consists of a univariate time series created by recording with a Shimmer sensor people performing a jumping exercise. It is a classification task with one ’correct’ class and two ’incorrect’ classes. There are 419 training samples and 179 test samples. The original time series vary in length from 1231 to 6710. The results of this experiment are shown in Figure 3. Again red corresponds to having at least one negative eigenvalue. We observe again that better performance usually means having a PSD kernel matrix. The simple techniques perform really well here with LTW besting DTW as in the first experiment. In Figure 5 we can see the eigenvalues of the GTDW method. The largest eigen- value is 343.76 and has been removed. Negative eigenvalues are coloured red whilst non-negative eigenvalues are coloured blue. We see that GDTW does better than av- erage despite the fact it is not PSD, however the method is far outperformed by the PSD linear time warping with linear local kernel method. The smallest eigenvalue of the GDTW kernel matrix is -0.9178 which is not that negative. This explains why the method does not perform that badly, however those negative eigenvalues are always going to hamper performance. We also note that dtw-GDTW is PSD in the first ex- periment but not in the second. The method is not in general PSD in that for some datasets it may PSD and for others it may not, exactly as we observe here. 10 Mourtadha Badiane, Martin O’Reilly and Pádraig Cunningham Fig. 4. Experiment II - Jumping Fig. 5. Eigenvalues of the GDTW kernel matrix (largest eigenvalue 343.76 excluded) Kernel Methods for Time Series Classification and Regression 11 7 Conclusion It is clear from the results above that usually PSD kernels outperform non-PSD kernels, in the like manner the orignal DTW algorithm fails to produce PSD kernel matrices and therefore performs particularly poorly in both investigations. Linear time warping with a linear local kernel is perhaps the easiest possible way to compare time series of differing lengths, yet despite its simplicity it performs remarkably well. Other tech- niques though they are significantly more sophisticated struggle to beat linear time warping. Furthermore, the computation of ltw kernels requires only linear time: O(l) where l is the length of the longest time series. This is much more time efficient than dynamic time warping . It is interesting to note that SAX paired with a RBF kernel performed reasonably well on both datasets, moreover it produces PSD kernel matri- ces. Although it is known that edit distance is not a metric it would be interesting to investigate SAX paired with the RBF kernel further and see will it always produce PSD kernels. One noteworthy point is that mapping input vectors to a higher dimensional space becomes less effective the higher the input dimensions are ([5]). Basically, as the number of input dimensions increase the linear svm performance improves greatly in comparison to other kernels which involve mapping to a higher dimensional feature space such as polynomial or RBF kernels. This is perhaps why the LTW with a linear kernel performs best on experiment 2 where the input vectors’ (temporal) dimensions varies from 1231 to 6710. It is also worth noting that the global alignment kernel should always produce good results and a PSD kernel. However in experiment 2 above GA failed to produce a PSD kernel and failed to perform well, this was most likely due to the fact that the total number of possible alignments grew so large that the value stored in the computer started to overflow (towards e−100 ). The GA kernel should always produce PSD matrices. However because values stored in a computer can collapse when they become too large, the GA kernel can break down when the time series are too long. We recommend GA as the superior kernel so long as the time series are not too long or the computer can handle the large magnitude of the numbers being stored. In that case the GA needs no adjusting, it will produce PSD kernels which outperforms both the regular DTW and the simplistic LTW. However when the time series are too long or the computer cannot handle the magnitude of the numbers being stored, the algorithm needs to be adjusted to work on the machine. Unfortunately this adjustment causes the kernel to be no longer PSD and therefore it performs poorly. To understand this remember the global alignment kernel is defined as: X K(x, y) = eS(π) (24) π∈A(x,y) Now when x and y are time series of length 40,000 (as in the Jumping dataset) then |A(x, y)| is the Dellanoy number D40000,40000 which is well over 240000 = 1012041 a massive massive number. This number is (the index of) what we are summing over to calculate the kernel on x and y. In summation, we recommend to use the simple LTW technique as DTW is poorly suited for SVMs since it does not produce PSD kernels and therefore it usually performs poorly. For a sophisticated technique to beat LTW, one should try the global alignment kernel (GA) so long as the time series are not too long that the numbers would explode. GA like DTW takes into consideration misalignment on the x-axis (warping) but unlike DTW should produce PSD kernel matrices. 12 Mourtadha Badiane, Martin O’Reilly and Pádraig Cunningham Acknowledgments This publication has resulted from research supported in part by a grant from Science Foundation Ireland (SFI) under Grant Number 16/RC/3872 and is co-funded under the European Regional Development Fund. References 1. Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. A Training Algo- rithm for Optimal Margin Classifiers. 2. Gudmundsson, Steinn, Thomas Philip Runarsson, and Sven Sigurdsson. Support vector machines and dynamic time warping for time series. 3. Corinna Cortes, and Vladimir N. Vapnik. Support-vector networks. 4. Justino, Edson JR, Flavio Bortolozzi, and Robert Sabourin. A comparison of SVM and HMM classifiers in the off-line signature verification. 5. Hsu, Chih-Wei, Chih-Chung Chang, and Chih-Jen Lin. A practical guide to support vector classification. 6. https://en.wikipedia.org/wiki/Supportv ectorm achine 7. Shimodaira, H., Noma, K. I., Nakai, M., Sagayama, S. Dynamic time-alignment kernel in support vector machine. 8. Cuturi, Marco. Fast global alignment kernels. 9. Cuturi, Marco, et al. A kernel for time series based on global alignments. 10. Lin, J., Keogh, E., Wei, L., Lonardi, S. (2007). Experiencing SAX: a novel symbolic representation of time series. 11. Keogh, E., Chakrabarti, K., Pazzani, M., Mehrotra, S. (2001) Locally adaptive dimensionality reduction for indexing large time series databases 12. Yi, B. K., Faloutsos, C. (2000, September). Fast time sequence indexing for arbi- trary Lp norms. 13. Keogh, E., Chakrabarti, K., Pazzani, M., Mehrotra, S. (2001). Dimensionality reduction for fast similarity search in large time series databases. 14. Y. Chen, E. Keogh, B. Hu, N. Begum, A. Bagnall, A. Mueen, G. Batista (2015), The UCR Time Series Classification Archive, www.cs.ucr.edu/~eamonn/time_series_ data/ 15. V. Mahato, M. O’Reilly, P. Cunningham (2018) A Survey of k-NN Methods for Time Series Classification and Regression under reviews for AICS 2018. Appendix Parameters for Experiment 1 – GDTW & NDTW - warping window: 2. – SAX (negative & Gaussian) - sliding window: 15; PAA coefficient: 10; alphabet size: 5; numurosity reduction: none; Z threshold: 0.01 Parameters for Experiment 2 – GDTW & NDTW - warping window: 26 – SAX (negative & Gaussian) - sliding window: 10; PAA coefficient: 10; alphabet size: 15; numurosity reduction: none; Z threshold: 0.1.