1 Introduction

ICASSP.

10.1109/ICASSP.2012.6289085

An Efficient Subsequence Similarity Search on Modern Intel Many-core Processors for Data Intensive Applications

kraevaya@susu.ru 0 1 0 Proceedings of the XX International Conference “Data Analytics and Management in Data Intensive Domains” (DAMDID/RCDL'2018) , Moscow , Russia 1 South Ural State University (National Research University) , Chelyabinsk , Russia

2012

6289085 5173 5176

Many time series analytical problems arising in a wide spectrum of data intensive applications require subsequence similarity search as a subtask. Currently, Dynamic Time Warping (DTW) is considered as the best similarity measure in most domains. Since DTW is computationally expensive there are parallel algorithms have been developed for FPGA, GPU, and Intel Xeon Phi. In our previous work, we accelerated subsequence similarity search with Intel Xeon Phi's Knights Corner generation by CPU+Phi computational scheme. Such an approach needs significant changes for the Phi's second-generation product, Knights Landing (KNL), which is an independent bootable device supporting only native applications. In this paper, we present a novel parallel algorithm for subsequence similarity search in time series for the Intel Xeon Phi KNL many-core processor. In order to efficiently exploit vectorization capabilities of Phi KNL, the algorithm provides the sophisticated data layout and computational scheme. We performed experiments on synthetic and real-word datasets, which showed good scalability of the algorithm.

1 Introduction

Nowadays, time series are pervasive in a wide spectrum of applications with data intensive analytics, e.g. climate modelling [ 1 ], economic forecasting [ 16 ], medical monitoring [ 6 ], etc. Many time series analytical problems require subsequence similarity search as a subtask, which assumes that a query subsequence and a longer time series are given, and we are to find a subsequence of the time series, whose similarity to the query is the maximum among all the subsequences.

At the present time, Dynamic Time Warping (DTW) [ 3 ] is considered as the best time series subsequence similarity measure in most domains [ 5 ], since it allows the subsequence to have some stretching, shrinking, warping, or different length in comparison to the query. Since computation of DTW is time-consuming ( ( 2) where is length of the query sequence) there are speedup techniques have been proposed including algorithmic developments (indexing methods [ 7 ], early abandoning strategies, embedding and computation reuse strategies [ 13 ], etc.) and parallel hardware-based solutions for FPGA [ 19 ] and GPU [ 14 ], and Intel Xeon Phi [ 21 ].

In this paper, the Intel Xeon Phi many-core system is the subject of our efforts. Architecture of Phi provides a large number of compute cores with a high local memory bandwidth and 512-bit wide vector processing units. Being based on the Intel x86 architecture, Phi supports threadlevel parallelism and the same programming tools as a regular Intel Xeon CPU and serves as an attractive alternative to FPGA and GPU. Now, Intel offers two generations of Phi products, namely Knights Corner (KNC) [ 4 ] and Knights Landing (KNL) [ 15 ]. The former is a coprocessor with up to 61 cores, which supports native applications as well as offloading of calculations from a host CPU. The latter provides up to 72 cores and as opposed to predecessor is an independent bootable device, which runs applications only in native mode.

In our previous works [ 9 ] and [ 21 ], we accelerated subsequence similarity search on Phi KNC by means of the CPU+Phi computational scheme. Such an approach needs significant changes for Phi KNL because if there is no CPU, in addition to parallelization, we have to efficiently vectorize computations in order to achieve high performance.

In this paper, we address the accelerating subsequence similarity search on Phi KNL for the case when time series involved in the computations fit in main memory. The paper makes the following basic contributions. We developed a novel parallel algorithm of subsequence similarity search in time series for the Intel Xeon Phi KNL processor. The algorithm efficiently exploits vectorization capabilities of Phi KNL by means of the sophisticated data layout and computational scheme. We performed experiments on realword datasets, which showed good scalability of the algorithm.

The rest of the paper is organized as follows. Section 2 discusses related works. Section 3 gives formal notation and statement of the problem. In Section 4, we present the proposed parallel algorithm. We describe experimental evaluation of our algorithm in Section 5. Finally, in Section 6, we summarize the results obtained and propose directions for further research. compute DTW. Such a scheme significantly outperformed naïve approach.

This development, however, cannot be directly applied to Phi KNL since it is an independent bootable many-core processor and supports only native applications. Thus, our previous approach needs significant changes. Moreover, in order to achieve high performance of similarity search on Phi KNL, in addition to parallelization, we should provide data layout and computational scheme that allow exploiting vectorization capabilities of the many-core processor thereof in the most efficient way.

3 Notation and problem background 3.1 Definitions and notations

elements: denoted by | |.

Definition 2.

Definition 1. A time series is a sequence of real-valued = 1, 2, … , . Length of a time series is

Given two time series, Warping (DTW) distance between and is denoted by ( − 1, ) ( , − 1) ( − 1, − 1) ( , ) = −

+ (0,0) = 0; ( , 0) = (0, ) = ∞; = = 1, … , . (3) In (2), a set of ( , ) is considered as an × warping matrix for the alignment of the two respective time series.

A warping path is a contiguous set of warping matrix elements that defines a mapping between the two time series. The warping path must start and finish in diagonally opposite corner cells of the warping matrix, the steps in the warping path are restricted to adjacent cells, and the points in the warping path must be monotonically spaced in time.

Definition 3. A subsequence , of a time series is its contiguous subset of elements, which starts from position : , = , +1 , … , +−1 , 1 ≤ ≤ − + 1.

Definition 4. Given a time series and a time series The problem of subsequence similarity search under the

measure has been extensively studied in recent decade. Since computation of DTW is time-consuming there are speedup techniques have been proposed. methods [ 7 ], early abandoning strategies, embedding and computation reuse strategies [ 13 ], etc. Despite the fact these techniques focus on reduction of the number of calls DTW calculation procedure, computation of DTW measure still takes up to 80 percent of the total run time of the similarity search [ 20 ]. Due to this reason a number of parallel hardware-based solutions have been developed.

In [ 17 ], authors proposed subsequence similarity search on CPU cluster. Subsequences starting from different positions of the time series are sent to different nodes, and each node calculates DTW in the naïve way. In [ 18 ], authors accelerated subsequence similarity search with

SMP system. They distribute different queries into different cores, and each subsequence is sent to different cores to be compared with different patterns in the naïve way. In both implementations, the data transfer becomes the bottleneck.

In [ 20 ], a GPU-based implementation was proposed. The warping matrix is generated in parallel, but the warping path is searched serially. Since the matrix generation step and the path search step are split into two kernels, this leads to overheads for storage and transmission of the warping matrix for each DTW calculation. subsequences and offloads it to the coprocessor in order to 5 Strictly speaking, DTW allows comparing two time series of different lengths. However, for the sake of simplicity, we assume time series of equal lengths due to it is possible without losing the generality [ 12 ]. as a user specified query where best matching subsequence , ∀ , , ≤

, , , 1 ≤ , ≤ −

+ 1 query . to subsequence ,

In what follows, where there is no ambiguity, we refer as , as a candidate in match to a meets the property = | | ≫ | | = , the

Definition 5. The z-normalization of a time series is defined as a time series = 1̂, ̂2, … , ̂ where ̂ = = 2 = − 1 1

∑=1

∑=1 2 − 2 defined as below.

Definition 6. The Euclidean distance (ED) between two (z-normalized) subsequences and where | | = | |, is = (1) (2) (4) (5) (6) (7) ( , ) =

∑=1 ( − )2

3.2 Serial algorithm

Currently, UCR-DTW [ 11 ] is the fastest serial algorithm of subsequence similarity search, which integrates a large number of algorithmic speedup techniques. Since our algorithm is based on UCR-DTW, in this section, we briefly describe its basic features that are essentially exploited in our approach.

Squared distances. Instead of use square root in DTW and ED distance calculation, it is possible to use the squares thereof. Since both functions are monotonic and concave, it does not change the relative rankings of subsequences. In what follows, where there is no ambiguity, we will still use DTW and ED assuming the squared versions of them.

Z-normalization. Both the query subsequence and each subsequence of the time series need to be z-normalized before the comparison. Z-normalization shifts and scales the time series such that the mean is zero and the standard deviation is one.

Cascading Lower Bounds. Lower bound (LB) is an easy computable threshold of the DTW distance measure to identify and prune clearly dissimilar subsequences [ 5 ]. In what follows, we refer this threshold as the best-so-far distance (or

for brevity) due to UCR-DTW tries to exceed improve (decrease) it while scanning the time series. If the lower bound has exceeded , the DTW distance will as well, and the respective subsequence is assumed to be clearly dissimilar and pruned without calculation of DTW. UCR-DTW exploits the following LBs, namely LBKimFL [ 11 ], LBKeoghEC and LBKeoghEQ [ 7 ], and they are applied in a cascade.

The LBKimFL lower bound uses the distances between bound, and defined as below. the First (Last) pair of points from and as a lower of the query, respectively, and defined as below. = ℓ1, … , ℓ are the upper envelope and lower envelope (̂ − )2, ̂ > (̂ − ℓ )2, ̂ < ℓ (10) 0, path cannot deviate more than cells from the diagonal of

The LBKeoghEQ lower bound is the distance from the query and the closer of the two envelopes of a candidate subsequence. That is, as opposed to LBKeoghEC, the roles of the query and the candidate subsequence are reversed: ℎ ( , ) ∶=

ℎ ( , ) calculated, and

is assumed to be equal to infinity. Then the algorithm scans the input time series applying the cascade of LBs to the current subsequence. If the subsequence is not pruned, then DTW distance is calculated. Next,

is updated if it is greater than the value of DTW distance calculated above. By doing so, in the end, UCR-DTW finds the best matching subsequence of the given time series.

4 Method

In this section, we present a novel computational scheme and data layout, which allow efficient parallelization and vectorization of subsequence similarity search on Intel

Xeon Phi KNL.

Vectorization plays the key role for getting the high performance of transform the loops into sequences of vector operations and utilize

VPUs. Thus, in order to provide the high performance of subsequence similarity search on Phi KNL, we should organize computations with as many autovectorizable loops as possible.

However, many auto-vectorizable loops are not enough for the overall good performance of an algorithm.

Unaligned memory access is one of the basic factors that can cause inefficient vectorization due to timing overhead for loop peeling [ 2 ]. If the start address of the processed data is not aligned by the VPU width (i.e. by the number of floats that can be loaded in VPU), the loop is split into three parts by the compiler. The first part of iterations, which access the memory from the start address to the first aligned address is peeled off and vectorized separately. The rest part of iterations from the last aligned address to the end address is split and vectorized separately as well.

According to this argumentation, we propose a data layout, which provides an aligned access to subsequences of the time series. Applying this data layout, we also propose the respective computational scheme where as many computations as possible are implemented as autovectorizable for-loops.

4.1 Data layout

define alignment of as a subsequence ̃ as below. 1, 2, … , , 0,0, … ,0 , > 0 (14) , ℎ

According to Def. 2, ∀ , : | | = | | ( , ) = ( , ̃ ). Thus, in what follows, we still write and for the query subsequence and a subsequence of the input time series, assuming, however, the aligned versions thereof. Alignment of subsequences allows to avoid timing overheads for loop peeling.

Next, we store all the (aligned) subsequences of a time series as a matrix in order to provide auto-vectorization of computations.

Definition 8. Let us denote the number of all the subsequences of length of a time series as , = Algorithm PHIBESTMATCH

( , ) ∶= ̃+−1 | | − + 1 = −

+ 1. We define the subsequence matrix (i.e. the matrix of all aligned subsequences of length from a time series ), ∈ ℝ ×(+ ) as below. (15)

We also establish two more matrices regarding lower bounding of all subsequences of the input time series. The first matrix stores for each subsequence the values of all the LBs in the cascade (cf. Sect. 3.2). The second one is bitmap matrix, which stores for each subsequence the result of comparison of

with every LB.

Definition 9. Let us denote the number of LBs exploited by the subsequence similarity search algorithm as and enumerate them according to the order in the lower bounding cascade. Given a time series , we define the LBmatrix of all subsequences of length from , ∈ ℝ × ( , ) ∶= , as below. time series to search bitmap matrix of all subsequences of length from , ∈ ×

Finally, we

establish a matrix to store candidate subsequences, i.e. those subsequences from the matrix, which have not been pruned after the lower bounding. This matrix will be processed in parallel by calculating of DTW distance

measure between each row representing the subsequence and the query.

Definition 11. Let us denote the number of threads employed by the parallel algorithm as ( ≥ 1). Given the ( ,∙) ≔ ( ,∙): ∀ , 1 ≤ ≤

segment size ∈ ℤ + ( ≤ matrix, ∈ ℝ(∙ )×(+ ) as below.

, ( , ) = 1 ), we define the candidate

4.1 Computational scheme

We are finally in a position to introduce our approach. order to start the search, the algorithm initializes calculating the DTW distance measure between the query by and first subsequence.

After that, the algorithm carries out the following loop as long as there are subsequences that have not been pruned during the search. At first, lower bounding is performed and the bitmap matrix is calculated in parallel. Next, promising candidates are added to the candidate matrix in serial mode. Then, the DTW distance measure between the query and each subsequence of the candidate matrix is calculated in parallel, and minimal distance is found. By the end of loop, we output the index of the subsequence with minimal distance measure. Below, we describe these steps in detail. similarity of the best match subsequence index of the best match subsequence 4: numcand ← 1: CALCENVELOPE( , , , ) 2: CALCLOWERBOUNDS( , , , ) 3:

← UCRDTW( 1, , , , ∞) while numcand > 0 do if numcand > 0 then LOWERBOUNDING( , , ) numcand ← FILLCANDMATR( ,

, , , ) bestmatch ← CALCCANDMATR( , , , , ) 11: return bestmatch 1: #pragma omp parallel for num_threads( ) ZNORMALIZE( ( ,∙)) (i,1) ← LBKIMFL( , (i,2) ← LBKEOGHEC( , ( ,∙)) ( ,∙)) CALCENVELOPE( ( ,∙), , , )

(i,3) ← LBKEOGHEQ( ( ,∙), , , ) (16) (17) (18)

Figure 9 Calculation of lower bounds

Strictly speaking, this step brings redundant calculations to our algorithm. In contrast, UCR-DTW calculates the next LB in the cascade only if a current subsequence is clearly dissimilar after the calculation of the previous LB. As opposed to UCR-DTW, we calculate all the LBs and znormalized versions for all the subsequences because of the following reasons. Firstly, it is possible to perform such computations once and before the scanning of all the be efficiently vectorized by the compiler.

Lower bounding. Figure 3 depicts the pseudo-code for lower bounding of subsequences. The algorithm performs lower bounding by scanning the LB-matrix and calculating the respective row of the bitmap matrix. In the next step of the algorithm, a subsequence corresponding to the row of the bitmap matrix where each element equals to one, will be added to the candidate matrix in order to further calculate

DTW distance measure.

Algorithm LOWERBOUNDING best-so-far similarity distance 1: #pragma omp parallel num_threads( )

whoami ← omp_get_thread_num() 3: for i from ℎ for j from 1 to

( , ) ← ( , ) < clearly dissimilar exist. Parallel processing is based on the following technique. The LB-matrix is logically divided into equal-sized segments, and each thread scans its own segment. In order to avoid scanning of each segment from scratch, we establish the segment index as an array of elements where each element keeps the index of the most recent candidate subsequence in the respective segment, i.e. = 1, … , 0,

= ∞

where ⎧ ⎨ ⎪ ⎪ : ∙ ( − 1) + 1 ≤ ≤

∀ , 1 ≤ ≤ ⎩ ( , ) < , matrix. ∙ i.e.

The algorithm performs scanning the bitmap matrix along the segments. We start scanning not from the beginning of a segment but from the respective segment’s index, which stores the number of the most recent candidate subsequence in the segment. If a subsequence is promising, it is added to the candidate matrix.

In order to output index of the best match subsequence, we establish the candidate subsequence index as an array of elements where each element keeps the starting position of a candidate subsequence in the input time series, = 1, … , ∙ where ≔ : 1 ≤ ≤ − + 1 ∧ ∃ ( ,∙) ⟺ ∃ , ⟺ = ( − 1) ∙ + 1 ←

( similarity of the best-so-far subsequence index of the best-so-far subsequence 3: for i from 1 to numcand do 1: #pragma omp parallel for num_threads( ) shared ( , idx) private (distance) #pragma omp critical distance ← UCRDTW( ( ,∙), , , ) 9: return bestmatch if

> distance then ← distance bestmatch ←

Figure 12 Processing of the candidate matrix The algorithm performs as follows. For each row of the then candidate matrix, we calculate DTW distance measure between the respective candidate and the query by means of the UCR-DTW algorithm. If this distance is less than is updated. The loop is parallelized by means of the OpenMP pragma where is indicated as a variable shared across all threads while the distance variable is indicated as a private for each thread. In order to correctly update the shared variable, we use pragma with critical section.

5 Experiments 5.1 Experimental setup

Objectives. In the experiments, we compared the performance of our algorithm in comparison with UCRDTW. We also evaluated the scalability of our algorithm on Intel Xeon Phi for different datasets. We measured the run time (after deduction of I/O time) and calculated the algorithm’s speedup and parallel efficiency. Here we understand these characteristics of parallel algorithm scalability as follows. Speedup and parallel efficiency of a parallel algorithm employing threads are calculated, respectively, as ( ) = 1 and ( ) = ( ), where 1 and are employed, respectively. are the run times of the algorithm when one and threads

Hardware. We performed our experiments on a node of the

Tornado SUSU supercomputer [8] with the characteristics summarized in Table 1.

Figure 6 depicts the performance of PHIBESTMATCH in comparison with the UCR-DTW algorithm. As we can see, our algorithm is two times faster for both EPG and Random Walk datasets when UCR-DTW runs on one CPU core and PHIBESTMATCH runs on 240 cores of Intel Xeon Phi. Being run on Intel Xeon Phi, UCR-DTW is obviously slower than PHIBESTMATCH (from 10 to 15 times). However, when more than one thread per physical core is used, speedup became sub-linear, and efficiency decreases accordingly. That is, speedup stops increasing from 80× when 120 threads are employed, and efficiency drops from 60 percent for 120 threads to 30 percent for 240 threads.

On EPG data, the algorithm shows closer to linear speedup and efficiency (up to 50× and at least 80 percent, respectively) if the number of threads employed is up to the number of physical cores. When more than one thread per physical core is used, speedup slowly increases up to 78×, and efficiency drops accordingly (similar to the picture for the Random Walk dataset).

We can conclude that the proposed algorithm demonstrates closer to linear scalability when the number of threads it runs on is up to the number of physical cores of the Intel Xeon Phi many-core processor. However, when more than one thread per physical core is used, speedup and parallel efficiency decrease significantly.

There are two possible reasons for this. Firstly, our algorithm is not completely parallel, and the candidate matrix filling is its serial part, which limits speedup. Secondly, according to its nature, DTW calculations (cf. Def. 2) can hardly ever be auto-vectorized. Thus, if during the (seamlessly auto-vectorizable) lower bounding step many subsequences have not been pruned as clearly dissimilar then they will be processed by many threads in the DTW calculation step but without auto-vectorization as it might be expected.

6 Conclusion

In this paper, we address the problem of accelerating subsequence similarity search on the modern Intel Xeon Phi system of Knights Landing (KNL) generation. Phi KNL is an independent bootable device, which provides up to 72 compute cores with a high local memory bandwidth and 512-bit wide vector processing units. Being based on the x86 architecture, the Intel Phi KNL many-core processor supports thread-level parallelism and the same programming tools as a regular Intel Xeon CPU, and serves as an attractive alternative to FPGA and GPU. We consider the case when time series involved in the computations fit in main memory. a) Random Walk dataset b) EPG dataset We developed a novel parallel algorithm of subsequence similarity search for Intel Xeon Phi KNL, called PHIBESTMATCH. Our algorithm is based on UCRDTW [ 11 ], which is the fastest serial algorithm of subsequence similarity search due to it integrates cascade lower bounding and many other algorithmic speedup techniques. PHIBESTMATCH efficiently exploits vectorization capabilities of Phi KNL by means of the sophisticated data layout and computational scheme.

We performed experiments on synthetic and realword datasets, which showed the following. PHIBESTMATCH being run on Intel Xeon Phi is two times faster than UCR-DTW being run on Intel Xeon. The proposed algorithm demonstrates closer to linear both speedup and parallel efficiency when the number of threads it runs on is up to the number of physical cores of Intel Xeon Phi.

In further research, we plan to move on our approach in the following directions: advance the parallelization of the UCR-DTW algorithm for Intel Xeon Phi KNL, and extend our algorithm for the computer cluster system with nodes equipped with Intel Xeon Phi KNL. Acknowledgments. This work was financially supported by the Russian Foundation for Basic Research (grant No. 17-07-00463), by Act 211 Government of the Russian Federation (contract No. 02.A03.21.0011) and by the Ministry of education and science of Russian Federation (government order 2.7905.2017/8.9).

[1] Abdullaev , S.M. , Zhelnin , A.A. , Lenskaya , O.Y. : The structure of mesoscale convective systems in central Russia . Russian Meteorology and Hydrology . 37 ( 1 ), pp. 12 - 20 ( 2012 ).

[2] Bacon , D.F. , Graham , S.L. , Sharp , O.J.: Compiler transformation for high-performance computing . ACM Computing Surveys. 26 , pp. 345 - 420 ( 1994 ). doi: 10 .1145/197405.197406

[3] Berndt , D.J. , Clifford , J.: Using dynamic time warping to find patterns in time series . In: Fayyad, U.M. , Uthurusamy , R . (eds.) KDD Workshop, pp. 359 - 370 . AAAI Press ( 1994 )

[4] Chrysos , G. : Intel Xeon Phi coprocessor (codename Knights Corner) . In: 2012 IEEE Hot Chips 24th Symposium (HCS) , pp. 1 - 31 ( 2012 ). doi: 10 .1109/HOTCHIPS. 2012 .7476487

[5] Ding , H. , Trajcevski , G. , Scheuermann , P. , Wang , X. , Keogh , E.J. : Querying and mining of time series data: experimental comparison of representations and distance measures . PVLDB 1 ( 2 ), 1542 - 1552 ( 2008 )

[6] Epishev , V. , Isaev , A. , Miniakhmetov , R. et al.: Physiological data mining system for elite sports . Bull . of South Ural State University. Series: Comput. Math. and Soft . Eng., 2 ( 1 ): 44 - 54 , 2013 . (in Russian) doi: 10.14529/cmse130105

[7] Keogh , E. , Ratanamahatana , C. : Exact indexing of dynamic time warping . Knowl. Inf. Syst . 2005 : vol. 7 , no. 3 , pp. 406 - 417 .

[8] Kostenetskiy , P. , Safonov , A. : SUSU supercomputer resources . In: L. Sokolinsky , I. Starodubov (eds.) PCT' 2016 , pp. 561 - 573 . CEUR-WS, vol. 1576 ( 2016 )

[9] Miniakhmetov , R. , Movchan , A. , Zymbler , M. : Accelerating time series subsequence matching on the Intel Xeon Phi many-core coprocessor . In: Biljanovic, P. , Butkovic , Z. et al. (eds.) MIPRO 2015 , pp. 1399 - 1404 ( 2015 ). doi: 10 .1109/MIPRO. 2015 .7160493

[10] Pearson , K. : The problem of the random walk . Nature 72 ( 1865 ), 342 ( 1905 ). doi: 10 .1038/072342a0

[11] Rakthanmanon , T. , Campana , B.J.L. , Mueen , A. , Batista , G.E.A.P.A. , Westover , M.B. , Zhu , Q. , Zakaria , J. , Keogh , E.J. : Searching and mining trillions of time series subsequences under dynamic time warping . In: Yang, Q. , Agarwal , D. , Pei , J . (eds.) KDD, pp. 262 - 270 . ACM ( 2012 ). doi: 10 .1145/2339530.2339576

[12] Ratanamahatana , C. , Keogh , E.J.: Three myths about Dynamic Time Warping Data Mining . In: Kargupta, H. , Srivastava , J. , Kamath , C. , Goodman , A . (eds.), SDM 2005 . pp. 506 - 510 . doi: 10 .1137/1.9781611972757.50

[13] Sakurai , Y. , Faloutsos , C. , Yamamuro , M. : Stream monitoring under the time warping distance . In: Chirkova, R. , Dogac , A. , Tamer Ozsu , M. , Sellis , T.K. (eds.) ICDE 2007 , pp. 1046 - 1055 . IEEE Computer Society ( 2007 ). doi: 10 .1109/ICDE. 2007 .368963

[14] Sart , D. , Mueen , A. , Najjar , W.A. , Keogh , E.J. , Niennattrakul , V. : Accelerating dynamic time warping subsequence search with GPUs and FPGAs . In: Webb, G.I. , Liu , B. , Zhang , C. , Gunopulos , D. , Wu , X . (eds.) ICDM 2010 , pp. 1001 - 1006 . IEEE Computer Society ( 2010 ). doi: 10 .1109/ICDM. 2010 .21

[15] Sodani , A. : Knights Landing (KNL): 2nd generation Intel Xeon Phi processor . In: 2015 IEEE Hot Chips 27th Symposium (HCS) , pp. 1 - 24 . IEEE Computer Society ( 2015 )

[16] Sokolinskaya , I. , Sokolinsky , L.B. : Scalability evaluation of NSLP algorithm for solving nonstationary linear programming problems on cluster computing systems . In: Voevodin, V. , Sobolev , S. (eds.) RuSCDays 2017 . CCIS, vol. 793 , pp. 40 - 53 . Springer, Heidelberg ( 2017 ). doi: 10 .1007/978-3- 319 -71255- 0 _ 4

[17] Srikanthan , S. , Kumar , A. , and Gupta , R.: Implementing the dynamic time warping algorithm in multithreaded environments for real time and unsupervised pattern discovery . In: IEEE ICCCT , pp. 394 - 398 . IEEE Computer Society ( 2011 ). doi: 10 .1109/ICCCT. 2011 .6075111

[18] Takahashi , N. , Yoshihisa , T. , Sakurai , Y. , Kanazawa , M.: A parallelized data stream processing system using Dynamic Time Warping distance . In: Barolli, L. , Xhafa , F. , Hsu , H.-H. (eds.) CISIS, pp. 1100 - 1105 ( 2009 ). doi: 10 .1109/CISIS. 2009 .77

[19] Wang , Z. , Huang , S. , Wang , L. , Li , H. , Wang , Yu , Yang, H.: Accelerating subsequence similarity search based on dynamic time warping distance with FPGA . In: Hutchings, B.L. , Betz , V. (eds.) ACM/SIGDA FPGA' 13 , pp. 53 - 62 . ACM ( 2013 ). doi: 10 .1145/2435264.2435277

[20] Zhang , Y. , Adl , K. , Glass , J.: Fast spoken query detection using lower-bound Dynamic Time Warping on Graphical Processing Units . In:

[21] Zymbler , M. : Best-Match time series subsequence search on the Intel Many Integrated Core architecture . In: Morzy,

Valduriez , P. , Bellatreche , L. (eds.) ADBIS 2015 . LNCS , vol. 9282 . pp. 275 - 286 . Springer, Heidelberg ( 2015 ). doi: 10 .1007/978-3- 319 - 23135-8_ 19