An Efficient Subsequence Similarity Search on Modern Intel Many-core Processors for Data Intensive Applications Β© Yana Kraeva Β© Mikhail Zymbler South Ural State University (National Research University), Chelyabinsk, Russia kraevaya@susu.ru mzym@susu.ru Abstract. Many time series analytical problems arising in a wide spectrum of data intensive applications require subsequence similarity search as a subtask. Currently, Dynamic Time Warping (DTW) is considered as the best similarity measure in most domains. Since DTW is computationally expensive there are parallel algorithms have been developed for FPGA, GPU, and Intel Xeon Phi. In our previous work, we accelerated subsequence similarity search with Intel Xeon Phi’s Knights Corner generation by CPU+Phi computational scheme. Such an approach needs significant changes for the Phi’s second-generation product, Knights Landing (KNL), which is an independent bootable device supporting only native applications. In this paper, we present a novel parallel algorithm for subsequence similarity search in time series for the Intel Xeon Phi KNL many-core processor. In order to efficiently exploit vectorization capabilities of Phi KNL, the algorithm provides the sophisticated data layout and computational scheme. We performed experiments on synthetic and real-word datasets, which showed good scalability of the algorithm. Keywords: time series, subsequence similarity search, dynamic time warping, parallel algorithm, OpenMP, vectorization, Intel Xeon Phi Knights Landing. alternative to FPGA and GPU. Now, Intel offers two 1 Introduction generations of Phi products, namely Knights Corner (KNC) [4] and Knights Landing (KNL) [15]. The former is Nowadays, time series are pervasive in a wide spectrum of a coprocessor with up to 61 cores, which supports native applications with data intensive analytics, e.g. climate applications as well as offloading of calculations from a host modelling [1], economic forecasting [16], medical CPU. The latter provides up to 72 cores and as opposed to monitoring [6], etc. Many time series analytical problems predecessor is an independent bootable device, which runs require subsequence similarity search as a subtask, which applications only in native mode. assumes that a query subsequence and a longer time series In our previous works [9] and [21], we accelerated are given, and we are to find a subsequence of the time subsequence similarity search on Phi KNC by means of the series, whose similarity to the query is the maximum among CPU+Phi computational scheme. Such an approach needs all the subsequences. significant changes for Phi KNL because if there is no CPU, At the present time, Dynamic Time Warping (DTW) [3] in addition to parallelization, we have to efficiently is considered as the best time series subsequence similarity vectorize computations in order to achieve high measure in most domains [5], since it allows the performance. subsequence to have some stretching, shrinking, warping, In this paper, we address the accelerating subsequence or different length in comparison to the query. Since similarity search on Phi KNL for the case when time series computation of DTW is time-consuming (𝑂𝑂(𝑛𝑛2 ) where 𝑛𝑛 involved in the computations fit in main memory. The paper is length of the query sequence) there are speedup makes the following basic contributions. We developed a techniques have been proposed including algorithmic novel parallel algorithm of subsequence similarity search in developments (indexing methods [7], early abandoning time series for the Intel Xeon Phi KNL processor. The strategies, embedding and computation reuse algorithm efficiently exploits vectorization capabilities of strategies [13], etc.) and parallel hardware-based solutions Phi KNL by means of the sophisticated data layout and for FPGA [19] and GPU [14], and Intel Xeon Phi [21]. computational scheme. We performed experiments on real- In this paper, the Intel Xeon Phi many-core system is the word datasets, which showed good scalability of the subject of our efforts. Architecture of Phi provides a large algorithm. number of compute cores with a high local memory The rest of the paper is organized as follows. Section 2 bandwidth and 512-bit wide vector processing units. Being discusses related works. Section 3 gives formal notation and based on the Intel x86 architecture, Phi supports thread- statement of the problem. In Section 4, we present the level parallelism and the same programming tools as a proposed parallel algorithm. We describe experimental regular Intel Xeon CPU and serves as an attractive evaluation of our algorithm in Section 5. Finally, in Section 6, we summarize the results obtained and propose Proceedings of the XX International Conference β€œData directions for further research. Analytics and Management in Data Intensive Domains” (DAMDID/RCDL’2018), Moscow, Russia, October 9- 12, 2018 143 2 Related works compute DTW. Such a scheme significantly outperformed naΓ―ve approach. The problem of subsequence similarity search under the This development, however, cannot be directly applied DTW measure has been extensively studied in recent to Phi KNL since it is an independent bootable many-core decade. Since computation of DTW is time-consuming processor and supports only native applications. Thus, our there are speedup techniques have been proposed. previous approach needs significant changes. Moreover, in Algorithmic developments include indexing order to achieve high performance of similarity search on methods [7], early abandoning strategies, embedding and Phi KNL, in addition to parallelization, we should provide computation reuse strategies [13], etc. Despite the fact these data layout and computational scheme that allow exploiting techniques focus on reduction of the number of calls DTW vectorization capabilities of the many-core processor calculation procedure, computation of DTW distance thereof in the most efficient way. measure still takes up to 80 percent of the total run time of the similarity search [20]. Due to this reason a number of parallel hardware-based solutions have been developed. 3 Notation and problem background In [17], authors proposed subsequence similarity search 3.1 Definitions and notations on CPU cluster. Subsequences starting from different positions of the time series are sent to different nodes, and Definition 1. A time series 𝑇𝑇 is a sequence of real-valued each node calculates DTW in the naΓ―ve way. In [18], authors elements: 𝑇𝑇 = 𝑑𝑑1 , 𝑑𝑑2 , … , π‘‘π‘‘π‘šπ‘š . Length of a time series 𝑇𝑇 is accelerated subsequence similarity search with SMP denoted by |𝑇𝑇|. system. They distribute different queries into different Definition 2. Given two time series, 𝑋𝑋 = cores, and each subsequence is sent to different cores to be π‘₯π‘₯1 , π‘₯π‘₯2 , … , π‘₯π‘₯π‘šπ‘š .and π‘Œπ‘Œ = 𝑦𝑦1 , 𝑦𝑦2 , … , π‘¦π‘¦π‘šπ‘š , the Dynamic Time compared with different patterns in the naΓ―ve way. In both Warping (DTW) distance between 𝑋𝑋 and π‘Œπ‘Œ is denoted by implementations, the data transfer becomes the bottleneck. 𝐷𝐷𝐷𝐷𝐷𝐷(𝑋𝑋, π‘Œπ‘Œ) and defined as below5. In [20], a GPU-based implementation was proposed. 𝐷𝐷𝐷𝐷𝐷𝐷(𝑋𝑋, π‘Œπ‘Œ) = 𝑑𝑑(π‘šπ‘š, π‘šπ‘š) (1) The warping matrix is generated in parallel, but the warping 𝑑𝑑(𝑖𝑖 βˆ’ 1, 𝑗𝑗) path is searched serially. Since the matrix generation step 𝑑𝑑(𝑖𝑖, 𝑗𝑗) = οΏ½π‘₯π‘₯𝑖𝑖 βˆ’ 𝑦𝑦𝑗𝑗 οΏ½ + π‘šπ‘šπ‘šπ‘šπ‘šπ‘š οΏ½ 𝑑𝑑(𝑖𝑖, 𝑗𝑗 βˆ’ 1) (2) and the path search step are split into two kernels, this leads 𝑑𝑑(𝑖𝑖 βˆ’ 1, 𝑗𝑗 βˆ’ 1) to overheads for storage and transmission of the warping 𝑑𝑑(0,0) = 0; 𝑑𝑑(𝑖𝑖, 0) = 𝑑𝑑(0, 𝑗𝑗) = ∞; 𝑖𝑖 = 𝑗𝑗 = 1, … , π‘šπ‘š. (3) matrix for each DTW calculation. In (2), a set of 𝑑𝑑(𝑖𝑖, 𝑗𝑗) is considered as an π‘šπ‘š Γ— π‘šπ‘š In [14], GPU and FPGA implementations of warping matrix for the alignment of the two respective time subsequence similarity search were presented. The GPU series. implementation is based on the same ideas as [20]. The A warping path is a contiguous set of warping matrix system consists of two modules, namely Normalizer elements that defines a mapping between the two time (z-normalization of subsequences) and Warper (DTW series. The warping path must start and finish in diagonally calculation), and is generated by a C-to-VHDL tool, which opposite corner cells of the warping matrix, the steps in the exploits the fine-grained parallelism of the DTW. However, warping path are restricted to adjacent cells, and the points this implementation suffers from lacking flexibility, i.e. it in the warping path must be monotonically spaced in time. must be recompiled if length of query is changed. Definition 3. A subsequence 𝑇𝑇𝑖𝑖,π‘˜π‘˜ of a time series 𝑇𝑇 is its In [19], authors proposed a framework for FPGA-based contiguous subset of π‘˜π‘˜ elements, which starts from position subsequence similarity search, which utilizes the data 𝑖𝑖: 𝑇𝑇𝑖𝑖,π‘˜π‘˜ = 𝑑𝑑𝑖𝑖 , 𝑑𝑑𝑖𝑖+1 , … , 𝑑𝑑𝑖𝑖+π‘˜π‘˜βˆ’1 , 1 ≀ 𝑖𝑖 ≀ π‘šπ‘š βˆ’ π‘˜π‘˜ + 1. reusability of continuous DTW calculations to reduce the Definition 4. Given a time series 𝑇𝑇 and a time series 𝑄𝑄 bandwidth and exploit the coarse-grain parallelism. The aforementioned developments, however, cannot be as a user specified query where π‘šπ‘š = |𝑇𝑇| ≫ |𝑄𝑄| = 𝑛𝑛, the directly applied to x86-based many-core systems, Intel best matching subsequence 𝑇𝑇𝑖𝑖,𝑛𝑛 meets the property Xeon Phi KNC [4] and Intel Xeon Phi KNL [15]. βˆ€π‘˜π‘˜ 𝐷𝐷𝐷𝐷𝐷𝐷�𝑄𝑄, 𝑇𝑇𝑖𝑖,𝑛𝑛 οΏ½ ≀ 𝐷𝐷𝐷𝐷𝐷𝐷�𝑄𝑄, π‘‡π‘‡π‘˜π‘˜,𝑛𝑛 οΏ½, In our previous works [9] and [21], we accelerated 1 ≀ 𝑖𝑖, π‘˜π‘˜ ≀ π‘šπ‘š βˆ’ 𝑛𝑛 + 1 (4) subsequence similarity search on the Intel Xeon Phi KNC In what follows, where there is no ambiguity, we refer many-core coprocessor. A naΓ―ve approach when all the to subsequence 𝑇𝑇𝑖𝑖,𝑛𝑛 as 𝐢𝐢, as a candidate in match to a subsequences of a time series are simply distributed across query 𝑄𝑄. the threads of both CPU and Phi for calculation of DTW Definition 5. The z-normalization of a time series 𝑇𝑇 is distance measure, gave unsatisfactory performance and defined as a time series 𝑇𝑇� = 𝑑𝑑1Μ‚ , 𝑑𝑑̂2 , … , π‘‘π‘‘Μ‚π‘šπ‘š where speedup. This was because of insufficient workload of the 𝑑𝑑𝑖𝑖̂ = 𝑖𝑖 𝑑𝑑 βˆ’πœ‡πœ‡ (5) coprocessor. We proposed the CPU+Phi computational 𝜎𝜎 1 scheme where the coprocessor is exploited only for DTW πœ‡πœ‡ = βˆ‘π‘šπ‘š 𝑑𝑑 (6) π‘šπ‘š 𝑖𝑖=1 𝑖𝑖 1 distance measure computations whereas CPU performs 𝜎𝜎 2 = βˆ‘π‘šπ‘š 𝑑𝑑 2 𝑖𝑖 βˆ’ πœ‡πœ‡ 2 (7) π‘šπ‘š 𝑖𝑖=1 lower bounding and prepares subsequences for the Definition 6. The Euclidean distance (ED) between two coprocessor. CPU supports a queue of candidate (z-normalized) subsequences 𝑄𝑄 and 𝐢𝐢 where |𝑄𝑄| = |𝐢𝐢|, is subsequences and offloads it to the coprocessor in order to defined as below. 5 Strictly speaking, DTW allows comparing two time simplicity, we assume time series of equal lengths due to series of different lengths. However, for the sake of it is possible without losing the generality [12]. 144 𝐸𝐸𝐸𝐸(𝑄𝑄, 𝐢𝐢) = οΏ½βˆ‘π‘›π‘›π‘–π‘–=1(π‘žπ‘žπ‘–π‘– βˆ’ 𝑐𝑐𝑖𝑖 )2 (8) calculated, and 𝑏𝑏𝑏𝑏𝑏𝑏 is assumed to be equal to infinity. Then the algorithm scans the input time series applying the 3.2 Serial algorithm cascade of LBs to the current subsequence. If the subsequence is not pruned, then DTW distance is Currently, UCR-DTW [11] is the fastest serial algorithm of calculated. Next, 𝑏𝑏𝑏𝑏𝑏𝑏 is updated if it is greater than the value subsequence similarity search, which integrates a large of DTW distance calculated above. By doing so, in the end, number of algorithmic speedup techniques. Since our UCR-DTW finds the best matching subsequence of the algorithm is based on UCR-DTW, in this section, we briefly given time series. describe its basic features that are essentially exploited in our approach. Squared distances. Instead of use square root in DTW 4 Method and ED distance calculation, it is possible to use the squares In this section, we present a novel computational scheme thereof. Since both functions are monotonic and concave, it and data layout, which allow efficient parallelization and does not change the relative rankings of subsequences. In vectorization of subsequence similarity search on Intel what follows, where there is no ambiguity, we will still use Xeon Phi KNL. DTW and ED assuming the squared versions of them. Vectorization plays the key role for getting the high Z-normalization. Both the query subsequence and each performance of computations with parallel subsequence of the time series need to be z-normalized architectures [2]. This means a compiler's ability to before the comparison. Z-normalization shifts and scales the transform the loops into sequences of vector operations and time series such that the mean is zero and the standard utilize VPUs. Thus, in order to provide the high deviation is one. performance of subsequence similarity search on Phi KNL, Cascading Lower Bounds. Lower bound (LB) is an easy we should organize computations with as many auto- computable threshold of the DTW distance measure to vectorizable loops as possible. identify and prune clearly dissimilar subsequences [5]. In However, many auto-vectorizable loops are not enough what follows, we refer this threshold as the best-so-far for the overall good performance of an algorithm. distance (or 𝑏𝑏𝑏𝑏𝑏𝑏 for brevity) due to UCR-DTW tries to Unaligned memory access is one of the basic factors that improve (decrease) it while scanning the time series. If the can cause inefficient vectorization due to timing overhead lower bound has exceeded 𝑏𝑏𝑏𝑏𝑏𝑏, the DTW distance will for loop peeling [2]. If the start address of the processed data exceed 𝑏𝑏𝑏𝑏𝑏𝑏 as well, and the respective subsequence is is not aligned by the VPU width (i.e. by the number of floats assumed to be clearly dissimilar and pruned without that can be loaded in VPU), the loop is split into three parts calculation of DTW. UCR-DTW exploits the following by the compiler. The first part of iterations, which access the LBs, namely LBKimFL [11], LBKeoghEC and LBKeoghEQ [7], memory from the start address to the first aligned address is and they are applied in a cascade. peeled off and vectorized separately. The rest part of The LBKimFL lower bound uses the distances between iterations from the last aligned address to the end address is the First (Last) pair of points from 𝐢𝐢 and 𝑄𝑄 as a lower split and vectorized separately as well. bound, and defined as below. According to this argumentation, we propose a data πΏπΏπΏπΏπΎπΎπΎπΎπ‘šπ‘š 𝐹𝐹𝐹𝐹(𝑄𝑄, 𝐢𝐢) ∢= 𝐸𝐸𝐸𝐸(π‘žπ‘žοΏ½1 , 𝑐𝑐̂1 ) + ED(π‘žπ‘žοΏ½π‘›π‘› , 𝑐𝑐̂𝑛𝑛 ) (9) layout, which provides an aligned access to subsequences The LBKeoghEC lower bound is the distance from the of the time series. Applying this data layout, we also closer of the two so-called envelopes of the query to a propose the respective computational scheme where as candidate subsequence, and defined as below. many computations as possible are implemented as auto- (𝑐𝑐̂𝑖𝑖 βˆ’ 𝑒𝑒𝑖𝑖 )2 , 𝑖𝑖𝑖𝑖 𝑐𝑐̂𝑖𝑖 > 𝑒𝑒𝑖𝑖 vectorizable for-loops. 𝑛𝑛 πΏπΏπΏπΏπΎπΎπΎπΎπΎπΎπΎπΎβ„Ž 𝐸𝐸𝐸𝐸(𝑄𝑄, 𝐢𝐢) ∢= οΏ½βˆ‘π‘–π‘–=1 οΏ½(𝑐𝑐̂𝑖𝑖 βˆ’ ℓ𝑖𝑖 )2 , 𝑖𝑖𝑖𝑖 𝑐𝑐̂𝑖𝑖 < ℓ𝑖𝑖 (10) 4.1 Data layout 0, π‘œπ‘œπ‘œπ‘œβ„Žπ‘’π‘’π‘’π‘’π‘’π‘’π‘’π‘’π‘’π‘’π‘’π‘’ Firstly, we provide alignment of the processed In (10), subsequence π‘ˆπ‘ˆ = 𝑒𝑒1 , … , 𝑒𝑒𝑛𝑛 and subsequence subsequences. 𝐿𝐿 = β„“1 , … , ℓ𝑛𝑛 are the upper envelope and lower envelope Definition 7. Given a subsequence 𝐢𝐢 and VPU width 𝑀𝑀, of the query, respectively, and defined as below. we denote pad length as 𝑝𝑝𝑝𝑝𝑝𝑝 = 𝑀𝑀 βˆ’ (𝑛𝑛 mod 𝑀𝑀) and 𝑒𝑒𝑖𝑖 = max π‘žπ‘žοΏ½π‘˜π‘˜ (11) define alignment of 𝐢𝐢 as a subsequence 𝐢𝐢̃ as below. π‘–π‘–βˆ’π‘Ÿπ‘Ÿβ‰€π‘˜π‘˜β‰€π‘–π‘–+π‘Ÿπ‘Ÿ ℓ𝑖𝑖 = min π‘žπ‘žοΏ½π‘˜π‘˜ (12) 𝑐𝑐1 , 𝑐𝑐2 , … , 𝑐𝑐𝑛𝑛 , 0,0, οΏ½οΏ½οΏ½οΏ½οΏ½ … ,0 , 𝑖𝑖𝑖𝑖 𝑛𝑛 π‘šπ‘šπ‘šπ‘šπ‘šπ‘š 𝑀𝑀 > 0 π‘–π‘–βˆ’π‘Ÿπ‘Ÿβ‰€π‘˜π‘˜β‰€π‘–π‘–+π‘Ÿπ‘Ÿ where the parameter π‘Ÿπ‘Ÿ (1 ≀ π‘Ÿπ‘Ÿ ≀ 𝑛𝑛) denotes the Sakoe– 𝐢𝐢̃ ∢= οΏ½ 𝑝𝑝𝑝𝑝𝑝𝑝 (14) Chiba band constraint [5], which states that the warping 𝐢𝐢, π‘œπ‘œπ‘œπ‘œβ„Žπ‘’π‘’π‘’π‘’π‘’π‘’π‘’π‘’π‘’π‘’π‘’π‘’ According to Def. 2, βˆ€π‘„π‘„, 𝐢𝐢: |𝑄𝑄| = |𝐢𝐢| 𝐷𝐷𝐷𝐷𝐷𝐷(𝑄𝑄, 𝐢𝐢) = path cannot deviate more than π‘Ÿπ‘Ÿ cells from the diagonal of the warping matrix. 𝐷𝐷𝐷𝐷𝐷𝐷(𝑄𝑄� , 𝐢𝐢̃ ). Thus, in what follows, we still write 𝑄𝑄 and 𝐢𝐢 The LBKeoghEQ lower bound is the distance from the for the query subsequence and a subsequence of the input query and the closer of the two envelopes of a candidate time series, assuming, however, the aligned versions subsequence. That is, as opposed to LBKeoghEC, the roles of thereof. Alignment of subsequences allows to avoid timing the query and the candidate subsequence are reversed: overheads for loop peeling. Next, we store all the (aligned) subsequences of a time πΏπΏπΏπΏπΎπΎπΎπΎπΎπΎπΎπΎβ„Ž 𝐸𝐸𝐸𝐸(𝑄𝑄, 𝐢𝐢) ∢= πΏπΏπ΅π΅πΎπΎπΎπΎπΎπΎπΎπΎβ„Ž 𝐸𝐸𝐸𝐸(𝐢𝐢, 𝑄𝑄) (13) series as a matrix in order to provide auto-vectorization of Finally, UCR-DTW performs as follows. Firstly, computations. z-normalized version of the query and its envelopes are Definition 8. Let us denote the number of all the 145 subsequences of length 𝑛𝑛 of a time series 𝑇𝑇 as 𝑁𝑁, 𝑁𝑁 = Algorithm PHIBESTMATCH |𝑇𝑇| βˆ’ 𝑛𝑛 + 1 = π‘šπ‘š βˆ’ 𝑛𝑛 + 1. We define the subsequence Input: matrix (i.e. the matrix of all aligned subsequences of length 𝑇𝑇 time series to search 𝑛𝑛 from a time series 𝑇𝑇), 𝑆𝑆𝑇𝑇𝑛𝑛 ∈ ℝ𝑁𝑁×(𝑛𝑛+𝑝𝑝𝑝𝑝𝑝𝑝) as below. 𝑄𝑄 query subsequence 𝑆𝑆𝑇𝑇𝑛𝑛 (𝑖𝑖, 𝑗𝑗) ∢= 𝑑𝑑̃𝑖𝑖+π‘—π‘—βˆ’1 (15) π‘Ÿπ‘Ÿ warping constraint We also establish two more matrices regarding lower 𝑝𝑝 number of threads employed bounding of all subsequences of the input time series. The first matrix stores for each subsequence the values of all the 𝑠𝑠 segment size LBs in the cascade (cf. Sect. 3.2). The second one is bitmap Output: matrix, which stores for each subsequence the result of 𝑏𝑏𝑏𝑏𝑏𝑏 similarity of the best match subsequence comparison of 𝑏𝑏𝑏𝑏𝑏𝑏 with every LB. Returns: Definition 9. Let us denote the number of LBs exploited index of the best match subsequence by the subsequence similarity search algorithm as π‘™π‘™π‘™π‘™π‘šπ‘šπ‘šπ‘šπ‘šπ‘š 1: CALCENVELOPE(𝑄𝑄, π‘Ÿπ‘Ÿ, π‘ˆπ‘ˆ, 𝐿𝐿) and enumerate them according to the order in the lower 2: CALCLOWERBOUNDS(𝑆𝑆𝑇𝑇𝑛𝑛 , 𝑄𝑄, π‘Ÿπ‘Ÿ,𝐿𝐿𝑛𝑛𝑇𝑇 ) bounding cascade. Given a time series 𝑇𝑇, we define the LB- matrix of all subsequences of length 𝑛𝑛 from 𝑇𝑇, 𝐿𝐿𝑛𝑛𝑇𝑇 ∈ 3: 𝑏𝑏𝑏𝑏𝑏𝑏 ← UCRDTW(𝑇𝑇1,𝑛𝑛 , 𝑄𝑄, π‘Ÿπ‘Ÿ, ∞) β„π‘π‘Γ—π‘™π‘™π‘™π‘™π‘šπ‘šπ‘šπ‘šπ‘šπ‘š as below. 4: numcand ← 𝑁𝑁 𝐿𝐿𝑛𝑛𝑇𝑇 (𝑖𝑖, 𝑗𝑗) ∢= 𝐿𝐿𝐿𝐿𝑗𝑗 �𝑇𝑇𝑖𝑖,𝑛𝑛 οΏ½ (16) 5: while numcand > 0 do Definition 10. Given a time series 𝑇𝑇, we define the 6: LOWERBOUNDING(𝐿𝐿𝑛𝑛𝑇𝑇 , 𝑏𝑏𝑏𝑏𝑏𝑏, 𝐡𝐡𝑇𝑇𝑛𝑛 ) bitmap matrix of all subsequences of length 𝑛𝑛 from 𝑇𝑇, 7: numcand ← FILLCANDMATR(𝑆𝑆𝑇𝑇𝑛𝑛 , 𝐡𝐡𝑇𝑇𝑛𝑛 , 𝐢𝐢𝑇𝑇𝑛𝑛 , 𝑝𝑝, 𝑠𝑠) 𝐡𝐡𝑇𝑇𝑛𝑛 ∈ π”Ήπ”Ήπ‘π‘Γ—π‘™π‘™π‘™π‘™π‘šπ‘šπ‘šπ‘šπ‘šπ‘š as below. 8: if numcand > 0 then 𝐡𝐡𝑇𝑇𝑛𝑛 (𝑖𝑖, 𝑗𝑗) ∢= 𝐿𝐿𝑛𝑛𝑇𝑇 (𝑖𝑖, 𝑗𝑗) < 𝑏𝑏𝑏𝑏𝑏𝑏 (17) 9: bestmatch ← CALCCANDMATR(𝐢𝐢𝑇𝑇𝑛𝑛 , Finally, we establish a matrix to store candidate 10: 𝑛𝑛𝑛𝑛𝑛𝑛𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 , π‘Ÿπ‘Ÿ, 𝑝𝑝, 𝑏𝑏𝑏𝑏𝑏𝑏) subsequences, i.e. those subsequences from the 𝑆𝑆𝑇𝑇𝑛𝑛 matrix, 11: return bestmatch which have not been pruned after the lower bounding. This matrix will be processed in parallel by calculating of DTW Figure 8 Overall computational scheme distance measure between each row representing the subsequence and the query. Calculation of LBs. Figure 2 depicts the pseudo-code Definition 11. Let us denote the number of threads for calculation of lower bounds. employed by the parallel algorithm as 𝑝𝑝 (𝑝𝑝 β‰₯ 1). Given the 𝑁𝑁 Algorithm CALCLOWERBOUNDS segment size 𝑠𝑠 ∈ β„€+ (𝑠𝑠 ≀ οΏ½ οΏ½), we define the candidate Input: 𝑝𝑝 matrix, 𝐢𝐢𝑇𝑇𝑛𝑛 ∈ ℝ(π‘ π‘ βˆ™π‘π‘)Γ—(𝑛𝑛+𝑝𝑝𝑝𝑝𝑝𝑝) as below. 𝑆𝑆𝑇𝑇𝑛𝑛 subsequence matrix 𝐢𝐢𝑇𝑇𝑛𝑛 (𝑖𝑖,βˆ™) ≔ 𝑆𝑆𝑇𝑇𝑛𝑛 (π‘˜π‘˜,βˆ™): 𝑄𝑄 query subsequence βˆ€π‘—π‘—, 1 ≀ 𝑗𝑗 ≀ π‘™π‘™π‘™π‘™π‘šπ‘šπ‘šπ‘šπ‘šπ‘š , 𝐡𝐡𝑇𝑇𝑛𝑛 (π‘˜π‘˜, 𝑗𝑗) = 1 (18) π‘Ÿπ‘Ÿ warping constraint 4.1 Computational scheme 𝑝𝑝 number of threads employed Output: We are finally in a position to introduce our approach. 𝐿𝐿𝑛𝑛𝑇𝑇 LB-matrix Figure 1 depicts the overall scheme of the algorithm. 1: #pragma omp parallel for num_threads(𝑝𝑝) The algorithm performs as follows. Firstly, the lower envelope and the upper envelope of the query are calculated, 2: for i from 1 to 𝑁𝑁 do and the query is z-normalized. Then, for each subsequence 3: ZNORMALIZE(𝑆𝑆𝑇𝑇𝑛𝑛 (𝑖𝑖,βˆ™)) in the subsequence matrix, all the LBs of the lower 4: 𝐿𝐿𝑛𝑛𝑇𝑇 (i,1) ← LBKIMFL(𝑄𝑄, 𝑆𝑆𝑇𝑇𝑛𝑛 (𝑖𝑖,βˆ™)) bounding cascade are preliminarily calculated (as well as 5: 𝐿𝐿𝑛𝑛𝑇𝑇 (i,2) ← LBKEOGHEC(𝑄𝑄, 𝑆𝑆𝑇𝑇𝑛𝑛 (𝑖𝑖,βˆ™)) z-normalized version of each subsequence). Additionally, in 6: CALCENVELOPE(𝑆𝑆𝑇𝑇𝑛𝑛 (𝑖𝑖,βˆ™), π‘Ÿπ‘Ÿ, π‘ˆπ‘ˆ, 𝐿𝐿) order to start the search, the algorithm initializes 𝑏𝑏𝑏𝑏𝑏𝑏 by 7: 𝐿𝐿𝑛𝑛𝑇𝑇 (i,3) ← LBKEOGHEQ(𝑆𝑆𝑇𝑇𝑛𝑛 (𝑖𝑖,βˆ™), 𝑄𝑄, π‘ˆπ‘ˆ, 𝐿𝐿) calculating the DTW distance measure between the query and first subsequence. Figure 9 Calculation of lower bounds After that, the algorithm carries out the following loop as long as there are subsequences that have not been pruned Strictly speaking, this step brings redundant calculations during the search. At first, lower bounding is performed and to our algorithm. In contrast, UCR-DTW calculates the next the bitmap matrix is calculated in parallel. Next, promising LB in the cascade only if a current subsequence is clearly candidates are added to the candidate matrix in serial mode. dissimilar after the calculation of the previous LB. As Then, the DTW distance measure between the query and opposed to UCR-DTW, we calculate all the LBs and z- each subsequence of the candidate matrix is calculated in normalized versions for all the subsequences because of the parallel, and minimal distance is found. By the end of loop, following reasons. Firstly, it is possible to perform such we output the index of the subsequence with minimal computations once and before the scanning of all the distance measure. Below, we describe these steps in detail. subsequences. Secondly, these computations are parallelizable in a simple way based on the data parallelism paradigm. Finally, being combined, these computations can 146 be efficiently vectorized by the compiler. Algorithm FILLCANDMATR Lower bounding. Figure 3 depicts the pseudo-code for Input: lower bounding of subsequences. The algorithm performs 𝑆𝑆𝑇𝑇𝑛𝑛 subsequence matrix lower bounding by scanning the LB-matrix and calculating 𝐡𝐡𝑇𝑇𝑛𝑛 bitmap matrix the respective row of the bitmap matrix. In the next step of the algorithm, a subsequence corresponding to the row of 𝑝𝑝 number of threads employed the bitmap matrix where each element equals to one, will be 𝑠𝑠 segment size added to the candidate matrix in order to further calculate Output: DTW distance measure. 𝐢𝐢𝑇𝑇𝑛𝑛 candidate matrix Returns: Algorithm LOWERBOUNDING number of subsequences added Input: to the candidate matrix 𝐿𝐿𝑛𝑛𝑇𝑇 LB-matrix 1: numcand ← 0 𝑏𝑏𝑏𝑏𝑏𝑏 best-so-far similarity distance 2: for i from 1 to 𝑝𝑝 do 𝑝𝑝 number of threads employed 3: for k from 1 to 𝑠𝑠 do Output: π‘™π‘™π‘™π‘™π‘šπ‘šπ‘šπ‘šπ‘šπ‘š 𝑛𝑛 𝐡𝐡𝑇𝑇𝑛𝑛 bitmap matrix 4: if ⋀𝑗𝑗=1 𝐡𝐡𝑇𝑇 (𝑝𝑝𝑝𝑝𝑝𝑝𝑖𝑖 + π‘˜π‘˜, 𝑗𝑗) = 1 then 1: #pragma omp parallel num_threads(𝑝𝑝) 5: if numcand < 𝑠𝑠 βˆ™ 𝑝𝑝 then 2: whoami ← omp_get_thread_num() 6: numcand ← numcand + 1 𝑁𝑁 7: 𝑝𝑝𝑝𝑝𝑝𝑝𝑖𝑖 ← 𝑝𝑝𝑝𝑝𝑝𝑝𝑖𝑖 + π‘˜π‘˜ 3: for i from π‘π‘π‘π‘π‘π‘π‘€π‘€β„Žπ‘œπ‘œπ‘œπ‘œπ‘œπ‘œπ‘œπ‘œ to οΏ½ οΏ½ do π‘€π‘€β„Žπ‘œπ‘œπ‘œπ‘œπ‘œπ‘œπ‘œπ‘œβˆ™π‘π‘ 8: 𝐢𝐢𝑇𝑇𝑛𝑛 (𝑛𝑛𝑛𝑛𝑛𝑛𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 ,βˆ™) ← 𝑆𝑆𝑇𝑇𝑛𝑛 (𝑝𝑝𝑝𝑝𝑝𝑝𝑖𝑖 ,βˆ™) 4: for j from 1 to π‘™π‘™π‘™π‘™π‘šπ‘šπ‘šπ‘šπ‘šπ‘š do 9: 𝑖𝑖𝑖𝑖𝑖𝑖𝑛𝑛𝑛𝑛𝑛𝑛𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 ← (𝑝𝑝𝑝𝑝𝑝𝑝𝑖𝑖 βˆ’ 1) β‹… 𝑛𝑛 + 1 5: 𝐡𝐡𝑇𝑇𝑛𝑛 (𝑖𝑖, 𝑗𝑗) ← 𝐿𝐿𝑛𝑛𝑇𝑇 (𝑖𝑖, 𝑗𝑗) < 𝑏𝑏𝑏𝑏𝑏𝑏 10: else Figure 10 Lower bounding of subsequences 11: break 12: if numcand = 𝑠𝑠 βˆ™ 𝑝𝑝 then During the search, we perform many scans of the 13: break LB-matrix in parallel as long as subsequences that are not 14: return numcand clearly dissimilar exist. Parallel processing is based on the following technique. The LB-matrix is logically divided Figure 11 The candidate matrix filling into 𝑝𝑝 equal-sized segments, and each thread scans its own segment. In order to avoid scanning of each segment from Processing of the candidate matrix. Figure 5 depicts scratch, we establish the segment index as an array of 𝑝𝑝 the pseudo-code for calculating DTW distance measure of elements where each element keeps the index of the most candidate subsequences. recent candidate subsequence in the respective segment, i.e. Algorithm CALCCANDMATR 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 = �𝑝𝑝𝑝𝑝𝑝𝑝1 , … , 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 οΏ½ where Input: 0, 𝑏𝑏𝑏𝑏𝑏𝑏 = ∞ ⎧ 𝐢𝐢𝑇𝑇𝑛𝑛 candidate matrix βŽͺπ‘˜π‘˜: 𝑝𝑝 βˆ™ (𝑖𝑖 βˆ’ 1) + 1 ≀ π‘˜π‘˜ ≀ οΏ½ 𝑁𝑁 οΏ½ ∧ numcand number of candidate subsequences 𝑝𝑝𝑝𝑝𝑝𝑝𝑖𝑖 : = π‘–π‘–βˆ™π‘π‘ (19) ⎨ βˆ€π‘—π‘—, 1 ≀ 𝑗𝑗 ≀ π‘™π‘™π‘™π‘™π‘šπ‘šπ‘šπ‘šπ‘šπ‘š , 𝑄𝑄 query subsequence βŽͺ ⎩ 𝐿𝐿𝐿𝐿𝑇𝑇𝑛𝑛 (π‘˜π‘˜, 𝑗𝑗) < 𝑏𝑏𝑏𝑏𝑏𝑏, π‘œπ‘œπ‘œπ‘œβ„Žπ‘’π‘’π‘’π‘’π‘’π‘’π‘’π‘’π‘’π‘’π‘’π‘’ π‘Ÿπ‘Ÿ warping constraint The candidate matrix filling. Figure 4 depicts the 𝑝𝑝 number of threads employed pseudo-code for the procedure of filling the candidate Output: matrix. 𝑏𝑏𝑏𝑏𝑏𝑏 similarity of the best-so-far subsequence The algorithm performs scanning the bitmap matrix Returns: along the segments. We start scanning not from the index of the best-so-far subsequence beginning of a segment but from the respective segment’s index, which stores the number of the most recent candidate 1: #pragma omp parallel for num_threads(𝑝𝑝) subsequence in the segment. If a subsequence is promising, 2: shared (𝑏𝑏𝑏𝑏𝑏𝑏, idx) private (distance) it is added to the candidate matrix. 3: for i from 1 to numcand do In order to output index of the best match subsequence, 4: distance ← UCRDTW(𝐢𝐢𝑇𝑇𝑛𝑛 (𝑖𝑖,βˆ™), 𝑄𝑄, π‘Ÿπ‘Ÿ, 𝑏𝑏𝑏𝑏𝑏𝑏) we establish the candidate subsequence index as an array of 5: #pragma omp critical 𝑠𝑠 βˆ™ 𝑝𝑝 elements where each element keeps the starting 6: if 𝑏𝑏𝑏𝑏𝑏𝑏 > distance then position of a candidate subsequence in the input time series, 7: 𝑏𝑏𝑏𝑏𝑏𝑏 ← distance i.e. 𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼 = �𝑖𝑖𝑖𝑖𝑖𝑖1 , … , π‘–π‘–π‘–π‘–π‘–π‘–π‘ π‘ βˆ™π‘π‘ οΏ½ where 8: bestmatch ← 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 ≔ π‘˜π‘˜: 1 ≀ π‘˜π‘˜ ≀ π‘šπ‘š βˆ’ 𝑛𝑛 + 1 ∧ βˆƒπ‘†π‘†π‘‡π‘‡π‘›π‘› (𝑖𝑖,βˆ™) ⟺ 9: return bestmatch βˆƒπ‘‡π‘‡π‘–π‘–,𝑛𝑛 ⟺ π‘˜π‘˜ = (𝑖𝑖 βˆ’ 1) βˆ™ 𝑛𝑛 + 1 (20) Figure 12 Processing of the candidate matrix 147 The algorithm performs as follows. For each row of the 5.2 Results and discussion candidate matrix, we calculate DTW distance measure Figure 6 depicts the performance of PHIBESTMATCH in between the respective candidate and the query by means of comparison with the UCR-DTW algorithm. As we can see, the UCR-DTW algorithm. If this distance is less than 𝑏𝑏𝑠𝑠𝑠𝑠 our algorithm is two times faster for both EPG and Random then 𝑏𝑏𝑏𝑏𝑏𝑏 is updated. The loop is parallelized by means of Walk datasets when UCR-DTW runs on one CPU core and the OpenMP pragma where 𝑏𝑏𝑏𝑏𝑏𝑏 is indicated as a variable PHIBESTMATCH runs on 240 cores of Intel Xeon Phi. Being shared across all threads while the distance variable is run on Intel Xeon Phi, UCR-DTW is obviously slower than indicated as a private for each thread. In order to correctly PHIBESTMATCH (from 10 to 15 times). update the shared variable, we use pragma with critical Figure 7 and Figure 8 depict the experimental results on section. the Random Walk dataset and the EPG dataset, respectively. On Random Walk data, both speedup and 5 Experiments parallel efficiency are linear when the number of threads matches of physical cores the algorithm is running on. 5.1 Experimental setup However, when more than one thread per physical core is Objectives. In the experiments, we compared the used, speedup became sub-linear, and efficiency decreases performance of our algorithm in comparison with UCR- accordingly. That is, speedup stops increasing from 80Γ— DTW. We also evaluated the scalability of our algorithm on when 120 threads are employed, and efficiency drops from Intel Xeon Phi for different datasets. We measured the run 60 percent for 120 threads to 30 percent for 240 threads. time (after deduction of I/O time) and calculated the On EPG data, the algorithm shows closer to linear algorithm’s speedup and parallel efficiency. Here we speedup and efficiency (up to 50Γ— and at least 80 percent, understand these characteristics of parallel algorithm respectively) if the number of threads employed is up to the scalability as follows. Speedup and parallel efficiency of a number of physical cores. When more than one thread per parallel algorithm employing 𝑝𝑝 threads are calculated, physical core is used, speedup slowly increases up to 78Γ—, 𝑑𝑑 𝑠𝑠(𝑝𝑝) respectively, as 𝑠𝑠(𝑝𝑝) = 1 and 𝑒𝑒(𝑝𝑝) = , where 𝑑𝑑1 and and efficiency drops accordingly (similar to the picture for 𝑑𝑑𝑝𝑝 𝑝𝑝 the Random Walk dataset). 𝑑𝑑𝑝𝑝 are the run times of the algorithm when one and 𝑝𝑝 threads We can conclude that the proposed algorithm are employed, respectively. demonstrates closer to linear scalability when the number of Hardware. We performed our experiments on a node threads it runs on is up to the number of physical cores of of the Tornado SUSU supercomputer [8] with the the Intel Xeon Phi many-core processor. However, when characteristics summarized in Table 1. more than one thread per physical core is used, speedup and parallel efficiency decrease significantly. Table 1 Specifications of hardware There are two possible reasons for this. Firstly, our Specification Xeon 2Γ—Xeon algorithm is not completely parallel, and the candidate matrix Phi CPU filling is its serial part, which limits speedup. Secondly, Model, Intel SE10X X5680 according to its nature, DTW calculations (cf. Def. 2) can hardly ever be auto-vectorized. Thus, if during the (seamlessly # physical cores 61 2Γ—6 auto-vectorizable) lower bounding step many subsequences Hyper threading 4Γ— 2Γ— have not been pruned as clearly dissimilar then they will be # logical cores 244 24 processed by many threads in the DTW calculation step but Frequency, GHz 1.1 3.33 without auto-vectorization as it might be expected. VPU width, bit 512 128 Peak performance, TFLOPS 1.076 0.371 6 Conclusion RAM, Gb 8 8 In this paper, we address the problem of accelerating Datasets. In the experiments, we used the following subsequence similarity search on the modern Intel Xeon Phi datasets (cf. Table 2). The Random Walk is a dataset that system of Knights Landing (KNL) generation. Phi KNL is was synthetically generated according to the model of the an independent bootable device, which provides up to 72 same name [10] and used in [11]. The EPG dataset [14] is a compute cores with a high local memory bandwidth and set of signals from the so-called Electrical Penetration 512-bit wide vector processing units. Being based on the Graph reflecting the behavior of the Aster leafhopper x86 architecture, the Intel Phi KNL many-core processor (macrosteles quadrilineatus). A critical task that researchers supports thread-level parallelism and the same perform is to search for patterns in such time series, because programming tools as a regular Intel Xeon CPU, and serves in the US agriculture, for one state and one crop, this insect as an attractive alternative to FPGA and GPU. We consider causes losses more than two million dollars a year [14]. the case when time series involved in the computations fit in main memory. Table 2 Specifications of datasets Dataset Type π‘šπ‘š 𝑛𝑛 Random Walk Synthetic 106 128 EPG Real 2.5βˆ™105 360 148 a) Random Walk dataset b) EPG dataset Figure 13 Algorithm’s performance a) Speedup b) Parallel efficiency Figure 14 Algorithm’s scalability on the Random Walk dataset a) Speedup b) Parallel efficiency Figure 15 Algorithm’s scalability on the EPG dataset We developed a novel parallel algorithm of We performed experiments on synthetic and real- subsequence similarity search for Intel Xeon Phi KNL, word datasets, which showed the following. called PHIBESTMATCH. Our algorithm is based on UCR- PHIBESTMATCH being run on Intel Xeon Phi is two times DTW [11], which is the fastest serial algorithm of faster than UCR-DTW being run on Intel Xeon. The subsequence similarity search due to it integrates cascade proposed algorithm demonstrates closer to linear both lower bounding and many other algorithmic speedup speedup and parallel efficiency when the number of techniques. PHIBESTMATCH efficiently exploits threads it runs on is up to the number of physical cores vectorization capabilities of Phi KNL by means of the of Intel Xeon Phi. sophisticated data layout and computational scheme. In further research, we plan to move on our approach 149 in the following directions: advance the parallelization of [11] Rakthanmanon, T., Campana, B.J.L., Mueen, the UCR-DTW algorithm for Intel Xeon Phi KNL, and A., Batista, G.E.A.P.A., Westover, M.B., Zhu, extend our algorithm for the computer cluster system Q., Zakaria, J., Keogh, E.J.: Searching and with nodes equipped with Intel Xeon Phi KNL. mining trillions of time series subsequences under dynamic time warping. In: Yang, Q., Acknowledgments. This work was financially Agarwal, D., Pei, J. (eds.) KDD, pp. 262-270. supported by the Russian Foundation for Basic Research ACM (2012). doi: 10.1145/2339530.2339576 (grant No. 17-07-00463), by Act 211 Government of the [12] Ratanamahatana, C., Keogh, E.J.: Three myths Russian Federation (contract No. 02.A03.21.0011) and about Dynamic Time Warping Data Mining. In: by the Ministry of education and science of Russian Kargupta, H., Srivastava, J., Kamath, C., Federation (government order 2.7905.2017/8.9). Goodman, A. (eds.), SDM 2005. pp. 506-510. doi: 10.1137/1.9781611972757.50 References [13] Sakurai, Y., Faloutsos, C., Yamamuro, M.: [1] Abdullaev, S.M., Zhelnin, A.A., Stream monitoring under the time warping Lenskaya, O.Y.: The structure of mesoscale distance. In: Chirkova, R., Dogac, A., Tamer convective systems in central Russia. Russian Ozsu, M., Sellis, T.K. (eds.) ICDE 2007, Meteorology and Hydrology. 37(1), pp. 12-20 pp. 1046-1055. IEEE Computer Society (2007). (2012). doi: 10.1109/ICDE.2007.368963 [2] Bacon, D.F., Graham, S.L., Sharp, O.J.: [14] Sart, D., Mueen, A., Najjar, W.A., Keogh, E.J., Compiler transformation for high-performance Niennattrakul, V.: Accelerating dynamic time computing. ACM Computing Surveys. 26, warping subsequence search with GPUs and pp. 345-420 (1994). FPGAs. In: Webb, G.I., Liu, B., Zhang, C., doi: 10.1145/197405.197406 Gunopulos, D., Wu, X. (eds.) ICDM 2010, [3] Berndt, D.J., Clifford, J.: Using dynamic time pp. 1001-1006. IEEE Computer Society (2010). warping to find patterns in time series. In: doi: 10.1109/ICDM.2010.21 Fayyad, U.M., Uthurusamy, R. (eds.) KDD [15] Sodani, A.: Knights Landing (KNL): 2nd Workshop, pp. 359-370. AAAI Press (1994) generation Intel Xeon Phi processor. In: 2015 [4] Chrysos, G.: Intel Xeon Phi coprocessor IEEE Hot Chips 27th Symposium (HCS), pp. 1- (codename Knights Corner). In: 2012 IEEE Hot 24. IEEE Computer Society (2015) Chips 24th Symposium (HCS), pp. 1-31 (2012). [16] Sokolinskaya, I., Sokolinsky, L.B.: Scalability doi: 10.1109/HOTCHIPS.2012.7476487 evaluation of NSLP algorithm for solving non- [5] Ding, H., Trajcevski, G., Scheuermann, P., stationary linear programming problems on Wang, X., Keogh, E.J.: Querying and mining of cluster computing systems. In: Voevodin, V., time series data: experimental comparison of Sobolev, S. (eds.) RuSCDays 2017. CCIS, representations and distance measures. PVLDB vol. 793, pp. 40-53. Springer, Heidelberg 1(2), 1542-1552 (2008) (2017). doi: 10.1007/978-3-319-71255-0_4 [6] Epishev, V., Isaev, A., Miniakhmetov, R. et al.: Physiological data mining system for elite [17] Srikanthan, S., Kumar, A., and Gupta, R.: sports. Bull. of South Ural State University. Implementing the dynamic time warping Series: Comput. Math. and Soft. Eng., 2(1):44- algorithm in multithreaded environments for 54, 2013. (in Russian) real time and unsupervised pattern discovery. doi: 10.14529/cmse130105 In: IEEE ICCCT, pp. 394-398. IEEE Computer Society (2011). [7] Keogh, E., Ratanamahatana, C.: Exact indexing doi: 10.1109/ICCCT.2011.6075111 of dynamic time warping. Knowl. Inf. Syst. 2005: vol. 7, no. 3, pp. 406-417. [18] Takahashi, N., Yoshihisa, T., Sakurai, Y., Kanazawa, M.: A parallelized data stream [8] Kostenetskiy, P., Safonov, A.: SUSU processing system using Dynamic Time supercomputer resources. In: L. Sokolinsky, Warping distance. In: Barolli, L., Xhafa, F., I. Starodubov (eds.) PCT'2016, pp. 561-573. Hsu, H.-H. (eds.) CISIS, pp. 1100-1105 (2009). CEUR-WS, vol. 1576 (2016) doi: 10.1109/CISIS.2009.77 [9] Miniakhmetov, R., Movchan, A., Zymbler, M.: [19] Wang, Z., Huang, S., Wang, L., Li, H., Wang, Accelerating time series subsequence matching Yu, Yang, H.: Accelerating subsequence on the Intel Xeon Phi many-core coprocessor. similarity search based on dynamic time In: Biljanovic, P., Butkovic, Z. et al. (eds.) warping distance with FPGA. In: Hutchings, MIPRO 2015, pp. 1399-1404 (2015). B.L., Betz, V. (eds.) ACM/SIGDA FPGA’13, doi: 10.1109/MIPRO.2015.7160493 pp. 53-62. ACM (2013). [10] Pearson, K.: The problem of the random walk. doi: 10.1145/2435264.2435277 Nature 72(1865), 342 (1905). [20] Zhang, Y., Adl, K., Glass, J.: Fast spoken query doi: 10.1038/072342a0 detection using lower-bound Dynamic Time Warping on Graphical Processing Units. In: 150 ICASSP, pp. 5173-5176. (2012). Integrated Core architecture. In: Morzy, T. doi: 10.1109/ICASSP.2012.6289085 Valduriez, P., Bellatreche, L. (eds.) ADBIS [21] Zymbler, M.: Best-Match time series 2015. LNCS, vol. 9282. pp. 275-286. Springer, subsequence search on the Intel Many Heidelberg (2015). doi: 10.1007/978-3-319- 23135-8_19 151