An Efficient Subsequence Similarity Search on Modern Intel
    Many-core Processors for Data Intensive Applications
               © Yana Kraeva                                         © Mikhail Zymbler
                      South Ural State University (National Research University),
                                         Chelyabinsk, Russia
               kraevaya@susu.ru                                         mzym@susu.ru
            Abstract. Many time series analytical problems arising in a wide spectrum of data intensive
     applications require subsequence similarity search as a subtask. Currently, Dynamic Time Warping (DTW)
     is considered as the best similarity measure in most domains. Since DTW is computationally expensive there
     are parallel algorithms have been developed for FPGA, GPU, and Intel Xeon Phi. In our previous work, we
     accelerated subsequence similarity search with Intel Xeon Phi’s Knights Corner generation by CPU+Phi
     computational scheme. Such an approach needs significant changes for the Phi’s second-generation product,
     Knights Landing (KNL), which is an independent bootable device supporting only native applications. In this
     paper, we present a novel parallel algorithm for subsequence similarity search in time series for the Intel Xeon
     Phi KNL many-core processor. In order to efficiently exploit vectorization capabilities of Phi KNL, the
     algorithm provides the sophisticated data layout and computational scheme. We performed experiments on
     synthetic and real-word datasets, which showed good scalability of the algorithm.
            Keywords: time series, subsequence similarity search, dynamic time warping, parallel algorithm,
     OpenMP, vectorization, Intel Xeon Phi Knights Landing.

                                                                        alternative to FPGA and GPU. Now, Intel offers two
 1 Introduction                                                         generations of Phi products, namely Knights Corner
                                                                        (KNC) [4] and Knights Landing (KNL) [15]. The former is
 Nowadays, time series are pervasive in a wide spectrum of              a coprocessor with up to 61 cores, which supports native
 applications with data intensive analytics, e.g. climate               applications as well as offloading of calculations from a host
 modelling [1], economic forecasting [16], medical                      CPU. The latter provides up to 72 cores and as opposed to
 monitoring [6], etc. Many time series analytical problems              predecessor is an independent bootable device, which runs
 require subsequence similarity search as a subtask, which              applications only in native mode.
 assumes that a query subsequence and a longer time series                  In our previous works [9] and [21], we accelerated
 are given, and we are to find a subsequence of the time                subsequence similarity search on Phi KNC by means of the
 series, whose similarity to the query is the maximum among             CPU+Phi computational scheme. Such an approach needs
 all the subsequences.                                                  significant changes for Phi KNL because if there is no CPU,
      At the present time, Dynamic Time Warping (DTW) [3]               in addition to parallelization, we have to efficiently
 is considered as the best time series subsequence similarity           vectorize computations in order to achieve high
 measure in most domains [5], since it allows the                       performance.
 subsequence to have some stretching, shrinking, warping,                   In this paper, we address the accelerating subsequence
 or different length in comparison to the query. Since                  similarity search on Phi KNL for the case when time series
 computation of DTW is time-consuming (𝑂𝑂(𝑛𝑛2 ) where 𝑛𝑛                involved in the computations fit in main memory. The paper
 is length of the query sequence) there are speedup                     makes the following basic contributions. We developed a
 techniques have been proposed including algorithmic                    novel parallel algorithm of subsequence similarity search in
 developments (indexing methods [7], early abandoning                   time series for the Intel Xeon Phi KNL processor. The
 strategies,     embedding        and    computation     reuse          algorithm efficiently exploits vectorization capabilities of
 strategies [13], etc.) and parallel hardware-based solutions           Phi KNL by means of the sophisticated data layout and
 for FPGA [19] and GPU [14], and Intel Xeon Phi [21].                   computational scheme. We performed experiments on real-
      In this paper, the Intel Xeon Phi many-core system is the         word datasets, which showed good scalability of the
 subject of our efforts. Architecture of Phi provides a large           algorithm.
 number of compute cores with a high local memory                           The rest of the paper is organized as follows. Section 2
 bandwidth and 512-bit wide vector processing units. Being              discusses related works. Section 3 gives formal notation and
 based on the Intel x86 architecture, Phi supports thread-              statement of the problem. In Section 4, we present the
 level parallelism and the same programming tools as a                  proposed parallel algorithm. We describe experimental
 regular Intel Xeon CPU and serves as an attractive                     evaluation of our algorithm in Section 5. Finally, in
                                                                        Section 6, we summarize the results obtained and propose
Proceedings of the XX International Conference “Data                    directions for further research.
Analytics and Management in Data Intensive Domains”
(DAMDID/RCDL’2018), Moscow, Russia, October 9-
12, 2018


                                                                  143
2 Related works                                                         compute DTW. Such a scheme significantly outperformed
                                                                        naïve approach.
The problem of subsequence similarity search under the                      This development, however, cannot be directly applied
DTW measure has been extensively studied in recent                      to Phi KNL since it is an independent bootable many-core
decade. Since computation of DTW is time-consuming                      processor and supports only native applications. Thus, our
there are speedup techniques have been proposed.                        previous approach needs significant changes. Moreover, in
    Algorithmic       developments       include      indexing          order to achieve high performance of similarity search on
methods [7], early abandoning strategies, embedding and                 Phi KNL, in addition to parallelization, we should provide
computation reuse strategies [13], etc. Despite the fact these          data layout and computational scheme that allow exploiting
techniques focus on reduction of the number of calls DTW                vectorization capabilities of the many-core processor
calculation procedure, computation of DTW distance                      thereof in the most efficient way.
measure still takes up to 80 percent of the total run time of
the similarity search [20]. Due to this reason a number of
parallel hardware-based solutions have been developed.
                                                                        3 Notation and problem background
    In [17], authors proposed subsequence similarity search             3.1 Definitions and notations
on CPU cluster. Subsequences starting from different
positions of the time series are sent to different nodes, and           Definition 1. A time series 𝑇𝑇 is a sequence of real-valued
each node calculates DTW in the naïve way. In [18], authors             elements: 𝑇𝑇 = 𝑡𝑡1 , 𝑡𝑡2 , … , 𝑡𝑡𝑚𝑚 . Length of a time series 𝑇𝑇 is
accelerated subsequence similarity search with SMP                      denoted by |𝑇𝑇|.
system. They distribute different queries into different                       Definition 2. Given two time series, 𝑋𝑋 =
cores, and each subsequence is sent to different cores to be            𝑥𝑥1 , 𝑥𝑥2 , … , 𝑥𝑥𝑚𝑚 .and 𝑌𝑌 = 𝑦𝑦1 , 𝑦𝑦2 , … , 𝑦𝑦𝑚𝑚 , the Dynamic Time
compared with different patterns in the naïve way. In both              Warping (DTW) distance between 𝑋𝑋 and 𝑌𝑌 is denoted by
implementations, the data transfer becomes the bottleneck.              𝐷𝐷𝐷𝐷𝐷𝐷(𝑋𝑋, 𝑌𝑌) and defined as below5.
    In [20], a GPU-based implementation was proposed.                   𝐷𝐷𝐷𝐷𝐷𝐷(𝑋𝑋, 𝑌𝑌) = 𝑑𝑑(𝑚𝑚, 𝑚𝑚)                                         (1)
The warping matrix is generated in parallel, but the warping                                                             𝑑𝑑(𝑖𝑖 − 1, 𝑗𝑗)
path is searched serially. Since the matrix generation step             𝑑𝑑(𝑖𝑖, 𝑗𝑗) = �𝑥𝑥𝑖𝑖 − 𝑦𝑦𝑗𝑗 � + 𝑚𝑚𝑚𝑚𝑚𝑚 � 𝑑𝑑(𝑖𝑖, 𝑗𝑗 − 1)               (2)
and the path search step are split into two kernels, this leads                                                       𝑑𝑑(𝑖𝑖 − 1, 𝑗𝑗 − 1)
to overheads for storage and transmission of the warping                𝑑𝑑(0,0) = 0; 𝑑𝑑(𝑖𝑖, 0) = 𝑑𝑑(0, 𝑗𝑗) = ∞; 𝑖𝑖 = 𝑗𝑗 = 1, … , 𝑚𝑚. (3)
matrix for each DTW calculation.                                               In (2), a set of 𝑑𝑑(𝑖𝑖, 𝑗𝑗) is considered as an 𝑚𝑚 × 𝑚𝑚
    In [14], GPU and FPGA implementations of                            warping matrix for the alignment of the two respective time
subsequence similarity search were presented. The GPU                   series.
implementation is based on the same ideas as [20]. The                         A warping path is a contiguous set of warping matrix
system consists of two modules, namely Normalizer                       elements that defines a mapping between the two time
(z-normalization of subsequences) and Warper (DTW                       series. The warping path must start and finish in diagonally
calculation), and is generated by a C-to-VHDL tool, which               opposite corner cells of the warping matrix, the steps in the
exploits the fine-grained parallelism of the DTW. However,              warping path are restricted to adjacent cells, and the points
this implementation suffers from lacking flexibility, i.e. it           in the warping path must be monotonically spaced in time.
must be recompiled if length of query is changed.                              Definition 3. A subsequence 𝑇𝑇𝑖𝑖,𝑘𝑘 of a time series 𝑇𝑇 is its
    In [19], authors proposed a framework for FPGA-based
                                                                        contiguous subset of 𝑘𝑘 elements, which starts from position
subsequence similarity search, which utilizes the data
                                                                        𝑖𝑖: 𝑇𝑇𝑖𝑖,𝑘𝑘 = 𝑡𝑡𝑖𝑖 , 𝑡𝑡𝑖𝑖+1 , … , 𝑡𝑡𝑖𝑖+𝑘𝑘−1 , 1 ≤ 𝑖𝑖 ≤ 𝑚𝑚 − 𝑘𝑘 + 1.
reusability of continuous DTW calculations to reduce the
                                                                               Definition 4. Given a time series 𝑇𝑇 and a time series 𝑄𝑄
bandwidth and exploit the coarse-grain parallelism.
    The aforementioned developments, however, cannot be                 as a user specified query where 𝑚𝑚 = |𝑇𝑇| ≫ |𝑄𝑄| = 𝑛𝑛, the
directly applied to x86-based many-core systems, Intel                  best matching subsequence 𝑇𝑇𝑖𝑖,𝑛𝑛 meets the property
Xeon Phi KNC [4] and Intel Xeon Phi KNL [15].                           ∀𝑘𝑘 𝐷𝐷𝐷𝐷𝐷𝐷�𝑄𝑄, 𝑇𝑇𝑖𝑖,𝑛𝑛 � ≤ 𝐷𝐷𝐷𝐷𝐷𝐷�𝑄𝑄, 𝑇𝑇𝑘𝑘,𝑛𝑛 �,
    In our previous works [9] and [21], we accelerated                            1 ≤ 𝑖𝑖, 𝑘𝑘 ≤ 𝑚𝑚 − 𝑛𝑛 + 1                                  (4)
subsequence similarity search on the Intel Xeon Phi KNC                        In what follows, where there is no ambiguity, we refer
many-core coprocessor. A naïve approach when all the                    to subsequence 𝑇𝑇𝑖𝑖,𝑛𝑛 as 𝐶𝐶, as a candidate in match to a
subsequences of a time series are simply distributed across             query 𝑄𝑄.
the threads of both CPU and Phi for calculation of DTW                         Definition 5. The z-normalization of a time series 𝑇𝑇 is
distance measure, gave unsatisfactory performance and                   defined as a time series 𝑇𝑇� = 𝑡𝑡1̂ , 𝑡𝑡̂2 , … , 𝑡𝑡̂𝑚𝑚 where
speedup. This was because of insufficient workload of the               𝑡𝑡𝑖𝑖̂ = 𝑖𝑖
                                                                                  𝑡𝑡 −𝜇𝜇
                                                                                                                                            (5)
coprocessor. We proposed the CPU+Phi computational                                 𝜎𝜎
                                                                               1
scheme where the coprocessor is exploited only for DTW                  𝜇𝜇 =     ∑𝑚𝑚 𝑡𝑡                                                    (6)
                                                                              𝑚𝑚 𝑖𝑖=1 𝑖𝑖
                                                                                1
distance measure computations whereas CPU performs                      𝜎𝜎 2 = ∑𝑚𝑚      𝑡𝑡 2 𝑖𝑖 − 𝜇𝜇 2                         (7)
                                                                                𝑚𝑚 𝑖𝑖=1
lower bounding and prepares subsequences for the                            Definition 6. The Euclidean distance (ED) between two
coprocessor. CPU supports a queue of candidate
                                                                        (z-normalized) subsequences 𝑄𝑄 and 𝐶𝐶 where |𝑄𝑄| = |𝐶𝐶|, is
subsequences and offloads it to the coprocessor in order to
                                                                        defined as below.

5
  Strictly speaking, DTW allows comparing two time                      simplicity, we assume time series of equal lengths due to
series of different lengths. However, for the sake of                   it is possible without losing the generality [12].


                                                                  144
𝐸𝐸𝐸𝐸(𝑄𝑄, 𝐶𝐶) = �∑𝑛𝑛𝑖𝑖=1(𝑞𝑞𝑖𝑖 − 𝑐𝑐𝑖𝑖 )2                                   (8)          calculated, and 𝑏𝑏𝑏𝑏𝑏𝑏 is assumed to be equal to infinity. Then
                                                                                      the algorithm scans the input time series applying the
3.2 Serial algorithm                                                                  cascade of LBs to the current subsequence. If the
                                                                                      subsequence is not pruned, then DTW distance is
Currently, UCR-DTW [11] is the fastest serial algorithm of
                                                                                      calculated. Next, 𝑏𝑏𝑏𝑏𝑏𝑏 is updated if it is greater than the value
subsequence similarity search, which integrates a large
                                                                                      of DTW distance calculated above. By doing so, in the end,
number of algorithmic speedup techniques. Since our
                                                                                      UCR-DTW finds the best matching subsequence of the
algorithm is based on UCR-DTW, in this section, we briefly
                                                                                      given time series.
describe its basic features that are essentially exploited in
our approach.
     Squared distances. Instead of use square root in DTW                             4 Method
and ED distance calculation, it is possible to use the squares                        In this section, we present a novel computational scheme
thereof. Since both functions are monotonic and concave, it                           and data layout, which allow efficient parallelization and
does not change the relative rankings of subsequences. In                             vectorization of subsequence similarity search on Intel
what follows, where there is no ambiguity, we will still use                          Xeon Phi KNL.
DTW and ED assuming the squared versions of them.                                          Vectorization plays the key role for getting the high
     Z-normalization. Both the query subsequence and each                             performance        of      computations         with     parallel
subsequence of the time series need to be z-normalized                                architectures [2]. This means a compiler's ability to
before the comparison. Z-normalization shifts and scales the                          transform the loops into sequences of vector operations and
time series such that the mean is zero and the standard                               utilize VPUs. Thus, in order to provide the high
deviation is one.                                                                     performance of subsequence similarity search on Phi KNL,
     Cascading Lower Bounds. Lower bound (LB) is an easy                              we should organize computations with as many auto-
computable threshold of the DTW distance measure to                                   vectorizable loops as possible.
identify and prune clearly dissimilar subsequences [5]. In                                 However, many auto-vectorizable loops are not enough
what follows, we refer this threshold as the best-so-far                              for the overall good performance of an algorithm.
distance (or 𝑏𝑏𝑏𝑏𝑏𝑏 for brevity) due to UCR-DTW tries to                              Unaligned memory access is one of the basic factors that
improve (decrease) it while scanning the time series. If the                          can cause inefficient vectorization due to timing overhead
lower bound has exceeded 𝑏𝑏𝑏𝑏𝑏𝑏, the DTW distance will                                for loop peeling [2]. If the start address of the processed data
exceed 𝑏𝑏𝑏𝑏𝑏𝑏 as well, and the respective subsequence is                              is not aligned by the VPU width (i.e. by the number of floats
assumed to be clearly dissimilar and pruned without                                   that can be loaded in VPU), the loop is split into three parts
calculation of DTW. UCR-DTW exploits the following                                    by the compiler. The first part of iterations, which access the
LBs, namely LBKimFL [11], LBKeoghEC and LBKeoghEQ [7],                                memory from the start address to the first aligned address is
and they are applied in a cascade.                                                    peeled off and vectorized separately. The rest part of
     The LBKimFL lower bound uses the distances between                               iterations from the last aligned address to the end address is
the First (Last) pair of points from 𝐶𝐶 and 𝑄𝑄 as a lower                             split and vectorized separately as well.
bound, and defined as below.                                                               According to this argumentation, we propose a data
𝐿𝐿𝐿𝐿𝐾𝐾𝐾𝐾𝑚𝑚 𝐹𝐹𝐹𝐹(𝑄𝑄, 𝐶𝐶) ∶= 𝐸𝐸𝐸𝐸(𝑞𝑞�1 , 𝑐𝑐̂1 ) + ED(𝑞𝑞�𝑛𝑛 , 𝑐𝑐̂𝑛𝑛 ) (9)                layout, which provides an aligned access to subsequences
     The LBKeoghEC lower bound is the distance from the                               of the time series. Applying this data layout, we also
closer of the two so-called envelopes of the query to a                               propose the respective computational scheme where as
candidate subsequence, and defined as below.                                          many computations as possible are implemented as auto-
                                      (𝑐𝑐̂𝑖𝑖 − 𝑢𝑢𝑖𝑖 )2 , 𝑖𝑖𝑖𝑖 𝑐𝑐̂𝑖𝑖 > 𝑢𝑢𝑖𝑖            vectorizable for-loops.
                                𝑛𝑛
𝐿𝐿𝐿𝐿𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾ℎ 𝐸𝐸𝐸𝐸(𝑄𝑄, 𝐶𝐶) ∶= �∑𝑖𝑖=1 �(𝑐𝑐̂𝑖𝑖 − ℓ𝑖𝑖 )2 , 𝑖𝑖𝑖𝑖 𝑐𝑐̂𝑖𝑖 < ℓ𝑖𝑖 (10)         4.1 Data layout
                                               0, 𝑜𝑜𝑜𝑜ℎ𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒
                                                                                      Firstly, we provide alignment of the processed
     In (10), subsequence 𝑈𝑈 = 𝑢𝑢1 , … , 𝑢𝑢𝑛𝑛 and subsequence                         subsequences.
𝐿𝐿 = ℓ1 , … , ℓ𝑛𝑛 are the upper envelope and lower envelope                                Definition 7. Given a subsequence 𝐶𝐶 and VPU width 𝑤𝑤,
of the query, respectively, and defined as below.                                     we denote pad length as 𝑝𝑝𝑝𝑝𝑝𝑝 = 𝑤𝑤 − (𝑛𝑛 mod 𝑤𝑤) and
𝑢𝑢𝑖𝑖 = max 𝑞𝑞�𝑘𝑘                                         (11)                         define alignment of 𝐶𝐶 as a subsequence 𝐶𝐶̃ as below.
        𝑖𝑖−𝑟𝑟≤𝑘𝑘≤𝑖𝑖+𝑟𝑟
ℓ𝑖𝑖 =      min           𝑞𝑞�𝑘𝑘                                         (12)                    𝑐𝑐1 , 𝑐𝑐2 , … , 𝑐𝑐𝑛𝑛 , 0,0,
                                                                                                                      ����� … ,0 , 𝑖𝑖𝑖𝑖 𝑛𝑛 𝑚𝑚𝑚𝑚𝑚𝑚 𝑤𝑤 > 0
        𝑖𝑖−𝑟𝑟≤𝑘𝑘≤𝑖𝑖+𝑟𝑟
where the parameter 𝑟𝑟 (1 ≤ 𝑟𝑟 ≤ 𝑛𝑛) denotes the Sakoe–                               𝐶𝐶̃ ∶= �                           𝑝𝑝𝑝𝑝𝑝𝑝                          (14)
Chiba band constraint [5], which states that the warping                                                             𝐶𝐶, 𝑜𝑜𝑜𝑜ℎ𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒
                                                                                           According to Def. 2, ∀𝑄𝑄, 𝐶𝐶: |𝑄𝑄| = |𝐶𝐶| 𝐷𝐷𝐷𝐷𝐷𝐷(𝑄𝑄, 𝐶𝐶) =
path cannot deviate more than 𝑟𝑟 cells from the diagonal of
the warping matrix.                                                                   𝐷𝐷𝐷𝐷𝐷𝐷(𝑄𝑄� , 𝐶𝐶̃ ). Thus, in what follows, we still write 𝑄𝑄 and 𝐶𝐶
     The LBKeoghEQ lower bound is the distance from the                               for the query subsequence and a subsequence of the input
query and the closer of the two envelopes of a candidate                              time series, assuming, however, the aligned versions
subsequence. That is, as opposed to LBKeoghEC, the roles of                           thereof. Alignment of subsequences allows to avoid timing
the query and the candidate subsequence are reversed:                                 overheads for loop peeling.
                                                                                           Next, we store all the (aligned) subsequences of a time
𝐿𝐿𝐿𝐿𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾ℎ 𝐸𝐸𝐸𝐸(𝑄𝑄, 𝐶𝐶) ∶= 𝐿𝐿𝐵𝐵𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾ℎ 𝐸𝐸𝐸𝐸(𝐶𝐶, 𝑄𝑄) (13)
                                                                                      series as a matrix in order to provide auto-vectorization of
     Finally, UCR-DTW performs as follows. Firstly,
                                                                                      computations.
z-normalized version of the query and its envelopes are
                                                                                           Definition 8. Let us denote the number of all the


                                                                                145
subsequences of length 𝑛𝑛 of a time series 𝑇𝑇 as 𝑁𝑁, 𝑁𝑁 =                   Algorithm PHIBESTMATCH
|𝑇𝑇| − 𝑛𝑛 + 1 = 𝑚𝑚 − 𝑛𝑛 + 1. We define the subsequence                         Input:
matrix (i.e. the matrix of all aligned subsequences of length                     𝑇𝑇 time series to search
 𝑛𝑛 from a time series 𝑇𝑇), 𝑆𝑆𝑇𝑇𝑛𝑛 ∈ ℝ𝑁𝑁×(𝑛𝑛+𝑝𝑝𝑝𝑝𝑝𝑝) as below.                    𝑄𝑄 query subsequence
𝑆𝑆𝑇𝑇𝑛𝑛 (𝑖𝑖, 𝑗𝑗) ∶= 𝑡𝑡̃𝑖𝑖+𝑗𝑗−1                                  (15)
                                                                                  𝑟𝑟 warping constraint
        We also establish two more matrices regarding lower
                                                                                  𝑝𝑝 number of threads employed
bounding of all subsequences of the input time series. The
first matrix stores for each subsequence the values of all the                    𝑠𝑠 segment size
LBs in the cascade (cf. Sect. 3.2). The second one is bitmap                   Output:
matrix, which stores for each subsequence the result of                           𝑏𝑏𝑏𝑏𝑏𝑏 similarity of the best match subsequence
comparison of 𝑏𝑏𝑏𝑏𝑏𝑏 with every LB.                                            Returns:
        Definition 9. Let us denote the number of LBs exploited                   index of the best match subsequence
by the subsequence similarity search algorithm as 𝑙𝑙𝑙𝑙𝑚𝑚𝑚𝑚𝑚𝑚                1: CALCENVELOPE(𝑄𝑄, 𝑟𝑟, 𝑈𝑈, 𝐿𝐿)
and enumerate them according to the order in the lower
                                                                            2: CALCLOWERBOUNDS(𝑆𝑆𝑇𝑇𝑛𝑛 , 𝑄𝑄, 𝑟𝑟,𝐿𝐿𝑛𝑛𝑇𝑇 )
bounding cascade. Given a time series 𝑇𝑇, we define the LB-
matrix of all subsequences of length 𝑛𝑛 from 𝑇𝑇, 𝐿𝐿𝑛𝑛𝑇𝑇 ∈                   3: 𝑏𝑏𝑏𝑏𝑏𝑏 ← UCRDTW(𝑇𝑇1,𝑛𝑛 , 𝑄𝑄, 𝑟𝑟, ∞)
ℝ𝑁𝑁×𝑙𝑙𝑙𝑙𝑚𝑚𝑚𝑚𝑚𝑚 as below.                                                    4: numcand ← 𝑁𝑁
𝐿𝐿𝑛𝑛𝑇𝑇 (𝑖𝑖, 𝑗𝑗) ∶= 𝐿𝐿𝐿𝐿𝑗𝑗 �𝑇𝑇𝑖𝑖,𝑛𝑛 �                           (16)         5: while numcand > 0 do
        Definition 10. Given a time series 𝑇𝑇, we define the                6:    LOWERBOUNDING(𝐿𝐿𝑛𝑛𝑇𝑇 , 𝑏𝑏𝑏𝑏𝑏𝑏, 𝐵𝐵𝑇𝑇𝑛𝑛 )
bitmap matrix of all subsequences of length 𝑛𝑛 from 𝑇𝑇,                     7:    numcand ← FILLCANDMATR(𝑆𝑆𝑇𝑇𝑛𝑛 , 𝐵𝐵𝑇𝑇𝑛𝑛 , 𝐶𝐶𝑇𝑇𝑛𝑛 , 𝑝𝑝, 𝑠𝑠)
𝐵𝐵𝑇𝑇𝑛𝑛 ∈ 𝔹𝔹𝑁𝑁×𝑙𝑙𝑙𝑙𝑚𝑚𝑚𝑚𝑚𝑚 as below.                                          8:    if numcand > 0 then
𝐵𝐵𝑇𝑇𝑛𝑛 (𝑖𝑖, 𝑗𝑗) ∶= 𝐿𝐿𝑛𝑛𝑇𝑇 (𝑖𝑖, 𝑗𝑗) < 𝑏𝑏𝑏𝑏𝑏𝑏                    (17)         9:      bestmatch ← CALCCANDMATR(𝐶𝐶𝑇𝑇𝑛𝑛 ,
        Finally, we establish a matrix to store candidate                   10:        𝑛𝑛𝑛𝑛𝑛𝑛𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 , 𝑟𝑟, 𝑝𝑝, 𝑏𝑏𝑏𝑏𝑏𝑏)
subsequences, i.e. those subsequences from the 𝑆𝑆𝑇𝑇𝑛𝑛 matrix,               11: return bestmatch
which have not been pruned after the lower bounding. This
matrix will be processed in parallel by calculating of DTW                  Figure 8 Overall computational scheme
distance measure between each row representing the
subsequence and the query.                                                      Calculation of LBs. Figure 2 depicts the pseudo-code
        Definition 11. Let us denote the number of threads                  for calculation of lower bounds.
employed by the parallel algorithm as 𝑝𝑝 (𝑝𝑝 ≥ 1). Given the
                                        𝑁𝑁                                  Algorithm CALCLOWERBOUNDS
segment size 𝑠𝑠 ∈ ℤ+ (𝑠𝑠 ≤ � �), we define the candidate                       Input:
                                        𝑝𝑝
matrix, 𝐶𝐶𝑇𝑇𝑛𝑛 ∈ ℝ(𝑠𝑠∙𝑝𝑝)×(𝑛𝑛+𝑝𝑝𝑝𝑝𝑝𝑝) as below.                                   𝑆𝑆𝑇𝑇𝑛𝑛 subsequence matrix
𝐶𝐶𝑇𝑇𝑛𝑛 (𝑖𝑖,∙) ≔ 𝑆𝑆𝑇𝑇𝑛𝑛 (𝑘𝑘,∙):                                                    𝑄𝑄 query subsequence
          ∀𝑗𝑗, 1 ≤ 𝑗𝑗 ≤ 𝑙𝑙𝑙𝑙𝑚𝑚𝑚𝑚𝑚𝑚 , 𝐵𝐵𝑇𝑇𝑛𝑛 (𝑘𝑘, 𝑗𝑗) = 1      (18)                𝑟𝑟 warping constraint
4.1 Computational scheme
                                                                                  𝑝𝑝 number of threads employed
                                                                               Output:
We are finally in a position to introduce our approach.                           𝐿𝐿𝑛𝑛𝑇𝑇 LB-matrix
Figure 1 depicts the overall scheme of the algorithm.
                                                                            1: #pragma omp parallel for num_threads(𝑝𝑝)
    The algorithm performs as follows. Firstly, the lower
envelope and the upper envelope of the query are calculated,
                                                                            2: for i from 1 to 𝑁𝑁 do
and the query is z-normalized. Then, for each subsequence                   3:    ZNORMALIZE(𝑆𝑆𝑇𝑇𝑛𝑛 (𝑖𝑖,∙))
in the subsequence matrix, all the LBs of the lower                         4:    𝐿𝐿𝑛𝑛𝑇𝑇 (i,1) ← LBKIMFL(𝑄𝑄, 𝑆𝑆𝑇𝑇𝑛𝑛 (𝑖𝑖,∙))
bounding cascade are preliminarily calculated (as well as                   5:    𝐿𝐿𝑛𝑛𝑇𝑇 (i,2) ← LBKEOGHEC(𝑄𝑄, 𝑆𝑆𝑇𝑇𝑛𝑛 (𝑖𝑖,∙))
z-normalized version of each subsequence). Additionally, in                 6:    CALCENVELOPE(𝑆𝑆𝑇𝑇𝑛𝑛 (𝑖𝑖,∙), 𝑟𝑟, 𝑈𝑈, 𝐿𝐿)
order to start the search, the algorithm initializes 𝑏𝑏𝑏𝑏𝑏𝑏 by              7:    𝐿𝐿𝑛𝑛𝑇𝑇 (i,3) ← LBKEOGHEQ(𝑆𝑆𝑇𝑇𝑛𝑛 (𝑖𝑖,∙), 𝑄𝑄, 𝑈𝑈, 𝐿𝐿)
calculating the DTW distance measure between the query
and first subsequence.                                                      Figure 9 Calculation of lower bounds
    After that, the algorithm carries out the following loop
as long as there are subsequences that have not been pruned                     Strictly speaking, this step brings redundant calculations
during the search. At first, lower bounding is performed and                to our algorithm. In contrast, UCR-DTW calculates the next
the bitmap matrix is calculated in parallel. Next, promising                LB in the cascade only if a current subsequence is clearly
candidates are added to the candidate matrix in serial mode.                dissimilar after the calculation of the previous LB. As
Then, the DTW distance measure between the query and                        opposed to UCR-DTW, we calculate all the LBs and z-
each subsequence of the candidate matrix is calculated in                   normalized versions for all the subsequences because of the
parallel, and minimal distance is found. By the end of loop,                following reasons. Firstly, it is possible to perform such
we output the index of the subsequence with minimal                         computations once and before the scanning of all the
distance measure. Below, we describe these steps in detail.                 subsequences. Secondly, these computations are
                                                                            parallelizable in a simple way based on the data parallelism
                                                                            paradigm. Finally, being combined, these computations can


                                                                      146
be efficiently vectorized by the compiler.                                 Algorithm FILLCANDMATR
    Lower bounding. Figure 3 depicts the pseudo-code for                      Input:
lower bounding of subsequences. The algorithm performs                           𝑆𝑆𝑇𝑇𝑛𝑛 subsequence matrix
lower bounding by scanning the LB-matrix and calculating
                                                                                 𝐵𝐵𝑇𝑇𝑛𝑛 bitmap matrix
the respective row of the bitmap matrix. In the next step of
the algorithm, a subsequence corresponding to the row of                         𝑝𝑝 number of threads employed
the bitmap matrix where each element equals to one, will be                      𝑠𝑠 segment size
added to the candidate matrix in order to further calculate                   Output:
DTW distance measure.                                                            𝐶𝐶𝑇𝑇𝑛𝑛 candidate matrix
                                                                              Returns:
Algorithm LOWERBOUNDING
                                                                                 number of subsequences added
   Input:
                                                                                 to the candidate matrix
      𝐿𝐿𝑛𝑛𝑇𝑇 LB-matrix
                                                                           1: numcand ← 0
      𝑏𝑏𝑏𝑏𝑏𝑏 best-so-far similarity distance
                                                                           2: for i from 1 to 𝑝𝑝 do
      𝑝𝑝 number of threads employed
                                                                           3:    for k from 1 to 𝑠𝑠 do
   Output:                                                                                 𝑙𝑙𝑙𝑙𝑚𝑚𝑚𝑚𝑚𝑚 𝑛𝑛
      𝐵𝐵𝑇𝑇𝑛𝑛 bitmap matrix                                                 4:          if ⋀𝑗𝑗=1        𝐵𝐵𝑇𝑇 (𝑝𝑝𝑝𝑝𝑝𝑝𝑖𝑖 + 𝑘𝑘, 𝑗𝑗) = 1 then
1: #pragma omp parallel num_threads(𝑝𝑝)                                    5:             if numcand < 𝑠𝑠 ∙ 𝑝𝑝 then
2: whoami ← omp_get_thread_num()                                           6:                  numcand ← numcand + 1
                                       𝑁𝑁                                  7:                  𝑝𝑝𝑝𝑝𝑝𝑝𝑖𝑖 ← 𝑝𝑝𝑝𝑝𝑝𝑝𝑖𝑖 + 𝑘𝑘
3: for i from 𝑝𝑝𝑝𝑝𝑝𝑝𝑤𝑤ℎ𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 to �        � do
                                              𝑤𝑤ℎ𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜∙𝑝𝑝               8:                  𝐶𝐶𝑇𝑇𝑛𝑛 (𝑛𝑛𝑛𝑛𝑛𝑛𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 ,∙) ← 𝑆𝑆𝑇𝑇𝑛𝑛 (𝑝𝑝𝑝𝑝𝑝𝑝𝑖𝑖 ,∙)
4:      for j from 1 to 𝑙𝑙𝑙𝑙𝑚𝑚𝑚𝑚𝑚𝑚 do                                      9:                  𝑖𝑖𝑖𝑖𝑖𝑖𝑛𝑛𝑛𝑛𝑛𝑛𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 ← (𝑝𝑝𝑝𝑝𝑝𝑝𝑖𝑖 − 1) ⋅ 𝑛𝑛 + 1
5:        𝐵𝐵𝑇𝑇𝑛𝑛 (𝑖𝑖, 𝑗𝑗) ← 𝐿𝐿𝑛𝑛𝑇𝑇 (𝑖𝑖, 𝑗𝑗) < 𝑏𝑏𝑏𝑏𝑏𝑏                       10:            else
Figure 10 Lower bounding of subsequences                                   11:                 break
                                                                           12: if numcand = 𝑠𝑠 ∙ 𝑝𝑝 then
      During the search, we perform many scans of the                      13:         break
LB-matrix in parallel as long as subsequences that are not                 14: return numcand
clearly dissimilar exist. Parallel processing is based on the
following technique. The LB-matrix is logically divided                    Figure 11 The candidate matrix filling
into 𝑝𝑝 equal-sized segments, and each thread scans its own
segment. In order to avoid scanning of each segment from                       Processing of the candidate matrix. Figure 5 depicts
scratch, we establish the segment index as an array of 𝑝𝑝                  the pseudo-code for calculating DTW distance measure of
elements where each element keeps the index of the most                    candidate subsequences.
recent candidate subsequence in the respective segment, i.e.               Algorithm CALCCANDMATR
 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 = �𝑝𝑝𝑝𝑝𝑝𝑝1 , … , 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 � where
                                                                              Input:
                               0, 𝑏𝑏𝑏𝑏𝑏𝑏 = ∞
              ⎧                                                                  𝐶𝐶𝑇𝑇𝑛𝑛 candidate matrix
              ⎪𝑘𝑘: 𝑝𝑝 ∙ (𝑖𝑖 − 1) + 1 ≤ 𝑘𝑘 ≤ � 𝑁𝑁 � ∧                             numcand number of candidate subsequences
𝑝𝑝𝑝𝑝𝑝𝑝𝑖𝑖 : =                                          𝑖𝑖∙𝑝𝑝   (19)
              ⎨          ∀𝑗𝑗, 1 ≤ 𝑗𝑗 ≤ 𝑙𝑙𝑙𝑙𝑚𝑚𝑚𝑚𝑚𝑚 ,                              𝑄𝑄 query subsequence
              ⎪
              ⎩ 𝐿𝐿𝐿𝐿𝑇𝑇𝑛𝑛 (𝑘𝑘, 𝑗𝑗) < 𝑏𝑏𝑏𝑏𝑏𝑏, 𝑜𝑜𝑜𝑜ℎ𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒                    𝑟𝑟 warping constraint
      The candidate matrix filling. Figure 4 depicts the                         𝑝𝑝 number of threads employed
pseudo-code for the procedure of filling the candidate                        Output:
matrix.                                                                          𝑏𝑏𝑏𝑏𝑏𝑏 similarity of the best-so-far subsequence
      The algorithm performs scanning the bitmap matrix                       Returns:
along the segments. We start scanning not from the
                                                                                 index of the best-so-far subsequence
beginning of a segment but from the respective segment’s
index, which stores the number of the most recent candidate                1: #pragma omp parallel for num_threads(𝑝𝑝)
subsequence in the segment. If a subsequence is promising,                 2: shared (𝑏𝑏𝑏𝑏𝑏𝑏, idx) private (distance)
it is added to the candidate matrix.                                       3: for i from 1 to numcand do
      In order to output index of the best match subsequence,              4: distance ← UCRDTW(𝐶𝐶𝑇𝑇𝑛𝑛 (𝑖𝑖,∙), 𝑄𝑄, 𝑟𝑟, 𝑏𝑏𝑏𝑏𝑏𝑏)
we establish the candidate subsequence index as an array of                5: #pragma omp critical
𝑠𝑠 ∙ 𝑝𝑝 elements where each element keeps the starting                     6: if 𝑏𝑏𝑏𝑏𝑏𝑏 > distance then
position of a candidate subsequence in the input time series,              7:         𝑏𝑏𝑏𝑏𝑏𝑏 ← distance
i.e. 𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼 = �𝑖𝑖𝑖𝑖𝑖𝑖1 , … , 𝑖𝑖𝑖𝑖𝑖𝑖𝑠𝑠∙𝑝𝑝 � where                       8:         bestmatch ← 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖
𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 ≔ 𝑘𝑘: 1 ≤ 𝑘𝑘 ≤ 𝑚𝑚 − 𝑛𝑛 + 1 ∧ ∃𝑆𝑆𝑇𝑇𝑛𝑛 (𝑖𝑖,∙) ⟺                     9: return bestmatch
           ∃𝑇𝑇𝑖𝑖,𝑛𝑛 ⟺ 𝑘𝑘 = (𝑖𝑖 − 1) ∙ 𝑛𝑛 + 1                  (20)
                                                                           Figure 12 Processing of the candidate matrix


                                                                     147
    The algorithm performs as follows. For each row of the              5.2 Results and discussion
candidate matrix, we calculate DTW distance measure
                                                                        Figure 6 depicts the performance of PHIBESTMATCH in
between the respective candidate and the query by means of
                                                                        comparison with the UCR-DTW algorithm. As we can see,
the UCR-DTW algorithm. If this distance is less than 𝑏𝑏𝑠𝑠𝑠𝑠
                                                                        our algorithm is two times faster for both EPG and Random
then 𝑏𝑏𝑏𝑏𝑏𝑏 is updated. The loop is parallelized by means of
                                                                        Walk datasets when UCR-DTW runs on one CPU core and
the OpenMP pragma where 𝑏𝑏𝑏𝑏𝑏𝑏 is indicated as a variable
                                                                        PHIBESTMATCH runs on 240 cores of Intel Xeon Phi. Being
shared across all threads while the distance variable is
                                                                        run on Intel Xeon Phi, UCR-DTW is obviously slower than
indicated as a private for each thread. In order to correctly
                                                                        PHIBESTMATCH (from 10 to 15 times).
update the shared variable, we use pragma with critical
                                                                             Figure 7 and Figure 8 depict the experimental results on
section.
                                                                        the Random Walk dataset and the EPG dataset,
                                                                        respectively. On Random Walk data, both speedup and
5 Experiments                                                           parallel efficiency are linear when the number of threads
                                                                        matches of physical cores the algorithm is running on.
5.1 Experimental setup                                                  However, when more than one thread per physical core is
Objectives. In the experiments, we compared the                         used, speedup became sub-linear, and efficiency decreases
performance of our algorithm in comparison with UCR-                    accordingly. That is, speedup stops increasing from 80×
DTW. We also evaluated the scalability of our algorithm on              when 120 threads are employed, and efficiency drops from
Intel Xeon Phi for different datasets. We measured the run              60 percent for 120 threads to 30 percent for 240 threads.
time (after deduction of I/O time) and calculated the                        On EPG data, the algorithm shows closer to linear
algorithm’s speedup and parallel efficiency. Here we                    speedup and efficiency (up to 50× and at least 80 percent,
understand these characteristics of parallel algorithm                  respectively) if the number of threads employed is up to the
scalability as follows. Speedup and parallel efficiency of a            number of physical cores. When more than one thread per
parallel algorithm employing 𝑝𝑝 threads are calculated,                 physical core is used, speedup slowly increases up to 78×,
                          𝑡𝑡             𝑠𝑠(𝑝𝑝)
respectively, as 𝑠𝑠(𝑝𝑝) = 1 and 𝑒𝑒(𝑝𝑝) =       , where 𝑡𝑡1 and          and efficiency drops accordingly (similar to the picture for
                          𝑡𝑡𝑝𝑝             𝑝𝑝
                                                                        the Random Walk dataset).
𝑡𝑡𝑝𝑝 are the run times of the algorithm when one and 𝑝𝑝 threads              We can conclude that the proposed algorithm
are employed, respectively.                                             demonstrates closer to linear scalability when the number of
      Hardware. We performed our experiments on a node                  threads it runs on is up to the number of physical cores of
of the Tornado SUSU supercomputer [8] with the                          the Intel Xeon Phi many-core processor. However, when
characteristics summarized in Table 1.                                  more than one thread per physical core is used, speedup and
                                                                        parallel efficiency decrease significantly.
Table 1 Specifications of hardware                                           There are two possible reasons for this. Firstly, our
         Specification                  Xeon       2×Xeon               algorithm is not completely parallel, and the candidate matrix
                                         Phi        CPU                 filling is its serial part, which limits speedup. Secondly,
Model, Intel                           SE10X        X5680               according to its nature, DTW calculations (cf. Def. 2) can
                                                                        hardly ever be auto-vectorized. Thus, if during the (seamlessly
# physical cores                         61          2×6
                                                                        auto-vectorizable) lower bounding step many subsequences
Hyper threading                          4×           2×
                                                                        have not been pruned as clearly dissimilar then they will be
# logical cores                          244          24                processed by many threads in the DTW calculation step but
Frequency, GHz                           1.1         3.33               without auto-vectorization as it might be expected.
VPU width, bit                           512         128
Peak performance, TFLOPS                1.076       0.371
                                                                        6 Conclusion
RAM, Gb                                   8            8
                                                                        In this paper, we address the problem of accelerating
    Datasets. In the experiments, we used the following                 subsequence similarity search on the modern Intel Xeon Phi
datasets (cf. Table 2). The Random Walk is a dataset that               system of Knights Landing (KNL) generation. Phi KNL is
was synthetically generated according to the model of the               an independent bootable device, which provides up to 72
same name [10] and used in [11]. The EPG dataset [14] is a              compute cores with a high local memory bandwidth and
set of signals from the so-called Electrical Penetration                512-bit wide vector processing units. Being based on the
Graph reflecting the behavior of the Aster leafhopper                   x86 architecture, the Intel Phi KNL many-core processor
(macrosteles quadrilineatus). A critical task that researchers          supports thread-level parallelism and the same
perform is to search for patterns in such time series, because          programming tools as a regular Intel Xeon CPU, and serves
in the US agriculture, for one state and one crop, this insect          as an attractive alternative to FPGA and GPU. We consider
causes losses more than two million dollars a year [14].                the case when time series involved in the computations fit
                                                                        in main memory.
Table 2 Specifications of datasets
          Dataset           Type          𝑚𝑚       𝑛𝑛
       Random Walk         Synthetic      106     128
       EPG                   Real       2.5∙105   360


                                                                  148
                 a) Random Walk dataset                                               b) EPG dataset
                                         Figure 13 Algorithm’s performance


                        a) Speedup                                                 b) Parallel efficiency
                           Figure 14 Algorithm’s scalability on the Random Walk dataset


                         a) Speedup                                                b) Parallel efficiency
                                Figure 15 Algorithm’s scalability on the EPG dataset

    We developed a novel parallel algorithm of                         We performed experiments on synthetic and real-
subsequence similarity search for Intel Xeon Phi KNL,              word datasets, which showed the following.
called PHIBESTMATCH. Our algorithm is based on UCR-                PHIBESTMATCH being run on Intel Xeon Phi is two times
DTW [11], which is the fastest serial algorithm of                 faster than UCR-DTW being run on Intel Xeon. The
subsequence similarity search due to it integrates cascade         proposed algorithm demonstrates closer to linear both
lower bounding and many other algorithmic speedup                  speedup and parallel efficiency when the number of
techniques. PHIBESTMATCH efficiently exploits                      threads it runs on is up to the number of physical cores
vectorization capabilities of Phi KNL by means of the              of Intel Xeon Phi.
sophisticated data layout and computational scheme.                    In further research, we plan to move on our approach


                                                             149
in the following directions: advance the parallelization of         [11] Rakthanmanon, T., Campana, B.J.L., Mueen,
the UCR-DTW algorithm for Intel Xeon Phi KNL, and                        A., Batista, G.E.A.P.A., Westover, M.B., Zhu,
extend our algorithm for the computer cluster system                     Q., Zakaria, J., Keogh, E.J.: Searching and
with nodes equipped with Intel Xeon Phi KNL.                             mining trillions of time series subsequences
                                                                         under dynamic time warping. In: Yang, Q.,
Acknowledgments. This work was financially                               Agarwal, D., Pei, J. (eds.) KDD, pp. 262-270.
supported by the Russian Foundation for Basic Research                   ACM (2012). doi: 10.1145/2339530.2339576
(grant No. 17-07-00463), by Act 211 Government of the
                                                                    [12] Ratanamahatana, C., Keogh, E.J.: Three myths
Russian Federation (contract No. 02.A03.21.0011) and
                                                                         about Dynamic Time Warping Data Mining. In:
by the Ministry of education and science of Russian
                                                                         Kargupta, H., Srivastava, J., Kamath, C.,
Federation (government order 2.7905.2017/8.9).
                                                                         Goodman, A. (eds.), SDM 2005. pp. 506-510.
                                                                         doi: 10.1137/1.9781611972757.50
References                                                          [13] Sakurai, Y., Faloutsos, C., Yamamuro, M.:
   [1] Abdullaev, S.M., Zhelnin, A.A.,                                   Stream monitoring under the time warping
        Lenskaya, O.Y.: The structure of mesoscale                       distance. In: Chirkova, R., Dogac, A., Tamer
        convective systems in central Russia. Russian                    Ozsu, M., Sellis, T.K. (eds.) ICDE 2007,
        Meteorology and Hydrology. 37(1), pp. 12-20                      pp. 1046-1055. IEEE Computer Society (2007).
        (2012).                                                          doi: 10.1109/ICDE.2007.368963
   [2] Bacon, D.F., Graham, S.L., Sharp, O.J.:                      [14] Sart, D., Mueen, A., Najjar, W.A., Keogh, E.J.,
        Compiler transformation for high-performance                     Niennattrakul, V.: Accelerating dynamic time
        computing. ACM Computing Surveys. 26,                            warping subsequence search with GPUs and
        pp. 345-420 (1994).                                              FPGAs. In: Webb, G.I., Liu, B., Zhang, C.,
        doi: 10.1145/197405.197406                                       Gunopulos, D., Wu, X. (eds.) ICDM 2010,
   [3] Berndt, D.J., Clifford, J.: Using dynamic time                    pp. 1001-1006. IEEE Computer Society (2010).
        warping to find patterns in time series. In:                     doi: 10.1109/ICDM.2010.21
        Fayyad, U.M., Uthurusamy, R. (eds.) KDD                     [15] Sodani, A.: Knights Landing (KNL): 2nd
        Workshop, pp. 359-370. AAAI Press (1994)                         generation Intel Xeon Phi processor. In: 2015
   [4] Chrysos, G.: Intel Xeon Phi coprocessor                           IEEE Hot Chips 27th Symposium (HCS), pp. 1-
        (codename Knights Corner). In: 2012 IEEE Hot                     24. IEEE Computer Society (2015)
        Chips 24th Symposium (HCS), pp. 1-31 (2012).                [16] Sokolinskaya, I., Sokolinsky, L.B.: Scalability
        doi: 10.1109/HOTCHIPS.2012.7476487                               evaluation of NSLP algorithm for solving non-
   [5] Ding, H., Trajcevski, G., Scheuermann, P.,                        stationary linear programming problems on
        Wang, X., Keogh, E.J.: Querying and mining of                    cluster computing systems. In: Voevodin, V.,
        time series data: experimental comparison of                     Sobolev, S. (eds.) RuSCDays 2017. CCIS,
        representations and distance measures. PVLDB                     vol. 793, pp. 40-53. Springer, Heidelberg
        1(2), 1542-1552 (2008)                                           (2017).
                                                                         doi: 10.1007/978-3-319-71255-0_4
   [6] Epishev, V., Isaev, A., Miniakhmetov, R. et al.:
        Physiological data mining system for elite                  [17] Srikanthan, S., Kumar, A., and Gupta, R.:
        sports. Bull. of South Ural State University.                    Implementing the dynamic time warping
        Series: Comput. Math. and Soft. Eng., 2(1):44-                   algorithm in multithreaded environments for
        54, 2013. (in Russian)                                           real time and unsupervised pattern discovery.
        doi: 10.14529/cmse130105                                         In: IEEE ICCCT, pp. 394-398. IEEE Computer
                                                                         Society (2011).
   [7] Keogh, E., Ratanamahatana, C.: Exact indexing
                                                                         doi: 10.1109/ICCCT.2011.6075111
        of dynamic time warping. Knowl. Inf. Syst.
        2005: vol. 7, no. 3, pp. 406-417.                           [18] Takahashi, N., Yoshihisa, T., Sakurai, Y.,
                                                                         Kanazawa, M.: A parallelized data stream
   [8] Kostenetskiy, P., Safonov, A.: SUSU
                                                                         processing system using Dynamic Time
        supercomputer resources. In: L. Sokolinsky,
                                                                         Warping distance. In: Barolli, L., Xhafa, F.,
        I. Starodubov (eds.) PCT'2016, pp. 561-573.
                                                                         Hsu, H.-H. (eds.) CISIS, pp. 1100-1105 (2009).
        CEUR-WS, vol. 1576 (2016)
                                                                         doi: 10.1109/CISIS.2009.77
   [9] Miniakhmetov, R., Movchan, A., Zymbler, M.:
                                                                    [19] Wang, Z., Huang, S., Wang, L., Li, H., Wang,
        Accelerating time series subsequence matching
                                                                         Yu, Yang, H.: Accelerating subsequence
        on the Intel Xeon Phi many-core coprocessor.
                                                                         similarity search based on dynamic time
        In: Biljanovic, P., Butkovic, Z. et al. (eds.)
                                                                         warping distance with FPGA. In: Hutchings,
        MIPRO 2015, pp. 1399-1404 (2015).
                                                                         B.L., Betz, V. (eds.) ACM/SIGDA FPGA’13,
        doi: 10.1109/MIPRO.2015.7160493
                                                                         pp. 53-62. ACM (2013).
   [10] Pearson, K.: The problem of the random walk.                     doi: 10.1145/2435264.2435277
        Nature 72(1865), 342 (1905).
                                                                    [20] Zhang, Y., Adl, K., Glass, J.: Fast spoken query
        doi: 10.1038/072342a0
                                                                         detection using lower-bound Dynamic Time
                                                                         Warping on Graphical Processing Units. In:


                                                              150
     ICASSP, pp. 5173-5176. (2012).               Integrated Core architecture. In: Morzy, T.
     doi: 10.1109/ICASSP.2012.6289085             Valduriez, P., Bellatreche, L. (eds.) ADBIS
[21] Zymbler, M.: Best-Match time series          2015. LNCS, vol. 9282. pp. 275-286. Springer,
     subsequence search on the Intel Many         Heidelberg (2015). doi: 10.1007/978-3-319-
                                                  23135-8_19


                                            151