<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>ICASSP.</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1109/ICASSP.2012.6289085</article-id>
      <title-group>
        <article-title>An Efficient Subsequence Similarity Search on Modern Intel Many-core Processors for Data Intensive Applications</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>© Yana Kraeva © Mikhail Zymbler</string-name>
          <email>kraevaya@susu.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Proceedings of the XX International Conference “Data Analytics and Management in Data Intensive Domains” (DAMDID/RCDL'2018)</institution>
          ,
          <addr-line>Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>South Ural State University (National Research University)</institution>
          ,
          <addr-line>Chelyabinsk</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2012</year>
      </pub-date>
      <volume>6289085</volume>
      <fpage>5173</fpage>
      <lpage>5176</lpage>
      <abstract>
        <p>Many time series analytical problems arising in a wide spectrum of data intensive applications require subsequence similarity search as a subtask. Currently, Dynamic Time Warping (DTW) is considered as the best similarity measure in most domains. Since DTW is computationally expensive there are parallel algorithms have been developed for FPGA, GPU, and Intel Xeon Phi. In our previous work, we accelerated subsequence similarity search with Intel Xeon Phi's Knights Corner generation by CPU+Phi computational scheme. Such an approach needs significant changes for the Phi's second-generation product, Knights Landing (KNL), which is an independent bootable device supporting only native applications. In this paper, we present a novel parallel algorithm for subsequence similarity search in time series for the Intel Xeon Phi KNL many-core processor. In order to efficiently exploit vectorization capabilities of Phi KNL, the algorithm provides the sophisticated data layout and computational scheme. We performed experiments on synthetic and real-word datasets, which showed good scalability of the algorithm.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Nowadays, time series are pervasive in a wide spectrum of
applications with data intensive analytics, e.g. climate
modelling [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], economic forecasting [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], medical
monitoring [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], etc. Many time series analytical problems
require subsequence similarity search as a subtask, which
assumes that a query subsequence and a longer time series
are given, and we are to find a subsequence of the time
series, whose similarity to the query is the maximum among
all the subsequences.
      </p>
      <p>
        At the present time, Dynamic Time Warping (DTW) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
is considered as the best time series subsequence similarity
measure in most domains [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], since it allows the
subsequence to have some stretching, shrinking, warping,
or different length in comparison to the query. Since
computation of DTW is time-consuming ( ( 2) where 
is length of the query sequence) there are speedup
techniques have been proposed including algorithmic
developments (indexing methods [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], early abandoning
strategies, embedding and computation reuse
strategies [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], etc.) and parallel hardware-based solutions
for FPGA [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] and GPU [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], and Intel Xeon Phi [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ].
      </p>
      <p>
        In this paper, the Intel Xeon Phi many-core system is the
subject of our efforts. Architecture of Phi provides a large
number of compute cores with a high local memory
bandwidth and 512-bit wide vector processing units. Being
based on the Intel x86 architecture, Phi supports
threadlevel parallelism and the same programming tools as a
regular Intel Xeon CPU and serves as an attractive
alternative to FPGA and GPU. Now, Intel offers two
generations of Phi products, namely Knights Corner
(KNC) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and Knights Landing (KNL) [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. The former is
a coprocessor with up to 61 cores, which supports native
applications as well as offloading of calculations from a host
CPU. The latter provides up to 72 cores and as opposed to
predecessor is an independent bootable device, which runs
applications only in native mode.
      </p>
      <p>
        In our previous works [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], we accelerated
subsequence similarity search on Phi KNC by means of the
CPU+Phi computational scheme. Such an approach needs
significant changes for Phi KNL because if there is no CPU,
in addition to parallelization, we have to efficiently
vectorize computations in order to achieve high
performance.
      </p>
      <p>In this paper, we address the accelerating subsequence
similarity search on Phi KNL for the case when time series
involved in the computations fit in main memory. The paper
makes the following basic contributions. We developed a
novel parallel algorithm of subsequence similarity search in
time series for the Intel Xeon Phi KNL processor. The
algorithm efficiently exploits vectorization capabilities of
Phi KNL by means of the sophisticated data layout and
computational scheme. We performed experiments on
realword datasets, which showed good scalability of the
algorithm.</p>
      <p>The rest of the paper is organized as follows. Section 2
discusses related works. Section 3 gives formal notation and
statement of the problem. In Section 4, we present the
proposed parallel algorithm. We describe experimental
evaluation of our algorithm in Section 5. Finally, in
Section 6, we summarize the results obtained and propose
directions for further research.
compute DTW. Such a scheme significantly outperformed
naïve approach.</p>
      <p>This development, however, cannot be directly applied
to Phi KNL since it is an independent bootable many-core
processor and supports only native applications. Thus, our
previous approach needs significant changes. Moreover, in
order to achieve high performance of similarity search on
Phi KNL, in addition to parallelization, we should provide
data layout and computational scheme that allow exploiting
vectorization capabilities of the
many-core processor
thereof in the most efficient way.</p>
    </sec>
    <sec id="sec-2">
      <title>3 Notation and problem background</title>
      <sec id="sec-2-1">
        <title>3.1 Definitions and notations</title>
        <p>elements: 
denoted by | |.</p>
        <p>Definition 2.</p>
        <p>Definition 1. A time series  is a sequence of real-valued
=  1,  2, … ,   . Length of a time series  is</p>
        <p>Given
two
time
series,
Warping (DTW) distance between  and  is denoted by
 ( − 1,  )
 ( ,  − 1)
 ( − 1,  − 1)
 ( ,  ) =   −</p>
        <p>+  
 (0,0) = 0;  ( , 0) =  (0,  ) = ∞;  =  = 1, … ,  . (3)
In (2), a set of  ( ,  ) is considered as an 
× 
warping matrix for the alignment of the two respective time
series.</p>
        <p>A warping path is a contiguous set of warping matrix
elements that defines a mapping between the two time
series. The warping path must start and finish in diagonally
opposite corner cells of the warping matrix, the steps in the
warping path are restricted to adjacent cells, and the points
in the warping path must be monotonically spaced in time.</p>
        <p>Definition 3. A subsequence   , of a time series  is its
contiguous subset of  elements, which starts from position
 :   ,
=   ,  +1 , … ,  +−1 , 1 ≤  ≤  − 
+ 1.</p>
        <p>Definition 4. Given a time series  and a time series 
The problem of subsequence similarity search under the</p>
        <p>
          measure has been extensively studied in recent
decade. Since computation of DTW is time-consuming
there are speedup techniques have been proposed.
methods [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], early abandoning strategies, embedding and
computation reuse strategies [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], etc. Despite the fact these
techniques focus on reduction of the number of calls DTW
calculation procedure, computation of DTW
measure still takes up to 80 percent of the total run time of
the similarity search [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. Due to this reason a number of
parallel hardware-based solutions have been developed.
        </p>
        <p>
          In [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], authors proposed subsequence similarity search
on CPU cluster. Subsequences starting from
different
positions of the time series are sent to different nodes, and
each node calculates DTW in the naïve way. In [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], authors
accelerated
subsequence similarity search
with
        </p>
        <p>SMP
system. They distribute different queries into different
cores, and each subsequence is sent to different cores to be
compared with different patterns in the naïve way. In both
implementations, the data transfer becomes the bottleneck.</p>
        <p>
          In [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ], a GPU-based implementation was proposed.
The warping matrix is generated in parallel, but the warping
path is searched serially. Since the matrix generation step
and the path search step are split into two kernels, this leads
to overheads for storage and transmission of the warping
matrix for each DTW calculation.
subsequences and offloads it to the coprocessor in order to
5 Strictly speaking, DTW allows comparing two time
series of different lengths. However, for the sake of
simplicity, we assume time series of equal lengths due to
it is possible without losing the generality [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
as a user specified query where 
best matching subsequence   ,
∀  
 ,   ,
≤
        </p>
        <p>,   , ,
1 ≤  ,  ≤  −</p>
        <p>+ 1
query  .
to subsequence   ,</p>
        <p>In what follows, where there is no ambiguity, we refer
as  , as a candidate in match to a
meets the property
= | | ≫ | | =  , the</p>
        <p>Definition 5. The z-normalization of a time series  is
defined as a time series  =  1̂,  ̂2, … ,  ̂
where
 ̂ =
 =
 2 =
  −

1


1</p>
        <p>∑=1</p>
        <p>∑=1  2 −  2
defined as below.</p>
        <p>Definition 6. The Euclidean distance (ED) between two
(z-normalized) subsequences  and 
where | | = | |, is

=
(1)
(2)
(4)
(5)
(6)
(7)
  ( ,  ) =</p>
        <p>∑=1 (  −   )2</p>
      </sec>
      <sec id="sec-2-2">
        <title>3.2 Serial algorithm</title>
        <p>
          Currently, UCR-DTW [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] is the fastest serial algorithm of
subsequence similarity search, which integrates a large
number of algorithmic speedup techniques. Since our
algorithm is based on UCR-DTW, in this section, we briefly
describe its basic features that are essentially exploited in
our approach.
        </p>
        <p>Squared distances. Instead of use square root in DTW
and ED distance calculation, it is possible to use the squares
thereof. Since both functions are monotonic and concave, it
does not change the relative rankings of subsequences. In
what follows, where there is no ambiguity, we will still use
DTW and ED assuming the squared versions of them.</p>
        <p>Z-normalization. Both the query subsequence and each
subsequence of the time series need to be z-normalized
before the comparison. Z-normalization shifts and scales the
time series such that the mean is zero and the standard
deviation is one.</p>
        <p>
          Cascading Lower Bounds. Lower bound (LB) is an easy
computable threshold of the DTW distance measure to
identify and prune clearly dissimilar subsequences [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. In
what follows, we refer this threshold as the best-so-far
distance (or
        </p>
        <p>
          for brevity) due to UCR-DTW tries to
exceed  
improve (decrease) it while scanning the time series. If the
lower bound has exceeded   , the DTW distance will
as well, and the respective subsequence is
assumed to be clearly dissimilar and pruned
without
calculation of DTW. UCR-DTW exploits the following
LBs, namely LBKimFL [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], LBKeoghEC and LBKeoghEQ [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ],
and they are applied in a cascade.
        </p>
        <p>The LBKimFL lower bound uses the distances between
bound, and defined as below.
the First (Last) pair of points from  and  as a lower
of the query, respectively, and defined as below.
 = ℓ1, … , ℓ are the upper envelope and lower envelope
(̂  −   )2,   ̂  &gt;  
(̂  − ℓ )2,   ̂  &lt; ℓ (10)
0, 
path cannot deviate more than  cells from the diagonal of</p>
        <p>The LBKeoghEQ lower bound is the distance from the
query and the closer of the two envelopes of a candidate
subsequence. That is, as opposed to LBKeoghEC, the roles of
the query and the candidate subsequence are reversed:
   
ℎ  ( ,  ) ∶=</p>
        <p>ℎ  ( ,  )
calculated, and</p>
        <p>is assumed to be equal to infinity. Then
the algorithm scans the input time series applying the
cascade of LBs to the current subsequence. If the
subsequence is not pruned, then
DTW
distance is
calculated. Next,</p>
        <p>is updated if it is greater than the value
of DTW distance calculated above. By doing so, in the end,
UCR-DTW finds the best matching subsequence of the
given time series.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4 Method</title>
      <p>In this section, we present a novel computational scheme
and data layout, which allow efficient parallelization and
vectorization of subsequence similarity search on Intel</p>
      <sec id="sec-3-1">
        <title>Xeon Phi KNL.</title>
        <p>Vectorization plays the key role for getting the high
performance
of
transform the loops into sequences of vector operations and
utilize</p>
        <p>VPUs. Thus, in
order to
provide the
high
performance of subsequence similarity search on Phi KNL,
we should organize computations with as many
autovectorizable loops as possible.</p>
        <p>However, many auto-vectorizable loops are not enough
for the overall good
performance of an
algorithm.</p>
        <p>
          Unaligned memory access is one of the basic factors that
can cause inefficient vectorization due to timing overhead
for loop peeling [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. If the start address of the processed data
is not aligned by the VPU width (i.e. by the number of floats
that can be loaded in VPU), the loop is split into three parts
by the compiler. The first part of iterations, which access the
memory from the start address to the first aligned address is
peeled off and vectorized separately. The rest part of
iterations from the last aligned address to the end address is
split and vectorized separately as well.
        </p>
        <p>According to this argumentation, we propose a data
layout, which provides an aligned access to subsequences
of the time series. Applying this data layout, we also
propose the respective computational scheme where as
many computations as possible are implemented as
autovectorizable for-loops.</p>
        <sec id="sec-3-1-1">
          <title>4.1 Data layout</title>
          <p>define alignment of  as a subsequence  ̃ as below.
 1,  2, … ,   , 0,0, … ,0 ,       &gt; 0
(14)
 
 , 
ℎ</p>
          <p>According to Def. 2, ∀ ,  : | | = | |  
( ,  ) =
( ,  ̃ ). Thus, in what follows, we still write 
and 
for the query subsequence and a subsequence of the input
time series, assuming, however, the aligned versions
thereof. Alignment of subsequences allows to avoid timing
overheads for loop peeling.</p>
          <p>Next, we store all the (aligned) subsequences of a time
series as a matrix in order to provide auto-vectorization of
computations.</p>
          <p>Definition 8. Let us denote the number of all the
subsequences of length  of a time series  as  ,  =
Algorithm PHIBESTMATCH</p>
          <p>( ,  ) ∶=  ̃+−1
| | − 
+ 1 =  −</p>
          <p>+ 1. We define the subsequence
matrix (i.e. the matrix of all aligned subsequences of length
 from a time series  ),   ∈ ℝ ×(+   ) as below.
(15)</p>
          <p>We also establish two more matrices regarding lower
bounding of all subsequences of the input time series. The
first matrix stores for each subsequence the values of all the
LBs in the cascade (cf. Sect. 3.2). The second one is bitmap
matrix, which stores for each subsequence the result of
comparison of</p>
          <p>with every LB.</p>
          <p>Definition 9. Let us denote the number of LBs exploited
by the subsequence similarity search algorithm as   
and enumerate them according to the order in the lower
bounding cascade. Given a time series  , we define the
LBmatrix of all subsequences of length  from  ,    ∈
ℝ ×  

  ( ,  ) ∶=      ,
as below.
time series to search
bitmap matrix of all subsequences of length  from  ,
  ∈   ×</p>
          <p>Finally,
we</p>
          <p>establish a matrix to store candidate
subsequences, i.e. those subsequences from the   matrix,
which have not been pruned after the lower bounding. This
matrix will be processed in parallel by calculating of DTW
distance</p>
          <p>measure between each row representing the
subsequence and the query.</p>
          <p>Definition 11. Let us denote the number of threads
employed by the parallel algorithm as  ( ≥ 1). Given the
  ( ,∙) ≔   ( ,∙):
∀ , 1 ≤  ≤</p>
          <p>segment size  ∈ ℤ + ( ≤
matrix,   ∈ ℝ(∙ )×(+   ) as below.</p>
          <p>,   ( ,  ) = 1

 ), we define the candidate</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>4.1 Computational scheme</title>
          <p>We are finally in a position to introduce our approach.
order to start the search, the algorithm initializes  
calculating the DTW distance measure between the query
by
and first subsequence.</p>
          <p>After that, the algorithm carries out the following loop
as long as there are subsequences that have not been pruned
during the search. At first, lower bounding is performed and
the bitmap matrix is calculated in parallel. Next, promising
candidates are added to the candidate matrix in serial mode.
Then, the DTW distance measure between the query and
each subsequence of the candidate matrix is calculated in
parallel, and minimal distance is found. By the end of loop,
we output the index of the subsequence with minimal
distance measure. Below, we describe these steps in detail.
 


 
   similarity of the best match subsequence
index of the best match subsequence
4: numcand ← 
1: CALCENVELOPE( ,  ,  ,  )
2: CALCLOWERBOUNDS(  ,  ,  ,   )
3:</p>
          <p>← UCRDTW( 1, ,  ,  , ∞)
while numcand &gt; 0 do
if numcand &gt; 0 then
LOWERBOUNDING(   ,    ,   )
numcand ← FILLCANDMATR(  ,</p>
          <p>,  ,  ,   )
bestmatch ← CALCCANDMATR(  ,
  ,
  ,  ,  )
11: return bestmatch
1: #pragma omp parallel for num_threads( )
ZNORMALIZE(  ( ,∙))
 
 
 (i,1) ← LBKIMFL( ,  
 (i,2) ← LBKEOGHEC( ,  
 ( ,∙))
 ( ,∙))
CALCENVELOPE(  ( ,∙),  ,  ,  )</p>
          <p>(i,3) ← LBKEOGHEQ(  ( ,∙),  ,  ,  )
(16)
(17)
(18)</p>
          <p>Figure 9 Calculation of lower bounds</p>
          <p>Strictly speaking, this step brings redundant calculations
to our algorithm. In contrast, UCR-DTW calculates the next
LB in the cascade only if a current subsequence is clearly
dissimilar after the calculation of the previous LB. As
opposed to UCR-DTW, we calculate all the LBs and
znormalized versions for all the subsequences because of the
following reasons. Firstly, it is possible to perform such
computations once and before the scanning of all the
 
  
 
be efficiently vectorized by the compiler.</p>
          <p>Lower bounding. Figure 3 depicts the pseudo-code for
lower bounding of subsequences. The algorithm performs
lower bounding by scanning the LB-matrix and calculating
the respective row of the bitmap matrix. In the next step of
the algorithm, a subsequence corresponding to the row of
the bitmap matrix where each element equals to one, will be
added to the candidate matrix in order to further calculate</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>DTW distance measure.</title>
        <p>Algorithm LOWERBOUNDING
best-so-far similarity distance
1: #pragma omp parallel num_threads( )</p>
        <p>whoami ← omp_get_thread_num()
3: for i from   ℎ 
for j from 1 to</p>
        <p>( ,  ) ←    ( ,  ) &lt;  
 
 
 
clearly dissimilar exist. Parallel processing is based on the
following technique. The LB-matrix is logically divided
into  equal-sized segments, and each thread scans its own
segment. In order to avoid scanning of each segment from
scratch, we establish the segment index as an array of 
elements where each element keeps the index of the most
recent candidate subsequence in the respective segment, i.e.
  
= 
1, … , 
0,</p>
        <p>= ∞</p>
        <p>where
⎧
⎨
⎪
⎪ :  ∙ ( − 1) + 1 ≤  ≤</p>
        <p>∀ , 1 ≤  ≤   
⎩    ( ,  ) &lt;  
, 
matrix.
 ∙ 
i.e.</p>
        <p>The algorithm performs scanning the bitmap matrix
along the segments. We start scanning not from the
beginning of a segment but from the respective segment’s
index, which stores the number of the most recent candidate
subsequence in the segment. If a subsequence is promising,
it is added to the candidate matrix.</p>
        <p>In order to output index of the best match subsequence,
we establish the candidate subsequence index as an array of
elements where each element keeps the starting
position of a candidate subsequence in the input time series,
=    1, … ,    ∙
where
    ≔  : 1 ≤  ≤  − 
+ 1 ∧ ∃ 
 ( ,∙) ⟺
∃  , ⟺  = ( − 1) ∙  + 1
←</p>
        <p>( 
      
   similarity of the best-so-far subsequence
index of the best-so-far subsequence
3: for i from 1 to numcand do
1: #pragma omp parallel for num_threads( )
shared (   , idx) private (distance)
#pragma omp critical
distance ← UCRDTW(  ( ,∙),  ,  ,    )
9: return bestmatch
if</p>
        <p>&gt; distance then
   ← distance
bestmatch ←</p>
        <p>Figure 12 Processing of the candidate matrix
The algorithm performs as follows. For each row of the
then  
candidate matrix, we calculate DTW
distance measure
between the respective candidate and the query by means of
the UCR-DTW algorithm. If this distance is less than   
is updated. The loop is parallelized by means of
the OpenMP pragma where  
is indicated as a variable
shared across all threads while the distance variable is
indicated as a private for each thread. In order to correctly
update the shared variable, we use pragma with critical
section.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5 Experiments</title>
      <sec id="sec-4-1">
        <title>5.1 Experimental setup</title>
        <p>Objectives. In the experiments,
we compared the
performance of our algorithm in comparison with
UCRDTW. We also evaluated the scalability of our algorithm on
Intel Xeon Phi for different datasets. We measured the run
time (after deduction of I/O time) and calculated the
algorithm’s speedup and parallel efficiency. Here we
understand these characteristics of parallel algorithm
scalability as follows. Speedup and parallel efficiency of a
parallel algorithm employing  threads are calculated,
respectively, as  ( ) =
 
 1 and  ( ) =
 ( ), where  1 and

are employed, respectively.
  are the run times of the algorithm when one and  threads</p>
        <p>Hardware. We performed our experiments on a node
of the</p>
        <sec id="sec-4-1-1">
          <title>Tornado</title>
        </sec>
        <sec id="sec-4-1-2">
          <title>SUSU supercomputer [8] with the characteristics summarized in Table 1.</title>
          <p>Figure 6 depicts the performance of PHIBESTMATCH in
comparison with the UCR-DTW algorithm. As we can see,
our algorithm is two times faster for both EPG and Random
Walk datasets when UCR-DTW runs on one CPU core and
PHIBESTMATCH runs on 240 cores of Intel Xeon Phi. Being
run on Intel Xeon Phi, UCR-DTW is obviously slower than
PHIBESTMATCH (from 10 to 15 times).
However, when more than one thread per physical core is
used, speedup became sub-linear, and efficiency decreases
accordingly. That is, speedup stops increasing from 80×
when 120 threads are employed, and efficiency drops from
60 percent for 120 threads to 30 percent for 240 threads.</p>
          <p>On EPG data, the algorithm shows closer to linear
speedup and efficiency (up to 50× and at least 80 percent,
respectively) if the number of threads employed is up to the
number of physical cores. When more than one thread per
physical core is used, speedup slowly increases up to 78×,
and efficiency drops accordingly (similar to the picture for
the Random Walk dataset).</p>
          <p>We
can
conclude
that the
proposed
algorithm
demonstrates closer to linear scalability when the number of
threads it runs on is up to the number of physical cores of
the Intel Xeon Phi many-core processor. However, when
more than one thread per physical core is used, speedup and
parallel efficiency decrease significantly.</p>
          <p>There are two possible reasons for this. Firstly, our
algorithm is not completely parallel, and the candidate matrix
filling is its serial part, which limits speedup. Secondly,
according to its nature, DTW calculations (cf. Def. 2) can
hardly ever be auto-vectorized. Thus, if during the (seamlessly
auto-vectorizable) lower bounding step many subsequences
have not been pruned as clearly dissimilar then they will be
processed by many threads in the DTW calculation step but
without auto-vectorization as it might be expected.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6 Conclusion</title>
      <p>
        In this paper, we address the problem of accelerating
subsequence similarity search on the modern Intel Xeon Phi
system of Knights Landing (KNL) generation. Phi KNL is
an independent bootable device, which provides up to 72
compute cores with a high local memory bandwidth and
512-bit wide vector processing units. Being based on the
x86 architecture, the Intel Phi KNL many-core processor
supports
thread-level
parallelism
and
the
same
programming tools as a regular Intel Xeon CPU, and serves
as an attractive alternative to FPGA and GPU. We consider
the case when time series involved in the computations fit
in main memory.
a) Random Walk dataset
b) EPG dataset
We developed a novel parallel algorithm of
subsequence similarity search for Intel Xeon Phi KNL,
called PHIBESTMATCH. Our algorithm is based on
UCRDTW [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], which is the fastest serial algorithm of
subsequence similarity search due to it integrates cascade
lower bounding and many other algorithmic speedup
techniques. PHIBESTMATCH efficiently exploits
vectorization capabilities of Phi KNL by means of the
sophisticated data layout and computational scheme.
      </p>
      <p>We performed experiments on synthetic and
realword datasets, which showed the following.
PHIBESTMATCH being run on Intel Xeon Phi is two times
faster than UCR-DTW being run on Intel Xeon. The
proposed algorithm demonstrates closer to linear both
speedup and parallel efficiency when the number of
threads it runs on is up to the number of physical cores
of Intel Xeon Phi.</p>
      <p>In further research, we plan to move on our approach
in the following directions: advance the parallelization of
the UCR-DTW algorithm for Intel Xeon Phi KNL, and
extend our algorithm for the computer cluster system
with nodes equipped with Intel Xeon Phi KNL.
Acknowledgments. This work was financially
supported by the Russian Foundation for Basic Research
(grant No. 17-07-00463), by Act 211 Government of the
Russian Federation (contract No. 02.A03.21.0011) and
by the Ministry of education and science of Russian
Federation (government order 2.7905.2017/8.9).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Abdullaev</surname>
            ,
            <given-names>S.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhelnin</surname>
            ,
            <given-names>A.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lenskaya</surname>
            ,
            <given-names>O.Y.</given-names>
          </string-name>
          :
          <article-title>The structure of mesoscale convective systems in central Russia</article-title>
          .
          <source>Russian Meteorology and Hydrology</source>
          .
          <volume>37</volume>
          (
          <issue>1</issue>
          ), pp.
          <fpage>12</fpage>
          -
          <lpage>20</lpage>
          (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Bacon</surname>
            ,
            <given-names>D.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Graham</surname>
            ,
            <given-names>S.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sharp</surname>
            ,
            <given-names>O.J.:</given-names>
          </string-name>
          <article-title>Compiler transformation for high-performance computing</article-title>
          .
          <source>ACM Computing Surveys. 26</source>
          , pp.
          <fpage>345</fpage>
          -
          <lpage>420</lpage>
          (
          <year>1994</year>
          ). doi:
          <volume>10</volume>
          .1145/197405.197406
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Berndt</surname>
            ,
            <given-names>D.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clifford</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>Using dynamic time warping to find patterns in time series</article-title>
          . In: Fayyad,
          <string-name>
            <given-names>U.M.</given-names>
            ,
            <surname>Uthurusamy</surname>
          </string-name>
          ,
          <string-name>
            <surname>R</surname>
          </string-name>
          . (eds.) KDD Workshop, pp.
          <fpage>359</fpage>
          -
          <lpage>370</lpage>
          . AAAI Press (
          <year>1994</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Chrysos</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>Intel Xeon Phi coprocessor (codename Knights Corner)</article-title>
          .
          <source>In: 2012 IEEE Hot Chips 24th Symposium (HCS)</source>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>31</lpage>
          (
          <year>2012</year>
          ). doi:
          <volume>10</volume>
          .1109/HOTCHIPS.
          <year>2012</year>
          .7476487
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Ding</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Trajcevski</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scheuermann</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Keogh</surname>
            ,
            <given-names>E.J.</given-names>
          </string-name>
          :
          <article-title>Querying and mining of time series data: experimental comparison of representations and distance measures</article-title>
          .
          <source>PVLDB</source>
          <volume>1</volume>
          (
          <issue>2</issue>
          ),
          <fpage>1542</fpage>
          -
          <lpage>1552</lpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Epishev</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Isaev</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miniakhmetov</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          et al.:
          <article-title>Physiological data mining system for elite sports</article-title>
          .
          <source>Bull</source>
          . of South Ural State University.
          <source>Series: Comput. Math. and Soft</source>
          . Eng.,
          <volume>2</volume>
          (
          <issue>1</issue>
          ):
          <fpage>44</fpage>
          -
          <lpage>54</lpage>
          ,
          <year>2013</year>
          . (in Russian) doi: 10.14529/cmse130105
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Keogh</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ratanamahatana</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Exact indexing of dynamic time warping</article-title>
          .
          <source>Knowl. Inf. Syst</source>
          .
          <year>2005</year>
          : vol.
          <volume>7</volume>
          , no.
          <issue>3</issue>
          , pp.
          <fpage>406</fpage>
          -
          <lpage>417</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Kostenetskiy</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Safonov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>SUSU supercomputer resources</article-title>
          . In: L.
          <string-name>
            <surname>Sokolinsky</surname>
          </string-name>
          , I. Starodubov (eds.) PCT'
          <year>2016</year>
          , pp.
          <fpage>561</fpage>
          -
          <lpage>573</lpage>
          . CEUR-WS, vol.
          <volume>1576</volume>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Miniakhmetov</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Movchan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zymbler</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Accelerating time series subsequence matching on the Intel Xeon Phi many-core coprocessor</article-title>
          . In: Biljanovic,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Butkovic</surname>
          </string-name>
          ,
          <string-name>
            <surname>Z.</surname>
          </string-name>
          et al. (eds.)
          <source>MIPRO</source>
          <year>2015</year>
          , pp.
          <fpage>1399</fpage>
          -
          <lpage>1404</lpage>
          (
          <year>2015</year>
          ). doi:
          <volume>10</volume>
          .1109/MIPRO.
          <year>2015</year>
          .7160493
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Pearson</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>The problem of the random walk</article-title>
          .
          <source>Nature</source>
          <volume>72</volume>
          (
          <year>1865</year>
          ),
          <volume>342</volume>
          (
          <year>1905</year>
          ). doi:
          <volume>10</volume>
          .1038/072342a0
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Rakthanmanon</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Campana</surname>
            ,
            <given-names>B.J.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mueen</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Batista</surname>
            ,
            <given-names>G.E.A.P.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Westover</surname>
            ,
            <given-names>M.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zakaria</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Keogh</surname>
            ,
            <given-names>E.J.</given-names>
          </string-name>
          :
          <article-title>Searching and mining trillions of time series subsequences under dynamic time warping</article-title>
          . In: Yang,
          <string-name>
            <given-names>Q.</given-names>
            ,
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Pei</surname>
          </string-name>
          ,
          <string-name>
            <surname>J</surname>
          </string-name>
          . (eds.) KDD, pp.
          <fpage>262</fpage>
          -
          <lpage>270</lpage>
          . ACM (
          <year>2012</year>
          ). doi:
          <volume>10</volume>
          .1145/2339530.2339576
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Ratanamahatana</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Keogh</surname>
            ,
            <given-names>E.J.:</given-names>
          </string-name>
          <article-title>Three myths about Dynamic Time Warping Data Mining</article-title>
          . In: Kargupta,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Srivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Kamath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Goodman</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . (eds.),
          <string-name>
            <surname>SDM</surname>
          </string-name>
          <year>2005</year>
          . pp.
          <fpage>506</fpage>
          -
          <lpage>510</lpage>
          . doi:
          <volume>10</volume>
          .1137/1.9781611972757.50
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Sakurai</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Faloutsos</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yamamuro</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Stream monitoring under the time warping distance</article-title>
          . In: Chirkova,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Dogac</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Tamer</surname>
          </string-name>
          <string-name>
            <surname>Ozsu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Sellis</surname>
          </string-name>
          , T.K. (eds.)
          <source>ICDE</source>
          <year>2007</year>
          , pp.
          <fpage>1046</fpage>
          -
          <lpage>1055</lpage>
          . IEEE Computer Society (
          <year>2007</year>
          ). doi:
          <volume>10</volume>
          .1109/ICDE.
          <year>2007</year>
          .368963
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Sart</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mueen</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Najjar</surname>
            ,
            <given-names>W.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Keogh</surname>
            ,
            <given-names>E.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Niennattrakul</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Accelerating dynamic time warping subsequence search with GPUs and FPGAs</article-title>
          . In: Webb,
          <string-name>
            <given-names>G.I.</given-names>
            ,
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Gunopulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <surname>X</surname>
          </string-name>
          . (eds.)
          <source>ICDM</source>
          <year>2010</year>
          , pp.
          <fpage>1001</fpage>
          -
          <lpage>1006</lpage>
          . IEEE Computer Society (
          <year>2010</year>
          ). doi:
          <volume>10</volume>
          .1109/ICDM.
          <year>2010</year>
          .21
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Sodani</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Knights Landing (KNL): 2nd generation Intel Xeon Phi processor</article-title>
          .
          <source>In: 2015 IEEE Hot Chips 27th Symposium (HCS)</source>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>24</lpage>
          . IEEE Computer Society (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Sokolinskaya</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sokolinsky</surname>
            ,
            <given-names>L.B.</given-names>
          </string-name>
          :
          <article-title>Scalability evaluation of NSLP algorithm for solving nonstationary linear programming problems on cluster computing systems</article-title>
          . In: Voevodin,
          <string-name>
            <given-names>V.</given-names>
            ,
            <surname>Sobolev</surname>
          </string-name>
          , S. (eds.)
          <article-title>RuSCDays 2017</article-title>
          . CCIS, vol.
          <volume>793</volume>
          , pp.
          <fpage>40</fpage>
          -
          <lpage>53</lpage>
          . Springer, Heidelberg (
          <year>2017</year>
          ). doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>319</fpage>
          -71255-
          <issue>0</issue>
          _
          <fpage>4</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Srikanthan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kumar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Gupta</surname>
          </string-name>
          , R.:
          <article-title>Implementing the dynamic time warping algorithm in multithreaded environments for real time and unsupervised pattern discovery</article-title>
          .
          <source>In: IEEE ICCCT</source>
          , pp.
          <fpage>394</fpage>
          -
          <lpage>398</lpage>
          . IEEE Computer Society (
          <year>2011</year>
          ). doi:
          <volume>10</volume>
          .1109/ICCCT.
          <year>2011</year>
          .6075111
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Takahashi</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yoshihisa</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sakurai</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kanazawa</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>A parallelized data stream processing system using Dynamic Time Warping distance</article-title>
          . In: Barolli,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Xhafa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Hsu</surname>
          </string-name>
          , H.-H. (eds.) CISIS, pp.
          <fpage>1100</fpage>
          -
          <lpage>1105</lpage>
          (
          <year>2009</year>
          ). doi:
          <volume>10</volume>
          .1109/CISIS.
          <year>2009</year>
          .77
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Yu</given-names>
          </string-name>
          , Yang, H.:
          <article-title>Accelerating subsequence similarity search based on dynamic time warping distance with FPGA</article-title>
          . In: Hutchings,
          <string-name>
            <given-names>B.L.</given-names>
            ,
            <surname>Betz</surname>
          </string-name>
          , V. (eds.) ACM/SIGDA FPGA'
          <volume>13</volume>
          , pp.
          <fpage>53</fpage>
          -
          <lpage>62</lpage>
          . ACM (
          <year>2013</year>
          ). doi:
          <volume>10</volume>
          .1145/2435264.2435277
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Adl</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Glass</surname>
          </string-name>
          , J.:
          <article-title>Fast spoken query detection using lower-bound Dynamic Time Warping on Graphical Processing Units</article-title>
          . In:
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Zymbler</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Best-Match time series subsequence search on the Intel Many Integrated Core architecture</article-title>
          . In: Morzy,
          <string-name>
            <given-names>T.</given-names>
            <surname>Valduriez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Bellatreche</surname>
          </string-name>
          ,
          <string-name>
            <surname>L. (eds.) ADBIS</surname>
          </string-name>
          <year>2015</year>
          .
          <article-title>LNCS</article-title>
          , vol.
          <volume>9282</volume>
          . pp.
          <fpage>275</fpage>
          -
          <lpage>286</lpage>
          . Springer, Heidelberg (
          <year>2015</year>
          ). doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>319</fpage>
          - 23135-8_
          <fpage>19</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>