High-performance DTW-based signals comparison for the brain
                         electroencephalograms analysis
                                                    A.I. Makarova1 , V.V. Sulimova1
                                                1
                                                 Tula State University, Lenin Ave., 92, 300012, Tula, Russia


Abstract

Automatic analysis of electroencephalograms (EEGs) is one of the promising areas of research, which results can be used, in particular, to
build systems of mental control of objects. The Dynamic Time Warping procedure (DTW) is used in this work for comparing signals
representing EEG. An important feature of the problem we are considering is the need for multiple comparison of signals at the stage of
machine learning, which requires enormous computational costs. We propose a parallel algorithm that was implemented in C++ using the MPI
technology and tested using the resources of the supercomputer complex of Moscow State University “Lomonosov”. The results of its testing
on real data showed that the proposed method allows achieving an almost linear speedup and reducing the total calculation time from 29 days
to 3.5 hours using 128 processes, which opens the possibility of improving the quality of automatic analysis of electroencephalograms.

Keywords: high performance computing; comparing fragments of electroencephalograms; Dynamic Time Warping; the MPI technology; the
supercomputer complex of MSU “Lomonosov”


1. Introduction

   Electroencephalography is a method, which consists in reading electrical signals using electrodes located on the scalp. These
signals are generated by the brain in the process of brain activity [1,2] and recorded as electroencephalograms for their
subsequent analysis.
   Automatic analysis of brain electroencephalograms (EEGs) is one of the promising areas of scientific research, which results
can be used, in particular, to build systems for mental control of external devices, e.g. a computer (brain-computer interface,
BCI) [3].
   The idea of such “brain-computer” interface is based on the fact that signals generated by the brain under the influence of
certain (target) stimuli, corresponding to a specific task, contain specific components which are absent under the influence of
other stimuli. This happens in particular in the recognition of an object of a given type among a number of other objects. It is
obvious that the problem of automatic recognition of the target object on the basis of the analysis of electroencephalograms
plays one of the key roles in the construction of a system for the mental control of external objects [4,5]. Thus, it is especially
important to improve the quality of its solution.
   In this paper we, following [5], focus on the experimental research scheme, according to which a person is provided with a
series of mammogram images and his task is to find among them mammograms containing pathologies (the so-called target
mammograms). EEG signals from 66 electrodes fixed on different parts of one's head are recorded during the process of viewing
images. The task in this case is the automatic detection of signals registered in the process of viewing target mammograms.
Figure 1a shows the process of registering electroencephalograms, and figures 1b and 1c - examples of mammograms with
pathology and without pathology, respectively [5].


                                           a)                                     b)                                          c)
    Fig. 1. a) The process of registering electroencephalograms, b) an example of a target object (mammograms with pathology), c) an example of an un-target
                                                           object (mammograms without pathologies).
   Figure 2 shows examples of signals recorded from two of the 66 electrodes during viewing target (solid lines) and non-target
(dashed lines) objects. The signals are subjected to preprocessing (filtration, scaling and smoothing by a sliding window), used
by us to improve the quality of target objects recognition.


3rd International conference “Information Technology and Nanotechnology 2017”                                                                           18
                                                High-Performance Computing / A.I. Makarova , V.V. Sulimova
   From a mathematical point of view, the electroencephalograms, which need to be analyzed for each of the 66 electrodes,
fixed in different parts of the head are single-component discrete signals.


  Fig. 2. Examples of signals received from two electrodes. The solid lines show the signals corresponding to the target objects, and the dotted lines show the
                                                                       nontarget objects.

   Obviously, the analysis of electroencephalograms must inevitably be based on comparing the signals representing them.
Since the form of the response to the stimulus and the time of its onset can vary, in this work, for the comparison of signals an
adapted version of Dynamic Time Warping (DTW) procedure is used. This algorithm based on the search for optimal pairwise
alignment of compared signals by local compression and stretching of their axes. Initially, the DTW method was proposed to
compare speech signals [6], but was later adapted for many other areas [7,8].
  The task of analyzing electroencephalograms within the framework of this work is formulated in the form of a two-class
pattern recognition problem, which solution is carried out in two stages - training and recognition [9]. At the training stage, a
decision rule is formed for assigning new signals to the target or non-target class on the basis of processing a certain finite set of
signals with a known class affiliation. At the recognition stage, the constructed decision rule is applied to new signals.
Recognition can be carried out very quickly, while the process of constructing a good decision rule, which allows to classify
new signals with high accuracy, is very laborious.
  Modern methods of machine learning in the construction of decision rules are based on measures of object comparison and
allow us to automatically choose the most suitable ones in the training process, improving the quality of the solution of the
problem [10]. The great complexity of training is due to the necessity of multiple comparison of long signals in the calculation
of a whole series of matrices of their pairwise dissimilarity (for different electrodes, different types of preprocessing, different
values of the parameters of the comparison algorithm), choosing the most suitable of which is not possible a priori.
  In accordance with the above, an extremely topical task is to increase the productivity of the electroencephalogram
comparison.
  There are a number of ways to speed up the Dynamic Time Warping procedure, however, they either do not guarantee the
finding of the optimal solution [6,11-14], or the performance improvement is provided only in cases, when comparing close or
sparse signals [15,16]. But such situations are rare in the analysis of electroencephalograms. The use of modern parallel
computing technologies is a fundamentally different direction for increasing productivity of signals comparison. However, the
implementation of known parallel versions of DTW procedure (as well as similar tasks with cycles having diagonal
dependencies) do not give the desired effect due to the need for frequent synchronization of processes or threads, and in some
cases may even lead to an increase in the operating time compared to the serial version because of less efficient work with the
cache memory [17-21].
  In this paper, we propose a parallel algorithm that significantly improves the performance of calculating the matrix of pairwise
comparisons for any signals representing fragments of an electroencephalogram. Increase in productivity is performed without
loss of accuracy of calculations due to taking into account the features of the task, allowing to implement parallelization at a
higher level.
  The proposed algorithm is implemented in C ++ programming language using the MPI technology [21] and tested using the
resources of the supercomputer complex of Moscow State University “Lomonosov” [23]. The results of the research on real data
showed that the proposed method allows to achieve near-linear acceleration, and to reduce the total calculation time from 29
days to 3.5 hours using 128 processes, which opens the possibility of improving the quality of automatic analysis of
electroencephalograms.


3rd International conference “Information Technology and Nanotechnology 2017”                                                                                19
                                             High-Performance Computing / A.I. Makarova , V.V. Sulimova
2. Comparison of fragments of electroencephalograms based on DTW

    2.1. The mathematical formulation of the problem of comparing two signals

   Let x = ( x1 , x2 ,..., xNx ) and y = ( y1 , y2 ,..., y N y ) - are two single-component discrete signals, which length are N x and N y ,

respectively, and which consists of xi , y j О R , i = 1,..., N x , j = 1,..., N y elements.
   Specific pairwise warping


of two signals x and y uniquely determines the correspondence of signal's elements and has a length N w , equal to the number
of such pairwise correspondences. At that first and last elements of signals are certainly corresponded to each other:
w1,1 = w1,2 = 1, w Nw ,1 = N x , w Nw ,2 = N y .

   A warping of two signals is considered as optimal one if it ensures the minimum value of the following optimality criterion:

                                                                                                                                 (1)

where


b і 0 - penalty for non-parallel references between elements of signals, corresponded to local warping axes.
  The optimal value of the criterion can be considered as a dissimilarity of the signals:

               r ( x, y) =            ˆ).
                             J (x, y, w

    2.2. A sequential algorithm for computing the dissimilarity of two signals

   The minimum of the optimality criterion (1) can be found by means of the dynamic programming procedure [24]. It is
convenient to represent the algorithm of finding the optimal warping in terms of an oriented graph of pairwise correspondences
(Figure 3), in which the nodes correspond to the comparison of the signal's elements, the horizontal and vertical edges
correspond to the local warping of the axes (passing through them is penalized with the positive penalty b ), and the diagonal
edges - parallel references between signal's elements.
   The algorithm of finding the optimal pairwise warping consists in consecutive passing through all the vertices of the graph,
beginning with the upper left and ending with the bottom right vertex. An incomplete value of the optimality criterion 𝐽𝑖,1      ̅ is
calculated at each vertex on the basis of the initial parts of the signals x1..i = ( x1 ,..., xi ) , y1.. j = ( y1 ,..., y j ) :


   The sought value of the dissimilarity of two signals: 𝑟(𝐱, 𝐲) = √𝐽̃𝑁𝑥 𝑁𝑦 .


3rd International conference “Information Technology and Nanotechnology 2017”                                                           20
                                             High-Performance Computing / A.I. Makarova , V.V. Sulimova


                                 Fig. 3. A graph of pair correspondences of signal samples representing an electroencephalogram.

    2.3. Calculation of the matrix of values of dissimilarity of electroencephalograms

   As already mentioned above, at the stage of learning the computer algorithms for data analysis, it is necessary to calculate the
matrices of the values of dissimilarity for all pairs of signals representing fragments of electroencephalograms from a certain
training set X = {x1 , x 2 , ..., x K } .
  The matrix of pairwise dissimilarities calculated in accordance with the algorithm given in Section 2.2 is symmetric and
contains zero values on the main diagonal, so it is sufficient to calculate only the values belonging to the upper (or lower)
triangle (Figure 4), the number of which can be determined according to the expression K (K - 1) / 2 .


                                                   Fig. 4. Matrix of values of pairwise dissimilarity of signals.


    Since the number of computations has a quadratic dependence on the number of signals compared, even for a small volume
of training set. In this work the training set consists of K = 755 fragments of electroencephalograms (for each electrode) and so
it is necessary to perform 284635 pairwise comparisons of signals to calculate one matrix. Thus, it is necessary to perform
56357730 pairwise comparisons of signals to calculate such matrices for all 66 electrodes and for three different values of the
penalty for warping signal's axes. Since the time of one pairwise comparison requires, on average, 0.045 seconds, the calculation
of all the matrices takes about 29 days, which makes it impossible to carry out experiments on real data and requires taking
special measures to improve computing performance.

3. Parallel comparison of fragments of electroencephalograms

   The task of computing several matrices of pairwise dissimilarity of fragments of electroencephalograms has several levels of
data parallelism. Independently from each other can be calculated:
   1) the incomplete values of the criterion, located on one minor diagonal of the graph of pairwise correspondences,
   2) all elements belonging to the upper (or lower) triangle of the matrix of values of pairwise dissimilarity (Figure 4),
   3) all matrices of pairwise dissimilarity.
   Parallelization at the first level requires frequent interaction of processes or threads and is associated with the costs of
synchronization, what can also lead to inefficient use of the cache [17-21] and, accordingly, does not provide the desired
acceleration of computations.
   Parallelization at the third level obviously makes sense only if the number of matrices that need to be calculated is greater
than the number of available calculators. It also seems inappropriate, since in this study we focus on the computational
capabilities provided by the supercomputer complex “Lomonosov” of Moscow State University [23], consisting of more than
5000 nodes and more than 12000 processor cores.


3rd International conference “Information Technology and Nanotechnology 2017”                                                      21
                                             High-Performance Computing / A.I. Makarova , V.V. Sulimova
    So, in the framework of this paper, parallelization is performed at the second level, i.e. the tasks of calculating the elements
of a single matrix of pair-wise dissimilarity for a set of signals are identified as parallel tasks. In this case, if it is necessary to
calculate several matrices, then they are computed sequentially.
    MPI technology [21] has been chosen as a technology for organizing parallel computing, which allows to organize the
interaction of processes running on different computing nodes.
    In this case, since even for the Lomonosov supercomputer, the number of elements of each calculated matrix turns out to be
much larger than the number of processors, then the parallel tasks are aggregated by means of a one-time distribution of the
matrix elements between the processes.
    In this paper the following scheme is used to distribute M = K ( K - 1) / 2 elements between the P processes: the first
 M mod P processes receive the ( M - M mod P ) / P + 1 elements, and the remaining M - M mod P processes receive the
 ( M - M mod P ) / P elements, where mod is the modulo operation which gives the remainder after division of one integer by
another.
    This scheme allows us to distribute the work between processes as evenly as possible. The one-time even distribution of
elements between the processes proves to be the most effective in this situation, since all compared signals have the same length
and, accordingly, the comparison of any pairs of signals takes approximately the same time. As a result,this scheme allows to
ensure the most efficient use of the resources of the computer system.
    Graphical representation of the scheme of data distribution by processes is shown in Fig. 5.


                                           Fig. 5. Scheme of data distribution by processes for three processes.

   According to the proposed scheme of parallel computing, each process independently of the others:
   1) reads the original signals representing fragments of electroencephalograms from the input file,
   2) determines the number of matrix's elements that it must process and the linear index of the initial element belonging to its
range, according to the above-mentioned principle of uniform distribution of elements between processes,
   3) calculates the row and column number determining the position of the given element in the matrix from the found linear
index,
   4) performs a sequential search of the consecutive elements of the matrix belonging to the upper triangle starting from the
element defined in clause 2, at that performs for each element of its range a comparison of the corresponded signals (according
to the sequential algorithm described in Section 2.2) and stores the calculated values in their copy of the dissimilarity matrix.
Calculations continue until the number of elements obtained in step 2 is processed.
   All results are merged on the process with the number 0 using the function MPI_Reduce, after each process computes its
elements of the dissimilarity matrix.

4. Experimental study

   Brain electroencephalograms obtained in the course of the study described in [5] have been used to investigate the proposed
algorithm.
   In total, it is required to calculate three matrices (for different values of the penalty) of a pairwise dissimilarity of 755 signals
for each of the 66 electrodes. I.e. it is required to calculate 198 matrices. However, the calculation of all matrices of dissimilarity
takes approximately the same time and in this connection, it is enough to perform testing for one of the matrices.
   The performance study was implemented using the resources of the MSU supercomputer complex “Lomonosov” [23].
Testing was conducted for different matrix sizes and different number of processes to identify possible behavioral features of the
proposed algorithm. The results of time measurements in the calculation of one matrix of pair-wise dissimilarity of signals are
given in Table 1.
   The acceleration, which was obtained for each case, was calculated as the ratio of the sequential computation time to the
calculation time on P processes. The results are shown in Table 2.


3rd International conference “Information Technology and Nanotechnology 2017”                                                       22
                                                  High-Performance Computing / A.I. Makarova , V.V. Sulimova
 Table 1. Time of calculating the matrices of pairwise comparison of signals for different matrix sizes and different number of processes

Number of signals                                                          Running time of the algorithm (sec)
                                                                                     Number of processes
                       1                  2                 4                   8                 16               32               64                128
10                     2,14918            1,12535           0,701826            0,48118           0,21083          0,191736         0,205418          0,353755
20                     9,04609            4,61894           2,38828             1,15214           0,619012         0,381232         0,297953          0,392734
40                     37,1698            18,7533           9,42092             4,82778           2,34974          1,26236          0,822015          0,615929
100                    232,34             117,278           58,3369             29,1305           14,696           7,72897          4,09282           2,46599
200                    943,653            472,307           238,218             117,96            59,0047          30,2108          15,3994           8,34048
300                    2124,3             1083,46           541,713             268,855           133,068          67,1879          34,5807           17,4532
500                    5912,42            3010,1            1498,6              736,2             371,859          185,836          93,3412           49,3313
755                    13675,7            6864,46           3418,42             1700,52           847,277          422,34           212,281           106,928


 Table 2. Acceleration obtained in calculating the matrices of pairwise comparison of signals for different matrix sizes and different number of processes

 Number of                                                                   Acceleration of the algorithm
 signals
                                                                                    Number of processes
                      1                 2                  4                   8                 16               32               64                128
 10                   1                 1,909788           3,062269            4,466478          10,1939          11,20906         10,46247          6,075335
 20                   1                 1,958477           3,787701            7,851554          14,61376         23,72857         30,3608           23,03363
 40                   1                 1,98204            3,945453            7,69915           15,81869         29,44469         45,21791          60,34754
 100                  1                 1,981105           3,982728            7,975833          15,80974         30,06093         56,76771          94,21774
 200                  1                 1,997965           3,9613              7,999771          15,99284         31,23562         61,27856          113,1413
 300                  1                 1,960663           3,921449            7,901285          15,96402         31,6173          61,43022          121,7141
 500                  1                 1,964194           3,945296            8,030997          15,89963         31,81526         63,34202          119,8513
 755                  1                 1,992247           4,000591            8,042069          16,14077         32,38078         64,42263          127,8963


       Figure 6 shows graphs illustrating the data from Tables 1 and 2.


                                   a)                                                                                        b)
                             Fig. 6. Graphs of a) time dependence and b) acceleration of the algorithm from the number of processes.

    As expected, the acceleration achieved by using the proposed parallel algorithm increases with the size of the calculated
 matrix of signal dissimilarity. And the acceleration is almost linear in the calculation of the total matrix.
    Thus, the proposed approach launched on 128 processes of “Lomonosov” supercomputer complex allowed us to reduce the
 time of calculating one complete matrix of dissimilarity of fragments of electroencephalograms by 127.89 times. The calculation
 time for one matrix was reduced from 3.8 hours to 1.78 minutes, and the total calculation time of all 198 matrices was reduced
 from 29 days to 3.5 hours.

 5. Conclusion

    In this paper, a high-performance algorithm is proposed which calculates the matrix of the dissimilarity of signals
 representing fragments of brain electroencephalograms. The proposed algorithm was implemented in C++ programming
 language with using MPI parallel programming technology. It was tested using the resources of “Lomonosov” supercomputer


 3rd International conference “Information Technology and Nanotechnology 2017”                                                                               23
                                              High-Performance Computing / A.I. Makarova , V.V. Sulimova
complex at Moscow State University [23]. Experimental studies implemented on real data have shown that the proposed
algorithm allows to achieve an almost linear speedup and to reduce the total calculation time from 29 days to 3.5 hours by using
128 processes. And so, it opens the possibility of improving the quality of automatic analysis of electroencephalograms.

Acknowledgments

  This work is supported by the Russian Fund for Basic Research, grant 15-07-08967.
  The authors would like to thank rector of Lomonosov Moscow State University Viktor Sadovnichiy and Moscow State
University Supercomputing Center for providing “Lomonosov” supercomputer complex to perform experimental study.

References

     [1]  Zenkov LR, Zenkov KS. Clinical electroencephalography (with elements of epileptology). 3rd ed. Moscow: Publishing house MEDPRESS-
          INFORM, 2004; 368 p. (in Russian)
     [2] Teplan M. Fundamentalsof EE Gmeasurement. Measurement Science Review 2002; 2(2): 1–11.
     [3] Wolpaw J, McFarland DJ, Neat GW, Forneris CA. An R. EEG-based brain-computer interface for cursor control. Electroencephalography & Clinical
          Neurophysiology 1991; 8(3): 252–259.
     [4] Tran L. EEG Features for the Detection of Event-Related Potentials Evoked Using Rapid Serial Visual Presentation. PhD Thesis 2014; 63 p.
     [5] Hope C, Sterr A, Langovan PE, Geades N, Windridge D, Young K, Wells K. High Throughput Screening for Mammography using a Human-
          Computer Interface with Rapid Serial Visual Presentation (RSVP). Proc. SPIE 8673, Medical Imaging: Image Perception, Observer Performance,
          and Technology Assessment 2013; 867303. DOI:10.1117/12.2007557
     [6] Sakoe H, Chiba S. Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech and Signal
          Processing 1978; 26(1): 43–49.
     [7] Berndt DJ, Clifford J. Using dynamic time warping to find patterns in time series. Association for the Advancement of Artificial Intelligence,
          Workshop on Knowledge Discovery in Databases 1994: 229–248.
     [8] Keogh E, Pazzani M. Scaling up dynamic time warping for datamining applications. Proceedings of the sixth ACM SIGKDD intern. conf. on
          Knowledge discovery and data mining. ACM Press, New York, NY, USA 2000: 285–289.
     [9] Vapnik,V. Statistical Learning Theory. John-Wiley & Sons, Inc., 1998.
     [10] Tatarchuk A, Sulimova V, Mottl V, Windridge D. Supervised Selective Kernel Fusion for Membrane Protein Prediction. Lecture Notes in Computer
          Science 2014; 8626: 98–109.
     [11] Myers C, Rabiner LR, Rosenberg AE. Performance tradeoffs in dynamic time warping algorithms for isolated word recognition. IEEE Transactions
          on Acoustics, Speech and Signal Processing 1980; 28(6): 623–635.
     [12] Keogh E, Ratanamahatana C. Exact indexing of dynamic time warping. Knowledge and Information Systems 2004; 7(3): 358–386.
     [13] Lemire D. Faster retrieval with a two-pass dynamic-time-warping lower bound. Pattern Recogn. 2009; 42(9): 2169–2180.
     [14] Salvador S, Chan P. Toward accurate dynamic time wrapping in linear time and space. Intelligent Data Analysis 2007; 11(5): 561–580.
     [15] Al-Nayma G, Chawla S, Taheri J. SparseDTW: A Novel Approach to Speed up Dynamic Time Warping, 2012
     [16] Silva DF. Speeding up all-pairwise dynamic time warping matrix calculation 2016. URL: http://sites.labic.icmc.usp.br/prunedDTW
     [17] Lamport L. The parallel execution of DO loops. Commun. ACM 1974; 17(2): 83–93.
     [18] Babichev AV, Lebedev VG. Parallelization of program cycles. Programming 1983; 5: 52–63.(in Russian)
     [19] Fernandez A, Llaberia JM, Valero-Garcia M. Loop Transformation Using Nonunimodal Matrices. IEEE Transactions on Parallel and Distributed
          Systems 1995; 6(8): 832–840.
     [20] Abu Khalil JM, Morylev RI, Shteinberg BI. Parallel Algorithm of Global Alignment with Optimal Memory Usage. Modern problems of science and
          education 2013; 1. URL: http://www.science-education.ru/107-8139 . (in Russian)
     [21] Steinberg BI. Optimizing the use of the memory cache in computational tasks and optimizing compilation. The All-Russian Scientific Conference on
          Informatics Problems SPISOK-2013, mehmat of St. Petersburg University, St. Petersburg, 2013. (in Russian)
     [22] Antonov AS, Tutorial A. Parallel Programming Using MPI Technology. Moscow: MGU, 2004; 71 p.(in Russian)
     [23] Voevodin VlV, Zhumatiy SA, Sobolev SI, Antonov AS, Bryzgalov PA, Nikitenko DA, Stefanov KS, Voevodin VadV. Practice of ‘Lomonosov’
          Supercomputer. Open Systems. Moscow: “Open Systems” Publishing house 2012; 7: 36–39. (in Russian)
     [24] Bellman R, Kalaba RM. Dynamic Programming and Modern Control Theory. Science 1969; 118 p. (in Russian)


3rd International conference “Information Technology and Nanotechnology 2017”                                                                        24