<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Partition Around Medoids Clustering on the Intel Xeon Phi Many-Core Coprocessor</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>South Ural State University</institution>
          ,
          <addr-line>Chelyabinsk</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <fpage>29</fpage>
      <lpage>41</lpage>
      <abstract>
        <p>The paper touches upon the problem of implementation Partition Around Medoids (PAM) clustering algorithm for the Intel Many Integrated Core architecture. PAM is a form of well-known k-Medoids clustering algorithm and is applied in various subject domains, e.g. bioinformatics, text analysis, intelligent transportation systems, etc. An optimized version of PAM for the Intel Xeon Phi coprocessor is introduced where OpenMP parallelizing technology, loop vectorization, tiling technique and e cient distance matrix computation for Euclidean metric are used. Experimental results for di erent data sets con rm the e ciency of the proposed algorithm.</p>
      </abstract>
      <kwd-group>
        <kwd>data mining</kwd>
        <kwd>clustering</kwd>
        <kwd>k-Medoids</kwd>
        <kwd>Partition Around</kwd>
        <kwd>Medoids</kwd>
        <kwd>Intel Many Integrated Core architecture</kwd>
        <kwd>Intel Xeon Phi co- processor</kwd>
        <kwd>parallel computing</kwd>
        <kwd>tiling</kwd>
        <kwd>vectorization</kwd>
        <kwd>OpenMP</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Clustering is one of the basic problems of data mining aimed to organizing a set
of data objects into subsets (clusters) such that objects in a cluster are similar to
one another, yet dissimilar to objects in other clusters. Similarity is commonly
de ned in terms of how close the objects are and is based on a speci ed distance
metric.</p>
      <p>The most fundamental method of clustering is partitioning, which organizes
the objects of a set into several exclusive groups. More formally, given a set of
n objects, a partitioning algorithm constructs k partitions of the data, where
each partition represents a cluster and k n. The algorithm divides the data
objects into k clusters. An object is assigned to a closest cluster based on the
distance measure between the object and the cluster center. Then algorithm
iteratively improves the within-cluster variation by computing the new cluster
center using the objects assigned to the cluster in the previous iteration. After
This work was nancially supported by the Ministry of education and science
of the Russian Federation (\Research and development on priority directions of
scienti c-technological complex of Russia for 2014{2020" Federal Program, contract
No. 14.574.21.0035).
this cluster centers are updated and all the objects are then reassigned using the
new cluster centers. The iterations continue until the assignment is stable, that
is, the clusters formed in the current round are the same as those formed in the
previous round.</p>
      <p>
        Partitioning clustering algorithms di er in a way of calculation cluster
centers, e.g. k-Means [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and k-Modes [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] algorithms uses mean and mode values
of clustered objects respectively, whereas k-Medoids algorithm uses an object of
clustered data set (called medoid ).
      </p>
      <p>
        The Partition Around Medoids (PAM) [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] is a variation of k-Means, which
is used in a wide spectrum of applications, e.g. text analysis, bioinformatics,
intelligent transport systems, etc. The complexity of each iteration in the PAM
algorithm is O(k(n k)2). For large values of n and k computations are very
costly. That is why there are approaches to speed up k-Means and PAM
algorithms by means of GPU [
        <xref ref-type="bibr" rid="ref10 ref3">3, 10</xref>
        ]. At the same time there none for modern
accelerators based on the Intel Many Integrated Core (MIC) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] architecture. In
spite of many recent developments for manycore platforms in data mining [
        <xref ref-type="bibr" rid="ref13 ref15 ref23">13,
15, 23</xref>
        ] and databases [
        <xref ref-type="bibr" rid="ref1 ref7">1, 7</xref>
        ] there are few ones for the Intel MIC architecture.
      </p>
      <p>In this paper we present a parallel version of PAM for MIC accelerators. The
remaining part of the paper is organized as follows. Section 2 gives an overview
of serial PAM algorithm and discusses related work. In section 3 we describe
parallelization of PAM adapted for the Intel MIC architecture. The results of
the experiments evaluating the algorithm are presented in section 4. Section 5
contains summary and directions for future research.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Background of the Research</title>
      <p>
        Serial PAM Algorithm
To provide formal description of the PAM [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] algorithm we will use the following
notation. Let O = fo1; o2; : : : ; ong is a set of objects to be clustered where each
object is a tuple consisting of p real-valued attributes. Let k is the number
of clusters, k n, and C = fc1; c2; : : : ; ckg is a set of medoids, C O, and
: O C ! R is a distance metric.
      </p>
      <p>The algorithm takes the form of a steepest ascent hill climber, using a simple
swap neighbourhood operation. In each iteration medoid object ci and
nonmedoid object oj are selected that produce the best clustering when their roles
are switched. The objective function used is the sum of the distances from each
object to the closest medoid:</p>
      <p>n
E = X min
j=1 1 i k
(ci; oj ):
(1)</p>
      <p>Algorithm 1 depicts PAM pseudocode. PAM consists of two phases, namely
BUILD and SWAP. In the rst phase an initial clustering is obtained by the
successive selection of representative objects until k objects have been found.</p>
      <sec id="sec-2-1">
        <title>Input : Set of objects O, number of clusters k Output: Set of k clusters 1 Init C ;</title>
        <p>2 repeat
3 Calculate Tmin ;
4 Swap cmin omin;
5 until Tmin &lt; 0;
The rst object c1 is the one for which the sum of the distances to all other
objects is as small as possible:</p>
        <p>Object c1 is the most centrally located in O set. Subsequently, at each step
another object is selected, which decreases the objective function as much as
possible. This object is the one for which the minimal distance to all selected
medoids and distance to this object is as small as possible:
This process is continued until k objects have been found.</p>
        <p>In the second phase of the algorithm, it is attempted to improve C (i.e.
set of medoids) and therefore also to improve the clustering yielded by this
set. Algorithm searches for a pair of objects (cmin; omin), which minimizes the
objective function. This is done by considering all pairs of objects (ci; oh) where
ci is a medoid and oh is not a medoid. It is determined what e ect is obtained
on the objective function when a swap is carried out, i.e., when object ci is no
longer selected as a medoid but object oh is. Let denote this e ect as Tih, then
minimum value of Tmin is achieved with (cmin; omin) pair. If Tmin &gt; 0 then C
set can not be improved so the algorithm stops.</p>
        <p>Let us consider calculation of the Tih e ect using the following notation. Let
D = fd1; d2; : : : ; dng is a set of distances from each object to the closest medoid.
Let S = fs1; s2; : : : ; sng is a set of distances from each object to second closest
medoid. Let Cjih is a contribution of non selected object oj to the e ect Tih of
a swap between ci and oh on the objective function. In this case Tih is the sum
of the contributions Cjih:</p>
        <p>
          Tih =
n
X Cjih:
j=1
(6)
Algorithm 2 [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] depicts pseudocode of calculating Cjih.
        </p>
        <p>Input : oj; ci; oh; dj; sj</p>
        <p>
          Output: Cjih
1 if (oj; ci) &gt; dj and (oj; oh) &gt; dj then
2 Cjih 0
3 else if (oj; ci) = dj then
4 if (oj; oh) &lt; sj then
5 Cjih (oj; oh) dj
6 else
7 Cjih sj dj
8 end
9 else if (oj; oh) &lt; dj then
10 Cjih (oj; oh) dj
11 end
A signi cant amount of work has been done in the area of cluster analysis. The
classical k-Means and k-Medoids algorithms was suggested in [
          <xref ref-type="bibr" rid="ref11 ref5">5, 11</xref>
          ]. The original
PAM algorithm was proposed in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
        </p>
        <p>
          The research devoted to accelerating clustering algorithms using parallel
hardware includes the following. In [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] FPGA and GPU implementations of
k-Means are compared. Authors of [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] describe improvements of k-Means
reducing data transfers between CPU and GPU. In [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] a technique improving
data distribution among GPU threads in k-Means is suggested. k-Means
implementation for Hadoop framework with GPUs is described in [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]. In [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] several
clustering methods on GPU including k-Medoids are implemented. A GPU-based
framework for clustering genetic data using k-Medoids described in [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
        </p>
        <p>
          In our opinion currently the potential of the Intel MIC accelerators for
cluster analysis is underestimated. Paper [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] proposes modi cation of the DBSCAN
density-based clustering algorithm for the Intel MIC architecture. In [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] a
version of k-Means for CPU and Intel MIC heterogeneous architecture is presented,
where authors used vectorization and sophisticated layout scheme to improve
data locality. The contribution of this paper is technique of acceleration of
the Partitioning Around Medoids clustering algorithm with the Intel Xeon Phi
many-core coprocessor.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Parallel PAM</title>
    </sec>
    <sec id="sec-4">
      <title>Algorithm for MIC Accelerators</title>
      <p>
        In this section we describe an approach to implementation of PAM algorithm for
the Intel Xeon Phi coprocessor [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. The Intel Xeon Phi coprocessor is an
x86based SMP-on-a-chip with over fty cores. It supports 4 hardware threads per
core and contains 512-bit wide vector processor unit (VPU). Each core has two
levels of cache memory: a 32 Kb L1 data cache, a 32 Kb L1 instruction cache,
and a core-private 512 Kb uni ed L2 cache. The Intel Xeon Phi coprocessor
is connected to other devices via the PCIe bus. Intel Xeon Phi coprocessor is
based on Intel x86 architecture and it supports the same programming tools and
models as a regular Intel Xeon processor. Our approach is based on the following
principles.
      </p>
      <p>Data parallelism and vectorization. Using OpenMP technology we perform
simultaneous execution on multiple cores of the same function across the elements
of a dataset. Most loops of the original PAM algorithm with arithmetic
operations were implemented to provide conversion of such operations from scalar
form to vector form to be e ectively computed by the coprocessor's VPUs.</p>
      <p>Our implementation strives to provide data locality as much as possible, i.e.
the program uses data close to recently accessed locations. Since the coprocessor
loads a chunk of memory around an accessed location into the cache, locations
close to recently accessed locations are also likely to be in the cache so nally it
increases algorithm's performance.</p>
      <p>Algorithm 3 depicts PAM pseudocode adapted for use on the Intel Xeon Phi
many-core coprocessor.</p>
      <sec id="sec-4-1">
        <title>Input : Set of objects O, number of clusters k</title>
        <p>Output: Set of C clusters
1 O oad O; k from CPU to coprocessor;
2 M P repareDistanceM atrix(O);
3 C BuildM edoids(M ) ;
4 repeat
5 Tmin F indBestSwap(M; C) ;
6 Swap cmin and omin;
7 until Tmin &lt; 0;
8 O oad C from coprocessor to CPU;
/* BUILD phase */
/* SWAP phase */
The summary of parallel PAM subalgorithms is presented in Tab. 1.</p>
        <p>To improve performance we use precomputing technique by means of
calculating distances between all objects of O set in advance. There is no need for
repeated calculation of distances at each iteration, since distances simply can be
looked up in M matrix.</p>
        <p>
          The PAM algorithm deals with a lot of data arrays which are not t into Intel
Xeon Phi L2 memory cache. We process data by chunks of L bytes to satisfy data
locality requirement. It is recommended [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] to set L to 16 and try multiplying
or dividing by 2 and use n divisible by L. In our work we use L = 32.
        </p>
        <p>
          The PrepareDistanceMatrix subalgorithm initializes distance matrix (see
Algorithm 4). Unlike in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] we store matrix in full form (not in upper triangular
form) to provide better data locality for the rest of subalgorithms. To achieve
better performance of this subalgorithm we use tiling technique [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Input : Set of objects O</title>
        <p>Output: Distance matrix M
1 parallel for oi such that 1 i
2 for j = 1 to n step L do
3 for k = 1 to p do
4 for l such that j l
5 mil mil + (oi[k]
6 end
7 end
8 for l such that j l
9 mil pmil;
10 end
11 end
12 endfor
n do
j + L do
ol[k])2 ;</p>
        <p>/* vectorized */
/* access to ol is tiled */
j + L do</p>
        <p>/* vectorized */</p>
        <p>Tiling is a technique for improving data reuse in cache architectures. Cache
architectures generally employ least recently used (LRU) methods to determine
which data is evicted from the cache as new data is requested. Therefore, the
longer data remains unused, the more likely it will be evicted from the cache
and no longer available immediately when needed. Tiling the access pattern can
exploit data that remains in the cache from recent, previous iterations.</p>
        <p>The BuildMedoids subalgorithm implements BUILD phase (see Alg. 5)
according to formulas (2){(5). The FindBestSwap subalgorithm implements SWAP
phase (see Alg. 6). It checks all pairs of (ci; oh) objects where ci is a medoid and
oh is not a medoid, calculates the e ect for each Tih swapping and returns the
minimal one.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Input : Distance matrix M</title>
        <p>Output: Set of medoids C
1 parallel for i = 1 to n do</p>
        <p>n
2 if P mij is minimal then
j=1</p>
        <p>c1 oi;
3
4 end
5 endfor
6 Init D distances to nearest medoid;
7 for l = 2 to k do
8 parallel for i = 1 to n do</p>
        <p>n
9 if P min(dj ; mij ) is minimal then
j=1</p>
        <p>cl oi;
10
11
12
13
14 end</p>
        <p>end
endfor
Update D;</p>
        <p>n and oh is not a medoid do</p>
        <p>SWAP phase executes many logical operations in (see Alg. 2). By this reason
two versions of the PAM algorithm were implemented. PAM-1 executes more
logical operations with lesser temporary data. PAM-2 executes lesser logical
operations but stores more temporary data. Preparing matrix function and BUILD
phase are the same in both versions.
4</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experimental Evaluation</title>
      <p>To evaluate the developed algorithm we performed experiments on the hardware
speci ed in Tab. 2. Experiments were performed on single precision data, the
coprocessor was used in o oad mode. We measured PAM runtime while varying
number of clustered objects and investigated the in uence of dataset properties
on runtime of PAM subalgorithms.</p>
      <p>Experimental results for FCS Human dataset are introduced in Fig. 7(a). FCS
Human dataset has large dimension so the most time is taken by calculation of
distance matrix. Calculation of distance matrix on the Intel Xeon Phi is two
times faster then on the Intel Xeon. There is no signi cant di erence between
PAM-1 and PAM-2 for this dataset.</p>
      <p>Experimental results for Corel Image Histogram dataset are introduced in
Fig. 7(b). Data dimension is small so preparing distance matrix does not require
much time. PAM-1 shows similar performance on both CPU and Intel Xeon Phi.
The PAM-2 algorithm is two times slower on the Intel Xeon than on the Intel
Xeon Phi.</p>
      <p>Experimental results for MixSim dataset are introduced in Fig. 7(c). Again
PAM-1 shows similar performance on both CPU and the Intel Xeon Phi. PAM-2
on the Intel Xeon Phi shows best result on this dataset.
(a) FCS Human dataset</p>
      <p>(b) Corel Image Histogram dataset
(c) MixSim dataset</p>
      <p>(d) Letter Recognition dataset</p>
      <p>Experimental results for Letter Recognition dataset are introduced in
Fig. 7(d). PAM-1 shows the best result on the Intel Xeon. Both PAM-1 and
PAM-2 shows similar results on the Intel Xeon Phi.</p>
      <p>Intuitively PAM-2 is a better implementation for the Intel Xeon Phi. This
suggestion is con rmed by experiments. In all tests PAM-2 is twice better on the
Intel Xeon Phi than the Intel Xeon. In the same time PAM-1 is the best with
the Intel Xeon only once. In other tests there is no signi cant di erence.</p>
      <p>To investigate this fact deeper we made more experiments to see contribution
of every PAM subalgorithm in Fig. 8. Figures 8(a) and 8(c) show time of matrix
calculation and BUILD phase. The Intel Xeon Phi outperforms the Intel Xeon
in both subalgorithms. Figures 8(b) and 8(d) show average time of one iteration
in SWAP phase. In these gures we can see that Intel Xeon Phi performance
degraded faster then Intel Xeon. PAM-2 implementation looses to PAM-1 in
SWAP phase for big datasets so we need to continue PAM-2 improvements for
Intel Xeon Phi.</p>
      <p>(a) MixSim: BUILD phase and prepare (b) MixSim: PAM-1 and PAM-2
iteradistance matrix timings tion timings
(c) Letter Recognition: BUILD phase (d) Letter Recognition: PAM-1 and
and prepare distance matrix timings PAM-2 iteration timings</p>
      <p>Experiments show that PAM performance depends on clustered data nature.
The most complex thing for large dimension data is calculation of distance
matrix. In case of small dimension data the rest of the PAM subalgorithms take
signi cantly larger part of runtime than distance matrix calculation. BUILD
phase is more e ective on the Intel Xeon Phi. SWAP phase perform better on
the Intel Xeon. PAM execution on Letter Recognition dataset requires more
iterations than MixSim experiment. By this reason PAM-1 shows best result with
Letter Recognition and PAM-2 shows best result with MixSim.</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>The paper has described a parallel version of Partitioning Around Medoids
clustering algorithm for the Intel Xeon Phi many-core coprocessor. An optimized
version of PAM for the Intel Xeon Phi coprocessor is introduced where OpenMP
parallelizing technology, loop vectorization, tiling technique and e cient
distance matrix computation for Euclidean metric are used. Algorithm stores data
in continuous arrays and process data by chunks to achieve data locality for
better performance.</p>
      <p>Experimental results show e ectiveness of suggested approach. Experiments
show that PAM performance depends on clustered data nature. The most
complex thing for large dimension data is calculation of distance matrix. In case of
small dimension data the rest of the PAM subalgorithms take signi cantly larger
part of runtime than distance matrix calculation. BUILD phase is more e ective
on the Intel Xeon Phi. SWAP phase perform better on the Intel Xeon. PAM-1
shows best result with Letter Recognition dataset and PAM-2 shows best result
with MixSim.</p>
      <p>As future work we plan to extend our research in the following directions:
implement our algorithm for the cases of several coprocessors and cluster system
based on nodes equipped with the Intel Xeon Phi coprocessor(s).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Besedin</surname>
            ,
            <given-names>K.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kostenetskiy</surname>
            ,
            <given-names>P.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prikazchikov</surname>
            ,
            <given-names>S.O.</given-names>
          </string-name>
          :
          <article-title>Increasing e ciency of data transfer between main memory and intel xeon phi coprocessor or NVIDIA GPUS with data compression</article-title>
          . In: Malyshkin,
          <string-name>
            <surname>V</surname>
          </string-name>
          . (ed.)
          <source>Parallel Computing Technologies - 13th International Conference, PaCT</source>
          <year>2015</year>
          , Petrozavodsk, Russia,
          <source>August 31 - September 4</source>
          ,
          <year>2015</year>
          , Proceedings. LNCS, vol.
          <volume>9251</volume>
          , pp.
          <volume>319</volume>
          {
          <fpage>323</fpage>
          . Springer (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Engreitz</surname>
            ,
            <given-names>J.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jr.</surname>
            ,
            <given-names>B.J.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marshall</surname>
            ,
            <given-names>J.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Altman</surname>
          </string-name>
          , R.B.:
          <article-title>Independent component analysis: Mining microarray data for fundamental human gene expression modules</article-title>
          .
          <source>Journal of Biomedical Informatics</source>
          <volume>43</volume>
          (
          <issue>6</issue>
          ),
          <volume>932</volume>
          {
          <fpage>944</fpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Espenshade</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pangborn</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>von Laszewski</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roberts</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cavenaugh</surname>
            ,
            <given-names>J.S.:</given-names>
          </string-name>
          <article-title>Accelerating partitional algorithms for ow cytometry on gpus</article-title>
          .
          <source>In: IEEE International Symposium on Parallel and Distributed Processing with Applications</source>
          ,
          <source>ISPA</source>
          <year>2009</year>
          , Chengdu, Sichuan, China,
          <fpage>10</fpage>
          -
          <lpage>12</lpage>
          August
          <year>2009</year>
          . pp.
          <volume>226</volume>
          {
          <fpage>233</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Frey</surname>
            ,
            <given-names>P.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Slate</surname>
            ,
            <given-names>D.J.</given-names>
          </string-name>
          :
          <article-title>Letter recognition using holland-style adaptive classi ers</article-title>
          .
          <source>Machine Learning</source>
          <volume>6</volume>
          ,
          <volume>161</volume>
          {
          <fpage>182</fpage>
          (
          <year>1991</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>Extensions to the k-means algorithm for clustering large data sets with categorical values</article-title>
          .
          <source>Data Min. Knowl. Discov</source>
          .
          <volume>2</volume>
          (
          <issue>3</issue>
          ),
          <volume>283</volume>
          {
          <fpage>304</fpage>
          (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Hussain</surname>
            ,
            <given-names>H.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Benkrid</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ebrahim</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Erdogan</surname>
            ,
            <given-names>A.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seker</surname>
          </string-name>
          , H.:
          <article-title>Novel dynamic partial recon guration implementation of k-means clustering on fpgas: Comparative results with gpps and gpus</article-title>
          .
          <source>Int. J. Recon g. Comp</source>
          .
          <year>2012</year>
          ,
          <volume>135926</volume>
          :1{
          <fpage>135926</fpage>
          :
          <fpage>15</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Ivanova</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sokolinsky</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Decomposition of natural join based on domain-interval fragmented column indices</article-title>
          .
          <source>In: Information and Communication Technology, Electronics and Microelectronics (MIPRO)</source>
          ,
          <year>2015</year>
          38th International Convention on. pp.
          <volume>210</volume>
          {
          <issue>213</issue>
          (May
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Je</surname>
            <given-names>ers</given-names>
          </string-name>
          , J.,
          <string-name>
            <surname>Reinders</surname>
          </string-name>
          , J.:
          <article-title>Intel Xeon Phi Coprocessor High-Performance Programming</article-title>
          . Morgan Kaufmann,
          <volume>1</volume>
          <fpage>edn</fpage>
          . (3
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Kaufman</surname>
          </string-name>
          , L.,
          <string-name>
            <surname>Rousseeuw</surname>
            ,
            <given-names>P.J.</given-names>
          </string-name>
          :
          <article-title>Finding Groups in Data: An Introduction to Cluster Analysis</article-title>
          . John Wiley (
          <year>1990</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Kohlho</surname>
            ,
            <given-names>K.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sosnick</surname>
            ,
            <given-names>M.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hsu</surname>
          </string-name>
          , W.T.,
          <string-name>
            <surname>Pande</surname>
            ,
            <given-names>V.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Altman</surname>
          </string-name>
          , R.B.:
          <article-title>CAMPAIGN: an open-source library of gpu-accelerated data clustering algorithms</article-title>
          .
          <source>Bioinformatics</source>
          <volume>27</volume>
          (
          <issue>16</issue>
          ),
          <volume>2321</volume>
          {
          <fpage>2322</fpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Lloyd</surname>
            ,
            <given-names>S.P.</given-names>
          </string-name>
          :
          <article-title>Least squares quantization in PCM</article-title>
          .
          <source>IEEE Transactions on Information Theory</source>
          <volume>28</volume>
          (
          <issue>2</issue>
          ),
          <volume>129</volume>
          {
          <fpage>136</fpage>
          (
          <year>1982</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Melnykov</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>W.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maitra</surname>
          </string-name>
          , R.: Mixsim:
          <article-title>An r package for simulating data to study performance of clustering algorithms</article-title>
          .
          <source>Journal of Statistical Software</source>
          <volume>51</volume>
          (
          <issue>12</issue>
          ) (
          <year>Jan 2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Movchan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zymbler</surname>
            ,
            <given-names>M.L.</given-names>
          </string-name>
          :
          <article-title>Time series subsequence similarity search under dynamic time warping distance on the intel many-core accelerators</article-title>
          . In: Amato,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Connor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.C.H.</given-names>
            ,
            <surname>Falchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Gennaro</surname>
          </string-name>
          , C. (eds.)
          <article-title>Similarity Search</article-title>
          and Applications - 8th International Conference, SISAP 2015, Glasgow, UK,
          <source>October 12-14</source>
          ,
          <year>2015</year>
          , Proceedings. LNCS, vol.
          <volume>9371</volume>
          , pp.
          <volume>295</volume>
          {
          <fpage>306</fpage>
          . Springer (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Ortega</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rui</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chakrabarti</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Porkaew</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mehrotra</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>T.S.:</given-names>
          </string-name>
          <article-title>Supporting ranked boolean similarity queries in MARS</article-title>
          .
          <source>IEEE Trans. Knowl. Data Eng</source>
          .
          <volume>10</volume>
          (
          <issue>6</issue>
          ),
          <volume>905</volume>
          {
          <fpage>925</fpage>
          (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Pan</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zymbler</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>A parallel algorithm for market basket analysis on the cell processor</article-title>
          .
          <source>Bulletin of the South Ural State University. Series: Mathematical Modelling, Programming and Computer Software</source>
          .
          <volume>5</volume>
          ,
          <issue>48</issue>
          {
          <fpage>57</fpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Patwary</surname>
            ,
            <given-names>M.M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Satish</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sundaram</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manne</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Habib</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dubey</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Pardicle: Parallel approximate density-based clustering</article-title>
          . In: Damkroger,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Dongarra</surname>
          </string-name>
          ,
          <string-name>
            <surname>J</surname>
          </string-name>
          . (eds.)
          <article-title>International Conference for High Performance Computing, Networking, Storage and Analysis</article-title>
          ,
          <source>SC</source>
          <year>2014</year>
          ,
          <article-title>New Orleans</article-title>
          , LA, USA, November
          <volume>16</volume>
          -
          <issue>21</issue>
          ,
          <year>2014</year>
          . pp.
          <volume>560</volume>
          {
          <fpage>571</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Rechkalov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zymbler</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Accelerating medoids-based clustering with the intel many integrated core architecture</article-title>
          .
          <source>In: Application of Information and Communication Technologies - 9th International Conference, AICT</source>
          <year>2015</year>
          ,
          <article-title>Rostov-on-</article-title>
          <string-name>
            <surname>Don</surname>
          </string-name>
          , Russia,
          <source>October 14-16</source>
          ,
          <year>2015</year>
          . Proceedings. pp.
          <volume>413</volume>
          {
          <fpage>417</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Reynolds</surname>
            ,
            <given-names>A.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Richards</surname>
            , G., de la Iglesia,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rayward-Smith</surname>
            ,
            <given-names>V.J.:</given-names>
          </string-name>
          <article-title>Clustering rules: A comparison of partitioning and hierarchical clustering algorithms</article-title>
          .
          <source>J. Math. Model. Algorithms</source>
          <volume>5</volume>
          (
          <issue>4</issue>
          ),
          <volume>475</volume>
          {
          <fpage>504</fpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wei</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shao</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>A vectorized k-means algorithm for intel many integrated core architecture</article-title>
          . In: Wu,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . (eds.)
          <source>Advanced Parallel Processing Technologies - 10th International Symposium, APPT</source>
          <year>2013</year>
          , Stockholm, Sweden,
          <source>August 27-28</source>
          ,
          <year>2013</year>
          ,
          <source>Revised Selected Papers. LNCS</source>
          , vol.
          <volume>8299</volume>
          , pp.
          <volume>277</volume>
          {
          <fpage>294</fpage>
          . Springer (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Xiao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Feng</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leung</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sum</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>GPU accelerated spherical k-means training</article-title>
          . In: Lee,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Hirose</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            ,
            <surname>Kil</surname>
          </string-name>
          , R.M. (eds.)
          <source>Neural Information Processing - 20th International Conference, ICONIP</source>
          <year>2013</year>
          , Daegu, Korea, November 3-
          <issue>7</issue>
          ,
          <year>2013</year>
          . Proceedings,
          <source>Part II. LNCS</source>
          , vol.
          <volume>8227</volume>
          , pp.
          <volume>392</volume>
          {
          <fpage>399</fpage>
          . Springer (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Yan</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Su</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zheng</surname>
          </string-name>
          , H.:
          <article-title>DVT-PKM: an improved GPU based parallel k-means algorithm</article-title>
          . In: Huang,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Jo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>L</surname>
          </string-name>
          . (eds.)
          <source>Intelligent Computing Methodologies - 10th International Conference, ICIC 2014, Taiyuan, China, August 3-6</source>
          ,
          <year>2014</year>
          . Proceedings. LNCS, vol.
          <volume>8589</volume>
          , pp.
          <volume>591</volume>
          {
          <fpage>601</fpage>
          . Springer (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Zheng</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
          </string-name>
          , J.:
          <article-title>Accelerate k-means algorithm by using GPU in the hadoop framework</article-title>
          . In: Chen,
          <string-name>
            <given-names>Y.</given-names>
            ,
            <surname>Balke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            ,
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            ,
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            ,
            <surname>Tang</surname>
          </string-name>
          , T.Y.,
          <string-name>
            <surname>Hwang</surname>
          </string-name>
          , E. (eds.)
          <string-name>
            <surname>Web-Age Information</surname>
            Management - WAIM 2014 International Workshops: BigEM, HardBD, DaNoS, HRSUNE,
            <given-names>BIDASYS</given-names>
          </string-name>
          , Macau, China, June 16-18,
          <source>2014 Revised Selected Papers. LNCS</source>
          , vol.
          <volume>8597</volume>
          , pp.
          <volume>177</volume>
          {
          <fpage>186</fpage>
          . Springer (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Zymbler</surname>
            ,
            <given-names>M.L.</given-names>
          </string-name>
          :
          <article-title>Best-match time series subsequence search on the intel many integrated core architecture</article-title>
          . In: Morzy,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Valduriez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Bellatreche</surname>
          </string-name>
          ,
          <string-name>
            <surname>L</surname>
          </string-name>
          . (eds.)
          <source>Advances in Databases and Information Systems - 19th East European Conference, ADBIS</source>
          <year>2015</year>
          , Poitiers, France, September 8-
          <issue>11</issue>
          ,
          <year>2015</year>
          , Proceedings. LNCS, vol.
          <volume>9282</volume>
          , pp.
          <volume>275</volume>
          {
          <fpage>286</fpage>
          . Springer (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>