<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Partition Around Medoids Clustering on the Intel Xeon Phi Many-Core Coprocessor</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Timofey</forename><forename type="middle">V</forename><surname>Rechkalov</surname></persName>
							<email>trechkalov@yandex.ru</email>
							<affiliation key="aff0">
								<orgName type="institution">South Ural State University</orgName>
								<address>
									<settlement>Chelyabinsk</settlement>
									<country key="RU">Russia</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Partition Around Medoids Clustering on the Intel Xeon Phi Many-Core Coprocessor</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">C47A27A8AF47BBAA3FDB8782C670046D</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-23T21:49+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>data mining</term>
					<term>clustering</term>
					<term>k-Medoids</term>
					<term>Partition Around Medoids</term>
					<term>Intel Many Integrated Core architecture</term>
					<term>Intel Xeon Phi coprocessor</term>
					<term>parallel computing</term>
					<term>tiling</term>
					<term>vectorization</term>
					<term>OpenMP</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The paper touches upon the problem of implementation Partition Around Medoids (PAM) clustering algorithm for the Intel Many Integrated Core architecture. PAM is a form of well-known k-Medoids clustering algorithm and is applied in various subject domains, e.g. bioinformatics, text analysis, intelligent transportation systems, etc. An optimized version of PAM for the Intel Xeon Phi coprocessor is introduced where OpenMP parallelizing technology, loop vectorization, tiling technique and efficient distance matrix computation for Euclidean metric are used. Experimental results for different data sets confirm the efficiency of the proposed algorithm.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Clustering is one of the basic problems of data mining aimed to organizing a set of data objects into subsets (clusters) such that objects in a cluster are similar to one another, yet dissimilar to objects in other clusters. Similarity is commonly defined in terms of how close the objects are and is based on a specified distance metric.</p><p>The most fundamental method of clustering is partitioning, which organizes the objects of a set into several exclusive groups. More formally, given a set of n objects, a partitioning algorithm constructs k partitions of the data, where each partition represents a cluster and k ≤ n. The algorithm divides the data objects into k clusters. An object is assigned to a closest cluster based on the distance measure between the object and the cluster center. Then algorithm iteratively improves the within-cluster variation by computing the new cluster center using the objects assigned to the cluster in the previous iteration. After</p><p>This work was financially supported by the Ministry of education and science of the Russian Federation ("Research and development on priority directions of scientific-technological complex of Russia for 2014-2020" Federal Program, contract No. 14.574.21.0035).</p><p>this cluster centers are updated and all the objects are then reassigned using the new cluster centers. The iterations continue until the assignment is stable, that is, the clusters formed in the current round are the same as those formed in the previous round.</p><p>Partitioning clustering algorithms differ in a way of calculation cluster centers, e.g. k-Means <ref type="bibr" target="#b10">[11]</ref> and k-Modes <ref type="bibr" target="#b4">[5]</ref> algorithms uses mean and mode values of clustered objects respectively, whereas k-Medoids algorithm uses an object of clustered data set (called medoid ).</p><p>The Partition Around Medoids (PAM) <ref type="bibr" target="#b17">[18]</ref> is a variation of k-Means, which is used in a wide spectrum of applications, e.g. text analysis, bioinformatics, intelligent transport systems, etc. The complexity of each iteration in the PAM algorithm is O(k(n − k) 2 ). For large values of n and k computations are very costly. That is why there are approaches to speed up k-Means and PAM algorithms by means of GPU <ref type="bibr" target="#b2">[3,</ref><ref type="bibr" target="#b9">10]</ref>. At the same time there none for modern accelerators based on the Intel Many Integrated Core (MIC) <ref type="bibr" target="#b7">[8]</ref> architecture. In spite of many recent developments for manycore platforms in data mining <ref type="bibr" target="#b12">[13,</ref><ref type="bibr" target="#b14">15,</ref><ref type="bibr" target="#b22">23]</ref> and databases <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b6">7]</ref> there are few ones for the Intel MIC architecture.</p><p>In this paper we present a parallel version of PAM for MIC accelerators. The remaining part of the paper is organized as follows. Section 2 gives an overview of serial PAM algorithm and discusses related work. In section 3 we describe parallelization of PAM adapted for the Intel MIC architecture. The results of the experiments evaluating the algorithm are presented in section 4. Section 5 contains summary and directions for future research.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Background of the Research</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Serial PAM Algorithm</head><p>To provide formal description of the PAM <ref type="bibr" target="#b8">[9]</ref> algorithm we will use the following notation. Let O = {o 1 , o 2 , . . . , o n } is a set of objects to be clustered where each object is a tuple consisting of p real-valued attributes. Let k is the number of clusters, k n, and C = {c 1 , c 2 , . . . , c k } is a set of medoids, C⊂O, and</p><formula xml:id="formula_0">ρ : O × C → R is a distance metric.</formula><p>The algorithm takes the form of a steepest ascent hill climber, using a simple swap neighbourhood operation. In each iteration medoid object c i and nonmedoid object o j are selected that produce the best clustering when their roles are switched. The objective function used is the sum of the distances from each object to the closest medoid:</p><formula xml:id="formula_1">E = n j=1 min 1≤i≤k ρ(c i , o j ). (<label>1</label></formula><formula xml:id="formula_2">)</formula><p>Algorithm 1 depicts PAM pseudocode. PAM consists of two phases, namely BUILD and SWAP. In the first phase an initial clustering is obtained by the successive selection of representative objects until k objects have been found.  The first object c 1 is the one for which the sum of the distances to all other objects is as small as possible:</p><formula xml:id="formula_3">c 1 = arg min 1≤h≤n n j=1 ρ(o h , o j ).</formula><p>(</p><formula xml:id="formula_4">)<label>2</label></formula><p>Object c 1 is the most centrally located in O set. Subsequently, at each step another object is selected, which decreases the objective function as much as possible. This object is the one for which the minimal distance to all selected medoids and distance to this object is as small as possible:</p><formula xml:id="formula_5">c 2 = arg min 1≤h≤n n j=1 min(ρ(c 1 , o j ), ρ(o h , o j )),<label>(3)</label></formula><formula xml:id="formula_6">c 3 = arg min 1≤h≤n n j=1 min( min 1≤l≤2 (ρ(c l , o j )), ρ(o h , o j )),<label>(4)</label></formula><p>...</p><formula xml:id="formula_7">c k = arg min 1≤h≤n n j=1 min( min 1≤l≤k−1 (ρ(c l , o j )), ρ(o h , o j )). (<label>5</label></formula><formula xml:id="formula_8">)</formula><p>This process is continued until k objects have been found.</p><p>In the second phase of the algorithm, it is attempted to improve C (i.e. set of medoids) and therefore also to improve the clustering yielded by this set. Algorithm searches for a pair of objects (c min , o min ), which minimizes the objective function. This is done by considering all pairs of objects (c i , o h ) where c i is a medoid and o h is not a medoid. It is determined what effect is obtained on the objective function when a swap is carried out, i.e., when object c i is no longer selected as a medoid but object o h is. Let denote this effect as T ih , then minimum value of T min is achieved with (c min , o min ) pair. If T min &gt; 0 then C set can not be improved so the algorithm stops.</p><p>Let us consider calculation of the T ih effect using the following notation. Let D = {d 1 , d 2 , . . . , d n } is a set of distances from each object to the closest medoid. Let S = {s 1 , s 2 , . . . , s n } is a set of distances from each object to second closest medoid. Let C jih is a contribution of non selected object o j to the effect T ih of a swap between c i and o h on the objective function. In this case T ih is the sum of the contributions C jih :</p><formula xml:id="formula_9">T ih = n j=1 C jih .<label>(6)</label></formula><p>Algorithm 2 <ref type="bibr" target="#b8">[9]</ref> depicts pseudocode of calculating C jih .</p><formula xml:id="formula_10">Input : oj, ci, o h , dj, sj Output: C jih 1 if ρ(oj, ci) &gt; dj and ρ(oj, o h ) &gt; dj then 2 C jih ← 0 3 else if ρ(oj, ci) = dj then 4 if ρ(oj, o h ) &lt; sj then 5 C jih ← ρ(oj, o h ) − dj 6 else 7 C jih ← sj − dj 8 end 9 else if ρ(oj, o h ) &lt; dj then 10 C jih ← ρ(oj, o h ) − dj 11 end Fig. 2. Calculating C jih</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Related Work</head><p>A significant amount of work has been done in the area of cluster analysis. The classical k-Means and k-Medoids algorithms was suggested in <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b10">11]</ref>. The original PAM algorithm was proposed in <ref type="bibr" target="#b8">[9]</ref>.</p><p>The research devoted to accelerating clustering algorithms using parallel hardware includes the following. In <ref type="bibr" target="#b5">[6]</ref> FPGA and GPU implementations of k-Means are compared. Authors of <ref type="bibr" target="#b19">[20]</ref> describe improvements of k-Means reducing data transfers between CPU and GPU. In <ref type="bibr" target="#b20">[21]</ref> a technique improving data distribution among GPU threads in k-Means is suggested. k-Means implementation for Hadoop framework with GPUs is described in <ref type="bibr" target="#b21">[22]</ref>. In <ref type="bibr" target="#b2">[3]</ref> several clustering methods on GPU including k-Medoids are implemented. A GPU-based framework for clustering genetic data using k-Medoids described in <ref type="bibr" target="#b9">[10]</ref>.</p><p>In our opinion currently the potential of the Intel MIC accelerators for cluster analysis is underestimated. Paper <ref type="bibr" target="#b15">[16]</ref> proposes modification of the DBSCAN density-based clustering algorithm for the Intel MIC architecture. In <ref type="bibr" target="#b18">[19]</ref> a version of k-Means for CPU and Intel MIC heterogeneous architecture is presented, where authors used vectorization and sophisticated layout scheme to improve data locality. The contribution of this paper is technique of acceleration of the Partitioning Around Medoids clustering algorithm with the Intel Xeon Phi many-core coprocessor.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Parallel PAM Algorithm for MIC Accelerators</head><p>In this section we describe an approach to implementation of PAM algorithm for the Intel Xeon Phi coprocessor <ref type="bibr" target="#b16">[17]</ref>. The Intel Xeon Phi coprocessor is an x86based SMP-on-a-chip with over fifty cores. It supports 4× hardware threads per core and contains 512-bit wide vector processor unit (VPU). Each core has two levels of cache memory: a 32 Kb L1 data cache, a 32 Kb L1 instruction cache, and a core-private 512 Kb unified L2 cache. The Intel Xeon Phi coprocessor is connected to other devices via the PCIe bus. Intel Xeon Phi coprocessor is based on Intel x86 architecture and it supports the same programming tools and models as a regular Intel Xeon processor. Our approach is based on the following principles.</p><p>Data parallelism and vectorization. Using OpenMP technology we perform simultaneous execution on multiple cores of the same function across the elements of a dataset. Most loops of the original PAM algorithm with arithmetic operations were implemented to provide conversion of such operations from scalar form to vector form to be effectively computed by the coprocessor's VPUs.</p><p>Our implementation strives to provide data locality as much as possible, i.e. the program uses data close to recently accessed locations. Since the coprocessor loads a chunk of memory around an accessed location into the cache, locations close to recently accessed locations are also likely to be in the cache so finally it increases algorithm's performance.</p><p>Algorithm 3 depicts PAM pseudocode adapted for use on the Intel Xeon Phi many-core coprocessor. The summary of parallel PAM subalgorithms is presented in Tab. 1. The PAM algorithm deals with a lot of data arrays which are not fit into Intel Xeon Phi L2 memory cache. We process data by chunks of L bytes to satisfy data locality requirement. It is recommended <ref type="bibr" target="#b7">[8]</ref> to set L to 16 and try multiplying or dividing by 2 and use n divisible by L. In our work we use L = 32.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>To improve performance we use precomputing technique by means of calculating distances between all objects of O set in advance. There is no need for</head><p>The PrepareDistanceMatrix subalgorithm initializes distance matrix (see Algorithm 4). Unlike in <ref type="bibr" target="#b8">[9]</ref> we store matrix in full form (not in upper triangular form) to provide better data locality for the rest of subalgorithms. To achieve better performance of this subalgorithm we use tiling technique <ref type="bibr" target="#b7">[8]</ref>. Tiling is a technique for improving data reuse in cache architectures. Cache architectures generally employ least recently used (LRU) methods to determine which data is evicted from the cache as new data is requested. Therefore, the longer data remains unused, the more likely it will be evicted from the cache and no longer available immediately when needed. Tiling the access pattern can exploit data that remains in the cache from recent, previous iterations.</p><formula xml:id="formula_11">Input : Set of objects O Output: Distance matrix M 1 parallel for oi such that 1 ≤ i ≤ n do 2 for j = 1 to n step L do 3 for k = 1 to p do 4 for l such that j ≤ l ≤ j + L do /* vectorized */ 5 m il ← m il + (oi[k] − o l [k]) 2 ; /*</formula><p>The BuildMedoids subalgorithm implements BUILD phase (see Alg. 5) according to formulas (2)-( <ref type="formula" target="#formula_7">5</ref>). The FindBestSwap subalgorithm implements SWAP phase (see Alg. 6). It checks all pairs of (c i , o h ) objects where c i is a medoid and o h is not a medoid, calculates the effect for each T ih swapping and returns the minimal one. T ih ; Fig. <ref type="figure">6</ref>. SWAP phase SWAP phase executes many logical operations in (see Alg. 2). By this reason two versions of the PAM algorithm were implemented. PAM-1 executes more logical operations with lesser temporary data. PAM-2 executes lesser logical op-erations but stores more temporary data. Preparing matrix function and BUILD phase are the same in both versions.</p><formula xml:id="formula_12">Input : Distance matrix M Output: Set of medoids C 1 parallel for i = 1 to n do 2 if n j=1 mij is minimal then /* sum is vectorized */ 3 c1 ← oi;</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Experimental Evaluation</head><p>To evaluate the developed algorithm we performed experiments on the hardware specified in Tab. 2. Experiments were performed on single precision data, the coprocessor was used in offload mode. We measured PAM runtime while varying number of clustered objects and investigated the influence of dataset properties on runtime of PAM subalgorithms.  Experimental results for FCS Human dataset are introduced in Fig. <ref type="figure" target="#fig_4">7</ref>(a). FCS Human dataset has large dimension so the most time is taken by calculation of distance matrix. Calculation of distance matrix on the Intel Xeon Phi is two times faster then on the Intel Xeon. There is no significant difference between PAM-1 and PAM-2 for this dataset.</p><p>Experimental results for Corel Image Histogram dataset are introduced in Fig. <ref type="figure" target="#fig_4">7(b</ref>). Data dimension is small so preparing distance matrix does not require much time. PAM-1 shows similar performance on both CPU and Intel Xeon Phi. The PAM-2 algorithm is two times slower on the Intel Xeon than on the Intel Xeon Phi.</p><p>Experimental results for MixSim dataset are introduced in Fig. <ref type="figure" target="#fig_4">7(c</ref>). Again PAM-1 shows similar performance on both CPU and the Intel Xeon Phi. PAM-2 on the Intel Xeon Phi shows best result on this dataset. Experimental results for Letter Recognition dataset are introduced in Fig. <ref type="figure" target="#fig_4">7(d)</ref>. PAM-1 shows the best result on the Intel Xeon. Both PAM-1 and PAM-2 shows similar results on the Intel Xeon Phi.</p><p>Intuitively PAM-2 is a better implementation for the Intel Xeon Phi. This suggestion is confirmed by experiments. In all tests PAM-2 is twice better on the Intel Xeon Phi than the Intel Xeon. In the same time PAM-1 is the best with the Intel Xeon only once. In other tests there is no significant difference.</p><p>To investigate this fact deeper we made more experiments to see contribution of every PAM subalgorithm in Fig. <ref type="figure" target="#fig_6">8</ref>. Figures <ref type="figure" target="#fig_6">8(a</ref>  Experiments show that PAM performance depends on clustered data nature. The most complex thing for large dimension data is calculation of distance matrix. In case of small dimension data the rest of the PAM subalgorithms take significantly larger part of runtime than distance matrix calculation. BUILD phase is more effective on the Intel Xeon Phi. SWAP phase perform better on the Intel Xeon. PAM execution on Letter Recognition dataset requires more iterations than MixSim experiment. By this reason PAM-1 shows best result with Letter Recognition and PAM-2 shows best result with MixSim.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusion</head><p>The paper has described a parallel version of Partitioning Around Medoids clustering algorithm for the Intel Xeon Phi many-core coprocessor. An optimized version of PAM for the Intel Xeon Phi coprocessor is introduced where OpenMP parallelizing technology, loop vectorization, tiling technique and efficient distance matrix computation for Euclidean metric are used. Algorithm stores data in continuous arrays and process data by chunks to achieve data locality for better performance.</p><p>Experimental results show effectiveness of suggested approach. Experiments show that PAM performance depends on clustered data nature. The most complex thing for large dimension data is calculation of distance matrix. In case of small dimension data the rest of the PAM subalgorithms take significantly larger part of runtime than distance matrix calculation. BUILD phase is more effective on the Intel Xeon Phi. SWAP phase perform better on the Intel Xeon. PAM-1 shows best result with Letter Recognition dataset and PAM-2 shows best result with MixSim.</p><p>As future work we plan to extend our research in the following directions: implement our algorithm for the cases of several coprocessors and cluster system based on nodes equipped with the Intel Xeon Phi coprocessor(s).</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. PAM</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Input: 6 SwapFig. 3 .</head><label>63</label><figDesc>Fig. 3. Parallel PAM for Intel Xeon Phi coprocessor</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>access to o l is tiled */ 6 end 7 end 8 forFig. 4 .</head><label>6784</label><figDesc>Fig. 4. Prepare Distance Matrix</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>4 end 5 endfor 6 Fig. 5 . 3 for l = 1 n L do 4 for i = 1 k do 5 T</head><label>465345</label><figDesc>Fig. 5. BUILD phase</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Fig. 7 .</head><label>7</label><figDesc>Fig. 7. Performance of the PAM algorithm</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head></head><label></label><figDesc>) and 8(c) show time of matrix calculation and BUILD phase. The Intel Xeon Phi outperforms the Intel Xeon in both subalgorithms. Figures8(b) and 8(d) show average time of one iteration in SWAP phase. In these figures we can see that Intel Xeon Phi performance degraded faster then Intel Xeon. PAM-2 implementation looses to PAM-1 in SWAP phase for big datasets so we need to continue PAM-2 improvements for Intel Xeon Phi.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head>Fig. 8 .</head><label>8</label><figDesc>Fig. 8. Deep comparison of the PAM algorithm implementations</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 1 .</head><label>1</label><figDesc>Summary of parallel PAM subalgorithms</figDesc><table><row><cell>Name</cell><cell cols="2">Complexity Parallelizing technique(s)</cell></row><row><cell>PrepareDistanceMatrix</cell><cell>O(pn 2 )</cell><cell>OpenMP, vectorization</cell></row><row><cell>BuildMedoids</cell><cell>O(kn 2 )</cell><cell>OpenMP, vectorization</cell></row><row><cell>FindBestSwap</cell><cell>O(k(n − k) 2 )</cell><cell>OpenMP</cell></row><row><cell cols="3">repeated calculation of distances at each iteration, since distances simply can be</cell></row><row><cell>looked up in M matrix.</cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 2 .</head><label>2</label><figDesc>Specifications of hardware</figDesc><table><row><cell>Specifications</cell><cell cols="2">Processor Coprocessor</cell></row><row><cell>Model</cell><cell cols="2">Xeon X5680 Xeon Phi SE10X</cell></row><row><cell>Cores</cell><cell>6</cell><cell>61</cell></row><row><cell>Frequency, GHz</cell><cell>3.33</cell><cell>1.1</cell></row><row><cell>Threads per core</cell><cell>2</cell><cell>4</cell></row><row><cell>Peak performance, TFLOPS</cell><cell>0.371</cell><cell>1.076</cell></row><row><cell cols="3">Datasets used in experiments are summarized in Tab. 3.</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 3 .</head><label>3</label><figDesc>Datasets Summary</figDesc><table><row><cell>Dataset</cell><cell>p k</cell><cell>n, ×2 10 min max</cell><cell cols="2">Max data size, Mb Time to transfer to coprocessor, sec</cell></row><row><cell>FCS Human [2]</cell><cell cols="2">423 10 2 18</cell><cell>29.74</cell><cell>0.005</cell></row><row><cell cols="3">Corel Image Histogram [14] 32 15 5 35</cell><cell>4.38</cell><cell>0.001</cell></row><row><cell>MixSim [12]</cell><cell cols="2">5 10 5 35</cell><cell>0.68</cell><cell>0.001</cell></row><row><cell>Letter Recognition [4]</cell><cell cols="2">16 26 2 18</cell><cell>1.13</cell><cell>0.001</cell></row></table></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Increasing efficiency of data transfer between main memory and intel xeon phi coprocessor or NVIDIA GPUS with data compression</title>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">Y</forename><surname>Besedin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">S</forename><surname>Kostenetskiy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">O</forename><surname>Prikazchikov</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Parallel Computing Technologies -13th International Conference, PaCT 2015</title>
		<title level="s">Proceedings. LNCS</title>
		<editor>
			<persName><forename type="first">V</forename><surname>Malyshkin</surname></persName>
		</editor>
		<meeting><address><addrLine>Petrozavodsk, Russia</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2015-09-04">August 31 -September 4, 2015. 2015</date>
			<biblScope unit="volume">9251</biblScope>
			<biblScope unit="page" from="319" to="323" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Independent component analysis: Mining microarray data for fundamental human gene expression modules</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">M</forename><surname>Engreitz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">J D</forename><surname>Jr</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">J</forename><surname>Marshall</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">B</forename><surname>Altman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Biomedical Informatics</title>
		<imprint>
			<biblScope unit="volume">43</biblScope>
			<biblScope unit="issue">6</biblScope>
			<biblScope unit="page" from="932" to="944" />
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Accelerating partitional algorithms for flow cytometry on gpus</title>
		<author>
			<persName><forename type="first">J</forename><surname>Espenshade</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Pangborn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Von Laszewski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Roberts</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">S</forename><surname>Cavenaugh</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE International Symposium on Parallel and Distributed Processing with Applications, ISPA 2009</title>
				<meeting><address><addrLine>Chengdu, Sichuan, China</addrLine></address></meeting>
		<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2009-08">August 2009. 2009</date>
			<biblScope unit="page" from="226" to="233" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Letter recognition using holland-style adaptive classifiers</title>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">W</forename><surname>Frey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">J</forename><surname>Slate</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Machine Learning</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="page" from="161" to="182" />
			<date type="published" when="1991">1991</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Extensions to the k-means algorithm for clustering large data sets with categorical values</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Huang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Data Min. Knowl. Discov</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="283" to="304" />
			<date type="published" when="1998">1998</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Novel dynamic partial reconfiguration implementation of k-means clustering on fpgas: Comparative results with gpps and gpus</title>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">M</forename><surname>Hussain</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Benkrid</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ebrahim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">T</forename><surname>Erdogan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Seker</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Int. J. Reconfig. Comp</title>
		<imprint>
			<biblScope unit="volume">135926</biblScope>
			<biblScope unit="page">15</biblScope>
			<date type="published" when="2012">2012. 2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Decomposition of natural join based on domain-interval fragmented column indices</title>
		<author>
			<persName><forename type="first">E</forename><surname>Ivanova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Sokolinsky</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Information and Communication Technology, Electronics and Microelectronics (MIPRO)</title>
				<imprint>
			<date type="published" when="2015-05">2015. May 2015</date>
			<biblScope unit="page" from="210" to="213" />
		</imprint>
	</monogr>
	<note>38th International Convention on</note>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<title level="m" type="main">Intel Xeon Phi Coprocessor High-Performance Programming</title>
		<author>
			<persName><forename type="first">J</forename><surname>Jeffers</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Reinders</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2013">2013</date>
			<publisher>Morgan Kaufmann</publisher>
		</imprint>
	</monogr>
	<note>1 edn</note>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<title level="m" type="main">Finding Groups in Data: An Introduction to Cluster Analysis</title>
		<author>
			<persName><forename type="first">L</forename><surname>Kaufman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">J</forename><surname>Rousseeuw</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1990">1990</date>
			<publisher>John Wiley</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">CAM-PAIGN: an open-source library of gpu-accelerated data clustering algorithms</title>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">J</forename><surname>Kohlhoff</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">H</forename><surname>Sosnick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">T</forename><surname>Hsu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">S</forename><surname>Pande</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">B</forename><surname>Altman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Bioinformatics</title>
		<imprint>
			<biblScope unit="volume">27</biblScope>
			<biblScope unit="issue">16</biblScope>
			<biblScope unit="page" from="2321" to="2322" />
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Least squares quantization in PCM</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">P</forename><surname>Lloyd</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Information Theory</title>
		<imprint>
			<biblScope unit="volume">28</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="129" to="136" />
			<date type="published" when="1982">1982</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Mixsim: An r package for simulating data to study performance of clustering algorithms</title>
		<author>
			<persName><forename type="first">V</forename><surname>Melnykov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">C</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Maitra</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Statistical Software</title>
		<imprint>
			<biblScope unit="volume">51</biblScope>
			<biblScope unit="issue">12</biblScope>
			<date type="published" when="2012-01">Jan 2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Time series subsequence similarity search under dynamic time warping distance on the intel many-core accelerators</title>
		<author>
			<persName><forename type="first">A</forename><surname>Movchan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">L</forename><surname>Zymbler</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Similarity Search and Applications -8th International Conference, SISAP 2015</title>
		<title level="s">Proceedings. LNCS</title>
		<editor>
			<persName><forename type="first">G</forename><surname>Amato</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><forename type="middle">C H</forename><surname>Connor</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">F</forename><surname>Falchi</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C</forename><surname>Gennaro</surname></persName>
		</editor>
		<meeting><address><addrLine>Glasgow, UK</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2015">October 12-14, 2015. 2015</date>
			<biblScope unit="volume">9371</biblScope>
			<biblScope unit="page" from="295" to="306" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Supporting ranked boolean similarity queries in MARS</title>
		<author>
			<persName><forename type="first">M</forename><surname>Ortega</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Rui</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Chakrabarti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Porkaew</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Mehrotra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">S</forename><surname>Huang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Trans. Knowl. Data Eng</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="issue">6</biblScope>
			<biblScope unit="page" from="905" to="925" />
			<date type="published" when="1998">1998</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">A parallel algorithm for market basket analysis on the cell processor</title>
		<author>
			<persName><forename type="first">C</forename><surname>Pan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Zymbler</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Bulletin of the South Ural State University. Series: Mathematical Modelling, Programming and Computer Software</title>
		<imprint>
			<biblScope unit="volume">5</biblScope>
			<biblScope unit="page" from="48" to="57" />
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Pardicle: Parallel approximate density-based clustering</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">M A</forename><surname>Patwary</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Satish</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Sundaram</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Manne</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Habib</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Dubey</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2014</title>
				<editor>
			<persName><forename type="first">T</forename><surname>Damkroger</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Dongarra</surname></persName>
		</editor>
		<meeting><address><addrLine>New Orleans, LA, USA</addrLine></address></meeting>
		<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2014">November 16-21, 2014. 2014</date>
			<biblScope unit="page" from="560" to="571" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Accelerating medoids-based clustering with the intel many integrated core architecture</title>
		<author>
			<persName><forename type="first">T</forename><surname>Rechkalov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Zymbler</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Application of Information and Communication Technologies -9th International Conference, AICT 2015</title>
				<meeting><address><addrLine>Rostov-on-Don, Russia</addrLine></address></meeting>
		<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2015">October 14-16, 2015. 2015</date>
			<biblScope unit="page" from="413" to="417" />
		</imprint>
	</monogr>
	<note>Proceedings.</note>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Clustering rules: A comparison of partitioning and hierarchical clustering algorithms</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">P</forename><surname>Reynolds</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Richards</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>De La Iglesia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">J</forename><surname>Rayward-Smith</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">J. Math. Model. Algorithms</title>
		<imprint>
			<biblScope unit="volume">5</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="475" to="504" />
			<date type="published" when="2006">2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">A vectorized k-means algorithm for intel many integrated core architecture</title>
		<author>
			<persName><forename type="first">F</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Tan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Wei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Shao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Gao</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advanced Parallel Processing Technologies -10th International Symposium, APPT 2013</title>
		<title level="s">Revised Selected Papers. LNCS</title>
		<editor>
			<persName><forename type="first">C</forename><surname>Wu</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Cohen</surname></persName>
		</editor>
		<meeting><address><addrLine>Stockholm, Sweden</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2013">August 27-28, 2013. 2013</date>
			<biblScope unit="volume">8299</biblScope>
			<biblScope unit="page" from="277" to="294" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">GPU accelerated spherical k-means training</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Xiao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Feng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Leung</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Sum</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Neural Information Processing -20th International Conference, ICONIP 2013</title>
		<title level="s">Proceedings, Part II. LNCS</title>
		<editor>
			<persName><forename type="first">M</forename><surname>Lee</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Hirose</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Z</forename><surname>Hou</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><forename type="middle">M</forename><surname>Kil</surname></persName>
		</editor>
		<meeting><address><addrLine>Daegu, Korea</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2013">November 3-7, 2013. 2013</date>
			<biblScope unit="volume">8227</biblScope>
			<biblScope unit="page" from="392" to="399" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">DVT-PKM: an improved GPU based parallel k-means algorithm</title>
		<author>
			<persName><forename type="first">B</forename><surname>Yan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Su</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Zheng</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Intelligent Computing Methodologies -10th International Conference, ICIC 2014</title>
		<title level="s">Proceedings. LNCS</title>
		<editor>
			<persName><forename type="first">D</forename><surname>Huang</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">K</forename><surname>Jo</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Wang</surname></persName>
		</editor>
		<meeting><address><addrLine>Taiyuan, China</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2014">August 3-6, 2014. 2014</date>
			<biblScope unit="volume">8589</biblScope>
			<biblScope unit="page" from="591" to="601" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Accelerate k-means algorithm by using GPU in the hadoop framework</title>
		<author>
			<persName><forename type="first">H</forename><surname>Zheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Web-Age Information Management -WAIM 2014 International Workshops: BigEM, HardBD</title>
		<title level="s">Papers. LNCS</title>
		<editor>
			<persName><forename type="first">Y</forename><surname>Chen</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">W</forename><surname>Balke</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Xu</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">W</forename><surname>Xu</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">P</forename><surname>Jin</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">X</forename><surname>Lin</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><forename type="middle">Y</forename><surname>Tang</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">E</forename><surname>Hwang</surname></persName>
		</editor>
		<meeting><address><addrLine>, DaNoS, HRSUNE, BIDASYS, Macau, China</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2014">June 16-18, 2014. 2014</date>
			<biblScope unit="volume">8597</biblScope>
			<biblScope unit="page" from="177" to="186" />
		</imprint>
	</monogr>
	<note>Revised Selected</note>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Best-match time series subsequence search on the intel many integrated core architecture</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">L</forename><surname>Zymbler</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in Databases and Information Systems -19th East European Conference, ADBIS 2015</title>
		<title level="s">Proceedings. LNCS</title>
		<editor>
			<persName><forename type="first">T</forename><surname>Morzy</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">P</forename><surname>Valduriez</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Bellatreche</surname></persName>
		</editor>
		<meeting><address><addrLine>Poitiers, France</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2015">September 8-11, 2015. 2015</date>
			<biblScope unit="volume">9282</biblScope>
			<biblScope unit="page" from="275" to="286" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
