<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Stefan Igescu</string-name>
          <email>stefan.igescu@epfl.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Viktor Sanca</string-name>
          <email>viktor.sanca@epfl.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eleni Zapridou</string-name>
          <email>eleni.zapridou@epfl.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anastasia Ailamaki</string-name>
          <email>anastasia.ailamaki@epfl.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>EPFL</institution>
          ,
          <addr-line>Lausanne</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Google</institution>
          ,
          <addr-line>Sunnyvale</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>K-Means</institution>
          ,
          <addr-line>Speculative Execution, Sampling, AI for DB, Clustering, Parallelization, Algorithm Co-Design, Analytics</addr-line>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Workshop Proce dings</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>1</volume>
      <issue>2023</issue>
      <abstract>
        <p>K-means is one of the fundamental unsupervised data clustering and machine learning methods. It has been well studied over the years: parallelized, approximated, and optimized for diferent cases and applications. With increasingly higher parallelism leading to processing becoming bandwidth-bound, further speeding up iterative k-means would require data reduction using sampling at the cost of accuracy. We examine the use of speculation techniques to expedite the convergence of the iterative k-means algorithm without afecting the accuracy of the results. We achieve this by introducing two cooperative and concurrent phases: one works on the overall input data, and the other speculates and explores the space faster using sampling. At the end of every iteration, the two phases vote and choose the best centroids according to the objective function. Our speculative technique reduces the number of steps needed for convergence without compromising accuracy and, at the same time, provides an efective mechanism to escape local minima without prior initialization cost through resampling.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>K-means clustering is an unsupervised machine learning</title>
        <p>
          algorithm used for grouping data points into a predefined
number of clusters. It is one of the well-known and
commonly used clustering algorithms, and despite being
formulated in the 1960s, it is still widely applicable and
practical due to its simplicity and efectiveness[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>K-means. Given a dataset</title>
        <p>with  -dimensional real
entries,  ⊂ ℝ  , and a positive integer  , find a set of
centroids (means)  ,  ⊂ ℝ  such that the objective function
 minimizes the within-cluster sum of squares:</p>
      </sec>
      <sec id="sec-1-3">
        <title>Algorithm 1 K-means Clustering</title>
      </sec>
      <sec id="sec-1-4">
        <title>1: Initialize k centroids randomly</title>
        <p>2: repeat
3:
4:
5:</p>
      </sec>
      <sec id="sec-1-5">
        <title>Calculate the distance between each data point and the centroids</title>
      </sec>
      <sec id="sec-1-6">
        <title>Assign each data point to the nearest centroid</title>
        <p>Update the centroids by computing the mean of
all points assigned to each cluster
6: until The centroids no longer move or a maximum
number of iterations is reached</p>
      </sec>
      <sec id="sec-1-7">
        <title>7: Return the final clusters and centroids</title>
        <p>(Objective function) step, which includes lines 3 and 4, and the update step,
 (,  ) =
∑  ∈ || − || 2
∈</p>
      </sec>
      <sec id="sec-1-8">
        <title>This process is iterated until no points change their assignments or the number of assigned iterations runs out. Well-known Lloyd’s [2] variant is described in Algorithm 1.</title>
        <p>Canada
∗Corresponding author.
†Work done entirely at EPFL.
‡GitHub Repository.</p>
        <p>LGOBE
0000-0002-4799-8467 (V. Sanca); 0000-0002-5025-6835
(E. Zapridou); 0000-0002-9949-3639 (A. Ailamaki)</p>
      </sec>
      <sec id="sec-1-9">
        <title>By processing all the data points, K-means has an overall</title>
        <p>time complexity of ( ⋅  ⋅ )
and space complexity of
( ⋅ ( + ))</p>
        <p>where  is the number of data points,  is the
data dimensionality, and  is the set number of clusters.</p>
      </sec>
      <sec id="sec-1-10">
        <title>In the case of large datasets, the cost is dominated by the input cardinality. Even if the data were partitioned and processed in a data-parallel memory-bound way, the algorithm’s complexity and the required number of steps</title>
      </sec>
      <sec id="sec-1-11">
        <title>To address the increasingly higher data volumes, and</title>
        <p>
          © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License for convergence would remain the same.
lower the computation time and memory capacity re- by sampling in section 5, which allows the
specquirements, a possible approach is to take a random ulative algorithm to quickly explore subsets of
sample of the given dataset. Still, sampling trades of data and escape local minima,
accuracy for a more compact data representation, which
comes at a cost, as random sampling can lead to bias, • evaluating our approach across datasets in
secinadvertently skipping data points that would change tion 6, showing that speculative k-means
conthe clustering. A random sample cannot always and in verge on average in fewer steps towards the
optian unsupervised fashion without further information mal solution, often finding a better solution than
represent the entire dataset. As a consequence, the fi- methods with specialized centroid initialization
nal centroids that are found might not be the ones that phases.
correctly represent the initial dataset and minimize the We provide a novel perspective for tuning a
wellobjective function. Selecting the sample size dictates the established unsupervised learning algorithm and an
initradeof between execution time and centroid quality. tial blueprint for using speculation in iterative
algoThe more we want to improve performance, the smaller rithms.
the sample size has to be, sacrificing how representative
the resulting centroids are. Furthermore, more elaborate
sampling schemes to find a biased sample might intro- 2. Background and Related Work
duce overheads that are not compatible and maskable by
the algorithm throughput [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. K-means clustering is a widely used unsupervised
ma
        </p>
        <p>
          Ideally, while observing the iterative nature of the chine learning algorithm for data partitioning and
clusalgorithm and the mutual dependencies in the steps, let’s tering, and this section provides an overview of the basic
suppose there exists an oracle that informs the next step K-means algorithm, the existing initialization techniques,
at no cost about the centroids it found, avoiding full data the approximations which can be done to improve
peraccess. This would allow full data skipping with just formance, and the parallelization methods. We outline
the cost of assignment of data points to centroids. Let’s the desirable characteristics of these approaches and
desuppose next that this oracle manages to look in the scribe speculation for breaking dependencies in
analytifuture over several next steps and propose with some cal queries [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
uncertainty what the centroids after several iterations
should be. This would lead not only to data skipping but 2.1. Clustering quality: inertia
full iteration skipping, in an ideal scenario reducing the
algorithm to a single step only and fast-forwarding all Inertia measures how well the K-means algorithm
clusthe work. tered a dataset. Inertia  is defined as the sum of squared
        </p>
        <p>
          In practice, such approaches exist under the umbrella distances of data points  to their closest cluster center
term of speculation, as in hardware in microproces-  and weighted if required ( ):
sors [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], and have also recently been applied in the
domain of complex analytical queries [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. The key
characteristic of speculation is to achieve seamless and fast  (,  ) = ∑  ∈  , ⋅ || − || 2 (Inertia)
speculative steps and have a lightweight mechanism to ∈
detect and repair wrong proposed decisions while avoid- It is exactly the objective function (Equation Objective
ing performance degradation. function) of the K-Means algorithm, where a good model
        </p>
        <p>We take the idea of speculation, tailor it for iterative is one with low inertia for a given number of clusters
K-means, introduce a novel way of guiding the objective  , as naturally, as the number of clusters increases the
function to converge in fewer steps without sacrificing inertia would decrease.
the resulting quality, and provide the following
contributions by:
2.2. K-means++: improved initialization
• formulating speculative K-means algorithm in
section 3 and show how to break iterative step
dependencies,
• bridging speculative actions which are
approximate with exact execution phase in section 4, and
demonstrate cooperative, iterative dependency
reconstruction,
• showing that the speculative phase is enhanced</p>
      </sec>
      <sec id="sec-1-12">
        <title>K-means is sensitive to centroid initialization. In its clas</title>
        <p>
          sical form, the algorithm starts by randomly assigning a
set of centroids, which can lead to suboptimal solutions
when centroids are not carefully selected. To mitigate
this, it is important to use an appropriate initialization
strategy [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
        <p>
          One such approach is K-means++ [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], which elects
initial cluster centroids using sampling based on an
empirical probability distribution that reflects the contribution
of the points to the overall inertia. This helps find the
global optimum and produce better quality clusters than
random initialization. In Algorithm 2, we describe the
steps for the centroid initialization.
        </p>
      </sec>
      <sec id="sec-1-13">
        <title>Algorithm 2 K-means++ Initialization</title>
        <p>
          in-memory sampling schemes might be impractical and
costly, especially if the replaced computation via data
volume reduction does not match the sampling overhead
and overall throughput [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], or require workload-aware
and eficient online strategies [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
        </p>
        <p>Therefore, sampling uniformly at random provides a
data reduction and speedup strategy at the cost of
impacting the non-uniform distribution of clusters in the
original dataset.
2.4. Parallelization
1: Choose one center uniformly at random among the
data points.
2: for each data point x not chosen yet do
3: Compute () , the distance between x and the
nearest center that has already been chosen.
4: end for
5: Choose one new data point at random as a new
center, using a weighted probability distribution where
a point x is chosen with probability proportional to
() 2.
6: repeat
7: Steps 2 to 5
8: until k centers have not been chosen
9: Return the initial centers have been chosen</p>
      </sec>
      <sec id="sec-1-14">
        <title>With increasingly large datasets and system improve</title>
        <p>ments such as larger memory capacities, and parallel and
distributed architectures, one approach for addressing
K-means on large datasets is to adapt it to use available
hardware.</p>
        <p>
          Distributed frameworks such as Spark [
          <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
          ] and
MapReduce [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] take advantage of distributed
computational resources to process large datasets quickly and
eficiently while keeping the dataset in memory in a
scaleout manner.
        </p>
        <p>
          The initialization involves the computation of the dis- On the other hand, multi-processor setups and the
tances between each data point and the nearest center. introduction of GPUs with higher memory capacities
This is consequentially an expensive operation repeated were driving the formulation of scale-up
implementaevery time a new centroid is sampled. Therefore, all tions, which exploit available data and task parallelism
the aforementioned benefits come to an initial time-cost and high-bandwidth data access [
          <xref ref-type="bibr" rid="ref15 ref16">15, 16</xref>
          ].
tradeof and cost to pay before running the iterative clus- Conceptually, the main goal of parallelization is to
eftering phases of K-Means. This can be prohibitive in ifciently use abundant and available resources. Still, as
particular applications, especially in the case of large in the case of data parallelism, Amdahl’s law suggests
datasets. that an increase in resources often results in
diminishing returns [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. Considering all the bottlenecks, such
2.3. Sampling: Mini-batch and online as data exchange, data distribution disbalance, and
comK-means munication, especially in the critical sections, may lead
to resources remaining idle or underutilized. Adding
Sampling-based approximations aim to improve perfor- more resources does not necessarily lead to
proportionmance, usually execution time or memory constraints, ally faster execution.
by using a randomized subset of the initial input data to Finally, parallelization does not change the algorithmic
reduce the data volume. In addition, these algorithms are complexity of the task. For example, the data-parallel
forpractical when dealing with large datasets that would be mulation conceptually just partitions the input data over
too costly to process fully in-memory in one go [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. multiple workers, with defined synchronization
over
        </p>
        <p>
          Mini-batch K-means is an iterative algorithm that uses head. In the case of iterative algorithms and K-Means
a subset of the data to update the cluster centers. At in particular, there is a dependency between two steps
each K-means iteration, a new mini-batch is sampled [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. (Algorithm 1), where the order and sequence of iterations
Online K-means is a streaming algorithm that updates must be executed one at a time, necessitating a full pass
the cluster centers using one data point at a time [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. over the data to proceed to the next step.
        </p>
        <p>
          These methods reduce the execution time or the
amount of memory needed to run K-means clustering. 2.5. Speculative Execution for Analytics
However, this comes at a cost to the accuracy of the final
centroids found. The fact we are not using the entire Speculation for analytics [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] was introduced to strike a
dataset to find the centroids leads to exploring a sub- balance between data and task parallelism and
demonspace of possible centroids and could lead to solutions strate that idle hardware resources can be redirected to
with higher inertia. Intuitively, when an unbiased sample breaking dependencies in complex queries that would
is taken, data populations that clustering was intended to predominantly use data parallelism.
discover in the first place might be lost. Finally, elaborate
        </p>
        <p>This novel processing paradigm accelerates inter- more quickly traversing the chain of iterations towards
dependent queries using speculation through approx- the final solution.
imate query processing and subsequent repair mecha- Speculative K-Means (Figure 1, Algorithm 3) combines
nism to achieve results equivalent to purely data-parallel the regular (slow) and speculative (fast) execution in a
execution. The speculative technique used allows for task-parallel manner. Fast exploration is based on
samrelaxing dependencies between queries, increasing task pling, managing to finish multiple iterations until
synparallelism, and reducing end-to-end latency. It works chronization with the single slow iteration. The key
by detecting dependencies between outer and inner sub- component to speculation is merging and repairing at
queries and creating two separate execution paths: one the end of every slow iteration. At that point, centroids
for speculative execution and one for validation/repairs. from fast execution (from an unreliable oracle) are
comThe speculative execution path executes the outer query pared with slow execution using inertia (subsection 2.1),
by resolving any cross-query conditions with the help of a and better centroids are selected.
branch predictor. At the same time, the validation/repair
execution path performs three steps: (i) it fully computes
the inner query, (ii) it validates the speculative decisions,
and (iii) it repairs the results in case of mispredictions.</p>
        <p>This technique decouples the inner and outer queries and
allows them to run in parallel or share data and operators
if they process rows from the same table.</p>
        <p>Speculation allows better resource utilization through
data and task parallelism and subsequently opens op- Figure 1: Conceptual Design of Speculative K-means
portunities to use techniques such as scan sharing or
shared-query execution. Still, a lightweight speculation
and repair mechanism is required to exploit proposed
desirable optimization opportunities.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Speculative K-Means</title>
      <sec id="sec-2-1">
        <title>Taking into account the desirable characteristics of the K-means algorithm, an improved algorithm would:</title>
      </sec>
      <sec id="sec-2-2">
        <title>O1: reduce the number of steps and time needed to converge,</title>
      </sec>
      <sec id="sec-2-3">
        <title>O2: achieve equal or better centroid result quality,</title>
      </sec>
      <sec id="sec-2-4">
        <title>O3: avoid costly initialization while escaping local</title>
        <p>minima.</p>
        <p>
          These objectives are seemingly contradictory and
mutually conflicting. O1 would imply an approximate
solution that conflicts with O2, while O3, in conjunction
with O1, would result in approximate and potentially
significantly reduced result quality. There is a clear tradeof
between the resulting quality and necessary
(pre)computation time [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
        <p>To bridge these objectives, we propose a speculative
formulation where in conjunction with the exact
execution, a speculative step will be happening that satisfies the
objectives and terminates with equal (or better) cluster
characteristics, i.e., lower inertia as a measure of
clustering quality (Equation Inertia) We refer to Algorithm 1 and
assignment step (A) in lines 3 and 4, and update step
(B) in line 5, and the existence of strict cyclical
dependency between A and B that we want to break and allow</p>
      </sec>
      <sec id="sec-2-5">
        <title>The fast worker allows no-regret exploration and potentially avoids traversing the entire chain to reach the final solution by skipping some intermediate steps using the sample-based prediction of clusters.</title>
        <p>When the slow execution finishes processing its stage,
the fast one is interrupted. Both the fast and slow
execution propose the centroids they found, and the best ones
are chosen by computing their inertia over the entire
dataset (i.e., objective function). The centroids which
lead to lower inertia are selected. This procedure is
repeated until convergence or algorithm termination,
starting from the newly selected centroids.</p>
        <p>Algorithm 3 K-means Clustering using Speculation
1: Initialize centroids   randomly
2: Create two workers: slow and fast
3: Run 1 Assignment stage and 1 Update stage on the
slow execution starting from  
4: Run as many as possible stages on a sample of the
dataset on the fast execution starting from  
5: When slow execution finishes, stop fast execution.</p>
        <p>Get the proposed centroids   ,   
6: Compute the inertia  ( ,   ),  ( ,    ) on the
entire dataset 
7:   =    ( ( ,   ),  ( ,    )) select centroids
minimizing the inertia.
8: while not (centroids no longer move | a maximum
number of iterations is reached) do
9: repeat 2 to 7
10: end while
11: Return the final clusters and centroids</p>
      </sec>
      <sec id="sec-2-6">
        <title>This approach is not slower to converge than a standard</title>
        <p>K-means in terms of steps, and it produces better clusters
than approximate methods thanks to the centroid voting
and consolidation, which we demonstrate in section 4.</p>
        <p>Furthermore, it can converge closer to the global
minimum without costly centroid initialization as in
Kmeans++, as the centroids are resampled each time during
the fast execution. Since the K-means quality is strictly
conditioned by its initialization, by resampling the
centroids from which the fast execution starts each time, we
study the efect of avoiding local optima through random
space exploration in section 5.</p>
        <p>We satisfy the objectives (Algorithm 3) using collabora- other, as the assignment of data points depends on the
tively: current cluster centers, which are, in turn, updated based
on the assignments. This creates a cyclical pattern in the
O1: sampling-based speculation to guide the objective K-means algorithm.</p>
        <p>function (lines 4, 5), We can group the K-means steps into stages, each
O2: merge and vote on better centroids along the slow containing an Assignment step (A) and an Update step
step (lines 6, 7), (B), with many of these stages repeated in the K-means
algorithm until convergence. One way to speculate is to
O3: space exploration by dropping-out points using try to run parallel the Assignment and the Update step,
sampling (line 4). thus reducing the overall time of execution of a single
stage. In particular, we would speculate the result of
the Assignment (or the Update) allowing the consequent
Update (or the Assignment) to start running concurrently.</p>
        <p>Finally, we could perform a repair stage once the correct
result of the Assignment (or the Update) is available.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Exploitation: Fast and Accurate</title>
      <p>Firstly, we are interested in the qualitative behavior of
speculative K-means against vanilla K-means (Lloyd’s
algorithm) and K-Means++ as an approach with a
better initialization strategy. We observe and compare the
evolution of the clusters’ inertia against the computed
estimate of the optimal clustering and its respective iner- Figure 2: Evolution of ratio between the execution time of
tia. This will be essential to understand if our method is Assignment and Update steps.
reducing the inertia faster than existing methods and to
verify where our method is converging with respect to After a study of the time of executions ( Figure 2), the
the global minimum and, in particular, if it can reach it. Assignment step takes more time than the Update step.</p>
      <p>As the fast execution (Figure 1) performs several as- In particular, the ratio between the time of an
Assignsignments and update steps, allowing it to converge ment step and the time of the Update step increases with
faster, the slow execution computes the result equiva-  . Consequently, the parallelization of Assignment and
lent to vanilla K-Means on the entire dataset. This means Update in each stage is ineficient since the gain of added
tchloastetrhteo ftahsetfineaxlescoulutitoionnc,arnedsuucginggestthecetnottarolindusmthbaert oafre tasTkhpearreaflolereli,swmeismiurrsetleuvsaenatnwahpepnr oach that a≫llo ws  fas.t
steps needed to converge and, consequently, the overall speculation (and repair) when implementing iterative
time of execution. At the same time, the quality of the speculation. Instead of running Assignment and Update
ifnal solution is still comparable to the case when work- concurrently for each stage, we run two independent
ing with the entire dataset, thanks to the slow execution, and concurrent executions of K-means (Fast and Slow,
which can find the best (naive) position of the centroids Figure 1). Both executions start with the same given
to reduce the inertia in case of bad speculation. centroids, but: the slow execution processes one stage
consisting of one Assignment stage and one Update step
4.1. Breaking the K-means dependencies using the full dataset, while the fast execution runs on
a sample of the dataset and is able to process multiple
stages in the same amount of time as the slow execution.</p>
      <p>As soon as the slow execution finishes, the fast
execution is interrupted, guaranteeing that they both take the
The vanilla (naive) K-means Algorithm 1 consists of two
main steps: the assignment step and the update step. The
assignment step includes lines 3 and 4, while the update
step includes line 5. The two steps are dependent on each
same amount of time. Finally, we compare the centroids
proposed by both executions, compute the overall
objective function over the entire dataset given both proposed
centroids from Fast and Slow execution, and pick the
centroids with the lowest inertia. This step can be done in a
single scan of the data, overlapping it with the mandatory
step of the Slow execution path.</p>
      <p>The fast execution of the K-means algorithm is used
to speculate and predict an approximation of future
centroids that the slow execution may encounter by
traversing a speculative path. If the prediction is correct, it will
allow skipping several stages of the K-means algorithm,
breaking the cyclic dependency. This is possible since it
works on a subset of the dataset, allowing it to execute Figure 3: Evolution of the inertia of two runs starting from
faster and dive deeper into the algorithm, guiding the the same initial centroids: one based on Vanilla K-means, the
objective faster to convergence. However, as shown in other on Speculation K-means, and K-Means++. The Figure
section 2.3, the approximate approach can lead to less illustrates the estimated optimal achievable inertia and the
accurate results when applied to the full dataset due to inertia of the centroids suggested by the slow run and the fast
using an unbiased data sample. To address this issue, run at each stage.</p>
      <p>Slow execution is deployed to compute the result using
the entire dataset, ensuring that Speculation K-means
will converge toward a solution that considers the entire 30 stages. Furthermore, the accuracy of the final result
dataset. Thus, the two executions have diferent pur- is the same for both executions, showing how the slow
poses, with the fast execution being used to speculate execution and the correction phase ensure a comparable
and the slow execution being used to ensure accuracy. ifnal inertia.</p>
      <p>In Figure 1, the green shapes represent the steps
executed, whereas the red ones indicate the steps we would 4.2. Reconstructing the dependencies
save in case the speculation of the Fast execution suggests
a set of centroids with lower inertia. As the algorithm approaches convergence, the accuracy</p>
      <p>
        The main diference between the speculative execution and efectiveness of the speculation carried out by the
used in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], discussed in section 2.5, and the Speculation fast execution decrease, as shown in Figure 3. In the
K-means technique is that in the former, the speculation initial stages, the inertia of the centroids proposed by the
is used to break the dependency between the inner and fast execution is significantly smaller than that of the
cenouter query to increase parallelism, while in the latter, troids of the slow execution. Still, it becomes comparable
the speculation is used to predict possible future results in the following 3/4 stages. Additionally, as we approach
(i.e., centroids). Therefore, we are breaking the depen- convergence, the inertia of the fast execution’s centroids
dencies in a diferent sense: we are traversing the entire is constantly greater than that of the slow execution. This
chain of dependencies to arrive at the final result. Addi- is because the last stages of K-means before convergence
tionally, the correction phase in our approach ensures an need the entire datasets to compute the exact position of
accurate convergence of a sequence of actions in an iter- the centroids, and we cannot do this by using a sample
ative algorithm, rather than just the result of dependent of it. This leads us to question what to do with the fast
subqueries. execution and if we can improve its prediction even
dur
      </p>
      <p>
        In Figure 3, we compare vanilla K-Means, K-Means++ ing the last stages before convergence. This is especially
(with better initialization), and speculative K-means as noticeable, as in this formulation speculative approach
proposed until now. All start from the same initial cen- cannot ofer better result quality, similar to K-means++.
troid, except K-Means++, which has an additional step of First, we propose keeping track of the previously
preinitial centroid computation which is not represented. Af- dicted centroids in fast execution. Every time a fast
exeter this, they compute  = 8 clusters on the OpenML [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] cution starts, we sample a diferent subset of the dataset it
Dataset 23395 [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. OpenML Dataset 23395 and run for a will work on. Each subset will lead to a diferent centroids
maximum of 50 stages. Additionally, the optimal inertia prediction. Therefore we decide to keep their memory of
achievable is estimated by running 50 times K-means++ them using a tracing method, which consists in updating
with diferent initial centroids and then picking the solu- the newly discovered    centroids with the   past
tion with minimum inertia. The run with speculation is centroids in the following way:
more eficient, as it reaches convergence approximately
after 10 stages, while the vanilla implementation takes (1)
+ (1 − ) ⋅  
  
=  ⋅   
Again, we can see how  plays the role of a hyperpa- and improve the accuracy of our predictions by having
rameter which determines how much we should keep an increasing complete view of the data.
track of the past speculated centroids. This type of trac- While this approach allows fast convergence informed
ing leads to an exponential decay of the memory of the by previous speculative steps, Figure 4 demonstrates that
past centroids. The reason why this method should help speculation does not yet approach the inertia as that of
the prediction of the fast execution in the late stages of dedicated centroid initialization (K-Means++).
K-means is that it tries to combine the contribution of
diferent speculation passes. Suppose the problem is that
a dataset sample cannot represent the entire dataset in 5. Exploration: Escaping Local
the computation of the centroids. In that case, combining Minima
the contribution of diferent samples should mitigate this
limitation. In Figure 4, we can see how with tracing, the
prediction of the fast execution is more accurate even
in the last stages and has lower variance guided by past
execution.
      </p>
      <p>An alternative approach is to gradually increase the
sample size of the data points that the fast execution
algorithm is working with. As previously mentioned in
subsection 2.3, there is a trade-of between performance
and accuracy when using subsamples in K-means, with
smaller sample sizes leading to faster stages but less
accurate predictions. At the beginning of the algorithm, it
may be beneficial to use a small sample size to quickly
explore diferent speculation paths and arrive at centroid
estimates that are close to the final solution. In this phase,
the accuracy of the predicted centroids is not as
important as being able to delve into the speculation path for
numerous stages. However, as we approach convergence,
it becomes more crucial to have accurate speculation. In
this phase, we need to make small adjustments to the
centroids to find their optimal positions, and to do this,
we need access to the full dataset. Thus, we can gradually
increase the sample size to reduce the number of stages
processed by the fast execution (which is not a significant
concern at this point since we are close to convergence)
Figure 3 illustrates that both the Speculation K-means
and the vanilla version do not converge to the optimal
minimum nearly as K-Means++ does in these runs. As
discussed in Section 2, this corroborates that the
initialization of K-means plays a crucial role in determining the
quality of the final clustering. Initialization techniques
like K-means++ improve the final results, but have an
initial time cost (that we do not include in the
comparison). We propose an alternative approach that avoids
this initial cost while still leading closer to a global
minimum by escaping the initial centroids through fast and
randomized space exploration.</p>
      <p>Figure 1 and Algorithm 3 show how the fast
execution starts from the same centroids as the slow execution.</p>
      <p>However, by doing this, both executions are conditioned
by the first initialization of the centroids. Instead, we
modify our speculative approach and fast execution to
start from a slightly diferent set of centroids than the
slow one. This allows taking advantage of the fast
execution to explore more of the solution space, searching for
global minima. At the same time, in the repair and
consolidation stage, where we compare the found centroids by
computing their inertia with the slow phase, the results
will be at least as good as the vanilla approach if more
aggressive exploration failed. This method is similar to
having a vanilla K-means executed many times from
different initial centroids and picking the best-obtained
solution. Still, it is implemented only on the fast execution,
and it is distributed over many speculative steps.</p>
      <p>Each time before starting a fast speculation phase, we
sample  data points from the dataset as new initial
centroids  ′ and modify the starting centroids of the fast
execution by combining the given initial centroids and
the sampled ones linearly:
  =  ⋅   + (1 − ) ⋅  ′</p>
      <p>(2)
The value  determines how intensely we want to perturb
and mix the initial centroids: the closer it is to 0, the more
random the centroids are, and the more we explore the
solution space. This parameter expresses the degree of
exploration we want from the fast execution. It can be
set statically before the K-means execution to an optimal
value which may vary from dataset to dataset, or it can be
adapted at run time by adapting the randomness based on
the quality of solutions we are finding stage after stage.</p>
      <p>In Figure 5 we can see how using the centroid resampling
technique with  = 0.8 improves the inertia of the final
clusters.
our findings that is not biased over a given dataset, to
improve the empirical evaluation using datasets available
on the platform.</p>
      <p>NumberOfInstances</p>
      <p>NumberOfInstances
NumberOfNumericFeatures
NumberOfNumericFeatures</p>
      <p>NumberOfMissingValues
NumberOfSymbolicFeatures
&gt;
&lt;
&gt;
&lt;
=
=</p>
      <sec id="sec-3-1">
        <title>6.2. Faster convergence with fewer stages</title>
        <sec id="sec-3-1-1">
          <title>In this section, evaluate how speculative K-means can</title>
          <p>improve the performance of the vanilla K-means by
reducFigure 5: Evolution of the inertia starting from the same ing the number of stages required to reach convergence.
initial centroids: one based on Vanilla K-means, the other We conduct experiments on 10 datasets with the
charon Speculation K-means with centroid resampling, and K- acteristics listed in Table 1, using both vanilla K-means
Means++ with its own initialization step,  = 0.8 . and speculative K-means with a number of clusters 
ranging from 3 to 10. We measure the number of stages
each algorithm took to reach convergence, which we
deWith centroid resampling and starting centroid perturba- fined as the point when the relative diference in inertia
tion, we achieve guided and bounded randomization such between two stages is below 10−3 or when the
assignthat the increased exploration opportunities potentially ments are not changing anymore. Next, we computed
break the bad initialization without an explicit and ex- the ratio of the number of stages taken by speculative
Kpensive initial centroid computation of K-Means++. Still, means to the number of stages taken by vanilla K-means
while this speculative approach is probabilistic, it will and plotted a histogram of these ratios. Additionally, we
not be worse than the vanilla K-means, but it might not computed the ratio of the inertia of the final centroids
yield better results in all cases, conditional on available obtained from the two algorithms.
hyperparameters such as perturbation and sampling rate. In this setup, the goal is to achieve most of the stages
ratios below the value of 0.5, and ideally closer to 0. This
6. Holistic Evaluation is because we are trying to reduce the convergence time
and to achieve a 50% reduction in overall time, it is,
thereIn this section we evaluate our speculative K-means fore, necessary to reduce the number of stages by at least
formulation against vanilla (Lloyd’s) K-means and K- half. The results, shown in Figure 6, demonstrate that
means++ with a better initialization step. We focus on speculative K-means is often successful in reducing the
comparing the parameters such as inertia and the num- number of stages to less than half the original amount in
ber of stages before convergence to allow an algorithmic- most cases, with most of the ratios in the range of 0 to
oriented analysis of speed and quality of clustering. 0.5.</p>
          <p>
            We implement a prototype of speculative K-Means us- On the other hand, the ratio of inertia on the same
ing Python and compare it against the K-means baselines graph is better when the value approaches 1, showing
available in the widely used Scikit-learn [
            <xref ref-type="bibr" rid="ref20">20</xref>
            ] machine that the quality of the final centroids obtained using
Speclearning library in Python. ulative K-means is not compromised, and may even be
improved, despite the increased performance. This
sug6.1. Datasets gests that Speculative K-means has the potential to
significantly improve the eficiency of the K-means algorithm
The datasets on which the evaluations are done are re- without sacrificing solution quality.
trieved from OpenML [
            <xref ref-type="bibr" rid="ref18">18</xref>
            ] an online platform for ma- Previously, we showed that the speculation technique
chine learning research, which provides a database of can significantly reduce the number of stages required
machine learning datasets. We query the platform for for convergence in K-means clustering. However, there
10 datasets that follow the conditions shown in Table 1. are some time ofsets to consider due to the repair and
This allows us to report an averaged-out summary of speculation processes, including sampling a subset of
lative K-means with centroid resampling. For all
executions, we also compute the optimal solution and the
respective inertia and then calculate the ratio of the
inertia of the solution found by each of the three methods
to the optimal one. Finally, we plotted histograms of
these ratios for all three methods. In these executions,
Vanilla and Speculative K-means always started from
the same initial centroids, whereas K-means++ used its
initialization technique.
points for the fast execution and the repair step, which is
the most time-consuming due to the computation of
inertia of centroids from the fast and slow execution threads.
          </p>
          <p>This computation is similar in complexity to the
Assignment step, the most time-consuming step in K-means, Figure 7: The figure shows the ratio between the inertia
as shown in Figure 2. In a simple implementation, there of the centroids found by Vanilla, Speculative K-means, and
would be 1 Assignment step for the slow execution and 2 K-means++ (ratio_pp) to the optimal inertia. Speculative
KInertia steps for the repair, totaling 3 steps comparable to means used a  _ = 0.01 and centroids resampling
the Assignment. This can be reduced to 2 steps by using with  = 0.5
the results from the inertia computation in one stage to
compute the Assignment step in the next stage. As depicted in Figure 7, Speculative K-means had the</p>
          <p>As the Assignment step is the most time-consuming, highest number of runs where the ratio was close to 1,
each stage in Speculative K-means requires approxi- indicating that it was most successful in reaching closer
mately double the time of a stage in Vanilla K-means to the global minimum inertia. Then it is followed closely
at most. While halving the total number of stages can by K-means++ and finally by Vanilla K-means.
compensate for the increased time per stage, the overall These measurements suggest that Speculative
Ktime of execution would still be the same. Improving means, with centroid resampling, is indeed able to escape
the computation of inertia through better parallelization, local minima without an initial time cost. In this
particevaluation using hardware accelerators, using techniques ular case, it also outperforms K-means++ due to faster
such as scan sharing, or alternative approaches such as convergence and better exploration.
avoiding explicit computation of inertia could further On the other hand, the same observation holds when
improve the performance of Speculative K-means. These the distribution of the number of steps is plotted in
Figapproaches remain part of future work and directions. ure 8. While this corroborates that the initialization step
is important, as K-means++ requires fewer steps than
the vanilla K-means, speculation allows convergence in
fewer steps.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>6.3. Exploration and escaping the local minima</title>
        <p>In this section, we show how Speculative K-means can 6.4. Adversarial dataset evaluation
escape local minima by using the resampling centroid
technique described in section 5. To demonstrate this, Finally, we evaluate the performance of speculative
Kwe conducted executions with increasing values of  in a Means clustering over typical adversarial datasets
derange of values between 3 and 10. We run three versions picted in Figure 9. K-means algorithm family cannot
of K-means: vanilla K-means, K-means++, and Specu- provide satisfactory clustering and data separation that
follows the given patterns. Still, using these kinds of
datasets allow for exploring the impact of initialization
and speculative execution.</p>
        <p>ure 10. While the ideal inertia cannot be captured by
the objective function of the K-means clustering, the
results show that Speculative K-Means is always better
or equal to Vanilla K-Means. On the other hand, the
default parameters of the exploration mechanism have
not always escaped the initial centroids in two cases, in
comparison to K-Means++. This motivates a more
detailed future case study regarding adaptive sampling and
runtime tuning of parameters that we introduced in the
speculative version of K-means.</p>
        <sec id="sec-3-2-1">
          <title>Still, the number of steps to convergence, respectively</title>
          <p>after the initialization step for K-Means++, as presented
in Figure 11 demonstrates that Speculative K-Means
allows consistently faster convergence with fewer iterative
steps.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>We first compare the inertia obtained by Vanilla KMeans, K-Means++, and Speculative K-Means in Fig</title>
          <p>We have extensively evaluated our approach against
two baselines (K-means++ and vanilla K-Means) and
multiple datasets, and we demonstrated our initial
hypothesis that speculation bridges faster convergence
without loss of accuracy and allows probabilistic exploration
to allow escaping local minima without prior
initialization. Future directions include a thorough evaluation
of hardware-oriented optimizations, resource allocation,
and the impact of speculative execution on the overall
parallel execution cost.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>7. Conclusion</title>
      <p>Despite K-Means being a long-standing and well-studied
machine learning algorithm, systems and data
analyticsinspired tuning and optimizations such as speculation
can bridge strict tradeofs between diferent desirable
characteristics of algorithm variants. We combine
randomized space exploration based on sampling to allow
escaping local minima while achieving strictly equal or
better results than the original K-Means formulation. We
break the cyclic dependency in the iterative K-Means
formulation while preserving the original algorithm output
quality.</p>
      <p>The key characteristic of speculative K-Means is
that while phases are concurrently working on
optimizing task-local objectives, cooperative merging and
lightweight repair designed for speculation for iterative
algorithms allows merging and tuning to the desired
objective function. We extend speculative execution in
data analytics with the first iterative machine learning
algorithms and combine initially conflicting but
advantageous methods through cooperative algorithmic and
system design.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <sec id="sec-5-1">
        <title>We thank the anonymous reviewers for their insightful comments and detailed feedback. This work has been partially supported by Facebook Next-generation Data Infrastructure Research Award (2021).</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <article-title>Data clustering: 50 years beyond k-means</article-title>
          ,
          <source>Pattern recognition letters 31</source>
          (
          <year>2010</year>
          )
          <fpage>651</fpage>
          -
          <lpage>666</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Lloyd</surname>
          </string-name>
          ,
          <article-title>Least squares quantization in pcm</article-title>
          ,
          <source>IEEE transactions on information theory 28</source>
          (
          <year>1982</year>
          )
          <fpage>129</fpage>
          -
          <lpage>137</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ailamaki</surname>
          </string-name>
          ,
          <article-title>Sampling-based aqp in modern analytical engines</article-title>
          ,
          <source>in: Data Management on New Hardware</source>
          , DaMoN'22,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2022</year>
          . URL: https://doi.org/10.1145/3533737.3535095. doi:
          <volume>10</volume>
          .1145/3533737.3535095.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Hennessy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Patterson</surname>
          </string-name>
          ,
          <article-title>Hardware-based speculation</article-title>
          , in: Computer Architecture, Sixth Edition:
          <string-name>
            <given-names>A Quantitative</given-names>
            <surname>Approach</surname>
          </string-name>
          , 6th ed., Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,
          <year>2017</year>
          , pp.
          <fpage>208</fpage>
          -
          <lpage>217</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Sioulas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanca</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Mytilinis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ailamaki</surname>
          </string-name>
          ,
          <article-title>Accelerating complex analytics using speculation, 2021</article-title>
          . URL: https://www.cidrdb.org/cidr2021/ papers/cidr2021_paper03.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Fränti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sieranoja</surname>
          </string-name>
          ,
          <article-title>How much can k-means be improved by using better initialization and repeats?</article-title>
          ,
          <source>Pattern Recognition</source>
          <volume>93</volume>
          (
          <year>2019</year>
          )
          <fpage>95</fpage>
          -
          <lpage>112</lpage>
          . URL: https://www.sciencedirect.com/ science/article/pii/S0031320319301608. doi:https: //doi.org/10.1016/j.patcog.
          <year>2019</year>
          .
          <volume>04</volume>
          .014.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Arthur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vassilvitskii</surname>
          </string-name>
          , K-means+
          <article-title>+: The advantages of careful seeding</article-title>
          , volume
          <volume>8</volume>
          ,
          <year>2007</year>
          , pp.
          <fpage>1027</fpage>
          -
          <lpage>1035</lpage>
          . doi:
          <volume>10</volume>
          .1145/1283383.1283494.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bejarano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bose</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Brannan</surname>
          </string-name>
          , A. Thomas,
          <string-name>
            <given-names>K.</given-names>
            <surname>Adragni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. K.</given-names>
            <surname>Neerchal</surname>
          </string-name>
          , G. Ostrouchov,
          <article-title>Sampling within k-means algorithm to cluster large datasets</article-title>
          ,
          <source>UMBC Student Collection</source>
          (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Sculley</surname>
          </string-name>
          ,
          <article-title>Web-scale k-means clustering</article-title>
          ,
          <source>in: Proceedings of the 19th International Conference on World Wide Web, WWW '10</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2010</year>
          , p.
          <fpage>1177</fpage>
          -
          <lpage>1178</lpage>
          . URL: https://doi.org/10.1145/1772690. 1772862. doi:
          <volume>10</volume>
          .1145/1772690.1772862.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>L.</given-names>
            <surname>Bottou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <article-title>Convergence properties of the k-means algorithms</article-title>
          , in: G. Tesauro, D. Touretzky, T. Leen (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>7</volume>
          , MIT Press,
          <year>1994</year>
          . URL: https://proceedings.neurips.cc/paper/1994/ file/a1140a3d0df1c81e24ae954d935e8926-Paper.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Chrysogelos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ailamaki</surname>
          </string-name>
          ,
          <article-title>Laqy: Eficient and reusable query approximations via lazy sampling</article-title>
          ,
          <year>2023</year>
          , p.
          <fpage>15</fpage>
          . doi:
          <volume>10</volume>
          .1145/3589319.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Hua</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <article-title>Parallelizing k-means-based clustering on spark</article-title>
          ,
          <source>in: 2016 International Conference on Advanced Cloud and Big Data (CBD)</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>31</fpage>
          -
          <lpage>36</lpage>
          . doi:
          <volume>10</volume>
          .1109/CBD.
          <year>2016</year>
          .
          <volume>016</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>I.</given-names>
            <surname>Kusuma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Ma'Sum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Habibie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Jatmiko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Suhartanto</surname>
          </string-name>
          ,
          <article-title>Design of intelligent k-means based on spark for big data clustering</article-title>
          ,
          <source>in: 2016 international workshop on Big Data and information security (IWBIS)</source>
          , IEEE,
          <year>2016</year>
          , pp.
          <fpage>89</fpage>
          -
          <lpage>96</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <article-title>Parallel k-means clustering based on mapreduce</article-title>
          , in: Cloud Computing: First International Conference, CloudCom
          <year>2009</year>
          , Beijing, China, December 1-
          <issue>4</issue>
          ,
          <year>2009</year>
          . Proceedings 1, Springer,
          <year>2009</year>
          , pp.
          <fpage>674</fpage>
          -
          <lpage>679</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ou</surname>
          </string-name>
          ,
          <article-title>The study of parallel k-means algorithm</article-title>
          ,
          <source>in: 2006 6th World Congress on Intelligent Control and Automation</source>
          , volume
          <volume>2</volume>
          , IEEE,
          <year>2006</year>
          , pp.
          <fpage>5868</fpage>
          -
          <lpage>5871</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>R.</given-names>
            <surname>Farivar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Rebolledo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Chan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. H.</given-names>
            <surname>Campbell</surname>
          </string-name>
          ,
          <article-title>A parallel implementation of k-means clustering on gpus</article-title>
          .,
          <source>in: Pdpta</source>
          , volume
          <volume>13</volume>
          ,
          <year>2008</year>
          , pp.
          <fpage>212</fpage>
          -
          <lpage>312</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <article-title>Rules of thumb in data engineering</article-title>
          ,
          <source>in: Proceedings of 16th International Conference on Data Engineering (Cat. No. 00CB37073)</source>
          , IEEE, ????, pp.
          <fpage>3</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>J.</given-names>
            <surname>Vanschoren</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. N. van Rijn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bischl</surname>
          </string-name>
          , L. Torgo,
          <article-title>Openml: networked science in machine learning</article-title>
          ,
          <source>SIGKDD Explorations 15</source>
          (
          <year>2013</year>
          )
          <fpage>49</fpage>
          -
          <lpage>60</lpage>
          . URL: http:// doi.acm.
          <source>org/10</source>
          .1145/2641190.264119. doi:
          <volume>10</volume>
          .1145/ 2641190.2641198.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Openml</surname>
            <given-names>dataset 23395</given-names>
          </string-name>
          , ???? URL: https://www. openml.org/d/23395.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>F.</given-names>
            <surname>Pedregosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Varoquaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gramfort</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Michel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Thirion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Grisel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Blondel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Prettenhofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Dubourg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Vanderplas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Passos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cournapeau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Brucher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Perrot</surname>
          </string-name>
          , E. Duchesnay,
          <article-title>Scikit-learn: Machine learning in Python</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>12</volume>
          (
          <year>2011</year>
          )
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>