1. Introduction

Clustering Classical Data with Quantum k-Means⋆

Alessandro Poggiali

Alessandro Berti

Anna Bernasconi

Gianna M. Del Corso

Riccardo Guidotti

1 0 University of Piemonte Orientale , Italy 1 University of Pisa , Italy

2022

7 9

In the last years, we have witnessed an unstoppable growth of data created, captured, copied, and consumed globally by more and more devices. The demand for such an increasing amount of information to be processed led to research towards higher computational power systems and specialized algorithms. Among them, quantum computing is a promising paradigm based on quantum theory for performing fast computations. Quantum algorithms are expected to surpass their classical counterparts in terms of computational complexity for certain kinds of tasks, and machine learning is one of them, so the subfield of Quantum Machine Learning is one of the most promising. In this work, we design a hybrid quantum algorithm for k-Means. The main idea of our algorithm is to compute in a quantum way the distance between pairs of records in the input dataset. We show that our quantum algorithm could be, in principle, more eficient than the classical k-Means, yet obtain comparable clustering results.

eol>Quantum Machine Learning Clustering Data Mining

1. Introduction

Quantum Machine Learning (QML) is the branch of Quantum Computing (QC) that attempts to adapt classical data mining and machine learning algorithms, or their expensive subroutines, to run on a potential quantum computer. Indeed, the expectation is that such machines will be commonly available for applications in the near future. Many QML algorithms have been recently studied [ 1 ]. However, the translation of classical algorithms into their quantum counterpart named “quantization” is not trivial and hides many dificulties. In this work, we focus on the quantization of -Means [ 2 ], which is one of the most famous algorithms used for clustering. The -Means clustering algorithm is an unsupervised learning algorithm, and its goal is to find natural groups of elements in a data set. In particular, the elements inside a group are more similar to the central element of the group than to the central elements of other groups, according to a specific distance measure. Building a quantum version of this algorithm means creating a quantum circuit that takes classical data as input and exploits quantum gates to perform the computation, satisfying all the quantum mechanical constraints. Nowadays quantum computers are in the noisy intermediate-scale quantum (NISQ) era[ 3 ]. This means that today, quantum computers are prone to noises that generate errors in computation. This is known as the decoherence problem. Typically, the deeper the circuit, the more prone the output is to errors that make it unreliable.

For this reason, this work presents a hybrid -Means that exploits a quantum subroutine to boost the distance computation between each record and each centroid while the overall algorithm remains classical. This solution permits the mitigation of the decoherence problem in this NISQ era. In our analysis, we use the classical --Means algorithm to provide a measure of error for our hybrid -Means that we call --Means algorithm. The experiments show that our quantum algorithm could be, in principle, more eficient than its classical counterpart yet obtain comparable clustering results.

The rest of the paper organizes as follows. Section 2 provides the related works, Section 3 sets up the stage by defining the notations, Section 4 describes the Quantum k-Means, Section 5 reports the experimental results, and eventually, Section 6 summarizes contributions.

2. Related Works

In general, the problem of quantum clustering can be addressed in several ways. Some studies were inspired by quantum theory. For instance, the classical clustering method proposed in [ 4 ] is based on physical intuition derived from quantum mechanics. In [ 5 ], the authors perform clustering by exploiting a well-known reduction from clustering to the Maximum-Cut problem that is then solved using a quantum algorithm for approximate combinatorial optimization. In [ 6 ], the authors present an unsupervised quantum learning algorithm for -Means clustering based on adiabatic quantum computing [ 7 ].

The general idea for quantizing classical clustering algorithms is to substitute the most expensive parts in the algorithm with more eficient quantum subroutines. For instance, in [ 8] the fidelity distance measure is used for distance computation between each pair of states in the dataset. The fidelity is eficiently estimated with a quantum circuit containing only a few gates (two Hadamard gates and a Control-Swap). In this way, the algorithm can perform clustering directly on quantum states. However, the paper lacks a discussion on how the algorithm could deal with classical data.

As alternative approaches, full quantum routines for clustering have been proposed. In [9] two subroutines based on Grover’s search algorithm [10] are used to accelerate classical clustering methods. For the -Means quantization, the authors use the subroutines in this way. First, the total distance of each state to all other states of one cluster is computed with the help of an oracle that calculates the distance between two quantum states. Then, another routine finds the smallest value of this distance function in order to select a quantum state as the new centroid for the cluster. Unfortunately, this approach cannot be used in practice because the oracle is not described in detail.

Finally, in [11] it is proposed a quantum version of -Means (called -Means) that provides an exponential speedup in the number of records of the dataset compared to the classical version. Moreover, -Means returns explicit classical descriptions of the final centroids. Although Means looks promising from a practical point of view, the paper discussion is strictly theoretical. The experiments are performed using a classical algorithm ( --Means) simulating -Means, instead of a real quantum version.

Diferent from the literature, our work concentrates on practical problems arising when implementing a quantum clustering algorithm, with particular attention to the encoding of classical data.

3. Setting the stage

We keep our paper self-contained by summarizing the key concepts necessary to comprehend our work. Given a dataset = {⃗1, . . . , ⃗ } of records where every record ⃗ is a dimensional vector of numerical value, the goal of clustering is to assign each record to one out of diferent clusters {1, . . . , }, represented by centroids {⃗1, . . . , ⃗} respectively, so that similar records share the same assignment. The -Means algorithm is one of the most famous clusterings algorithms [12]. After randomly1 choosing initial centroids, the algorithm consists of two repeated steps until a certain convergence condition is met. The first step is the cluster assignment step: every element in the dataset has to be assigned to its closest centroid according to a specific distance measure. As distance, we consider the Eucledian distance that is defined as (⃗, ⃗) = √︁∑︀=1 ( − )2, where and are two -values vectors. The second step is the centroids update step, where for every cluster we recompute, the new cluster center to be used as a centroid for the next iteration. In this work, we concentrate on the cluster assignment step whose time complexity is ( ) where is the number of centroids, is the number of records, and their dimension.

As already mentioned in Section 2, a quantum version of -Means, called -Means, is proposed in [11]. The authors evaluate -Means by means of --Means, a “quantum-approximation” of -Means that simulates the quantum calculus that -Means is supposed to do. More precisely, --Means simulates the classical -Means algorithm as performed in a quantum environment. 1We adopt the clever initialization proposed in [13].

|0⟩

Since a quantum algorithm can introduce errors due to decoherence and noise present in quantum machines, --Means simulates such errors by introducing some noise in both steps of -Means. However, as discussed in Section 4, in the quantum version of -Means we propose in this paper, we keep the second step (i.e., centroids update) classical. For this reason, we will consider in the experiments a slightly diferent version of --Means where we introduce the noise only in the cluster assignment step.

In Algorithm 1, we show the pseudocode of the updated version of the considered --Means. Let ⃗ be the centroid closest to the point ⃗. Then, --Means defines the set of possible labels (⃗) for ⃗ as follows: (⃗) = {⃗ : ⃒⃒⃒ 2(⃗, ⃗) − 2(⃗, ⃗)⃒⃒ ≤ }.

⃒ When = 0, --Means is equivalent to the standard -Means since no uncertainty is included and (⃗) contains only the closest centroid. On the other hand, a high value of allows -Means to include in (⃗) not very close centroids, bringing more noise in the entire procedure. Indeed, the assignment rule selects randomly a cluster label from the set (⃗) (see line 5 in Algorithm 1). In [11] it is proven that if the data are “well-cluserable” (see [11] for the detailed definition) and the centroids are well separated, the --Means algorithm succeeds assigning the right cluster to most of the points for a suitable value of depending on the separation of the centroids.

4. Quantum k-Means

Classical information can be encoded in diferent ways into a quantum state. In [ 14], the authors revisit several data encoding strategies and quantum distance algorithms. The process of encoding input numerical features into the amplitude of a quantum system is called amplitude encoding [15]. Amplitude encoding allows the design of quantum circuits that compute distances between quantum states.

One distance measure commonly used in ML is the Euclidean distance [12]. We show here a general circuit for computing a quantum Euclidean distance (, ) between two general quantum states | ⟩, |⟩ encoded in a register via amplitude encoding [16]. To compute this distance, we use an additional ancilla qubit entangled with the two states | ⟩ and |⟩. This can be accomplished by first applying an Hadamard gate on the ancilla , and then by loading in the register the two states | ⟩ and | ⟩ conditioned on the ancilla. Then, the initial state |0⟩ |00 · · · 0⟩ evolves in √12 (|0⟩ | ⟩ + |1⟩ |⟩). Eventually, we apply a Hadamard gate |0⟩ |0⟩⊗ ^ |0⟩

ampl. | ⟩ 0 1 on ancilla . The corresponding state becomes: 12 (︀ |0⟩ (| ⟩ + |⟩) + |1⟩ (| ⟩ − | ⟩)︀) . The probability of measuring the ancilla in the state |0⟩ is given by = 14 ‖ + ‖22 which corresponds to = 1 − 41 ‖ − ‖22 = 1 − 14 (, )2, since and are unit vectors. The overall circuit computing the Euclidean distance between two states and is illustrated in Figure 1.

In the rest of this section, we propose our quantum version of -Means. In particular, what we want to “make quantum” is the computation of the distance between two records. This is similar to what some previous works have proposed [8], but here we give a practical implementation of the entire algorithm.

In order to eficiently load classical data in a suitable quantum state, we employ the FFQRAM algorithm [17]. In particular, we deal with -feature vectors, and we use the FF-QRAM algorithm to amplitude encode each vector.

Once data have been loaded, we build a quantum circuit that computes the distance between a single pair of vectors. In the -Means context, we compute distances between every record and every centroid in order to assign a cluster label to every record in the dataset. This can be accomplished by using the quantum circuit shown in Figure 1 that computes the Euclidean distance between two quantum states generated by the amplitude encoding technique.

Figure 2 illustrates the whole quantum circuit (QC). QC can compute the Euclidean distance simultaneously between the features of two vectors. The ^-qubit register |⟩ is the index register for the features of each vector. It consists of ^ = ⌈log2 ⌉ qubits which control the rotation of a qubit |⟩.

To get a proper estimate of the Euclidean distance, we need to repeat the execution of this circuit a certain number of times . In this way, we can estimate the Euclidean distance as √︂ (, ) = 4 − 4 ︁( #|0′⟩ ︁) , where # |0⟩ is the number of times the ancilla qubit is measured in the state |0⟩. Actually, also the FF-QRAM encoding procedure requires a post-selection on the qubit |⟩ (see [17, 14] for more details). Thus, # |0⟩ is the number of times the outcome for ancilla is 0 after having post-selected the result where qubit |⟩ was in 1. Notice that, here, ′ < is not the total number of repetitions, but the number of times the post-selection on |⟩ is successful.

Algorithm 2 describes the pseudocode of the proposed procedure, --Means, for cluster computation. We adopt -Means++ [13] as centroid initialization strategy (line 1). Then, we Algorithm 2: q--Means

Input: - input data, - number of clusters, - number of quantum shots

Output: - records to clusters assignment, C - centroids 7 compute the quantum Euclidean distance between each record ⃗ and each centroid ⃗. In particular, the quantum distance computation is repeated for each of the pairs of vectors, where is the number of records in the dataset and is the number of centroids (lines 3-7). Then, the algorithm assigns ⃗ to the cluster such that the distance between ⃗ and the centroid ⃗ of the cluster is the minimum (lines 8-10). Eventually, each centroid ⃗ is updated (lines 11-12). The output of --Means consists of a final cluster assignment and the corresponding centroids .

Our hybrid approach improves the cluster assignment step with respect to the classical Means by a factor if we do not consider the QRAM preparation. In particular, the overall complexity of the cluster assignment step is ( ), where is the number of records and is the number of centroids, plus the cost of the QRAM preparation.

5. Experiments

In this section, we assess the efectiveness of the --Means algorithm2. We intend to evaluate the capability of --Means in terms of clustering quality by simulating it on a number of datasets. Since quantum computers currently available are not large enough to test --Means, we exploit the qasm_simulator provided by qiskit3. 2Python code at: https://github.com/AlessandroPoggiali/Qkmeans-ICTCS 3Qiskit library: https://qiskit.org/

For an evaluation of the algorithm as much complete as possible, the parameters that we are going to test follow: • shots (): how many times we repeat the execution of a quantum circuit. • sc_thresh: it rules the stopping condition of the algorithm. It is defined as the relative tolerance with regards to the Frobenius norm of the diference in the cluster centers of two consecutive iterations to declare convergence. • max_iterations (max_ite): the maximum number of iterations the algorithm can perform.

If not diferently specified, we tested --Means with the following parameters: max_ite: 10, sc_thresh: 0.0001, and shots: 8192 on all four synthetic datasets.

In order to evaluate the algorithm performance and the cluster quality, our analysis considers the following measures.

• n_ite (ite): the actual number of iterations --Means performs to converge. • avg_similarity (sim): the concept of similarity is defined in terms of how accurately --Means assigns the right centroids to records with respect to the classical assignments. The avg_similarity measure is basically the average similarity among all iterations of the algorithm. • silhouette (sil): the Silhouette Coeficient measures how an element is similar to its cluster with respect to the other clusters. • SSE [18]: it corresponds to the sum of the squared diferences between each record and its centroid. • v_measure (vm) [19]: it measures the correctness of the clustering assignments with respect to ground truth.

Dataset. For the --Means assessment, we consider two groups of datasets: the first group of four synthetic datasets and two real datasets. The synthetic datasets come from the clustering guide of scikit-learn4. These datasets show the characteristics of diferent clustering algorithms on datasets that are “interesting” but still in two dimensions. For these datasets, we also have the ground truth (i.e., the actual clustering we want to obtain) so that we can have an objective evaluation of the algorithm. They include: aniso, blobs, blobs2, and moon (also known as noisymoon). While instead, the real datasets we take into consideration are: iris, and diabetes. Regarding the real datasets, we have no ground truth. Thus, we have to choose a suitable number of centroids . As synthetic datasets, the real datasets are available on scikit-learn.

Data Preprocessing. The essential data preprocessing required by --Means consists of two steps: standardization and normalization. The standardization of the dataset has the efect of having zero mean and unit variance among all samples. This is common practice in ML to compensate for scaling efects and to ensure that the data does not only populate a small subspace of the input space. In fact, input spaces in higher dimensions lead to indistinguishably small distances between data points. The normalization applies row by row and allows us to deal with unit length vectors. While the dataset standardization is always applied, we have two diferent preprocessing techniques available: the normalization to unit length ( 1-norm) and the inf-norm preprocessing, which is equivalent to dividing the entries of the vector by the component of maximum modulus.

Finally, vector values are converted to suitable angles in order to encode them in the FFQRAM. Note that the ancilla post-selection probability for the quantum Euclidean distance is always around 0.5 [15]. However, we can enhance the FF-QRAM post-selection probability applying the inf-norm normalization [20].

Looking at the algorithm assessments, the execution with the inf-norm gives comparable results with respect to the 1-norm (see Table 1). In this test, we executed --Means on blobs2 dataset using 1024 shots. The Silhouette score tells us that in the inf-norm case, we obtain a better clustering, and also the v_measure tells us that the final assignment is closer to the ground truth. Instead, the SSE in the inf-norm case is worse than in the 1-norm case. This depends on the range of input data, and hence it does make sense only when comparing results whereby input data have their range of values comparable. We report in Figure 3 the output in the original space of the final clustering assignment. We observe that we get a more accurate clustering in the second case, even though bad-clustered points are still present. Henceforth, we adopt the inf-norm as the default preprocessing.

Results on Synthetic Datasets. Table 2 reports the evaluation measures observed for -Means on synthetic datasets. The values show good performance on aniso, blobs, and blobs2. In fact, the measures sil and vm highlight a good clustering output. While, in the moon dataset, in spite of having a high similarity value, the vm returned is low. This probably happens because the moon dataset, as it is generated, is not a well-clusterable dataset. In fact, it has no spherical clusters hence algorithms like -Means will not be able to identify the right shapes of its clusters. Table 2 reports also the results obtained with --Means and -Means. In particular, --Means produces a clustering with error when the --Means similarity is comparable to the --Means similarity. A final consideration concerns the confusion matrices in Table 3, where we report the percentages of True Positive (TP), False Positive (FP), False Negative (FN), and True Negative (TN). We observe that -Means and --Means typically output the same results on the datasets blobs, blobs2, and moon. In fact, the value + in these datasets is quite small. This does not hold for the aniso dataset where, instead, + = 8.71%.

Results on Real Datasets. With real datasets, we have no prior knowledge about the clustering result we should achieve. In other words, the number of clusters has to be selected properly in order to get good clusterization. A very common heuristic used in clustering to determine the number of clusters in a dataset is the elbow method [12]. Our first test aims at performing the elbow method with the -Means and --Means algorithm separately on the iris dataset. We repeat the execution of both algorithms by varying from 2 to 8. We obtain the curves in Figure 4. A suitable value of for both algorithms is 3 because this is the point where the SSE stops decreasing sharply. The comparison between the classical and the quantum algorithm with this configuration is reported in Table 4. We repeated the same experiment on the diabetes dataset. Again, we select the right using the elbow method with -Means, and then we compare the result that --Means gives us for the same . Figure 4 shows that a good value for is 8 for this dataset. Hence, by executing the --Means algorithm with = 8, we obtain the result in Table 4. The result shows that --Means performs not well compared to --Means and -Means according to SIL measure. A possible explanation can be related to the use of PCA to reduce the number of features from 10 to 4, which is necessary for executing the algorithm in a reasonable amount of time, due to the technological limits. This makes records belonging to diferent clusters too similar, and the number of shots was insuficient to estimate a suficiently precise Euclidean distance, which therefore afected the quality of the clustering. Eventually, we observe that the SIL measures of -Means and --Means on the iris dataset are similar, while in the diabetes dataset, -Means performs better than --Means.

--Means on Real Quantum Hardware. All tests reported up to now were carried out using the qasm simulator, a simulator provided by qiskit which simulates quantum computation by using classical hardware. Here, we show the best we can do with currently available quantum computers. IBM ofers cloud access 5 to some of their quantum computers, so it is possible to delegate the execution of a quantum circuit to a real quantum machine. We had access to quantum computers with no more than five qubits. For this reason, we must consider simple instances of our --Means where no more than five qubits are necessary.

The first test considers --Means where we use three qubits: one qubit for the ancilla |⟩, one for the register |⟩, and one for addressing = 2 features (i.e., |⟩). Instead of simulating each of the quantum circuits locally, we first check the least busy quantum computer available and send to it every circuit to be executed. This requires several steps, like waiting on the queue of a quantum computer and receiving the result, so it introduces an overhead. Furthermore, this communication overhead will be paid per pair of vectors, so it could highly afect the overall performance of the algorithm, especially for big datasets. For this reason, we took into account for this experiment a small dataset (blobs3) consisting of = 16 two dimensional vectors, which form two well-distinguishable spherical clusters (Fig. 5). Notice that the number of clusters is not involved in the quantum circuit preparation, but we chose = 2 to simplify the overall execution.

We compare the output of --Means executed using real quantum hardware with the output of the algorithm executed by the simulator. In Table 5 we report this comparison, while in Figure 6 we show the clustering obtained.

From the table, we can see that both clusterization were successful with respect to the ground truth with the same number of iterations. However, to conclude, until a large-scale noise-free quantum computer is available, testing complex quantum circuits on real quantum hardware will be an unfeasible task.

6. Conclusion

We have proposed --Means, a hybrid approach for clustering classical data. The algorithm implements a quantum subroutine to boost the cluster assignment step of the classical -Means. In particular, this quantum subroutine computes the Euclidean distance between two -dimensional vectors, i.e., a record and a cluster centroid. The complexity of this step is ( ), where and are the number of records and the number of centroids, respectively. The experiments show that --Means could be in principle more eficient than classical -Means, yet obtaining comparable clustering results. In this work, we exploited quantum parallelism only to compute distances. Our future work is to design and analyze two variants of --Means that leverage quantum parallelism to compute (i) the distances between a single record and centroids simultaneously, and (ii) the distances between records and centroids simultaneously.

Acknowledgment. This work is supported by Università di Pisa under the “PRA – Progetti di Ricerca di Ateneo” (Institutional Research Grants) - Pr. no. PRA 2020-2021 92, “Quantum Computing, Technologies and Applications”, and the work of Gianna Del Corso is also supported by GNCS-INdAM. This work has been partially supported by the European Community Horizon 2020 programme under the funding schemes: H2020-INFRAIA-2019-1: Research Infrastructure G.A. 871042 SoBigData++. [8] E. Aïmeur, G. Brassard, S. Gambs, Machine learning in a quantum world, in: Conference of the Canadian Society for Computational Studies of Intelligence, Springer, 2006, pp. 431–442. [9] E. Aïmeur, G. Brassard, S. Gambs, Quantum clustering algorithms, in: Proceedings of the 24th international conference on machine learning, 2007, pp. 1–8. [10] L. K. Grover, A fast quantum mechanical algorithm for database search, in: Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, 1996, pp. 212–219. [11] I. Kerenidis, J. Landman, A. Luongo, A. Prakash, q-means: A quantum algorithm for unsupervised machine learning, Advances in Neural Information Processing Systems 32 (2019). [12] P. Tan, et al., Introduction to Data Mining, Addison-Wesley, 2005. [13] S. Vassilvitskii, D. Arthur, k-means++: The advantages of careful seeding, in: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, 2006, pp. 1027– 1035. [14] A. Berti, A. Bernasconi, G. M. Del Corso, R. Guidotti, Efect of Diferent Encodings and Distance Functions on Quantum Instance-based Classifiers, in: Proceedings 26th Pacific-Asia Conference on Knowledge Discovery and Data Mining PAKDD, 2022. [15] M. Schuld, M. Fingerhuth, F. Petruccione, Implementing a distance-based classifier with a quantum interference circuit, EPL (Europhysics Letters) 119 (2017) 60002. [16] M. Schuld, et al., Supervised learning with quantum computers, Springer, 2018. [17] D. K. Park, F. Petruccione, J.-K. K. Rhee, Circuit-based quantum random access memory for classical data, Scientific reports 9 (2019) 1–8. [18] P. J. Rousseeuw, Silhouettes: a graphical aid to the interpretation and validation of cluster analysis, Journal of computational and applied mathematics 20 (1987) 53–65. [19] A. Rosenberg, J. Hirschberg, V-measure: A conditional entropy-based external cluster evaluation measure, in: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLPCoNLL), 2007, pp. 410–420. [20] T. M. de Veras, I. C. De Araujo, D. K. Park, A. J. da Silva, Circuit-based quantum random access memory for classical data with continuous amplitudes, IEEE Transactions on Computers 70 (2020) 2125–2135.

[1]

Schuld ,

Sinayskiy ,

Petruccione , An introduction to quantum machine learning , Contemporary Physics 56 ( 2015 ) 172 - 185 .

[2]

MacQueen , et al., Some methods for classification and analysis of multivariate observations , in: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability , volume 1 , Oakland, CA, USA, 1967 , pp. 281 - 297 .

[3]

Preskill , Quantum computing in the NISQ era and beyond , Quantum 2 ( 2018 ) 79 .

[4]

Horn ,

Gottlieb , Algorithm for data clustering in pattern recognition problems based on quantum mechanics , Physical review letters 88 ( 2001 ) 018702 .

[5]

Otterbach ,

Manenti ,

Alidoust ,

Bestwick ,

Block ,

Bloom ,

Caldwell ,

Didier ,

E. S.

Fried ,

Hong , et al., Unsupervised machine learning on a hybrid quantum computer , arXiv preprint arXiv:1712.05771 ( 2017 ).

[6]

Lloyd ,

Mohseni ,

Rebentrost , Quantum algorithms for supervised and unsupervised machine learning , arXiv preprint arXiv:1307.0411 ( 2013 ).

[7]

Farhi ,

Goldstone ,

Gutmann ,

Sipser , Quantum computation by adiabatic evolution , arXiv preprint quant-ph/0001106 ( 2000 ).