A QUBO Formulation of the k-Medoids Problem C. Bauckhage1,3,4 , N. Piatkowski2 , R. Sifa1,3 , D. Hecker1,3 , and S. Wrobel1,3,4 1 Fraunhofer Center for Machine Learning, Sankt Augustin, Germany 2 AI Group, TU Dortmund, Dortmund, Germany 3 Fraunhofer IAIS, Sankt Augustin, Germany 4 B-IT, University of Bonn, Bonn, Germany Abstract. We are concerned with k-medoids clustering and propose a quadratic unconstrained binary optimization (QUBO) formulation of the problem of identifying k medoids among n data points without having to cluster the data. Given our QUBO formulation of this NP-hard problem, it should be possible to solve it on adiabatic quantum computers. 1 Introduction Quadratic unconstrained binary optimization problems (QUBOs) are concerned with finding an n-dimensional binary vector that minimizes a quadratic objective z ∗ = argmin z | Qz + z | q. (1) z∈{0,1}n They thus pose combinatorial optimization problems which generally prove to be NP-hard. Indeed, difficult problems such as capital budgeting or task alloca- tion in operations research, constraint satisfaction in AI, or maximum diversity, maximum clique, or graph partitioning in data mining and machine learning are but a few examples of where QUBOs occur in practice [1]. Owing to their practical importance, there exists a venerable literature on QUBOs and efficient solution strategies (for an extensive survey, see [2] and the references therein). More recently, however, interest in QUBOs has also arisen in a different context. Since adiabatic quantum computers such as produced by D-Wave [3,4] are designed to solve them, research on QUBO reformulations of various problems has noticeably intensified. Indeed, Boolean satisfiability [5], graph cuts [6,7,8], graph isomorphisms [9], binary clustering [6,10], or classifier training [11,12] have lately been solved via adiabatic quantum computing. Following this line of research, we propose a QUBO formulation of the prob- lem of identifying k medoids among n data points which, to our knowledge, has neither been attempted nor reported before. While we do not consider a quantum computing implementation itself, our result indicates that prototype extraction can be accomplished on quantum computers. Copyright c 2019 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). mean mean mean medoid medoid medoid (a) Gaussian blob (b) ring (c) spiral Fig. 1: Three didactic data samples together with their means and medoids. Next, we briefly summarize the notion of a medoid and the ideas behind k-medoids clustering. We then present our QUBO formulation of the k-medoids problem and discuss simple examples that illustrate the practical performance of our approach. Finally, we review related work and summarize our contributions. 2 k-Medoids Clustering Consider a sample X = {x1 , . . . , xn } of n Euclidean data points xi ∈ Rm . The well established sample mean X 2 1 X µ = argmin xi − x = xi (2) x∈Rm n xi ∈X xi ∈X is a frequently used summary statistic for such data. The so called sample medoid X 2 m = argmin xi − xj (3) xj ∈X xi ∈X on the other hand, is arguably less widely known but has interesting applications as well. Figure 1 illustrates characteristics of means and medoids. Both minimize sums of distances to available sample points. However, while the mean is not necessarily contained in the sample, the medoid always coincides with an element of the sample and, for squared Euclidean distances as in (3), it is easy to prove that this element is the sample point closest to the mean [13]. An interesting feature of medoids is that they result from evaluating squared distances between given data only.1 As these can be precomputed and stored in an n × n distance matrix D where Dij = d2 (xi , xj ), medoids can be computed 1 Note once again that x in equation (2) may not be contained in X whereas xi and xj in equation (3) always are. Algorithm 1 k-medoids clustering via Lloyd’s algorithm Require: index set I = {1, . . . , n}, distance matrix D ∈ Rn×n , and parameter k ∈ N initialize set of k cluster medoid indices M = {j1 , j2 , . . . , jk } ⊂ I repeat for l = 1, . . . , k do determine cluster  Cl = i ∈ I Dijl ≤ Dijq (4) for l = 1, . . . , k do update medoid index X jl = argmin Dij (5) j∈Cl i∈Cl until clusters stabilize from relational data. Contrary to means, medoids may thus also be estimated on sets of strings or graphs or other non-numeric data as long as there is an appropriate distance measure d(·, ·) [13]. Moreover, since medoids coincide with actual data points, analysts often find them easy to interpret. This is why k-medoids clustering is an increasingly popular tool whenever explainable or physically plausible prototypes need to be determined [14,15,16,17,18]. Conceptually, k-medoids clustering is almost identical to k-means clustering. To accomplish it, we may, for instance, simply adapt Lloyd’s algorithm [19]. In other words, given randomly initialized medoids, we may determine clusters by assigning data points to their closest medoid, update the medoid of each cluster, and repeat these steps until clusters stabilize. Note, however, that (local) medoids, too, are relational objects that can solely be determined from analyzing a distance matrix. Given such a distance matrix for a data set X , k-medoids clustering can thus be performed on index sets only. We demonstrate this in Algorithm 1 where I = {1, 2, . . . , n} indexes the data in X . Having initialized a set M = {j1 , j2 , . . . , jk } ⊂ I of medoid indices, each cluster Cl is an index set containing the indices of those data points xi that are closest to medoid mjl (equation (4)). Given the clusters, each medoid index jl is then set to the index of the medoid of the points indexed by Cl (equation (5)). Similar to k-means clustering, k-medoids clustering subdivides the input space into convex cells each of which is defined w.r.t. a prototype. Just as in k-means clustering via MacQueen’s algorithm [20], k-medoids clustering could therefore in principle be performed in two steps where medoids are determined first and data are assigned to them second. However, contrary to k-means cluster- ing, the combinatorial nature of k-medoids clustering, which selects prototypes from a discrete set of data points, prevents the use of continuous MacQueen-type updates of prototypes. Indeed, we are not aware of prior work on how to find local medoids without having to compute clusters. The quadratic unconstrained binary optimization formulation of the problem which we present in the next section, however, can accomplish this. 3 A QUBO Formulation for k-Medoids Estimation Our idea for how to select k local medoids among n data points is based on an observation regarding classical k-means clustering. Recall that, if a set X of n data points is partitioned into k clusters Cl of nl elements whose mean is µl , adding the within cluster scatter k X X 2 SW = x − µl (6) l=1 x∈Cl and the between cluster scatter k k 1 XX 2 SB = ni nj µi − µj (7) 2 n i=1 j=1 yields a constant; that is, we have SW + SB = c. This follows from Fisher’s analysis of variance [21,22] and is to say that the two scatter values are inverse under addition: either SW is large and SB is small, or SW is small and SB is large. Since a small within cluster scatter is usually taken to be the objective of k-means clustering, we therefore observe that “good” cluster means are close to the data points within their cluster and, at the same time, far apart from each other. Consequently, we posit that the same should hold true for the medoids that are supposed to result from k-medoids clustering. Note, however, that (6) involves variances about local centroids. Since medoids are not necessarily central, we henceforth work with a more robust measure of similarity. In particular, we consider Welsch’s M -estimator ∆ij = 1 − exp − 21 Dij  (8) which is known from robust regression [23] and also referred to as the correntropy loss [24,25]. 3.1 A QUBO to identify far apart data points The problem of selecting k mutually far apart objects among a total of n objects is known as the max-sum dispersion problem [26] and can be formalized as follows: given an n × n similarity matrix ∆ over n objects which are indexed by I = {1, 2, . . . , n}, determine a subset M∗ ⊂ I, |M∗ | = k such that 1 X X M∗ = argmax ∆ij M⊂I 2 (9) i∈M j∈M s.t. |M| = k. Looking at this problem, we note that, upon introducing binary indicator vectors z ∈ {0, 1}n whose entries are given by ( 1, if i ∈ M zi = (10) 0, otherwise, it can also be written in terms of a constrained binary optimization problem z ∗ = argmax 1 | 2 z ∆z z∈{0,1}n (11) s.t. z|1 = k where 1 ∈ Rn denotes the vector of all ones. If we further note that z | 1 = k ⇔ (z | 1 − k)2 = 0, we find that (11) can equivalently be expressed as a quadratic unconstrained binary optimization problem, namely z ∗ = argmax 12 z | ∆z − γ (z | 1 − k)2 (12) z∈{0,1}n where γ ∈ R is a Lagrange multiplier. Treating this multiplier as a constant and expanding the expression on the right hand side, we obtain z ∗ = argmax 21 z | ∆z − γ z | 11| z + 2 γ k z | 1 − γ k 2 (13) z∈{0,1}n = argmax z | 12 ∆ − γ 11| z + 2 γ k z | 1 − const  (14) z∈{0,1}n = argmin z | γ 11| − 12 ∆ z − 2 γ k z | 1.  (15) z∈{0,1}n 3.2 A QUBO to identify central data points Using the above notation, the problem of selecting k most central objects among a total of n objects can be formalized as follows XX M∗ = argmin ∆ij M⊂I i∈M j∈I (16) s.t. |M| = k. Introducing binary indicator vectors and a Lagrange multiplier, reasoning as above then leads to the following quadratic unconstrained binary optimization problem z ∗ = argmin z | ∆1 + γ (z | 1 − k)2 (17) z∈{0,1}n = argmin z | γ 11| z + z | ∆1 − 2 γ k 1 .   (18) z∈{0,1}n 3.3 Putting both QUBOs together In order to combine the QUBO that identifies far apart points (15) and the QUBO that identifies central points (18) into a single model that identifies rather central points that are rather far apart, we adhere to common practice [27]. That is, we introduce two additional tradeoff parameters α, β ∈ R to weigh the contributions of either model. This way, we obtain z ∗ = argmin z | γ 11| − α 12 ∆ z + z | β ∆1 − 2 γ k 1 .   (19) z∈{0,1}n 3.4 Choosing α, β, and γ Working with Welsch’s function in (8) has the additional benefit that it maps squared distances such as kxi − xj k2 into the interval [0, 1]. For the weighted, model specific contributions that occur in (19), we therefore have the following two upper bounds α 12 z | ∆z ≤ α 12 k 2 · 1 (20) | β z ∆1 ≤ β n k · 1. (21) This suggests to set the weighting parameters to α = 1/k and β = 1/n so as to normalize the two contribution to about the same range. For the Lagrange multiplier which enforces solutions z ∗ to have k entries equal to 1, we choose γ = 2 so as to prioritize this constraint. 4 Practical Examples Next, we present two admittedly simple experiments that illustrate the behavior of the QUBOs in (15), (18), and (19). Note again that our primary goal in this paper is to investigate whether it is possible to identifying k medoids among n data points without having to cluster the data; efficiency and scalability are not our main concerns. Correspondingly, we consider two samples of 2D data points that are small enough to allow for brute force estimation of the minimizers of the above QUBOs. In other words, in both experiments, we compute similarity matrices ∆ over n ∈ {12, 16} data points, chose α, β, and γ as above and then evaluate each of the 2n possible solutions to the QUBOs in (15), (18), and (19) in order to determine the respective best one. The data in our first experiment were deliberately chosen such that extremal, central, as well as locally medoidal data points can be easily identified by visual (a) 2D data (b) solution to (15) (c) solution to (18) (d) solution to (19) Fig. 2: Given a didactic sample of 2D data points forming 4 clusters, we solve the QUBOs in (15), (18), and (19) for k = 4. While the solution to (15) identifies the 4 most extremal data points, the solution to (18) identifies the 4 most central ones. Solving (19) trades off both these results and identifies 4 local medoids. (a) 2D data (b) solution to (15) (c) solution to (18) (d) solution to (19) Fig. 3: Given 16 data points sampled from 3 bivariate Gaussian distributions, we solve the QUBOs in (15), (18), and (19) for k = 3. Again, the solution to (15) identifies the 3 rather extremal data points and the solution to (18) corresponds to 3 rather central ones. The solution to (19) identifies 3 local medoids. inspection. They can be seen in Fig. 2a which shows n = 12 two-dimensional data points which form 4 apparent clusters. Setting k = 4 and solving the QUBOs in (15), (18), and (19) produces indicator vectors that identify the data points highlighted in Fig. 2b–2d. Looking at these results suggests that they are well in line with what human analysts would deem reasonable. In our second experiment, we generated 100 sets of n = 16 data points sampled from 3 bivariate Gaussian distributions; Fig. 3a shows an example. Setting k = 3, we then solved the QUBOs in (15), (18), and (19). Exemplary results can be seen in Fig. 3b–3d. The three extremal data points in Fig. 3b make intuitive sense. The fact that the three rather central data points in Fig. 3c are all situated in the cluster to the bottom left is due to the fact that this is the largest of the three clusters; here, our use of Welsch’s M -estimator has the effect that points which are central w.r.t. several nearby points are being favored. The local medoids highlighted in Fig. 3d are again intuitive. For each of the 100 data sets considered in this experiment, we also compared the k = 3 medoids obtained from solving (19) to those produced by Algorithm 1. In each of our 100 tests, we found them to be identical. While such a perfect agreement between both methods is likely an artifact of the small sample sizes we worked with, it nevertheless corroborates the utility of our QUBO based approach to the k-medoids problem. 5 Related Work As of this writing, there are three major algorithmic approaches to k-means clustering, namely those due to Lloyd [19], MacQueen [20], and Hartigan and Wong [28]. Lloyd- and Hartigan-type approaches alternatingly update means and clusters and have been applied to the combinatorial problem of k-medoids clustering, too. Examples for the former can be found in [13] and [29]; examples of the latter include the algorithms PAM and CLARA as well as variations thereof [30,31,32]. However, to the best of our knowledge, MacQueen type procedures, which determine local means without having to compute clusters, have not yet been reported for the k-medoids setting. The QUBO presented in equation (19) fills this gap. While it is not an iterative method such as MacQueen’s procedure, it nevertheless suggests that medoids, too, can be estimated without clustering. Our derivation of a QUBO for the k-medoids problem was motivated by the quest for quantum computing solutions for relational or combinatorial clustering. Here, most prior work we are aware of has focused on solutions for k = 2 [6,7,8]. Work on solutions for k > 2 can be found in [33]. However, the methods proposed there are similar in spirit to kernel k-means clustering in that they operate on binary cluster membership indicator matrices Z ∈ {0, 1}k×n . This is to say that they perform quantum clustering in a manner that produces clusters but does not identify cluster prototypes. Our approach in this paper, however, only requires binary indicator vectors z ∈ {0, 1}n and is, once again, able to identify 1 ≤ k ≤ n prototypes without having to compute clusters. 6 Summary In this paper, we have proposed a quadratic unconstrained binary optimization (QUBO) formulation of the problem of identifying k local medoids within a sample of n data points. The basic idea is to trade off measures of central and extremal tendencies of individual data points. Just as conventional approaches to k-medoids clustering, our solution works with relational data and therefore applies to a wide range of practical settings. However, to the best of our knowl- edge, our solution is the first that is capable of extracting k local medoids without having to compute clusters. This capability comes at a price. While conventional k-medoids clustering as well as our QUBO formulation constitute NP-hard combinatorial problems, the former can be solved (approximately) by means of greedy heuristics. Yet, for the latter, such heuristics are hard to conceive. However, an aspect of our model that may soon turn into a viable advantage is that it can be easily solved on quantum computers. While this paper did not consider corresponding quantum computing implementations, they are the topic of ongoing work and results will be reported once available. References 1. Kochenberger, G., Glover, F.: A Unified Framework for Modeling and Solving Combinatorial Optimization Problems: A Tutorial. In Hager, W., Huang, S.J., Pardalos, P., Prokopyev, O., eds.: Multiscale Optimization Methods and Applica- tions. Volume 82 of NOIA. Springer (2006) 2. Kochenberger, G., Hao, J.K., Glover, F., Lewis, M., Lü, Z., Wang, H., Wang, Y.: The Unconstrained Binary Quadratic Programming Problem: A Survey. J. of Combinatorial Optimization 28(1) (2014) 3. Johnson, M., et al.: Quantum Annealing with Manufactured Spins. Nature 473(7346) (2011) 4. D-Wave press release: D-Wave announces D-Wave 2000Q quantum computer and first system order (Jan 2017) 5. Farhi, E., Goldstone, J., Gutmann, S., Sipser, M.: Quantum Computation by Adiabatic Evolution. arXiv:quant-ph/0001106 (2000) 6. Bauckhage, C., Brito, E., Cvejoski, K., Ojeda, C., Sifa, R., Wrobel, S.: Ising Models for Binary Clustering via Adiabatic Quantum Computing. In: Proc. EMMCVPR. Volume 10746 of LNCS., Springer (2017) 7. Ushijima-Mwesigwa, H., Negre, C., Mniszewski, S.: Graph Partitioning Using Quantum Annealing on the D-Wave System. In: Proc. Int. Workshop on Post Moores Era Supercomputing, ACM (2017) 8. Jünger, M., Lobe, E., Mutzel, P., Reinelt, G., Rendl, F., Rinaldi, G., Stollenwerk, T.: Performance of a Quantum Annealer for Ising Ground State Computations on Chimera Graphs. arXiv:1904.11965 [cs.DS] (2019) 9. Calude, C., Dinneen, M., Hua, R.: QUBO Formulations for the Graph Isomorphism Problem and Related Problems. Theoretical Computer Science 701 (2017) 10. Bauckhage, C., Ojeda, C., Sifa, R., Wrobel, S.: Adiabatic Quantum Computing for Kernel k=2 Means Clustering. In: Proc. KDML-LWDA. (2018) 11. Pudenz, K., Lidar, D.: Quantum Adiabatic Machine Learning. Quantum Informa- tion Processing 12(5) (2013) 12. Adachi, S., Henderson, M.: Application of Quantum Annealing to Training of Deep Neural Networks. arXiv:1510.06356 [quant-ph] (2015) 13. Bauckhage, C.: NumPy / SciPy Recipes for Data Science: k-Medoids Clustering. researchgate.net (Feb 2015) 14. Drachen, A., Sifa, R., Thurau, C.: The Name in the Game: Patterns in Character Names and Game Tags. Entertainment Computing 5(1) (2014) 15. Bauckhage, C., Drachen, A., Sifa, R.: Clustering Game Behavior Data. IEEE Trans. on Computational Intelligence and AI in Games 7(3) (2015) 16. Caro, M., Aarva, A., Deringer, V., Csanyi, G., Laurila, T.: Reactivity of Amorphous Carbon Surfaces: Rationalizing the Role of Structural Motifs in Functionalization Using Machine Learning. Chemistry of Materials 30(21) (2018) 17. Molina, A., Vergari, A., Di Mauro, N., Natarajan, S., Esposito, F., Kersting, K.: Mixed Sum-Product Networks: A Deep Architecture for Hybrid Domains. In: Proc. AAAI. (2018) 18. Johnson, M., et al.: A Universal Probe Set for Targeted Sequencing of 353 Nuclear Genes from Any Flowering Plant Designed Using k-Medoids Clustering. Systematic Biology (12 2018) 19. Lloyd, S.: Least Squares Quantization in PCM. IEEE Trans. Information Theory 28(2) (1982) 20. MacQueen, J.: Some Methods for Classification and Analysis of Multivariate Ob- servations. In: Proc. Berkeley Symp. on Mathematical Statistics and Probability. (1967) 21. Fisher, R.: On the Probable Error of a Coefficient Correlation Deduced from a Small Sample. Metron 1 (1921) 22. Bauckhage, C.: k-Means and Fisher’s Analysis of Variance. researchgate.net (May 2018) 23. Dennis, J., Welsch, R.: Techniques for Nonlinear Least Squares and Robust Re- gression. Communications in Statistics – Simulation and Computation 7(4) (1978) 24. Liu, W., Pokharel, P., Principe, J.: Correntropy: Properties and Applications in Non-Gaussian Signal Processing. IEEE Trans. on Signal Processing 55(11) (2007) 25. Feng, Y., Huang, X., Shi, L., Yang, Y., Suykens, J.: Learning with the Maximum Correntropy Criterion Induced Losses for Regression. J. of Machine Learning Re- search 16 (2015) 26. Ravi, S., Rosenkrantz, D., Tayi, G.: Heuristic and Special Case Algorithms for Dispersion Problems. Operations Research 42(2) (1994) 27. Lucas, A.: Ising Formulations of Many NP Problems. Frontiers in Physics 2(5) (2014) 28. Hartigan, J., Wong, M.: Algorithm AS 136: A k-Means Clustering Algorithm. J. of the Royal Statistical Society C 28(1) (1979) 29. Park, H.S., Jun, C.H.: A Simple and Fast Algorithm for K-Medoids Clustering. Expert Systems with Applications 36(2) (2009) 30. Kaufman, L., Rousseeuw, P.: Partitioning Around Medoids (Program PAM). In: Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons (1990) 31. Kaufman, L., Rousseeuw, P.: Clustering Large Applications (Program CLARA). In: Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons (1990) 32. Schubert, E., Rousseeuw, P.: Faster k-Medoids Clustering: Improving the PAM, CLARA, and CLARANS Algorithms. arXiv:1810.05691 [cs.LG] (2018) 33. Kumar, V., Bass, G., Tomlin, C., Dulny III, J.: Quantum Annealing for Combi- natorial Clustering. Quantum Information Processing 17(2) (2018)