Efficient Clustering for Large-Scale, Sparse, Discrete Data with Low Fundamental Resolution Veronika Strnadová-Neeley, supervised by John R. Gilbert University of California, Santa Barbara veronika@cs.ucsb.edu ABSTRACT Scalable algorithm design has become central in the era of large-scale data analysis. My contribution to this line of re- search is the design of new algorithms for scalable clustering Figure 1: Fundamental resolution of an image: the more pixels and data reduction, by exploiting inherent low-dimensional we use, the clearer it gets, but only up to a point. structure in the input data to overcome the challenges of significant amounts of missing entries. I demonstrate that, large, discrete-valued data sets, the fundamental resolution, by focusing on a property of the data that we call its funda- rather than the number of data points, determines the extent mental resolution, we can improve the efficiency of clustering to which we can distinguish points from one another before methods on sparse, discrete-valued data sets. the data becomes redundant. If we know a large data set with many missing values has a fundamental resolution, we can more easily single out data points that are noise and fill 1. INTRODUCTION AND BACKGROUND in missing entries. The necessity for efficient algorithms in large-scale data In the following, I present efficient algorithms for cluster- analysis has become clear in recent years, as unprecedented ing large-scale, discrete-valued data sets with missing values scaling of information has sprung up in a variety of domains, by leveraging the fundamental resolution of the data. Pre- from bioinformatics to social networks to signal processing. viously, my collaborators and I have demonstrated that the In many cases, it is no longer sufficient to use even quadratic- underlying fundamental resolution of binary-valued genetic time algorithms for such data, and much of recent research mapping data can be used to quickly cluster large genetic has focused on developing efficient methods to analyze vast mapping data sets. I am now generalizing this clustering amounts of information. approach to large-scale, discrete-valued data, such as that Here we focus on scalable clustering algorithms, a form of found in the recommender systems domain. Genetic map- unsupervised learning that is invaluable in exploratory data ping and recommender systems present similar challenges analysis [16]. Many successes in the effort to design these to clustering algorithms, due to the large degree of sparsity algorithms have focused on leveraging an inherent structure and the sheer scale of the input data in these domains. in the data, and its structure may be best expressed in var- ious ways. The data may be best described as lying in an inherently low-dimensional Euclidean space, along a low- 2. RELATED WORK dimensional manifold, or it may have certain self-repeating, Much attention has been paid to the intuition that many or fractal properties. All these structural properties have large-scale data sets lie in an inherently low-dimensional been explored to some degree in order to design more effi- space, which explains the popularity of matrix factoriza- cient clustering algorithms.[11, 2, 6, 7, 8, 17] tion methods for large scale data analysis. Methods such My work focuses on leveraging a property of large-scale, as principal component analysis rely on an SVD decompo- discrete-valued data that I call its fundamental resolution, sition in order to project a high-dimensional data set into a concept that can be explained by comparing the images a lower dimensional space [10, 9]. Spectral clustering is an- in Figure 1. More pixels produce a clearer image, but only other such example, and has been modified in recent years up to a point – we cannot distinguish the leftmost image to improve in running time [11]. More recently, the CUR from that in the middle, even though more pixels are used to decomposition [12] has gained popularity as a sparse matrix render the image in the leftmost position. Similarly, in many factorization method that is both fast and in some cases more interpretable than a decomposition based on eigenvec- tors. With matrix factorization approaches, clustering the projected data in the lower-dimensional space often results in better clustering performance. However, my work focuses on data that does not necessarily lie in a low-dimensional Euclidean subspace – many dimensions in the input may be relevant in data with a low fundamental resolution. In ad- dition, there is no clear answer on how to deal with noise Proceedings of the VLDB 2017 PhD Workshop, August 28, 2017. Munich, Germany. Copyright (C) 2017 for this paper by its authors. Copying and missing entries when factorizing a large data matrix, permitted for private and academic purposes. whereas my work takes these issues into account. Other forms of lower-dimensional inherent structure have BubbleCluster also been explored to speed up the clustering of large-scale Dataset Input Size F -score Time data. A data set’s fractal dimension has been exploited for Barley 64K 0.9993 15 sec clustering [8], but this method is not scalable to large data Switchgrass 113K 0.9745 8.9 min sets. An approach based on exploiting a low fractal dimen- Switchgrass 548K 0.9894 1.9 hrs sion and entropy of a data set has been successfully applied Wheat 1.582M N/A 1.22 hrs to quickly search massive biological data sets.[17] However, here we focus here on efficient clustering, not efficient search. Table 1: Clustering performance on Barley, Switchgrass, and Older, popular methods such as the well-known DBSCAN[6], Wheat from the Joint Genome Institute using BubbleCluster. algorithm seek to preserve the shape of data, but rely on State-of-the-art mapping tools are unable to cluster data sets at the input lying in a metric space. In addition, these meth- this scale. (Table originally appeared in [13]) ods typically require at least quadratic time when the input data lies in three or more dimensions[7], and again do not Our algorithm BubbleCluster, which resembles the DB- account for missing values. Popular nonlinear dimensional- SCAN method [6], utilizes the LOD score to efficiently clus- ity reduction methods, such as Laplacian eigenmaps[2], also ter the data. First, we build a sketch of the clustering by don’t account for missing data and noise, and many such linking together points that exceed a high LOD threshold, approaches do not scale well. which only occurs for vectors with many matching binary values. Then, points with more missing values are linked to 3. EXPLOITING THE FUNDAMENTAL RES- the point in the skeleton attaining the highest LOD score. OLUTION OF GENETIC MAP DATA The fundamental resolution limits the number of unique in- put vectors and thus as the data size grows, it is more likely Genetic map data for a homozygous mapping population that enough high-quality points exist to build the skeleton can be represented as a binary matrix X, composed of rows and accurately place the remaining points. xu , where each entry xui can take on one of two values a BubbleCluster allows for efficient clustering of genetic map or b, or it can be missing. [4] In this application domain, input data into linkage groups. As Table 1 shows, our errors occur when an entry was erroneously recorded during method achieved both high precision and recall (expressed sequencing – that is, it was flipped from a to b or from b to as the F -score) on real genetic data. It was also the first a – and errors typically occur at a fixed rate . The goal of method to successfully cluster genetic map data at large genetic mapping is to produce a map of the genome, which scales, including the grand challenge hexaploid bread wheat shows the correct clustering and ordering of the input xu . genome [3], and outperformed state-of-the-art mapping tools Such maps have applications in health, agriculture, and the in terms of clustering performance on simulated data [13]. study of biodiversity. To further aide the efficiency and accuracy of the ge- Producing a genetic map typically requires three stages: netic mapping process, we introduced a fast data reduction 1) Clustering the vectors xu into linkage groups, 2) Ordering method that quickly converts the large-scale, noisy, and in- the vectors within each linkage group, and 3) Determining complete input data into a small-scale, more accurate and the correct genetic distance between the ordered vectors. In more complete set of points which more clearly represent previous work ([13], [14]), we showed that by exploiting the the genetic map. I will next describe this data reduction fundamental resolution of genetic map data, we can quickly method, based on the fundamental resolution of the genetic and accurately cluster the input vectors and reduce them mapping input data. from a large-scale, noisy, and incomplete data set into a small set of bins that more accurately represent the genome. 3.2 Efficient Data Reduction 3.1 Scalable Clustering Genetic map data has a fundamental resolution that is We have shown that, using the well-known LOD score linear in the dimensionality of the input vectors. We can similarity that is ubiquitous in genetics, we can design a leverage this property of the data to efficiently reduce it to fast and accurate algorithm to cluster input data for genetic a much smaller and more accurate set of vectors we call bins, mapping [13]. The LOD score is a logarithm of odds, com- paring the likelihood of two vectors being in the same cluster that represent positions along the genetic map as illustrated to the likelihood that they were generated by chance: in Figure 2. The data reduction process uses a recursive bisection method to quickly reduce the input vectors within P(data|xu and xv in same linkage group) LOD(xu , xv ) = log10 each linkage group to bins [14]. P(data| pure chance) The binary nature of the data limits the number of possi- Because we have binary data, the denominator in the LOD ble unique input points to 2n , where n is the dimensionality fraction is simply ( 21 )η , where η is the number of entries of the data. However, the fundamental resolution of the data that are  non-missing in both xu and xv . For example, if limits this number much further – the fundamental resolu- xu = a b b − b and xv = a a − − a , tion of a genetic map is equivalent to the number of possible then the denominator is ( 12 )3 , because the first, second, and unique positions on the map. With a homozygous map- last entries are non-missing entries in both xu and xv . The ping population (binary data), this number is O(kn), where numerator is a function of the estimated recombination frac- k is the number of linkage groups (clusters) [14]. There- tion of the genetic data, and is explained in more detail in fore, when the number of input points is much larger than our previous work [13]. The LOD score does not obey the tri- the fundamental resolution, many points must be identical, angle inequality, which together with the presence of errors helping us filter out errors. Furthermore, the large data set (flipped entries) and missing values, eliminates the possibil- size allows us to fill in missing data if we can cluster together ity of accurately clustering the data with existing efficient points that belong to the same unique map position. algorithms. If we know two points belong in the same position on the Input Data Matrix Linkage Group 1 Linkage Group 2 Input Data Matrix Cluster 1 Individuals x1 b b - - a Movies x1 5 - 1 1 - 1 4 - 2 - b b - - a x1 a - b a b x4, x7 (error) x1 5 - 1 1 - 1 4 - 2 - x3 5 - 5 2 - 3 - - 3 - x2 a b a a b x2 Genetic Markers a b a a b x2 - 1 - 5 1 - - 5 - - x7 4 - - 1 4 - 4 - 3 - x3 a a - - b b - b a - x5 x3 5 - 5 2 - 3 - - 3 - Cluster 2 Users x4 a - b - b a a b a b x3 , x6 x4 4 - - - 1 - 2 - 5 - x2 - 1 - 5 1 - - 5 - - x5 b - b a - x5 - 1 3 - - 3 3 4 - 1 x5 - 1 3 - - 3 3 4 - 1 x6 a a b a - (missing data) Genetic Map x6 3 - 5 - 5 - 3 2 - 3 Cluster 3 x7 - - - a b x7 4 - - 1 4 - 4 - 3 - x4 4 - - - 1 - 2 - 5 - x6 3 - 5 - 5 - 3 2 - 3 Figure 2: The fundamental resolution of genetic map data allows us to reduce the large-scale, incomplete and noisy input to a more complete set of representative vectors that more clearly describe positions along the genetic map. Figure 4: Fundamental resolution in Recommender Systems data equates to finding the vectors that best represent user sub-groups Bin Recall (% bins found perfectly) at  = 0.5% with low disagreement rates. 100% My hypothesis is that a fundamental resolution exists in 80% many discrete-valued data sets, and can be exploited to ef- 60% ficiently cluster these data at large scales. In the RS do- Recall main, we often have a discrete-valued input matrix with 40% 15% missing many missing values as shown in Figure 4, where an en- 20% 35% missing try Xui represents the rating that user u gave to item i, 50% missing with ratings typically taking values on a discrete scale. The 0% methods for clustering and reducing the binary genetic map 12.5K 25K 50K 100K 200K 400K 800K 1.6M Data set size data can be adapted to this more general setting. The fundamental resolution in RS data can be expressed Figure 3: As the data size grows, more map position vectors as the number of unique user sub-groups that rate items very (bins) are recovered perfectly, while lower missing rates allow our similarly. Recently, Christakopoulou et al.[5]have shown algorithm to recover the bins with less data. that utilizing user clusters to improve prediction of rating values and recommend better items is an extremely effective genetic map, we can infer which values are errors and what approach. An efficient, accurate clustering method for RS the missing data should be. In Figure 2, for example, x3 and data has the potential to enhance such approaches to rating x6 belong to the same position, so we can fill in the missing prediction as well as the top-n recommendation problem [1]. data in both x3 and x6 based on each other’s values. A similar idea applies to errors – the more points belong to 4.1 The LiRA Similarity Score for Recommender the same position, the more clear it becomes which values Systems are errors, as long as  is fairly low. We designed an algorithm that uses recursive bisection to We have developed a statistical score analogous to the quickly clusters together points in the same genetic map po- LOD score for RS data called LiRa, based on a likelihood ra- sition. At each step, we use a maximum a posteriori (MAP) tio. We assume that RS data has a fundamental resolution, estimate of  in order to find the best dimension along which and thus users (rows of the input matrix) can be clustered to split the point set. The algorithm returns both an esti- into groups with vectors representing the rating trends of mated error rate and a set of bins that represent unique each group. LiRa compares the likelihood of observing the positions on the genetic map. We showed that the number values in two user vectors xu and xv assuming the users are of bins and the error rate is consistent with existing real- in the same cluster, to the likelihood of observing the same world maps of wheat and barley[14]. data by chance, based on differences in their rating values: We also simulated genetic map data with realistic error p(differences in xu and xv | same cluster ) LiRa(xu , xv ) = log10 rates and a variety of missing data rates. Our algorithm p(differences in xu and xv | pure chance) scaled linearly with data set size for all tested missing rates (1) and error rates in synthetic data. As Figure 3 shows, the bin LiRa generalizes the LOD score by assuming that differ- recall, or fraction of bins we can recover perfectly, improves ences in two discrete-valued vectors from the same cluster at each missing rate with more data. Note that although follow a particular multinomial distribution, which is used an error rate of 0.5% seems low, it is actually much higher to compute the numerator. The LiRa score is useful in the than encountered in practice. RS setting because it leverages more data to make a more accurate judgment of similarity. We have shown that using the LiRa score to find nearest 4. GENERALIZING TO DISCRETE-VALUED neighbors in a k-nearest neighbors approach outperforms DATA WITH MISSING VALUES the popular and widely used Pearson and Cosine similarity Next, I will describe the last portion of my thesis work, scores in terms of rating prediction error [15]. I am cur- clustering large discrete-valued data sets with missing val- rently expanding on the clustering model used to compute ues. One example of such data is found in the Recommender the likelihood of users belonging to the same cluster in the Systems (RS) domain, and much of the experimentation of LiRa score, which will be useful for efficient data reduction these clustering methods will be on RS data. in the discrete-value setting. 4.2 Generalizing Efficient Data Reduction to [3] J. A. Chapman, M. Mascher, A. Buluç, K. Barry, the Discrete-Valued Domain E. Georganas, A. Session, V. Strnadova, J. Jenkins, As noted previously, my goal is to generalize my previous S. Sehgal, L. Oliker, et al. A whole-genome shotgun work on efficient clustering and data reduction in genetic approach for assembling and anchoring the hexaploid map data to the more general setting of large-scale, discrete- bread wheat genome. Genome biology, 16(1):26, 2015. valued data with missing values and noise. The LiRa score [4] J. Cheema and J. Dicks. Computational approaches from section 4.1 is the first step in this direction, and can and software tools for genetic linkage map estimation be used to produce an initial clustering of the input using a in plants. Briefings in bioinformatics, 10(6):595–608, thresholding scheme similar to the BubbleCluster algorithm. 2009. The threshold LiRa score within which points will belong to [5] E. Christakopoulou and G. Karypis. Local item-item the same cluster will rely on the clustering model used to models for top-n recommendation. In Proceedings of represent the data. For RS data, a working model is already the 10th ACM Conference on Recommender Systems, presented in previous work [15]. pages 67–74. ACM, 2016. After an initial fast clustering, I hope to generalize the [6] M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al. A data reduction stage to discrete-valued data also. Here, fu- density-based algorithm for discovering clusters in ture work involves more precisely defining the point at which large spatial databases with noise. In Kdd, volume 96, the fundamental resolution has been reached. In RS data, pages 226–231, 1996. the idea is to cluster together users who have very similar [7] J. Gan and Y. Tao. Dbscan revisited: mis-claim, rating patterns. One possibility is to only cluster together un-fixability, and approximation. In Proceedings of the users if the distribution of their rating differences follows a 2015 ACM SIGMOD International Conference on clustering model. For example, in Cluster 1 in Figure 4, Management of Data, pages 519–530. ACM, 2015. only one pair of ratings for the same item differs signifi- [8] A. Gionis, A. Hinneburg, S. Papadimitriou, and cantly: x13 = 1 and x33 = 5, giving a rating difference of 4. P. Tsaparas. Dimension induced clustering. In The remaining ratings are all close together. The recursive Proceedings of the eleventh ACM SIGKDD bisection method for data reduction in genetic mapping can international conference on Knowledge discovery in be modified to this more general case, by dividing user clus- data mining, pages 51–60. ACM, 2005. ters with large differences in rating values, based on MAP [9] Y. Koren, R. Bell, and C. Volinsky. Matrix estimates of the proportion of each rating difference. factorization techniques for recommender systems. I am currently formalizing the notion of fundamental res- Computer, 42(8), 2009. olution in the general case, and experimenting with the best [10] J. Leskovec, A. Rajaraman, and J. D. Ullman. Mining clustering model for RS data. As the final piece to my thesis, of massive datasets. Cambridge University Press, 2014. I hope to demonstrate that the efficient clustering and data [11] F. Lin and W. W. Cohen. Power iteration clustering. reduction methods can be applied to more general data sets, In Proceedings of the 27th international conference on and will be useful in the RS domain for rating prediction. machine learning (ICML-10), pages 655–662, 2010. [12] M. W. Mahoney and P. Drineas. Cur matrix 5. CONCLUSION decompositions for improved data analysis. I have shown that the concept of fundamental resolution Proceedings of the National Academy of Sciences, can be exploited to design efficient and accurate clustering 106(3):697–702, 2009. algorithms in the genetic mapping domain, and I am ex- [13] V. Strnadová, A. Buluç, J. Chapman, J. R. Gilbert, tending this concept to the more general setting of discrete- J. Gonzalez, S. Jegelka, D. Rokhsar, and L. Oliker. valued data. The methods presented here are useful for ap- Efficient and accurate clustering for large-scale genetic plications with large-scale, discrete input data with many mapping. In Bioinformatics and Biomedicine (BIBM), missing values, such as that found in the Recommender Sys- 2014 IEEE International Conference on, pages 3–10. tems domain. Future directions beyond my thesis work in- IEEE, 2014. clude exploring the connection between fractal dimension [14] V. Strnadová-Neeley, A. Buluç, J. Chapman, J. R. and fundamental resolution, as well as defining new cluster- Gilbert, J. Gonzalez, and L. Oliker. Efficient data ing models for data sets with a low fundamental resolution. reduction for large-scale genetic mapping. In Proceedings of the 6th ACM Conference on 6. ACKNOWLEDGMENTS Bioinformatics, Computational Biology and Health This work is supported by the Applied Mathematics Pro- Informatics, pages 126–135. ACM, 2015. gram of the DOE Office of Advanced Scientific Computing [15] V. Strnadová-Neeley, A. Buluç, J. R. Gilbert, Research under contract number DE-AC02-05CH11231 and L. Oliker, and W. Ouyang. Lira: A new by NSF Award CCF-1637564. likelihood-based similarity score for collaborative filtering. arXiv preprint arXiv:1608.08646, 2016. 7. REFERENCES [16] R. Xu and D. Wunsch. Survey of clustering [1] D. C. Anastasiu, E. Christakopoulou, S. Smith, algorithms. IEEE Transactions on neural networks, M. Sharma, and G. Karypis. Big data and 16(3):645–678, 2005. recommender systems. 2016. [17] Y. W. Yu, N. M. Daniels, D. C. Danko, and [2] M. Belkin and P. Niyogi. Laplacian eigenmaps for B. Berger. Entropy-scaling search of massive biological dimensionality reduction and data representation. data. Cell systems, 1(2):130–140, 2015. Neural computation, 15(6):1373–1396, 2003.