Efficient Clustering for Large-Scale, Sparse, Discrete Data
              with Low Fundamental Resolution

                                                 Veronika Strnadová-Neeley,
                                                supervised by John R. Gilbert
                                              University of California, Santa Barbara
                                                     veronika@cs.ucsb.edu

ABSTRACT
Scalable algorithm design has become central in the era of
large-scale data analysis. My contribution to this line of re-
search is the design of new algorithms for scalable clustering
                                                                      Figure 1: Fundamental resolution of an image: the more pixels
and data reduction, by exploiting inherent low-dimensional            we use, the clearer it gets, but only up to a point.
structure in the input data to overcome the challenges of
significant amounts of missing entries. I demonstrate that,           large, discrete-valued data sets, the fundamental resolution,
by focusing on a property of the data that we call its funda-         rather than the number of data points, determines the extent
mental resolution, we can improve the efficiency of clustering        to which we can distinguish points from one another before
methods on sparse, discrete-valued data sets.                         the data becomes redundant. If we know a large data set
                                                                      with many missing values has a fundamental resolution, we
                                                                      can more easily single out data points that are noise and fill
1.   INTRODUCTION AND BACKGROUND                                      in missing entries.
   The necessity for efficient algorithms in large-scale data           In the following, I present efficient algorithms for cluster-
analysis has become clear in recent years, as unprecedented           ing large-scale, discrete-valued data sets with missing values
scaling of information has sprung up in a variety of domains,         by leveraging the fundamental resolution of the data. Pre-
from bioinformatics to social networks to signal processing.          viously, my collaborators and I have demonstrated that the
In many cases, it is no longer sufficient to use even quadratic-      underlying fundamental resolution of binary-valued genetic
time algorithms for such data, and much of recent research            mapping data can be used to quickly cluster large genetic
has focused on developing efficient methods to analyze vast           mapping data sets. I am now generalizing this clustering
amounts of information.                                               approach to large-scale, discrete-valued data, such as that
   Here we focus on scalable clustering algorithms, a form of         found in the recommender systems domain. Genetic map-
unsupervised learning that is invaluable in exploratory data          ping and recommender systems present similar challenges
analysis [16]. Many successes in the effort to design these           to clustering algorithms, due to the large degree of sparsity
algorithms have focused on leveraging an inherent structure           and the sheer scale of the input data in these domains.
in the data, and its structure may be best expressed in var-
ious ways. The data may be best described as lying in an
inherently low-dimensional Euclidean space, along a low-
                                                                      2.   RELATED WORK
dimensional manifold, or it may have certain self-repeating,             Much attention has been paid to the intuition that many
or fractal properties. All these structural properties have           large-scale data sets lie in an inherently low-dimensional
been explored to some degree in order to design more effi-            space, which explains the popularity of matrix factoriza-
cient clustering algorithms.[11, 2, 6, 7, 8, 17]                      tion methods for large scale data analysis. Methods such
   My work focuses on leveraging a property of large-scale,           as principal component analysis rely on an SVD decompo-
discrete-valued data that I call its fundamental resolution,          sition in order to project a high-dimensional data set into
a concept that can be explained by comparing the images               a lower dimensional space [10, 9]. Spectral clustering is an-
in Figure 1. More pixels produce a clearer image, but only            other such example, and has been modified in recent years
up to a point – we cannot distinguish the leftmost image              to improve in running time [11]. More recently, the CUR
from that in the middle, even though more pixels are used to          decomposition [12] has gained popularity as a sparse matrix
render the image in the leftmost position. Similarly, in many         factorization method that is both fast and in some cases
                                                                      more interpretable than a decomposition based on eigenvec-
                                                                      tors. With matrix factorization approaches, clustering the
                                                                      projected data in the lower-dimensional space often results
                                                                      in better clustering performance. However, my work focuses
                                                                      on data that does not necessarily lie in a low-dimensional
                                                                      Euclidean subspace – many dimensions in the input may be
                                                                      relevant in data with a low fundamental resolution. In ad-
                                                                      dition, there is no clear answer on how to deal with noise
Proceedings of the VLDB 2017 PhD Workshop, August 28, 2017. Munich,
Germany. Copyright (C) 2017 for this paper by its authors. Copying    and missing entries when factorizing a large data matrix,
permitted for private and academic purposes.                          whereas my work takes these issues into account.
   Other forms of lower-dimensional inherent structure have                                        BubbleCluster
also been explored to speed up the clustering of large-scale                   Dataset Input Size F -score Time
data. A data set’s fractal dimension has been exploited for                     Barley      64K   0.9993 15 sec
clustering [8], but this method is not scalable to large data                 Switchgrass 113K    0.9745 8.9 min
sets. An approach based on exploiting a low fractal dimen-                    Switchgrass 548K    0.9894 1.9 hrs
sion and entropy of a data set has been successfully applied                    Wheat     1.582M   N/A 1.22 hrs
to quickly search massive biological data sets.[17] However,
here we focus here on efficient clustering, not efficient search.
                                                                    Table 1: Clustering performance on Barley, Switchgrass, and
   Older, popular methods such as the well-known DBSCAN[6],         Wheat from the Joint Genome Institute using BubbleCluster.
algorithm seek to preserve the shape of data, but rely on           State-of-the-art mapping tools are unable to cluster data sets at
the input lying in a metric space. In addition, these meth-         this scale. (Table originally appeared in [13])
ods typically require at least quadratic time when the input
data lies in three or more dimensions[7], and again do not             Our algorithm BubbleCluster, which resembles the DB-
account for missing values. Popular nonlinear dimensional-          SCAN method [6], utilizes the LOD score to efficiently clus-
ity reduction methods, such as Laplacian eigenmaps[2], also         ter the data. First, we build a sketch of the clustering by
don’t account for missing data and noise, and many such             linking together points that exceed a high LOD threshold,
approaches do not scale well.                                       which only occurs for vectors with many matching binary
                                                                    values. Then, points with more missing values are linked to
3.    EXPLOITING THE FUNDAMENTAL RES-                               the point in the skeleton attaining the highest LOD score.
      OLUTION OF GENETIC MAP DATA                                   The fundamental resolution limits the number of unique in-
                                                                    put vectors and thus as the data size grows, it is more likely
   Genetic map data for a homozygous mapping population
                                                                    that enough high-quality points exist to build the skeleton
can be represented as a binary matrix X, composed of rows
                                                                    and accurately place the remaining points.
xu , where each entry xui can take on one of two values a
                                                                       BubbleCluster allows for efficient clustering of genetic map
or b, or it can be missing. [4] In this application domain,
                                                                    input data into linkage groups. As Table 1 shows, our
errors occur when an entry was erroneously recorded during
                                                                    method achieved both high precision and recall (expressed
sequencing – that is, it was flipped from a to b or from b to
                                                                    as the F -score) on real genetic data. It was also the first
a – and errors typically occur at a fixed rate . The goal of
                                                                    method to successfully cluster genetic map data at large
genetic mapping is to produce a map of the genome, which
                                                                    scales, including the grand challenge hexaploid bread wheat
shows the correct clustering and ordering of the input xu .
                                                                    genome [3], and outperformed state-of-the-art mapping tools
Such maps have applications in health, agriculture, and the
                                                                    in terms of clustering performance on simulated data [13].
study of biodiversity.
                                                                       To further aide the efficiency and accuracy of the ge-
   Producing a genetic map typically requires three stages:
                                                                    netic mapping process, we introduced a fast data reduction
1) Clustering the vectors xu into linkage groups, 2) Ordering
                                                                    method that quickly converts the large-scale, noisy, and in-
the vectors within each linkage group, and 3) Determining
                                                                    complete input data into a small-scale, more accurate and
the correct genetic distance between the ordered vectors. In
                                                                    more complete set of points which more clearly represent
previous work ([13], [14]), we showed that by exploiting the
                                                                    the genetic map. I will next describe this data reduction
fundamental resolution of genetic map data, we can quickly
                                                                    method, based on the fundamental resolution of the genetic
and accurately cluster the input vectors and reduce them
                                                                    mapping input data.
from a large-scale, noisy, and incomplete data set into a
small set of bins that more accurately represent the genome.
                                                                    3.2    Efficient Data Reduction
3.1    Scalable Clustering                                             Genetic map data has a fundamental resolution that is
  We have shown that, using the well-known LOD score                linear in the dimensionality of the input vectors. We can
similarity that is ubiquitous in genetics, we can design a          leverage this property of the data to efficiently reduce it to
fast and accurate algorithm to cluster input data for genetic       a much smaller and more accurate set of vectors we call bins,
mapping [13]. The LOD score is a logarithm of odds, com-
paring the likelihood of two vectors being in the same cluster      that represent positions along the genetic map as illustrated
to the likelihood that they were generated by chance:               in Figure 2. The data reduction process uses a recursive
                                                                    bisection method to quickly reduce the input vectors within
                        P(data|xu and xv in same linkage group)
LOD(xu , xv ) = log10                                               each linkage group to bins [14].
                                 P(data| pure chance)
                                                                       The binary nature of the data limits the number of possi-
Because we have binary data, the denominator in the LOD             ble unique input points to 2n , where n is the dimensionality
fraction is simply ( 21 )η , where η is the number of entries       of the data. However, the fundamental resolution of the data
that are
        non-missing in both xu and xv . For example, if         limits this number much further – the fundamental resolu-
xu = a b b − b and xv = a a − − a ,                                 tion of a genetic map is equivalent to the number of possible
then the denominator is ( 12 )3 , because the first, second, and    unique positions on the map. With a homozygous map-
last entries are non-missing entries in both xu and xv . The        ping population (binary data), this number is O(kn), where
numerator is a function of the estimated recombination frac-        k is the number of linkage groups (clusters) [14]. There-
tion of the genetic data, and is explained in more detail in        fore, when the number of input points is much larger than
our previous work [13]. The LOD score does not obey the tri-        the fundamental resolution, many points must be identical,
angle inequality, which together with the presence of errors        helping us filter out errors. Furthermore, the large data set
(flipped entries) and missing values, eliminates the possibil-      size allows us to fill in missing data if we can cluster together
ity of accurately clustering the data with existing efficient       points that belong to the same unique map position.
algorithms.                                                            If we know two points belong in the same position on the
        Input Data Matrix
                                                                  Linkage Group 1                     Linkage Group 2                         Input Data Matrix                                   Cluster 1
                       Individuals
                  x1        b        b   -   -   a                                                                                                        Movies                         x1 5          -       1       1       -       1           4       -           2       -
                                                                  b    b     -   -   a     x1    a     -   b   a   b    x4, x7
                                                       (error)                                                                           x1   5   -   1   1   -   1    4   -   2   -     x3   5        -       5       2       -       3           -       -           3 -
                  x2        a        b   a   a   b                                         x2
Genetic Markers


                                                                  a    b     a   a   b                                                   x2   -   1   -   5   1   -    -   5   -   -     x7   4        -       -       1       4           -       4       -           3 -
                  x3        a        a   -   -   b
                                                                                                 b     -   b   a   -    x5               x3   5   -   5   2   -   3    -   -   3   -              Cluster 2


                                                                                                                                 Users
                  x4        a        -   b   -   b
                                                                  a    a    b    a   b     x3 , x6                                       x4   4   -   -   -   1   -    2   -   5   -     x2   -    1       -       5       1           -       -           5       -       -
                  x5        b        -   b   a     -
                                                                                                                                         x5   -   1   3   -   -   3    3   4   -   1     x5   -   1        3       -       -       3           3       4           -       1
                  x6        a        a   b   a     -
                                                        (missing data)
                                                                                         Genetic Map                                     x6   3   -   5   -   5   -    3   2   -   3              Cluster 3
                  x7            -    -   -   a   b
                                                                                                                                         x7   4   -   -   1   4   -    4   -   3   -     x4 4      -       -       -       1       -           2       -       5           -

                                                                                                                                                                                         x6 3 -            5       -       5       -           3 2             -           3
 Figure 2: The fundamental resolution of genetic map data allows
 us to reduce the large-scale, incomplete and noisy input to a more
 complete set of representative vectors that more clearly describe
 positions along the genetic map.
                                                                                                                                 Figure 4: Fundamental resolution in Recommender Systems data
                                                                                                                                 equates to finding the vectors that best represent user sub-groups
                                             Bin Recall (% bins found perfectly) at  = 0.5%
                                                                                                                                 with low disagreement rates.
                                    100%
                                                                                                                                    My hypothesis is that a fundamental resolution exists in
                                    80%
                                                                                                                                 many discrete-valued data sets, and can be exploited to ef-
                                    60%                                                                                          ficiently cluster these data at large scales. In the RS do-
                       Recall


                                                                                                                                 main, we often have a discrete-valued input matrix with
                                    40%                                                              15% missing                 many missing values as shown in Figure 4, where an en-
                                    20%                                                              35% missing                 try Xui represents the rating that user u gave to item i,
                                                                                                     50% missing                 with ratings typically taking values on a discrete scale. The
                                     0%
                                                                                                                                 methods for clustering and reducing the binary genetic map
                                           12.5K       25K       50K       100K 200K        400K       800K    1.6M
                                                                           Data set size
                                                                                                                                 data can be adapted to this more general setting.
                                                                                                                                    The fundamental resolution in RS data can be expressed
 Figure 3: As the data size grows, more map position vectors                                                                     as the number of unique user sub-groups that rate items very
 (bins) are recovered perfectly, while lower missing rates allow our                                                             similarly. Recently, Christakopoulou et al.[5]have shown
 algorithm to recover the bins with less data.                                                                                   that utilizing user clusters to improve prediction of rating
                                                                                                                                 values and recommend better items is an extremely effective
 genetic map, we can infer which values are errors and what                                                                      approach. An efficient, accurate clustering method for RS
 the missing data should be. In Figure 2, for example, x3 and                                                                    data has the potential to enhance such approaches to rating
 x6 belong to the same position, so we can fill in the missing                                                                   prediction as well as the top-n recommendation problem [1].
 data in both x3 and x6 based on each other’s values. A
 similar idea applies to errors – the more points belong to                                                                      4.1          The LiRA Similarity Score for Recommender
 the same position, the more clear it becomes which values                                                                                    Systems
 are errors, as long as  is fairly low.
    We designed an algorithm that uses recursive bisection to                                                                       We have developed a statistical score analogous to the
 quickly clusters together points in the same genetic map po-                                                                    LOD score for RS data called LiRa, based on a likelihood ra-
 sition. At each step, we use a maximum a posteriori (MAP)                                                                       tio. We assume that RS data has a fundamental resolution,
 estimate of  in order to find the best dimension along which                                                                   and thus users (rows of the input matrix) can be clustered
 to split the point set. The algorithm returns both an esti-                                                                     into groups with vectors representing the rating trends of
 mated error rate and a set of bins that represent unique                                                                        each group. LiRa compares the likelihood of observing the
 positions on the genetic map. We showed that the number                                                                         values in two user vectors xu and xv assuming the users are
 of bins and the error rate is consistent with existing real-                                                                    in the same cluster, to the likelihood of observing the same
 world maps of wheat and barley[14].                                                                                             data by chance, based on differences in their rating values:
    We also simulated genetic map data with realistic error                                                                                                           p(differences in xu and xv | same cluster )
                                                                                                                                 LiRa(xu , xv ) = log10
 rates and a variety of missing data rates. Our algorithm                                                                                                              p(differences in xu and xv | pure chance)
 scaled linearly with data set size for all tested missing rates                                                                                                                                               (1)
 and error rates in synthetic data. As Figure 3 shows, the bin                                                                   LiRa generalizes the LOD score by assuming that differ-
 recall, or fraction of bins we can recover perfectly, improves                                                                  ences in two discrete-valued vectors from the same cluster
 at each missing rate with more data. Note that although                                                                         follow a particular multinomial distribution, which is used
 an error rate of 0.5% seems low, it is actually much higher                                                                     to compute the numerator. The LiRa score is useful in the
 than encountered in practice.                                                                                                   RS setting because it leverages more data to make a more
                                                                                                                                 accurate judgment of similarity.
                                                                                                                                    We have shown that using the LiRa score to find nearest
 4.                     GENERALIZING TO DISCRETE-VALUED
                                                                                                                                 neighbors in a k-nearest neighbors approach outperforms
                        DATA WITH MISSING VALUES                                                                                 the popular and widely used Pearson and Cosine similarity
   Next, I will describe the last portion of my thesis work,                                                                     scores in terms of rating prediction error [15]. I am cur-
 clustering large discrete-valued data sets with missing val-                                                                    rently expanding on the clustering model used to compute
 ues. One example of such data is found in the Recommender                                                                       the likelihood of users belonging to the same cluster in the
 Systems (RS) domain, and much of the experimentation of                                                                         LiRa score, which will be useful for efficient data reduction
 these clustering methods will be on RS data.                                                                                    in the discrete-value setting.
4.2    Generalizing Efficient Data Reduction to                    [3] J. A. Chapman, M. Mascher, A. Buluç, K. Barry,
       the Discrete-Valued Domain                                      E. Georganas, A. Session, V. Strnadova, J. Jenkins,
   As noted previously, my goal is to generalize my previous           S. Sehgal, L. Oliker, et al. A whole-genome shotgun
work on efficient clustering and data reduction in genetic             approach for assembling and anchoring the hexaploid
map data to the more general setting of large-scale, discrete-         bread wheat genome. Genome biology, 16(1):26, 2015.
valued data with missing values and noise. The LiRa score          [4] J. Cheema and J. Dicks. Computational approaches
from section 4.1 is the first step in this direction, and can          and software tools for genetic linkage map estimation
be used to produce an initial clustering of the input using a          in plants. Briefings in bioinformatics, 10(6):595–608,
thresholding scheme similar to the BubbleCluster algorithm.            2009.
The threshold LiRa score within which points will belong to        [5] E. Christakopoulou and G. Karypis. Local item-item
the same cluster will rely on the clustering model used to             models for top-n recommendation. In Proceedings of
represent the data. For RS data, a working model is already            the 10th ACM Conference on Recommender Systems,
presented in previous work [15].                                       pages 67–74. ACM, 2016.
   After an initial fast clustering, I hope to generalize the      [6] M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al. A
data reduction stage to discrete-valued data also. Here, fu-           density-based algorithm for discovering clusters in
ture work involves more precisely defining the point at which          large spatial databases with noise. In Kdd, volume 96,
the fundamental resolution has been reached. In RS data,               pages 226–231, 1996.
the idea is to cluster together users who have very similar        [7] J. Gan and Y. Tao. Dbscan revisited: mis-claim,
rating patterns. One possibility is to only cluster together           un-fixability, and approximation. In Proceedings of the
users if the distribution of their rating differences follows a        2015 ACM SIGMOD International Conference on
clustering model. For example, in Cluster 1 in Figure 4,               Management of Data, pages 519–530. ACM, 2015.
only one pair of ratings for the same item differs signifi-        [8] A. Gionis, A. Hinneburg, S. Papadimitriou, and
cantly: x13 = 1 and x33 = 5, giving a rating difference of 4.          P. Tsaparas. Dimension induced clustering. In
The remaining ratings are all close together. The recursive            Proceedings of the eleventh ACM SIGKDD
bisection method for data reduction in genetic mapping can             international conference on Knowledge discovery in
be modified to this more general case, by dividing user clus-          data mining, pages 51–60. ACM, 2005.
ters with large differences in rating values, based on MAP         [9] Y. Koren, R. Bell, and C. Volinsky. Matrix
estimates of the proportion of each rating difference.                 factorization techniques for recommender systems.
   I am currently formalizing the notion of fundamental res-           Computer, 42(8), 2009.
olution in the general case, and experimenting with the best      [10] J. Leskovec, A. Rajaraman, and J. D. Ullman. Mining
clustering model for RS data. As the final piece to my thesis,         of massive datasets. Cambridge University Press, 2014.
I hope to demonstrate that the efficient clustering and data      [11] F. Lin and W. W. Cohen. Power iteration clustering.
reduction methods can be applied to more general data sets,            In Proceedings of the 27th international conference on
and will be useful in the RS domain for rating prediction.             machine learning (ICML-10), pages 655–662, 2010.
                                                                  [12] M. W. Mahoney and P. Drineas. Cur matrix
5.    CONCLUSION                                                       decompositions for improved data analysis.
   I have shown that the concept of fundamental resolution             Proceedings of the National Academy of Sciences,
can be exploited to design efficient and accurate clustering           106(3):697–702, 2009.
algorithms in the genetic mapping domain, and I am ex-            [13] V. Strnadová, A. Buluç, J. Chapman, J. R. Gilbert,
tending this concept to the more general setting of discrete-          J. Gonzalez, S. Jegelka, D. Rokhsar, and L. Oliker.
valued data. The methods presented here are useful for ap-             Efficient and accurate clustering for large-scale genetic
plications with large-scale, discrete input data with many             mapping. In Bioinformatics and Biomedicine (BIBM),
missing values, such as that found in the Recommender Sys-             2014 IEEE International Conference on, pages 3–10.
tems domain. Future directions beyond my thesis work in-               IEEE, 2014.
clude exploring the connection between fractal dimension          [14] V. Strnadová-Neeley, A. Buluç, J. Chapman, J. R.
and fundamental resolution, as well as defining new cluster-           Gilbert, J. Gonzalez, and L. Oliker. Efficient data
ing models for data sets with a low fundamental resolution.            reduction for large-scale genetic mapping. In
                                                                       Proceedings of the 6th ACM Conference on
6.    ACKNOWLEDGMENTS                                                  Bioinformatics, Computational Biology and Health
  This work is supported by the Applied Mathematics Pro-               Informatics, pages 126–135. ACM, 2015.
gram of the DOE Office of Advanced Scientific Computing           [15] V. Strnadová-Neeley, A. Buluç, J. R. Gilbert,
Research under contract number DE-AC02-05CH11231 and                   L. Oliker, and W. Ouyang. Lira: A new
by NSF Award CCF-1637564.                                              likelihood-based similarity score for collaborative
                                                                       filtering. arXiv preprint arXiv:1608.08646, 2016.
7.    REFERENCES                                                  [16] R. Xu and D. Wunsch. Survey of clustering
 [1] D. C. Anastasiu, E. Christakopoulou, S. Smith,                    algorithms. IEEE Transactions on neural networks,
     M. Sharma, and G. Karypis. Big data and                           16(3):645–678, 2005.
     recommender systems. 2016.                                   [17] Y. W. Yu, N. M. Daniels, D. C. Danko, and
 [2] M. Belkin and P. Niyogi. Laplacian eigenmaps for                  B. Berger. Entropy-scaling search of massive biological
     dimensionality reduction and data representation.                 data. Cell systems, 1(2):130–140, 2015.
     Neural computation, 15(6):1373–1396, 2003.