Retrieval of optimal subspace clusters set for an effective
           similarity search in a high-dimensional spaces

                                                 c Ivan Sudos
                                      Saint-Petersburg State University
                                                Saint-Petersburg
                                            iv.teh.adr@gmail.com


                      Abstract                                 of dimensionality” first stated by R. Bellman [1]. It
                                                               has two key aspects. The first one lies within the
    High dimensional data is often analysed                    following fact: if the number of dimensions grows
    resorting to its distribution properties in sub-           the information under analysis in the search space
    spaces. Subspace clustering is a powerfull                 become cumbersome. And the second one is the
    method for elicication of high dimensional                 metric related problem: in high dimensions we often
    data features. The result of subspace clus-                can’t state that two vectors are similar or different.
    tering can be an essential base for building                   In most of cases we usually can’t build reliable
    indexing structures and further data search.               algorithm and data structures (indexes) to handle
    However, a high number of subspaces and                    exact match search with acceptable latency. Here
    data instances can conceal a high number of                in the first place we consider approximate similarity
    subspace clusters some of which are difficult              search.
    to analyse within search algorithm. This                       Most of the approaches to search and indexing
    paper presents a model of generic indexing                 problems can be divided into following categories:
    approach based on detected subspace clusters
    and the way to find an optimal set of clus-                  1. Low-dimensional algorithms adaptations, like
    ters to have an acceptable tradeoff between                     ones that use R-trees. Here we try to fix
    search speed and relevance.                                     some particular problems of search algorithm
                                                                    for low-dimensional search spaces to make it
                                                                    somehow feasible in high-dimensional space.
1   Introduction                                                    However, it tends to work acceptable for
                                                                    relatevely low number of dimensions (doesn’t
Search and clustering are two most extensive prob-                  exceed dozens)
lems of data analysis in high dimensional spaces.
                                                                 2. Data distribution based algorithms.       These
Since the roots of complexity that occurs in this
                                                                    alorithms take into account distribution’s prop-
domain were established, plenty of particular aspects
                                                                    erties of the data. Number of them proceed
have been elicited and regarded so far. Many of
                                                                    dimensional reduction techniques like principal
approaches for solving high-dimensional problems
                                                                    component analysis or subspace clustering to
regard only these particular complexity aspects, as it
                                                                    fight the curse of dimensionality.
is quite sophisticated and practice detached to solve
the problem in general. In recent twenty years a                 3. Random projection based algorithms. These
number of solutions have been proposed to solve                     algorithms tries to decrease the volume of
certain problems of searching and clustering in high                scanned information in a search space by
dimensional spaces. Clustering and indexing are                     grouping it’s elements with a degree of ran-
strongly associated with each other in high dimen-                  domnicity. Some realizations of locally sensi-
sional spaces: resolving of indexing problems can                   tive hashing [16] relates to this category.
cause the need to resolve clustering problems.
    Regarding only search and clustering problems              All of the approaches in one way or another try
we will refer to a high-dimensional vector space as a          to solve the curse of dimensionality. This implies
search space. The general challenge for search and             they try to reduce time comlexity of search or
clustering in high dimensional spaces is called “curse         increase relevance, or do the both. In this paper
                                                               we consider the second category of solutions. Here
 Proceedings of the 14th All-Russian Conference                we have a point of contact with clustering problem.
 ”Digital Libraries:    Advanced Methods and                   Analysis of data distribution is closely related to
 Technologies, Digital Collections” — RCDL-2012,               clusteing. Clustering in high dimensions have several
 Pereslavl Zalesskii, Russia, October 15-18, 2012.             approaches and each approach apart can use its


                                                         136
own cluster model.        This paper stays only on                    Indexing: [5] introduces Bregman ball trees in-
subspace and projection clustering approches [2].                 dex structure that is reconsideration of ball trees
In accordance with [2], subspace clustering aims                  using Bregman divergence instead of classic metric.
to find all clusters in all possible subspaces and                Though it is not supposed to be used in high di-
projection clustering assignes each vector to exactly             mensions it introduces a feasible concept of applying
one subspace cluster. For example, a set of photos                Bregman divergences to known indexing structures
can be placed to one cluster if projection clustering             instead of metrices.
algorithm finds no difference in their color hystogram                Having distance functions like Bregman diver-
characteristics.   At the mean time, a subspace                   gences in use, the search can be more complicated.
clustering algorithm will form(if any exists) clusters            The computational difficulty of such functions is
for all possible characteristics: hystograms, shapes,             higher then simle functions like Euclid metric. This
gradients etc. Despite it looks not so flexible as                aspect is taken into account in this paper.
principal component analysis methods that can detect                  X-tree [6] is spacial tree based on hyper rect-
arbitrary manifolds or just clusters in not axis-parallel         angle partitioning of search space shows how well
dimensions, subspace clustering have one important                known low-dimensional index structure R-tree can be
advantage: locality property [3]. It means that                   adapted for relatevly high number of dimensions by
subspace clustering algorithms can determine a set                rejecting rectangles overlaping. However X-tree have
of relevant dimensions (relevant subspace) locally                its capability limits. This is an example of case when
for each part of the search space or subset of data               taking into account all dimensions simultaneously
vectors.                                                          leads to index structure size of the same order as the
    Our goal is to understand how clustering results              data.
can be coupled with similarity search in high-
dimensional spaces. This paper introduce a generic                    Clustering: The clustering techincs used in this
approach to utilize detected subspace clustering                  paper refer to projection and subspace clustering.
within search. We state a key optimization problem                The key survey on clustering algorithms in high
that allows to find the best tradeoff between search              dimensions was conducted in [2]. Algorithms are cat-
relevance and speed. The implies selection of the                 egorized and compared. Main groups distinguished
best subset of subspace clusters that can provide the             are: Axis-parallel subspace and projection clustering,
best relevance with a guaranteed retrieval minimal                Arbitrary oriented subspaces clustering and Pattern
complexity.                                                       based clustering. We are also interested in description
                                                                  of particular algorithms of subspace and projection
2   Related works                                                 clustering and clustering models they use [3]. A
                                                                  surevey compares different subspace clustering per-
Many of works were presented on indexing and                      formance. Clustering quality evalutation is regarded
clustering apart.     Applications of clustering for              along with results of proposed optimization problem
building a search index are implicitly described in               solution [7].
works about nearest neighbours in high dimensional
spaces. Some works show how subspace clustering                       Clustering within indexing A recent work [4]
results can be used as base for tree-like index                   shows us how we can build tree-like indexing
structures. Although a link between parameters of                 structure in high dimensional space atop known
detected clusters and efficiency of nearest neighbours            clusters set.     This work doesn’t imply special
search is not presented. Several papers regard a                  method of clusterization and thus doesn’t regard how
subspace clustering process quality from the point of             clustering influences on efficiency of search. Though
view of data redundancy.                                          this search approach helps to evaluate clustering
    Here we first state the optimization problem that             results experimentally.
arises when index structure is based on detected
subspace clusters. The problem links search ef-
fectivity (speed and relevance) and properties of                 3   Subspace clustering and similarity search
detected clusters. Thereby our work aims to link
                                                                  The common goal of indexing is to prune search
subspace clustering and nearest neighbour search.
                                                                  space by compression and/or elicit relations between
Index structures can be based on detected subspace
                                                                  parts of the data (trees and space partitioning).
clusters in various ways, however an existing index-
                                                                  Such approaches as Locally sensitive hashing and
ing approaches don’t consider affects of underlying
                                                                  VA-files are compression [8] techinques that allows
clustering.
                                                                  to approximat a group of near vectors with a
    Some papers that considers indexing and cluster-
                                                                  single object. These techniques suffer a curse of
ing problems in high dimensional spaces are reported
                                                                  dimensionality as well in high dimensional spaces.
below.
                                                                  Locally sensitive hashing leads to low relevancse


                                                            137
due to distance invisibility. Space partitioning is                 The second consideration is the measure of
not feasible since a number of hyperrectangles can              relevance of detected clusters.         Again, suppose
exceed or become close to the number of data                    we have an uniformly distributed query q . Let
vectors and the search will be not less complex                 C = {c1 . . . cN } will be the set of detected subspace
as and exhaustive bypass through all the data.                  clusters.      Each subspace cluster ci represents a
Clustering suffers from distanse invisibility in high           pair {Oi , Si } of set of vectors it contains: V =
dimensions as it was described above. In our model              {v1 . . . vk } and a set of dimensions S = {d1 . . . dl }
we consider subspace clustering as a solution to                that determines a subspace. Let introduce a relevance
proceed data compression to be used as a base of                of a given cluster c of size k for a given query q :
indexing structure. This considiration is maximally
abstracted from the complete indexing structure and                                             P       1
the algorithm of similarity search. Thus we assume                                              v∈V dist(q, v)
                                                                          R(c, q) = Rdim (c, q)
that a search algorithm and an index structure                                                        k
operates a set of objects that represent grouped
                                                                    The right side represents a product of: subspace
(clustered) data and perform retrieval on the smaller
                                                                relevance function Rdim (c, q) that represents a rele-
space thus with smaller precision. This requires
                                                                vance of subspace where the data forms cluster c
a function that calculates relevance of the cluster
                                                                by the average inverted distance from a query to a
for a query q and this function should avoid full
                                                                member v of the cluster. We accept the Manhattan
scan of cluster members as it leads to an exhaustive
                                                                distance function and Euclidean distance function
search. So far search algorithm is assumed to do the
                                                                here as they were prooved [9] to be the only suit-
following:
                                                                able for a distance measurment in high dimensional
   • Consider subspace clusters as a primary result             spaces.
     of data compression.                                           Relevance calculation of cluster c with respect to
                                                                query q should avoid iteration through all cluster’s
   • Use some approximate fast enough distance                  members. The relevance calculation function is sup-
     function to calculate relevance of the given               posed to be approximate. Having this consideration
     subspace cluster with respect to a given query             we introduce an approximate cluster relevance as
     vector q .
                                                                           Rapprox (c, q) = Rdim (c, q)Q(c, q)
   • Find the most relevant cluster(s) and take them
     into the furhter consideration.                            where Q(c, q) is an approximate distance func-
                                                                tion. Let denote complexity of it as g(|V |, |S|) =
The following considerations show how subspace
                                                                gq (|V |, |S|, V, S).    In further we will show an
clustering features affect the tradeoff between search
                                                                examples of this functions.
speed and relevance.
                                                                     The general idea of this paper is to understand
     Let the subspace cluster c = {V, S} be any subset
                                                                how the set of detected clusters should be selected to
V of data vectors so that maximum deviation in the
                                                                obtain the required search speed and relevance ratio.
subspace S of any v ∈ V from the rest of V is
                                                                Denote the set of all the possible subspace clusters
less then given threshold h and V contains not less
                                                                for a given search space as C . The subset C ∗ of C
elements then p. So that the cluster is considered
                                                                is an argument of optimization and the optimization
to be any clot of vectors in the any subspace with
                                                                problem is to obtain such C ∗ that will produce the
boundered parameters of density and a number of
                                                                best (in terms of retrieval speed and relevance) result
elements.
                                                                for a random query q . The first optimization problem:
     The first consideration is the measure of data
compression provided with subspace clustering. The                             
                                                                                  E(R(c∗ , q)) → max
general assumption about query is that it is an
                                                                               
                                                                                c∗ = arg max Rapprox (c, q)
                                                                               
uniformely distributed vector in the search space.                                          c∈C                     (1)
We assume that the complexity of search algorithm
                                                                               
                                                                               
                                                                                 N   ≤ N max

is monotonic non-decreasing function f (N, q) of N                                E(gq (|V |, |s|)) ≤ γ ∀q
where N is a number of objects in the search space
                                                                The first optimization problem’s goal is to maxi-
and q is a query vector. Since subspace clustering
                                                                mize mean relevance of the most suitable cluster
compression is performed N depicts a number of
                                                                determined by a given relevance calculation function
detected subspace clusters to be used by search
                                                                Rapprox for a random query q. The constraints are to
algorithm. Without respect to the retrieval algorithm
                                                                keep the number of objects under analysis (subspace
we assume the need to calculate relevance of a
                                                                clusters) below the given bound Nmax and keep cal-
random detected cluster for a query q . So Q(q, c) is a
                                                                culation complexity of the Rapprox in a given bounds.
function that calculates relevance of c with respect to
q . Let Q ∈ O(g(|V |, dim(S))).


                                                          138
Let’s introduce another optimization problem:                     quality. The are three known groups of methods
                                                                  for automatical evaluation of cluster dimensionality
              min R(c∗ , q) ≥ Rmin
           
                                                                  relevance:
            q∈H
           
           
           
              c∗ = arg max Rapprox (c, q)
                        c∈C                          (2)             • Rate subspaces and thus subspace clusters
            N → min                                                   higher if their dimensionality is higher [10], [2]
           
           
           
              E(gq (|V |, |s|)) → min ∀q
                                                                     • Introduce generic cost function K(O, S) that
The second one’s goal is to find the simplest                          rates relevancy of given subspace cluster (O, S)
clusters set that keeps relevance rate in the given                    [10], [11]
bounds. The latest problem is off less interest as
it is difficult to user to assign relevance constraints              • Measure cluster separability within a given
apriori. The stated problems implies calculation                       subspace. [9] That means subspace cluster
of mean relevance calculation, though it is near                       (O, S) is rated higher if each vector in O is far
impossible to perform for a whole search space.                        enough from any other vector in subspace S
The introduction of mean relevance denotes the                         that is not in the cluster.
following:                                                        The first approach is fair enough in case of the
    The goal of optimization is to select a subset of             nearest neighbour retrieval: the user is interested in
subspace clusters such that the mean value of real                matching all the coordinates of given query vector
relevance of given cluster for given query is the highest         q until certain subset of dimensions is not specified.
when the approximate relevance value is the highest.              In that case dimension clustering relevance can be
    This Optimization problems can be formulated                  determined as R(c) = R(S, O) = |S|.
for a low dimensional spaces as well. Though the                      The second approach can be used to assign
abcence of the need to take into account subspace                 weight to particular dimensions or vectors. This
and simple mechanisms of query point classification               approach is senseble in case of specific origins of the
(like MBR []) lead to the simple solution as                      data. Let’s regard a vectors in a finite dimensional
regarding only the set of clusters that satisfy given             vector space that are obtained by orthogonal projec-
density and size. For a high dimensions the most                  tion of time-series data in an infinite dimensionsal
principal difference is a number of possible subspace             space. Then the user can be interested in a uniform
clusters.                                                         sample in time. At the other side sequental dense
    The number of detected clusters in all subspaces              subset of times can be more relevant. In these cases
can be significantly bigger then in a low dimensional             K can be denoted as
space.                                                                                                             2
    That is because of 2d number of all possible
                                                                                                      P
                                                                                                             (si )
subspaces for a d-dimensional space. If all possible
                                                                                     X                0<i≤N
                                                                        K(O, S) =         (s2i ) − n 
                                                                                                                   
                                                                                                           n
                                                                                                                   
subspace clusters are considered, the relevance of the                              0<i≤N
search is supposed to be higher since it is possible
to pick up the nearest cluster for a given query q .              (variance multiplied n) or
But in this case N can be significantly high and
bypass complexity of all the clusters can approach                                                n
                                                                                   K(O, S) =
the complexity of an exhaustive search. The other                                              sn − s1
tradeoff can be observed when dealing with subspace               correspondently.
clusters in a relativly high dimensional subspaces.                   The third way to introduce K is generally
The relevance of this clusters can be higher but it is            refers to automatic detection of relevant dimensions
hard to analyse accessory of q to these clusters since            for a given subset of vectors. The one way is
there are a lot of attributes to be taken into account.           proposed in [9]. To detect the most relevant set
    The rest of paper is dedicated to analysing                   of dimensions for a given vector vq one have to
the internal structure of functions presented in a                compose distribution function of distances Φ(d) =
optimization problems and possible algorithmical                  P (dist(vq , v) < d), d ∈ < where v is a random
solutions for a stated problems.                                  vector from a vector space. This approach requires
                                                                  computation of distribution function that in turn
3.1   Dimensionality effects on relevance                         requires itaratibreon over the whole search space.
                                                                  The are two drawbacks here. It is rather heavy
Rdim (C) was introduced as a function that reflects
                                                                  operation to compute distribution function for a
relevance of subspace where cluster C is detected.
                                                                  single vector and have a O(N 2 ) cost. And the
Rdim (C) have a direct impact on the search relevance
                                                                  second drawback is that most of subspace/projection
of the given cluster. Most of papers [2], [6] [9]
                                                                  clustering algorithms imply detection of relevant
that tries to evaluate subspace or projection cluster-
                                                                  vectors themselves and can expose this information
ing omit details of subspace impact on clustering
                                                                  both with subspace cluster.


                                                            139
3.2    Cluster geometry effect on relevance
Influence of cluster geometry on distance to a query
point is studied in this section. Distance function is
used to evaluate relevance of the clusters with respect
to a query. The relevance function was introduced
above as Rapprox (c, q) = Rdim (c, q)Q(c, q).      And
the Q(c, q) is the approximate classificatoin function
used to mitigate relevance calculation time. The
Idea is to find out what clusters is suitable to be
evaluated with Q. Some shapes of clusters causes                Figure 1: approximate relevance error:
the situation when Q determines them as relevant                     Rapprox value is bigger for left picture,
         P        1                                                  however, a real relevance is bigger at right one
        v∈V
              dist(q,v)
when        k       turns out to be low. A Bregman
divergence can be used as a measure of accessory                   Example 2 Using of Mahalanobis distance as a
to the cluster. A Bregman divergence [12] distance              base for Rapprox can lead to erroneous guesses as
functions such as Mahalanobis distance [13] can be              well. Lets have
calculated using only mean vector of the cluster and                           q
covariance matrix that can be significantly faster than
                                                                     Q(c, q) = (q − µ(c))T Cov−1 (c)(q − µ(c))
bypassing all the cluster’s members. For this purpose
mean and covariance matrix should be stored along               where µ(c) is a centroid of c and Cov(c) is sample
with the object that represents subspace cluster.               covariance. This function is named Mahalanobis
    Another variant of Q is an indicator function               distance. Let R(c) = R(S, O) = |S|. Thereby
of Minimal Boundaring Rectangle (MBR) accessory.                approximate relevance function is
This function is generally simpler to be calculated
though for a relatively high-dimensonal hyper rect-
                                                                    Rapprox (c, q) =
angles can be very inaccurate as the volume of                            q                                   −1
hyperrectangle increases exponentially as number of
                                                                           |S| (q − µ(c))T Cov −1 (c)(q − µ(c))    .
dimensions grows.

                                                                The situation shown at figure 2 occurs when lin-
3.3    Examples
                                                                ear classifier (like Mahalanobis distance) can’t do
Having determined possible approximated relevance               reliable guess against some distributions other then
functions we will give some examples that reflects 1.           normal one. The subspace cluster at the left is less
                                                                distant in terms of Mahalanobis distance, but average
      Example 1      Let’s have                                 distance to a set at the right picture is less.

         Rapprox (c, q) = (|S|I(q, M BR(c)))−1 ,

where I(q, M BR(C)) = 1 if q belongs to the
minimal boundaring rectangle that contains c and
I(q, M BR(C)) = 0 otherwise. S is the set of sub-
space dimensions as was intoduced above. Assume
there is query q and it falls into two minimal bound-
aring rectangles in 2 different subspaces as depicted
at figure 3.3. These rectangles bound 2 subspace                    Figure 2: Mahalanobis distance relevance error
clusters.
    The subspace cluster in the left has more                       In the given examples it is possible to discard one
dimensions - 5, when the right one has only 3.                  of two clusters regarded. To leave only the clusters
So that if to accept Rapprox (c, q) = 5 and thereby             that will have the maximum relevance when selected
take into consideration left cluster we will select the         by maximum approximate relevance value is the goal
subspace cluster (and so that the nearst neighbours)            of 1.
less relevant then right cluster in terms of real
relevance. The average distance to q from the right             4    Solution of optimization problem
cluster is significantly less. The solution of 1 let us
escape from such false guess and leave only the most            The solution of the problem 1 in general case
relevant subspace clusters in the indexing set.                 is still open problem until there is no additional
                                                                assumptions about the query vector distibution. The
                                                                main difficulty is to calculate corresponding mean of


                                                          140
the relevance. In relativly low dimensional subspaces           5   Experiments
(2-3 dimensional) the experiments shows that the
Branches and bounds method is applicable. The most              The proposed approached were tested using time-
complex part is to calculate the mean calculation that          series data records of daily earth temperature sets
leads to multople integral sums calculations.                   with 17 · 106 data instances and 500 thousands of
    An approach we propose to solve 1 is based on               instanses, dimensionality of 200 and 20 correspon-
reduction of 1 to a well-known knapsack problem                 dently. While the first dataset was simulated, the
[15]. Suppose a set of B items. Each item i has                 second one is real observation of earth temperature
its own value vi and weight wi . Knapsack problem               since 1880 available at [17]. The MAFIA algorithm
implies selection of M items. The total weight of               managed to obtain 1630 subspace clusters with av-
selected items must be below threshold ω and the                erage dimensionality 4.6. Since our algorithm was
total value must be maximized. In case of 1 the items           applied it was able to prune similarity search time
are subspace clusters; The weight depicts a a value             (using sequental traversal over the clusters) more
difficulty of approximated relevance computation. To            then 5 times (the maximum number of clusters was
define a value of the subspace cluster we propose               bounded to 311) having the same search result as
to perform a test: for each vector x within given               without such adaptation. The other dataset initially
suspace cluster ci a pair                                       contained significantly less subspace clusters found
                                                                by MAFIA. While it initially contained 41 subspace
      (u(x); w(x)) := (R(x, ci ); Rapprox (x, ci ))             clusters in 23 subspaces (with initial dimensionality
                                                                of 20) the optimization process was aimed to bound
is calculated.     The value vi is then taken as                cluster number by 30. The performance of search
vi = corr(u, w) (a correlation).       Having this as           was increased near proportionally from 14.018 sec-
a value, corresponding knapsack problem can be                  onds to 11.95. The both experiments shows no loss
solved in common way, for example, recurrently.                 in relevance for 100 randomly generated queries.
This approach is to be proven, though for an
experimental data it yields results.
    Another problem that uprises is how to put the              6   Conclusions
most complete set of existing subspace clusters to
a consideration. For an axis-parallel clustering it             Using subspace clustering feasible approach to build
can be performed using bottom-up approaches like                indexing structure though it highly depends on data
CLIQUE [3] and MAFIA [14] as they are aimed                     distribution. Data set can contain a large number
to detect all possible clusters in all the possible             of subspace clusters some of them are redundant in
subspaces. Once the all available subspace clusters             scope of indexing. To know how to remove re-
are obtained the reduction should be proceeded: the             dundancy the corresponding relevance model should
best clusters should be selected in accordance with 1.          be proposed. Our model assumes generic influences
    Now, we can formulate an algorithm that elicit              of distribution factors on relevance and relevance
near optimal subset of subspace clusters to be used             calculation and the optimization problem states com-
in indexing. For a geneneral case solution here we              pression ration problem explicitly for a search in
propose to use our approach with knapsack problem               a high dimensional spaces. If the data have no
preceded with MAFIA algorithm.                                  features avaliable to proceed some redundancy re-
                                                                moval thereby resolving of optimization problem and
Algorithm 1 Clustering optimization:                            using subspace clustering as a base for compression
                                                                and indexing structure can be not feasible at all.
  Execute MAFIA algorithm to obtain full set of
                                                                The solution of stated optimization problem in a
  subspace clusters;
                                                                general case is a big challenge, though knowledge of
  for all subspace cluster ci do
                                                                particular relevance calculation functions and affects
      calculate weight wi based on Rapprox
                                                                of clusters geometry can lead to a feasible particular
      for all vector x in cluster ci do
                                                                solution.
           calculate u(x) := Rapprox (x, ci )
           calculate w(x) := R(x, ci )
           write pair (u(x), w(x)) to array A                   References
      end for
      vi := correlation(u, w)                                   [1] R.E. Bellman. Dynamic programming. Princeton
  end for                                                           University Press. 1957.
  solution := SolveKnapsack(v1 . . . vN , w1 . . . wN )
  output solution                                               [2] H.P. Kriegel, P. Kroger, A. Zimek. Clustering
                                                                    High-Dimensional Data: A Survey on Subspace
                                                                    Clustering, Pattern-Based Clustering, and Corre-
                                                                    lation Clustering. ACM Trans. Knowl. Discov.
                                                                    2009.


                                                          141
[3] L. Parsons, E. Haque, H. Liu Subspace Clus-                 Interesting Non-Redundant Concepts in High
    tering for High Dimensional Data: A Review.                 Dimensional Data ICDM 2010.
    Sigkdd Explorations. 2004.
                                                             [11] K. Sequeira, M. Zaki. SCHISM: A new ap-
[4] S. Gunnemann, H. Kremer, D. Lenhard, T.                      proach for interesting subspace mining. ICDM.
    Seidl. Subspace Clustering for Indexing High                 2004.
    Dimensional Data: A Main Memory Index
    based on Local Reductions and Individual Multi-          [12] L. M. Bregman. The relaxation method of
    Representations. EDBT. 2011.                                 finding the common points of convex sets and
                                                                 its application to the solution of problems
[5] F. Nielsen, P. Piro, M. Barlaud. Tailored Breg-              in convex programming. USSR Computational
    man Ball Trees for Effective Nearest Neighbors.              Mathematics and Mathematical Physics 7(3).
    EDBT. 2008.                                                  1967.
[6] S. Berchtold, D.A. Keim, H.P. Kriegel. The               [13] J. Ekstrom. Mahalonobis distance beyond nor-
    X-tree: An Index Structure for High-Dimensional              mal distribution. UCLA Press. 2005.
    Data VLDB. 1996.
                                                             [14] S. Goil, H. Nagesh, A. Choudhary. MAFIA:
[7] E. Muller, S. Gunnemann, I. Assent, T. Seidl.
                                                                 Efficient and Scalable Subspace Clustering for
    Evaluating Clustering in Subspace Projections of
                                                                 Very Large Data Set. SIGMOD. 1999.
    High Dimensional Data. VLDB. 2009.
[8] N. Cristianini, J. Shawe-Taylor. An Introduction         [15] M. Silvano, T. Paolo. Knapsack problems:
    to Support Vector Machines and other kernel-                 Algorithms and computer interpretations. Wiley-
    based learning methods. Cambridge University                 Interscience. 1990.
    Press, Cambridge, UK. 2000.                              [16] G.A. Indyk, P. Motwani. Similarity Search in
[9] A. Hinneburg, C. Aggarwal, D.A. Keim. What is                High Dimensions via Hashing. VLDB. 1999.
    the nearest neighbor in high dimensional spaces?
                                                             [17] Earth surface temperature measurments.
    VLBD. 2000.
                                                                 http://berkeleyearth.org/data/. 2012.
[10] E. Muller, I. Assent, S. Gunnemann. Rele-
    vant Subspace Clustering: Mining the Most


                                                       142