=Paper= {{Paper |id=None |storemode=property |title=Retrieval of Optimal Subspace Clusters Set for an Effective Similarity Search in a High-Dimensional Spaces |pdfUrl=https://ceur-ws.org/Vol-934/paper23.pdf |volume=Vol-934 |dblpUrl=https://dblp.org/rec/conf/rcdl/Sudos12 }} ==Retrieval of Optimal Subspace Clusters Set for an Effective Similarity Search in a High-Dimensional Spaces == https://ceur-ws.org/Vol-934/paper23.pdf
      Retrieval of optimal subspace clusters set for an effective
           similarity search in a high-dimensional spaces

                                                 c Ivan Sudos
                                      Saint-Petersburg State University
                                                Saint-Petersburg
                                            iv.teh.adr@gmail.com


                      Abstract                                 of dimensionality” first stated by R. Bellman [1]. It
                                                               has two key aspects. The first one lies within the
    High dimensional data is often analysed                    following fact: if the number of dimensions grows
    resorting to its distribution properties in sub-           the information under analysis in the search space
    spaces. Subspace clustering is a powerfull                 become cumbersome. And the second one is the
    method for elicication of high dimensional                 metric related problem: in high dimensions we often
    data features. The result of subspace clus-                can’t state that two vectors are similar or different.
    tering can be an essential base for building                   In most of cases we usually can’t build reliable
    indexing structures and further data search.               algorithm and data structures (indexes) to handle
    However, a high number of subspaces and                    exact match search with acceptable latency. Here
    data instances can conceal a high number of                in the first place we consider approximate similarity
    subspace clusters some of which are difficult              search.
    to analyse within search algorithm. This                       Most of the approaches to search and indexing
    paper presents a model of generic indexing                 problems can be divided into following categories:
    approach based on detected subspace clusters
    and the way to find an optimal set of clus-                  1. Low-dimensional algorithms adaptations, like
    ters to have an acceptable tradeoff between                     ones that use R-trees. Here we try to fix
    search speed and relevance.                                     some particular problems of search algorithm
                                                                    for low-dimensional search spaces to make it
                                                                    somehow feasible in high-dimensional space.
1   Introduction                                                    However, it tends to work acceptable for
                                                                    relatevely low number of dimensions (doesn’t
Search and clustering are two most extensive prob-                  exceed dozens)
lems of data analysis in high dimensional spaces.
                                                                 2. Data distribution based algorithms.       These
Since the roots of complexity that occurs in this
                                                                    alorithms take into account distribution’s prop-
domain were established, plenty of particular aspects
                                                                    erties of the data. Number of them proceed
have been elicited and regarded so far. Many of
                                                                    dimensional reduction techniques like principal
approaches for solving high-dimensional problems
                                                                    component analysis or subspace clustering to
regard only these particular complexity aspects, as it
                                                                    fight the curse of dimensionality.
is quite sophisticated and practice detached to solve
the problem in general. In recent twenty years a                 3. Random projection based algorithms. These
number of solutions have been proposed to solve                     algorithms tries to decrease the volume of
certain problems of searching and clustering in high                scanned information in a search space by
dimensional spaces. Clustering and indexing are                     grouping it’s elements with a degree of ran-
strongly associated with each other in high dimen-                  domnicity. Some realizations of locally sensi-
sional spaces: resolving of indexing problems can                   tive hashing [16] relates to this category.
cause the need to resolve clustering problems.
    Regarding only search and clustering problems              All of the approaches in one way or another try
we will refer to a high-dimensional vector space as a          to solve the curse of dimensionality. This implies
search space. The general challenge for search and             they try to reduce time comlexity of search or
clustering in high dimensional spaces is called “curse         increase relevance, or do the both. In this paper
                                                               we consider the second category of solutions. Here
 Proceedings of the 14th All-Russian Conference                we have a point of contact with clustering problem.
 ”Digital Libraries:    Advanced Methods and                   Analysis of data distribution is closely related to
 Technologies, Digital Collections” — RCDL-2012,               clusteing. Clustering in high dimensions have several
 Pereslavl Zalesskii, Russia, October 15-18, 2012.             approaches and each approach apart can use its



                                                         136
own cluster model.        This paper stays only on                    Indexing: [5] introduces Bregman ball trees in-
subspace and projection clustering approches [2].                 dex structure that is reconsideration of ball trees
In accordance with [2], subspace clustering aims                  using Bregman divergence instead of classic metric.
to find all clusters in all possible subspaces and                Though it is not supposed to be used in high di-
projection clustering assignes each vector to exactly             mensions it introduces a feasible concept of applying
one subspace cluster. For example, a set of photos                Bregman divergences to known indexing structures
can be placed to one cluster if projection clustering             instead of metrices.
algorithm finds no difference in their color hystogram                Having distance functions like Bregman diver-
characteristics.   At the mean time, a subspace                   gences in use, the search can be more complicated.
clustering algorithm will form(if any exists) clusters            The computational difficulty of such functions is
for all possible characteristics: hystograms, shapes,             higher then simle functions like Euclid metric. This
gradients etc. Despite it looks not so flexible as                aspect is taken into account in this paper.
principal component analysis methods that can detect                  X-tree [6] is spacial tree based on hyper rect-
arbitrary manifolds or just clusters in not axis-parallel         angle partitioning of search space shows how well
dimensions, subspace clustering have one important                known low-dimensional index structure R-tree can be
advantage: locality property [3]. It means that                   adapted for relatevly high number of dimensions by
subspace clustering algorithms can determine a set                rejecting rectangles overlaping. However X-tree have
of relevant dimensions (relevant subspace) locally                its capability limits. This is an example of case when
for each part of the search space or subset of data               taking into account all dimensions simultaneously
vectors.                                                          leads to index structure size of the same order as the
    Our goal is to understand how clustering results              data.
can be coupled with similarity search in high-
dimensional spaces. This paper introduce a generic                    Clustering: The clustering techincs used in this
approach to utilize detected subspace clustering                  paper refer to projection and subspace clustering.
within search. We state a key optimization problem                The key survey on clustering algorithms in high
that allows to find the best tradeoff between search              dimensions was conducted in [2]. Algorithms are cat-
relevance and speed. The implies selection of the                 egorized and compared. Main groups distinguished
best subset of subspace clusters that can provide the             are: Axis-parallel subspace and projection clustering,
best relevance with a guaranteed retrieval minimal                Arbitrary oriented subspaces clustering and Pattern
complexity.                                                       based clustering. We are also interested in description
                                                                  of particular algorithms of subspace and projection
2   Related works                                                 clustering and clustering models they use [3]. A
                                                                  surevey compares different subspace clustering per-
Many of works were presented on indexing and                      formance. Clustering quality evalutation is regarded
clustering apart.     Applications of clustering for              along with results of proposed optimization problem
building a search index are implicitly described in               solution [7].
works about nearest neighbours in high dimensional
spaces. Some works show how subspace clustering                       Clustering within indexing A recent work [4]
results can be used as base for tree-like index                   shows us how we can build tree-like indexing
structures. Although a link between parameters of                 structure in high dimensional space atop known
detected clusters and efficiency of nearest neighbours            clusters set.     This work doesn’t imply special
search is not presented. Several papers regard a                  method of clusterization and thus doesn’t regard how
subspace clustering process quality from the point of             clustering influences on efficiency of search. Though
view of data redundancy.                                          this search approach helps to evaluate clustering
    Here we first state the optimization problem that             results experimentally.
arises when index structure is based on detected
subspace clusters. The problem links search ef-
fectivity (speed and relevance) and properties of                 3   Subspace clustering and similarity search
detected clusters. Thereby our work aims to link
                                                                  The common goal of indexing is to prune search
subspace clustering and nearest neighbour search.
                                                                  space by compression and/or elicit relations between
Index structures can be based on detected subspace
                                                                  parts of the data (trees and space partitioning).
clusters in various ways, however an existing index-
                                                                  Such approaches as Locally sensitive hashing and
ing approaches don’t consider affects of underlying
                                                                  VA-files are compression [8] techinques that allows
clustering.
                                                                  to approximat a group of near vectors with a
    Some papers that considers indexing and cluster-
                                                                  single object. These techniques suffer a curse of
ing problems in high dimensional spaces are reported
                                                                  dimensionality as well in high dimensional spaces.
below.
                                                                  Locally sensitive hashing leads to low relevancse



                                                            137
due to distance invisibility. Space partitioning is                 The second consideration is the measure of
not feasible since a number of hyperrectangles can              relevance of detected clusters.         Again, suppose
exceed or become close to the number of data                    we have an uniformly distributed query q . Let
vectors and the search will be not less complex                 C = {c1 . . . cN } will be the set of detected subspace
as and exhaustive bypass through all the data.                  clusters.      Each subspace cluster ci represents a
Clustering suffers from distanse invisibility in high           pair {Oi , Si } of set of vectors it contains: V =
dimensions as it was described above. In our model              {v1 . . . vk } and a set of dimensions S = {d1 . . . dl }
we consider subspace clustering as a solution to                that determines a subspace. Let introduce a relevance
proceed data compression to be used as a base of                of a given cluster c of size k for a given query q :
indexing structure. This considiration is maximally
abstracted from the complete indexing structure and                                             P       1
the algorithm of similarity search. Thus we assume                                              v∈V dist(q, v)
                                                                          R(c, q) = Rdim (c, q)
that a search algorithm and an index structure                                                        k
operates a set of objects that represent grouped
                                                                    The right side represents a product of: subspace
(clustered) data and perform retrieval on the smaller
                                                                relevance function Rdim (c, q) that represents a rele-
space thus with smaller precision. This requires
                                                                vance of subspace where the data forms cluster c
a function that calculates relevance of the cluster
                                                                by the average inverted distance from a query to a
for a query q and this function should avoid full
                                                                member v of the cluster. We accept the Manhattan
scan of cluster members as it leads to an exhaustive
                                                                distance function and Euclidean distance function
search. So far search algorithm is assumed to do the
                                                                here as they were prooved [9] to be the only suit-
following:
                                                                able for a distance measurment in high dimensional
   • Consider subspace clusters as a primary result             spaces.
     of data compression.                                           Relevance calculation of cluster c with respect to
                                                                query q should avoid iteration through all cluster’s
   • Use some approximate fast enough distance                  members. The relevance calculation function is sup-
     function to calculate relevance of the given               posed to be approximate. Having this consideration
     subspace cluster with respect to a given query             we introduce an approximate cluster relevance as
     vector q .
                                                                           Rapprox (c, q) = Rdim (c, q)Q(c, q)
   • Find the most relevant cluster(s) and take them
     into the furhter consideration.                            where Q(c, q) is an approximate distance func-
                                                                tion. Let denote complexity of it as g(|V |, |S|) =
The following considerations show how subspace
                                                                gq (|V |, |S|, V, S).    In further we will show an
clustering features affect the tradeoff between search
                                                                examples of this functions.
speed and relevance.
                                                                     The general idea of this paper is to understand
     Let the subspace cluster c = {V, S} be any subset
                                                                how the set of detected clusters should be selected to
V of data vectors so that maximum deviation in the
                                                                obtain the required search speed and relevance ratio.
subspace S of any v ∈ V from the rest of V is
                                                                Denote the set of all the possible subspace clusters
less then given threshold h and V contains not less
                                                                for a given search space as C . The subset C ∗ of C
elements then p. So that the cluster is considered
                                                                is an argument of optimization and the optimization
to be any clot of vectors in the any subspace with
                                                                problem is to obtain such C ∗ that will produce the
boundered parameters of density and a number of
                                                                best (in terms of retrieval speed and relevance) result
elements.
                                                                for a random query q . The first optimization problem:
     The first consideration is the measure of data
compression provided with subspace clustering. The                             
                                                                                  E(R(c∗ , q)) → max
general assumption about query is that it is an
                                                                               
                                                                                c∗ = arg max Rapprox (c, q)
                                                                               
uniformely distributed vector in the search space.                                          c∈C                     (1)
We assume that the complexity of search algorithm
                                                                               
                                                                               
                                                                                 N   ≤ N max

is monotonic non-decreasing function f (N, q) of N                                E(gq (|V |, |s|)) ≤ γ ∀q
where N is a number of objects in the search space
                                                                The first optimization problem’s goal is to maxi-
and q is a query vector. Since subspace clustering
                                                                mize mean relevance of the most suitable cluster
compression is performed N depicts a number of
                                                                determined by a given relevance calculation function
detected subspace clusters to be used by search
                                                                Rapprox for a random query q. The constraints are to
algorithm. Without respect to the retrieval algorithm
                                                                keep the number of objects under analysis (subspace
we assume the need to calculate relevance of a
                                                                clusters) below the given bound Nmax and keep cal-
random detected cluster for a query q . So Q(q, c) is a
                                                                culation complexity of the Rapprox in a given bounds.
function that calculates relevance of c with respect to
q . Let Q ∈ O(g(|V |, dim(S))).




                                                          138
Let’s introduce another optimization problem:                     quality. The are three known groups of methods
                                                                  for automatical evaluation of cluster dimensionality
              min R(c∗ , q) ≥ Rmin
           
                                                                  relevance:
            q∈H
           
           
           
              c∗ = arg max Rapprox (c, q)
                        c∈C                          (2)             • Rate subspaces and thus subspace clusters
            N → min                                                   higher if their dimensionality is higher [10], [2]
           
           
           
              E(gq (|V |, |s|)) → min ∀q
                                                                     • Introduce generic cost function K(O, S) that
The second one’s goal is to find the simplest                          rates relevancy of given subspace cluster (O, S)
clusters set that keeps relevance rate in the given                    [10], [11]
bounds. The latest problem is off less interest as
it is difficult to user to assign relevance constraints              • Measure cluster separability within a given
apriori. The stated problems implies calculation                       subspace. [9] That means subspace cluster
of mean relevance calculation, though it is near                       (O, S) is rated higher if each vector in O is far
impossible to perform for a whole search space.                        enough from any other vector in subspace S
The introduction of mean relevance denotes the                         that is not in the cluster.
following:                                                        The first approach is fair enough in case of the
    The goal of optimization is to select a subset of             nearest neighbour retrieval: the user is interested in
subspace clusters such that the mean value of real                matching all the coordinates of given query vector
relevance of given cluster for given query is the highest         q until certain subset of dimensions is not specified.
when the approximate relevance value is the highest.              In that case dimension clustering relevance can be
    This Optimization problems can be formulated                  determined as R(c) = R(S, O) = |S|.
for a low dimensional spaces as well. Though the                      The second approach can be used to assign
abcence of the need to take into account subspace                 weight to particular dimensions or vectors. This
and simple mechanisms of query point classification               approach is senseble in case of specific origins of the
(like MBR []) lead to the simple solution as                      data. Let’s regard a vectors in a finite dimensional
regarding only the set of clusters that satisfy given             vector space that are obtained by orthogonal projec-
density and size. For a high dimensions the most                  tion of time-series data in an infinite dimensionsal
principal difference is a number of possible subspace             space. Then the user can be interested in a uniform
clusters.                                                         sample in time. At the other side sequental dense
    The number of detected clusters in all subspaces              subset of times can be more relevant. In these cases
can be significantly bigger then in a low dimensional             K can be denoted as
space.                                                                                                             2
    That is because of 2d number of all possible
                                                                                                      P
                                                                                                             (si )
subspaces for a d-dimensional space. If all possible
                                                                                     X                0