=Paper= {{Paper |id=None |storemode=property |title=Retrieval of Optimal Subspace Clusters Set for an Effective Similarity Search in a High-Dimensional Spaces |pdfUrl=https://ceur-ws.org/Vol-934/paper23.pdf |volume=Vol-934 |dblpUrl=https://dblp.org/rec/conf/rcdl/Sudos12 }} ==Retrieval of Optimal Subspace Clusters Set for an Effective Similarity Search in a High-Dimensional Spaces == https://ceur-ws.org/Vol-934/paper23.pdf

Retrieval of optimal subspace clusters set for an effective
similarity search in a high-dimensional spaces

c Ivan Sudos
Saint-Petersburg State University
Saint-Petersburg
iv.teh.adr@gmail.com

Abstract of dimensionality” first stated by R. Bellman [1]. It
has two key aspects. The first one lies within the
High dimensional data is often analysed following fact: if the number of dimensions grows
resorting to its distribution properties in sub- the information under analysis in the search space
spaces. Subspace clustering is a powerfull become cumbersome. And the second one is the
method for elicication of high dimensional metric related problem: in high dimensions we often
data features. The result of subspace clus- can’t state that two vectors are similar or different.
tering can be an essential base for building In most of cases we usually can’t build reliable
indexing structures and further data search. algorithm and data structures (indexes) to handle
However, a high number of subspaces and exact match search with acceptable latency. Here
data instances can conceal a high number of in the first place we consider approximate similarity
subspace clusters some of which are difficult search.
to analyse within search algorithm. This Most of the approaches to search and indexing
paper presents a model of generic indexing problems can be divided into following categories:
approach based on detected subspace clusters
and the way to find an optimal set of clus- 1. Low-dimensional algorithms adaptations, like
ters to have an acceptable tradeoff between ones that use R-trees. Here we try to fix
search speed and relevance. some particular problems of search algorithm
for low-dimensional search spaces to make it
somehow feasible in high-dimensional space.
1 Introduction However, it tends to work acceptable for
relatevely low number of dimensions (doesn’t
Search and clustering are two most extensive prob- exceed dozens)
lems of data analysis in high dimensional spaces.
2. Data distribution based algorithms. These
Since the roots of complexity that occurs in this
alorithms take into account distribution’s prop-
domain were established, plenty of particular aspects
erties of the data. Number of them proceed
have been elicited and regarded so far. Many of
dimensional reduction techniques like principal
approaches for solving high-dimensional problems
component analysis or subspace clustering to
regard only these particular complexity aspects, as it
fight the curse of dimensionality.
is quite sophisticated and practice detached to solve
the problem in general. In recent twenty years a 3. Random projection based algorithms. These
number of solutions have been proposed to solve algorithms tries to decrease the volume of
certain problems of searching and clustering in high scanned information in a search space by
dimensional spaces. Clustering and indexing are grouping it’s elements with a degree of ran-
strongly associated with each other in high dimen- domnicity. Some realizations of locally sensi-
sional spaces: resolving of indexing problems can tive hashing [16] relates to this category.
cause the need to resolve clustering problems.
Regarding only search and clustering problems All of the approaches in one way or another try
we will refer to a high-dimensional vector space as a to solve the curse of dimensionality. This implies
search space. The general challenge for search and they try to reduce time comlexity of search or
clustering in high dimensional spaces is called “curse increase relevance, or do the both. In this paper
we consider the second category of solutions. Here
Proceedings of the 14th All-Russian Conference we have a point of contact with clustering problem.
”Digital Libraries: Advanced Methods and Analysis of data distribution is closely related to
Technologies, Digital Collections” — RCDL-2012, clusteing. Clustering in high dimensions have several
Pereslavl Zalesskii, Russia, October 15-18, 2012. approaches and each approach apart can use its

136
own cluster model. This paper stays only on Indexing: [5] introduces Bregman ball trees in-
subspace and projection clustering approches [2]. dex structure that is reconsideration of ball trees
In accordance with [2], subspace clustering aims using Bregman divergence instead of classic metric.
to find all clusters in all possible subspaces and Though it is not supposed to be used in high di-
projection clustering assignes each vector to exactly mensions it introduces a feasible concept of applying
one subspace cluster. For example, a set of photos Bregman divergences to known indexing structures
can be placed to one cluster if projection clustering instead of metrices.
algorithm finds no difference in their color hystogram Having distance functions like Bregman diver-
characteristics. At the mean time, a subspace gences in use, the search can be more complicated.
clustering algorithm will form(if any exists) clusters The computational difficulty of such functions is
for all possible characteristics: hystograms, shapes, higher then simle functions like Euclid metric. This
gradients etc. Despite it looks not so flexible as aspect is taken into account in this paper.
principal component analysis methods that can detect X-tree [6] is spacial tree based on hyper rect-
arbitrary manifolds or just clusters in not axis-parallel angle partitioning of search space shows how well
dimensions, subspace clustering have one important known low-dimensional index structure R-tree can be
advantage: locality property [3]. It means that adapted for relatevly high number of dimensions by
subspace clustering algorithms can determine a set rejecting rectangles overlaping. However X-tree have
of relevant dimensions (relevant subspace) locally its capability limits. This is an example of case when
for each part of the search space or subset of data taking into account all dimensions simultaneously
vectors. leads to index structure size of the same order as the
Our goal is to understand how clustering results data.
can be coupled with similarity search in high-
dimensional spaces. This paper introduce a generic Clustering: The clustering techincs used in this
approach to utilize detected subspace clustering paper refer to projection and subspace clustering.
within search. We state a key optimization problem The key survey on clustering algorithms in high
that allows to find the best tradeoff between search dimensions was conducted in [2]. Algorithms are cat-
relevance and speed. The implies selection of the egorized and compared. Main groups distinguished
best subset of subspace clusters that can provide the are: Axis-parallel subspace and projection clustering,
best relevance with a guaranteed retrieval minimal Arbitrary oriented subspaces clustering and Pattern
complexity. based clustering. We are also interested in description
of particular algorithms of subspace and projection
2 Related works clustering and clustering models they use [3]. A
surevey compares different subspace clustering per-
Many of works were presented on indexing and formance. Clustering quality evalutation is regarded
clustering apart. Applications of clustering for along with results of proposed optimization problem
building a search index are implicitly described in solution [7].
works about nearest neighbours in high dimensional
spaces. Some works show how subspace clustering Clustering within indexing A recent work [4]
results can be used as base for tree-like index shows us how we can build tree-like indexing
structures. Although a link between parameters of structure in high dimensional space atop known
detected clusters and efficiency of nearest neighbours clusters set. This work doesn’t imply special
search is not presented. Several papers regard a method of clusterization and thus doesn’t regard how
subspace clustering process quality from the point of clustering influences on efficiency of search. Though
view of data redundancy. this search approach helps to evaluate clustering
Here we first state the optimization problem that results experimentally.
arises when index structure is based on detected
subspace clusters. The problem links search ef-
fectivity (speed and relevance) and properties of 3 Subspace clustering and similarity search
detected clusters. Thereby our work aims to link
The common goal of indexing is to prune search
subspace clustering and nearest neighbour search.
space by compression and/or elicit relations between
Index structures can be based on detected subspace
parts of the data (trees and space partitioning).
clusters in various ways, however an existing index-
Such approaches as Locally sensitive hashing and
ing approaches don’t consider affects of underlying
VA-files are compression [8] techinques that allows
clustering.
to approximat a group of near vectors with a
Some papers that considers indexing and cluster-
single object. These techniques suffer a curse of
ing problems in high dimensional spaces are reported
dimensionality as well in high dimensional spaces.
below.
Locally sensitive hashing leads to low relevancse

137
due to distance invisibility. Space partitioning is The second consideration is the measure of
not feasible since a number of hyperrectangles can relevance of detected clusters. Again, suppose
exceed or become close to the number of data we have an uniformly distributed query q . Let
vectors and the search will be not less complex C = {c1 . . . cN } will be the set of detected subspace
as and exhaustive bypass through all the data. clusters. Each subspace cluster ci represents a
Clustering suffers from distanse invisibility in high pair {Oi , Si } of set of vectors it contains: V =
dimensions as it was described above. In our model {v1 . . . vk } and a set of dimensions S = {d1 . . . dl }
we consider subspace clustering as a solution to that determines a subspace. Let introduce a relevance
proceed data compression to be used as a base of of a given cluster c of size k for a given query q :
indexing structure. This considiration is maximally
abstracted from the complete indexing structure and P 1
the algorithm of similarity search. Thus we assume v∈V dist(q, v)
R(c, q) = Rdim (c, q)
that a search algorithm and an index structure k
operates a set of objects that represent grouped
The right side represents a product of: subspace
(clustered) data and perform retrieval on the smaller
relevance function Rdim (c, q) that represents a rele-
space thus with smaller precision. This requires
vance of subspace where the data forms cluster c
a function that calculates relevance of the cluster
by the average inverted distance from a query to a
for a query q and this function should avoid full
member v of the cluster. We accept the Manhattan
scan of cluster members as it leads to an exhaustive
distance function and Euclidean distance function
search. So far search algorithm is assumed to do the
here as they were prooved [9] to be the only suit-
following:
able for a distance measurment in high dimensional
• Consider subspace clusters as a primary result spaces.
of data compression. Relevance calculation of cluster c with respect to
query q should avoid iteration through all cluster’s
• Use some approximate fast enough distance members. The relevance calculation function is sup-
function to calculate relevance of the given posed to be approximate. Having this consideration
subspace cluster with respect to a given query we introduce an approximate cluster relevance as
vector q .
Rapprox (c, q) = Rdim (c, q)Q(c, q)
• Find the most relevant cluster(s) and take them
into the furhter consideration. where Q(c, q) is an approximate distance func-
tion. Let denote complexity of it as g(|V |, |S|) =
The following considerations show how subspace
gq (|V |, |S|, V, S). In further we will show an
clustering features affect the tradeoff between search
examples of this functions.
speed and relevance.
The general idea of this paper is to understand
Let the subspace cluster c = {V, S} be any subset
how the set of detected clusters should be selected to
V of data vectors so that maximum deviation in the
obtain the required search speed and relevance ratio.
subspace S of any v ∈ V from the rest of V is
Denote the set of all the possible subspace clusters
less then given threshold h and V contains not less
for a given search space as C . The subset C ∗ of C
elements then p. So that the cluster is considered
is an argument of optimization and the optimization
to be any clot of vectors in the any subspace with
problem is to obtain such C ∗ that will produce the
boundered parameters of density and a number of
best (in terms of retrieval speed and relevance) result
elements.
for a random query q . The first optimization problem:
The first consideration is the measure of data
compression provided with subspace clustering. The 
E(R(c∗ , q)) → max
general assumption about query is that it is an

 c∗ = arg max Rapprox (c, q)

uniformely distributed vector in the search space. c∈C (1)
We assume that the complexity of search algorithm


 N ≤ N max

is monotonic non-decreasing function f (N, q) of N E(gq (|V |, |s|)) ≤ γ ∀q
where N is a number of objects in the search space
The first optimization problem’s goal is to maxi-
and q is a query vector. Since subspace clustering
mize mean relevance of the most suitable cluster
compression is performed N depicts a number of
determined by a given relevance calculation function
detected subspace clusters to be used by search
Rapprox for a random query q. The constraints are to
algorithm. Without respect to the retrieval algorithm
keep the number of objects under analysis (subspace
we assume the need to calculate relevance of a
clusters) below the given bound Nmax and keep cal-
random detected cluster for a query q . So Q(q, c) is a
culation complexity of the Rapprox in a given bounds.
function that calculates relevance of c with respect to
q . Let Q ∈ O(g(|V |, dim(S))).

138
Let’s introduce another optimization problem: quality. The are three known groups of methods
for automatical evaluation of cluster dimensionality
min R(c∗ , q) ≥ Rmin

relevance:
 q∈H



c∗ = arg max Rapprox (c, q)
c∈C (2) • Rate subspaces and thus subspace clusters
 N → min higher if their dimensionality is higher [10], [2]



E(gq (|V |, |s|)) → min ∀q
• Introduce generic cost function K(O, S) that
The second one’s goal is to find the simplest rates relevancy of given subspace cluster (O, S)
clusters set that keeps relevance rate in the given [10], [11]
bounds. The latest problem is off less interest as
it is difficult to user to assign relevance constraints • Measure cluster separability within a given
apriori. The stated problems implies calculation subspace. [9] That means subspace cluster
of mean relevance calculation, though it is near (O, S) is rated higher if each vector in O is far
impossible to perform for a whole search space. enough from any other vector in subspace S
The introduction of mean relevance denotes the that is not in the cluster.
following: The first approach is fair enough in case of the
The goal of optimization is to select a subset of nearest neighbour retrieval: the user is interested in
subspace clusters such that the mean value of real matching all the coordinates of given query vector
relevance of given cluster for given query is the highest q until certain subset of dimensions is not specified.
when the approximate relevance value is the highest. In that case dimension clustering relevance can be
This Optimization problems can be formulated determined as R(c) = R(S, O) = |S|.
for a low dimensional spaces as well. Though the The second approach can be used to assign
abcence of the need to take into account subspace weight to particular dimensions or vectors. This
and simple mechanisms of query point classification approach is senseble in case of specific origins of the
(like MBR []) lead to the simple solution as data. Let’s regard a vectors in a finite dimensional
regarding only the set of clusters that satisfy given vector space that are obtained by orthogonal projec-
density and size. For a high dimensions the most tion of time-series data in an infinite dimensionsal
principal difference is a number of possible subspace space. Then the user can be interested in a uniform
clusters. sample in time. At the other side sequental dense
The number of detected clusters in all subspaces subset of times can be more relevant. In these cases
can be significantly bigger then in a low dimensional K can be denoted as
space. 2
That is because of 2d number of all possible
 P
(si )
subspaces for a d-dimensional space. If all possible
X  0