Retrieval of optimal subspace clusters set for an effective similarity search in a high-dimensional spaces c Ivan Sudos Saint-Petersburg State University Saint-Petersburg iv.teh.adr@gmail.com Abstract of dimensionality” first stated by R. Bellman [1]. It has two key aspects. The first one lies within the High dimensional data is often analysed following fact: if the number of dimensions grows resorting to its distribution properties in sub- the information under analysis in the search space spaces. Subspace clustering is a powerfull become cumbersome. And the second one is the method for elicication of high dimensional metric related problem: in high dimensions we often data features. The result of subspace clus- can’t state that two vectors are similar or different. tering can be an essential base for building In most of cases we usually can’t build reliable indexing structures and further data search. algorithm and data structures (indexes) to handle However, a high number of subspaces and exact match search with acceptable latency. Here data instances can conceal a high number of in the first place we consider approximate similarity subspace clusters some of which are difficult search. to analyse within search algorithm. This Most of the approaches to search and indexing paper presents a model of generic indexing problems can be divided into following categories: approach based on detected subspace clusters and the way to find an optimal set of clus- 1. Low-dimensional algorithms adaptations, like ters to have an acceptable tradeoff between ones that use R-trees. Here we try to fix search speed and relevance. some particular problems of search algorithm for low-dimensional search spaces to make it somehow feasible in high-dimensional space. 1 Introduction However, it tends to work acceptable for relatevely low number of dimensions (doesn’t Search and clustering are two most extensive prob- exceed dozens) lems of data analysis in high dimensional spaces. 2. Data distribution based algorithms. These Since the roots of complexity that occurs in this alorithms take into account distribution’s prop- domain were established, plenty of particular aspects erties of the data. Number of them proceed have been elicited and regarded so far. Many of dimensional reduction techniques like principal approaches for solving high-dimensional problems component analysis or subspace clustering to regard only these particular complexity aspects, as it fight the curse of dimensionality. is quite sophisticated and practice detached to solve the problem in general. In recent twenty years a 3. Random projection based algorithms. These number of solutions have been proposed to solve algorithms tries to decrease the volume of certain problems of searching and clustering in high scanned information in a search space by dimensional spaces. Clustering and indexing are grouping it’s elements with a degree of ran- strongly associated with each other in high dimen- domnicity. Some realizations of locally sensi- sional spaces: resolving of indexing problems can tive hashing [16] relates to this category. cause the need to resolve clustering problems. Regarding only search and clustering problems All of the approaches in one way or another try we will refer to a high-dimensional vector space as a to solve the curse of dimensionality. This implies search space. The general challenge for search and they try to reduce time comlexity of search or clustering in high dimensional spaces is called “curse increase relevance, or do the both. In this paper we consider the second category of solutions. Here Proceedings of the 14th All-Russian Conference we have a point of contact with clustering problem. ”Digital Libraries: Advanced Methods and Analysis of data distribution is closely related to Technologies, Digital Collections” — RCDL-2012, clusteing. Clustering in high dimensions have several Pereslavl Zalesskii, Russia, October 15-18, 2012. approaches and each approach apart can use its 136 own cluster model. This paper stays only on Indexing: [5] introduces Bregman ball trees in- subspace and projection clustering approches [2]. dex structure that is reconsideration of ball trees In accordance with [2], subspace clustering aims using Bregman divergence instead of classic metric. to find all clusters in all possible subspaces and Though it is not supposed to be used in high di- projection clustering assignes each vector to exactly mensions it introduces a feasible concept of applying one subspace cluster. For example, a set of photos Bregman divergences to known indexing structures can be placed to one cluster if projection clustering instead of metrices. algorithm finds no difference in their color hystogram Having distance functions like Bregman diver- characteristics. At the mean time, a subspace gences in use, the search can be more complicated. clustering algorithm will form(if any exists) clusters The computational difficulty of such functions is for all possible characteristics: hystograms, shapes, higher then simle functions like Euclid metric. This gradients etc. Despite it looks not so flexible as aspect is taken into account in this paper. principal component analysis methods that can detect X-tree [6] is spacial tree based on hyper rect- arbitrary manifolds or just clusters in not axis-parallel angle partitioning of search space shows how well dimensions, subspace clustering have one important known low-dimensional index structure R-tree can be advantage: locality property [3]. It means that adapted for relatevly high number of dimensions by subspace clustering algorithms can determine a set rejecting rectangles overlaping. However X-tree have of relevant dimensions (relevant subspace) locally its capability limits. This is an example of case when for each part of the search space or subset of data taking into account all dimensions simultaneously vectors. leads to index structure size of the same order as the Our goal is to understand how clustering results data. can be coupled with similarity search in high- dimensional spaces. This paper introduce a generic Clustering: The clustering techincs used in this approach to utilize detected subspace clustering paper refer to projection and subspace clustering. within search. We state a key optimization problem The key survey on clustering algorithms in high that allows to find the best tradeoff between search dimensions was conducted in [2]. Algorithms are cat- relevance and speed. The implies selection of the egorized and compared. Main groups distinguished best subset of subspace clusters that can provide the are: Axis-parallel subspace and projection clustering, best relevance with a guaranteed retrieval minimal Arbitrary oriented subspaces clustering and Pattern complexity. based clustering. We are also interested in description of particular algorithms of subspace and projection 2 Related works clustering and clustering models they use [3]. A surevey compares different subspace clustering per- Many of works were presented on indexing and formance. Clustering quality evalutation is regarded clustering apart. Applications of clustering for along with results of proposed optimization problem building a search index are implicitly described in solution [7]. works about nearest neighbours in high dimensional spaces. Some works show how subspace clustering Clustering within indexing A recent work [4] results can be used as base for tree-like index shows us how we can build tree-like indexing structures. Although a link between parameters of structure in high dimensional space atop known detected clusters and efficiency of nearest neighbours clusters set. This work doesn’t imply special search is not presented. Several papers regard a method of clusterization and thus doesn’t regard how subspace clustering process quality from the point of clustering influences on efficiency of search. Though view of data redundancy. this search approach helps to evaluate clustering Here we first state the optimization problem that results experimentally. arises when index structure is based on detected subspace clusters. The problem links search ef- fectivity (speed and relevance) and properties of 3 Subspace clustering and similarity search detected clusters. Thereby our work aims to link The common goal of indexing is to prune search subspace clustering and nearest neighbour search. space by compression and/or elicit relations between Index structures can be based on detected subspace parts of the data (trees and space partitioning). clusters in various ways, however an existing index- Such approaches as Locally sensitive hashing and ing approaches don’t consider affects of underlying VA-files are compression [8] techinques that allows clustering. to approximat a group of near vectors with a Some papers that considers indexing and cluster- single object. These techniques suffer a curse of ing problems in high dimensional spaces are reported dimensionality as well in high dimensional spaces. below. Locally sensitive hashing leads to low relevancse 137 due to distance invisibility. Space partitioning is The second consideration is the measure of not feasible since a number of hyperrectangles can relevance of detected clusters. Again, suppose exceed or become close to the number of data we have an uniformly distributed query q . Let vectors and the search will be not less complex C = {c1 . . . cN } will be the set of detected subspace as and exhaustive bypass through all the data. clusters. Each subspace cluster ci represents a Clustering suffers from distanse invisibility in high pair {Oi , Si } of set of vectors it contains: V = dimensions as it was described above. In our model {v1 . . . vk } and a set of dimensions S = {d1 . . . dl } we consider subspace clustering as a solution to that determines a subspace. Let introduce a relevance proceed data compression to be used as a base of of a given cluster c of size k for a given query q : indexing structure. This considiration is maximally abstracted from the complete indexing structure and P 1 the algorithm of similarity search. Thus we assume v∈V dist(q, v) R(c, q) = Rdim (c, q) that a search algorithm and an index structure k operates a set of objects that represent grouped The right side represents a product of: subspace (clustered) data and perform retrieval on the smaller relevance function Rdim (c, q) that represents a rele- space thus with smaller precision. This requires vance of subspace where the data forms cluster c a function that calculates relevance of the cluster by the average inverted distance from a query to a for a query q and this function should avoid full member v of the cluster. We accept the Manhattan scan of cluster members as it leads to an exhaustive distance function and Euclidean distance function search. So far search algorithm is assumed to do the here as they were prooved [9] to be the only suit- following: able for a distance measurment in high dimensional • Consider subspace clusters as a primary result spaces. of data compression. Relevance calculation of cluster c with respect to query q should avoid iteration through all cluster’s • Use some approximate fast enough distance members. The relevance calculation function is sup- function to calculate relevance of the given posed to be approximate. Having this consideration subspace cluster with respect to a given query we introduce an approximate cluster relevance as vector q . Rapprox (c, q) = Rdim (c, q)Q(c, q) • Find the most relevant cluster(s) and take them into the furhter consideration. where Q(c, q) is an approximate distance func- tion. Let denote complexity of it as g(|V |, |S|) = The following considerations show how subspace gq (|V |, |S|, V, S). In further we will show an clustering features affect the tradeoff between search examples of this functions. speed and relevance. The general idea of this paper is to understand Let the subspace cluster c = {V, S} be any subset how the set of detected clusters should be selected to V of data vectors so that maximum deviation in the obtain the required search speed and relevance ratio. subspace S of any v ∈ V from the rest of V is Denote the set of all the possible subspace clusters less then given threshold h and V contains not less for a given search space as C . The subset C ∗ of C elements then p. So that the cluster is considered is an argument of optimization and the optimization to be any clot of vectors in the any subspace with problem is to obtain such C ∗ that will produce the boundered parameters of density and a number of best (in terms of retrieval speed and relevance) result elements. for a random query q . The first optimization problem: The first consideration is the measure of data compression provided with subspace clustering. The  E(R(c∗ , q)) → max general assumption about query is that it is an   c∗ = arg max Rapprox (c, q)  uniformely distributed vector in the search space. c∈C (1) We assume that the complexity of search algorithm    N ≤ N max is monotonic non-decreasing function f (N, q) of N E(gq (|V |, |s|)) ≤ γ ∀q where N is a number of objects in the search space The first optimization problem’s goal is to maxi- and q is a query vector. Since subspace clustering mize mean relevance of the most suitable cluster compression is performed N depicts a number of determined by a given relevance calculation function detected subspace clusters to be used by search Rapprox for a random query q. The constraints are to algorithm. Without respect to the retrieval algorithm keep the number of objects under analysis (subspace we assume the need to calculate relevance of a clusters) below the given bound Nmax and keep cal- random detected cluster for a query q . So Q(q, c) is a culation complexity of the Rapprox in a given bounds. function that calculates relevance of c with respect to q . Let Q ∈ O(g(|V |, dim(S))). 138 Let’s introduce another optimization problem: quality. The are three known groups of methods for automatical evaluation of cluster dimensionality min R(c∗ , q) ≥ Rmin  relevance:  q∈H    c∗ = arg max Rapprox (c, q) c∈C (2) • Rate subspaces and thus subspace clusters  N → min higher if their dimensionality is higher [10], [2]    E(gq (|V |, |s|)) → min ∀q • Introduce generic cost function K(O, S) that The second one’s goal is to find the simplest rates relevancy of given subspace cluster (O, S) clusters set that keeps relevance rate in the given [10], [11] bounds. The latest problem is off less interest as it is difficult to user to assign relevance constraints • Measure cluster separability within a given apriori. The stated problems implies calculation subspace. [9] That means subspace cluster of mean relevance calculation, though it is near (O, S) is rated higher if each vector in O is far impossible to perform for a whole search space. enough from any other vector in subspace S The introduction of mean relevance denotes the that is not in the cluster. following: The first approach is fair enough in case of the The goal of optimization is to select a subset of nearest neighbour retrieval: the user is interested in subspace clusters such that the mean value of real matching all the coordinates of given query vector relevance of given cluster for given query is the highest q until certain subset of dimensions is not specified. when the approximate relevance value is the highest. In that case dimension clustering relevance can be This Optimization problems can be formulated determined as R(c) = R(S, O) = |S|. for a low dimensional spaces as well. Though the The second approach can be used to assign abcence of the need to take into account subspace weight to particular dimensions or vectors. This and simple mechanisms of query point classification approach is senseble in case of specific origins of the (like MBR []) lead to the simple solution as data. Let’s regard a vectors in a finite dimensional regarding only the set of clusters that satisfy given vector space that are obtained by orthogonal projec- density and size. For a high dimensions the most tion of time-series data in an infinite dimensionsal principal difference is a number of possible subspace space. Then the user can be interested in a uniform clusters. sample in time. At the other side sequental dense The number of detected clusters in all subspaces subset of times can be more relevant. In these cases can be significantly bigger then in a low dimensional K can be denoted as space. 2 That is because of 2d number of all possible  P (si ) subspaces for a d-dimensional space. If all possible X  0