Similarity Multidimensional Indexing♣ © Elena Mikhaylova, Boris Novikov, Anton Volokhov Saint Petersburg state university egmichailova@mail.ru, borisnov@acm.org, a.v.volokhov@gmail.com Abstract Multidimensional searching is primarily associated with processing spatial and spatio-temporal data. The multidimensional k-NN (k nearest Another class of applications, processing neighbors) query problem arises in a multidimensional data includes systems based on large variety of database applications, various flavor of a text vector model for methods of including information retrieval, natural data mining, pattern recognition, data compression etc. language processing, and data mining. Data obtained from the application of this class are To solve it efficiently, database needs an typically characterized by high dimensionality. indexing structure supporting this kind Typically, the processing of multidimensional data of search. However, exact solution is requires searching on the basis of proximity or hardly feasible in multidimensional similarity, rather than exact attributes equality. The space. In this paper we describe and most common search query is to find K nearest analyze an indexing technique for neighbors for a given dataset. approximate solution of k-NN problem. In this work we present and compare two approaches Construction of the indexing tree is for similarity multidimensional search. One idea is based on clustering. Construction of based on tree-clustering structure, the second one uses hash indexing is based on s-stable LSH method. The former is better in precision and the distributions. Indices are implemented latter supersedes in the speed of computation. The on top of high-performance industrial remaining part of the paper is organized as follows. DBMS. Section 2 contains the overview of related work. Section 3 informally outlines the techniques used in Keywords: k-NN search, multidimensional our approach. Our algorithms and data structures are indexing, LSH presented in section 4, followed by analysis of experiments or results in section 5. 1. Introduction Efficiency of search is critical for modern 2. Related work information retrieval. Commonly, search queries Years of multidimensional searching evolution led retrieve relatively small portion of information. In this to development of various indexing algorithms. Best case, indexing search is much more efficient, than the known of them are R-trees [8], X-tree [2], AV-files full scan. Any given process or stored object can be and many others [3]. In the past decade, considerable characterized by a set of features that are usually attention is attracted by the scheme of space-sensitive called attributes. The purpose of any index is quick hashing (LSH - locality sensitive hashing). M-trees [4] access to the object by the values of some of its are intended for indexing the points of a metric space. attributes. In other words, the indices provide effective R-tree indices [8] work well for low-dimensional implementation of associative search. spatial data. But R-tree family degrades rapidly due to If attribute values belong to a linearly ordered set, the overlap in index pages in high dimensions. The searching can be implemented with one-dimensional problem of multidimensional search is extremely indices. These indexing structures (such as B-trees and difficult with large-dimensional vectors. It is well hash-files) are well studied in 70-80 years. But many known that the construction of universal indexing real applications have a goal to find by values of structures for large-scale data is impossible. This fact, several attributes or by attributes that can’t be naturally known as the "curse of dimensionality," determined by linearly ordered. In this case we need to implement the topology of the multidimensional space. It does not searching in multidimensional space. Of the numerous depend on the index constructing method. The multidimensional indexing structures proposed in the practical consequence is that performance of any index 80-ies., only R-trees [1] passed the test of time. The R- structure, starting at some dimension, becomes worse tree indexes are widely used in various applications than a full scan. This problem can be somewhat and implemented in industrial databases. mitigated by considering the indexing structures and algorithms that give approximate results. Typically, Proceedings of the 13th All-Russian scientific conference the vectors represent only the proximity of the source “Digital Libraries: Advanced Methods and Technologies, application objects (eg vectors for information Digital Collections” - RCDL'2011, Voronezh, Russia, retrieval can not accurately express the meaning of the 2011 77 documents). According to that, it is reasonably to use other. The remaining points are stored separately, not approximate methods for finding. For example, LSH clustered. Such a structure is obtained unbalanced. In [6] gives only a probabilistic guarantee of correct order to create the indexing structure, in [13], the answer for k-NN query [9, 5]. distances between the selected set of pivots and the Locality-sensitive hashing works with vector space. data objects is computed, sorted and nearest distances This approach was introduced [15] and well developed are stored in separated tables. For search at first query [16-17] in last decade. is compared with pivots. In paper [15] the idea of using randomized hyperlanes for “sketching” initial dataset was 3. The approach and rationale presented. Then, in paper [16] was introduced improved In this paper, we describe two methods that were method, which uses pivots from s-stable distribution to used for approximate multidimensional search. The hash vectors from the dataset. Idea to preprocess first one is actually a variation of a matrix tree, based dataset with dimensionality reduction methods [18] on spanning tree clustering algorithm. The second is allows implementing new algorithm that appears to be an implementation of the local-sensitive hashing near optimal LSH method [17]. However, it works (LSH), specifically tuned for high computational only with Gaussian norm in Euclidean space. speed. In many cases, it is computationally difficult not The goal of this work is to construct the sub- only to find and to prepare vectors (e.g., feature optimal index structure for approximate searching of extraction of images), but even to calculate the multidimensional objects based on features of function of similarity (or distance between vectors). similarity (or distance function). This structure will Search could be greatly accelerated if it is possible to provide a reasonable response time for practical ranges store a matrix of pairwise distances between objects. of different parameters. In order to evaluate the Unfortunately, this solution is not scalable because the characteristics and performance of this structure, we size of the index in this case, will square dependence have implemented the model over the data provided on the number of objects. industrial relational DBMS. Perhaps this decision In [10] k-NN search problem is dealt with the affects the efficiency but also ensures quick double filtering effect of clustering and indexing. The implementation, suitable for use in the prototype, for clustering algorithm ensures that the largest cluster fits example, in an experimental system for content based into main memory and that only clusters closest to a image retrieval. That is, the developed system allows query point need to be searched and hence loaded into finding images on a number of characteristics similar main memory. In each cluster data is distributed with to the one desired. ordered-partition tree (OP-tree) main memory resident If we have a reasonable matrix of pairwise index, which is efficient for processing k-NN queries. distances, the problem of finding K nearest neighbors High-dimensional clustering is used by some can be solved by one-dimensional index range scan. content-based image retrieval systems to partition the This operation is efficiently implemented in any data into groups (clusters), which are then indexed to relational database. In order to limit the size of an accelerate processing of queries. Recently, the Cluster index, we will not store all pairwise distances, but only Pruning approach was proposed in [12] as a simple the distance to the K nearest points. Usually, search way to produce such clusters. The evaluation of the engines need to solve the problem of K-nn only for algorithm was performed within an image indexing small values of K. context. The paper [12] discusses the parameters that affect the efficient to the algorithm, and proposes changes to the basic algorithm to improve 3.1 Locality-sensitive idea performance. [11] presents an adaptive Multi-level Mahalanobis- Hash functions family is called locality sensitive based Dimensionality Reduction (MMDR) technique with parameters (R, cR, P1, P2), if following: for high-dimensional indexing. The MMDR technique discovers elliptical clusters for more effective V1 − V2 < R → PrH (h( p ) = h(q )) > P1 (1) dimensionality reduction by using only the low- . dimensional subspaces, data points in the different axis V1 − V2 > cR → PrH (h( p ) = h(q )) < P2 systems are indexed using a single B+-tree. The technique is highly scalable in terms of data size and dimensionality. It is also dynamic and adaptive to Intuitively it means exactly that points with low insertions. An extensive performance study was v1 − v2 (near in initial space) will collide into the conducted using both real and synthetic datasets, and the results show that our technique not only achieves same value with higher probability, then points with higher precision, but also enables queries to be higher one. processed efficiently. LSH method is based on hashing dataset with In paper [14] a new clustering method is proposed: functions chosen uniformly at random from locality- to group together only points that are close to each sensitive family. To satisfy approximate nearest 78 neighbor query, we don’t need to perform full scan, E. To a new level we repeat step A – D. So we but can find neighbors only in buckets, where query built the second level of our indexing point falls. structure. To construct LSH family, one needs to choose hash F. We add levels with steps A – E until the functions and determine probabilities P1 and P2. number of allocated centroids on the top-level Indexing structure, according to these parameters can became sufficiently small. The latter set of do nearest neighbors searching with following centroids forms the upper level of the index query(2) and preprocessing(3) time: tree. Search is organized as follows: ( log1 / P1 ) (2) A. Begin from upper level component. ( log1 / P2 ) B. Calculate distances between the current level's n . vectors and given search query. C. We choose the nearest vector and the ( log1 / P1 ) (3) corresponding component. We proceed to the 1+ next level and repeat step B. Then, using ( log1 / P2 ) n . precalculated matrix of distances, we choose the nearest centroid and go to the next level. D. On lowest level we select nearest vectors to the query. 4. Data structures and algorithms By adding new vector: A. Repeat steps A-D from search, we find the lowest level component which centre is nearest 4.1 Algorithms description to the specified vector. B. If number of vectors exceeds the upper limit, In the experiments, we would rely on high- we do not divide it into two parts, and climb performance relational database system. Our storage one level up and perform re-clustering for all structure was constructed on Microsoft SQL-server. vectors caught in this component. The data were presented in the form of points of multidimensional space. We consider the objects as multidimensional vectors. 4.1.2. s-stable distribution LSH 4.1.1. Spanning tree This randomized algorithm is using the idea, described in [16] The first storage structure organized as follows. Here we use scalar product with random vectors For a given vector space we construct a minimum from s-stable distribution to “sketch” initial high length spanning tree. We accept vectors as individual dimensional vector. graph components and compute all pairwise distances Locality-sensitive hashing family with parameters between them. (R , cR, P1, P2) The metric space is defined as a pair (D; d) where (P1,P2 – probabilities of “good” and “bad” D denotes the domain of objects and d: DЧD → R is a collisions, R – query radii, c – approximation factor) is total function which must have the well-known non- defined as follows: negativity, symmetry, identity and triangle inequality properties. The distance is calculated using the metric ⎢a ⋅v+b⎥ (4) h(v) = ⎢ . ⎣ w ⎥⎦ of L1. We create index structure so: A. Add to the tree minimal length arc if endpoints a – random vector from s-stable distribution, of this arc belong to different components. At predefined in pivots table; b – random shift, chosen the same time we restrict maximum number of uniformly at random from [0..w]; v – vector from arcs entering each vertex. initial dataset. B. Repeat step A until the number of components Due to s-stability, for every two vectors (p, q) becomes equal to one. distance between their projections (a.p – a.q) is C. Split the tree to l components by removing l longest arcs. During splitting we monitor the distributed as p − q s X where X is a s-stable number of vertices in the connected distribution. It means that if two points a near( low component – it must be more then min and p − q ) then they should collide with high less then max. D. For all building component we find the central probability, and if they are far they should collide with vector – centroid. These centroids are vectors small probability. in same space, but theirs quantity is much However, if we use these functions to preprocess smaller then amount of source vectors, and we dataset, we will have a plenty of false-positive answers consider them as new level vectors. 79 д = 1 − ( 1 − P1K ) L because of low difference We build a spanning tree of minimum between P1 and P2 length(depth?) and divide it into components. For arcs To increase it, we construct composite hash not belonging to spanning tree component number is function gi , i ∈ 1..L as a concatenation of hash equal to zero, otherwise, this field indicates the functions hi, j , j ∈ 1..K . number of components, which the arc belongs to. This table will be very large, but it is necessary For new hash functions P1 = P1K, P2 = P2K. only at the base index constructing stage. Later during To increase total probability of retrieving answer, search and insertion, we need only some of pairwise we hash dataset with L different hash functions gi. distances. Equation (5) describes probability д of getting a Vector table is as follows: correct answer. Vector number Int Component number Int Distance Float д = 1 − ( 1 − P1K )L . (5) Low-level vectors number Int For top-level vectors component number equals To make an indexing structure, we need to zero, else this parameter indicates the number of precompute parameters L, K and w. components, which contains this vector. w can be estimated using formula (6) We use the distance between vectors and component center by searching. For top-level vectors log1 / P1 (6) this distance is zero. For every vector-centroid we wmin = argminw . store low-level vector number - is necessary for know log1 / P2 how many vectors contains this component. Parameters K and L are choosing experimentally by The table structure component: constructing different data structures, satisfying Component number Int equation (7) Level Int Vectors amount Int Centroid number Int ⎢ log1 / д ⎥ (7) L= ⎢ ( k ⎥ ⎣ − log 1 − P1 ⎦ ) . Vectors amount indicates how many vectors are at the lowest level of this component. 4.1.2 Index structure construction 4.2 Indexing on spanning tree To construct a spanning tree minimum length we 4.2.1. Data structures use the following algorithm: 1. We consider multi-dimensional vectors as For indexing structure we used 4 tables. First table vertices of the tree and the distance between stores coordinates of vectors. In order to the table the vectors of the arc. At the lower level we structure not to depend on the dimension of space, assume each node a separate connected each vector is represented in the table as a set of rows - component. We set all component numbers to one row for each coordinate of the vector. Each row zero has the following structure: 2. Compute all pairwise distances and sort them Vector number Int in ascending order. Coordinate number Int 3. Select next minimum length arc with zero Value Float component number. Consider vectors - the The second table stores calculated distances ends of the selected arc. If these vectors between all pairs of vectors. Algorithm performance belong to one component, then we skip this arc doesn't depend on distance measure. In our and repeat step 3 with the next arc. experiments distance was calculated using the metric 4. Check the number of arcs included in these of L1, but for the method it does not matter. Our peaks and owned by any component. The function is defined as follows: r(x, y)=∑|xi-yi|. With classical tree construction algorithm was these distances construct a table of distances in each modified to avoid height degree of nodes. To row that stores the distance between two vectors. ensure this rule the number of edges adjacent Table with attributes: to a node is restricted with a parameter. (no Vector number 1 Int more Nin). If at least one end of the arc constraint is violated, then we skip this arc Vector number 2 Int (with marking – set component number Distance Float negative) and goto step 2. Component number Int 80 5. Add to the spanning tree arc between two recalculate centroid’s coordinates and the vertices. number of vectors inside components on all 6. Connect these components are connected into levels. one. 3. If number of vectors exceeds the upper limit, 7. Continue to step 3 - 6 up until all the vertices we do not divide it into two parts. We climb not included in one connected component. one level up and find all vectors belonging to 8. Divide constructed spanning tree into several selected component of the current level, and connected components. Each component must perform re-clustering for all vectors of this have from Nmin to Nmax vertices. Number of component. We rebuild spanning tree for a components depends on the number of vectors new set of vectors and divide it into stored in the table, and is determined with Nmin connected components. We are doing so at and Nmax. When we split a spanning tree of every level of the indexing structure if current connected components of the arcs belonging to component is overcrowded. the spanning tree, we select the arc with Operation of completion thus obtained is maximum length. Exclude this arc from the sufficient labor intensive. But we anticipate that most spanning tree, forming two connected queries will be directed at search and not an upgrade. components. Check that the number of vectors in the resulting components more Nmin. If not, 4.2 LSH-method it returns back and mark the removed arc. Iterate arcs from the spanning tree, while not 4.1.1. Data structures succeed to break a connected component into two parts. Then divide each new piece, while To implement locality-sensitive hashing using s-stable the size of the component will not appear distributions we used following structures: between Nmin to Nmax. Table contained vector number and nested table with 9. In each component we find the centroid. To its coordinates. find value of the i-th component centroid we average values of the i-th component's vectors Vector number Int in a given connected component. For each Coordinates Nested Table of Float component, we store the number of vectors in it. 10. Consider centroids as multidimensional Table with predefined pivots from normal distribution vectors on the next level. If vectors amount is and random shift for them. more than given Nmax we repeat steps 1-8 of our algorithm. Pivot number Int Coordinates Nested Table of Float 4.1.3 Search Random Shift Float L tables for different hash functions with identical 1. Begin at the top level. structure: K columns with hash values of simple hash 2. Compute distances between stored vectors of functions and nested table for keys of vectors that have current level (inside current component) and the same hash values. We can store all hash values in query vector, order the list by distance in the same table because number of simple hash ascending order. If current level is lowest, we functions is defined on preprocessing stage. found an answer. 3. Take first vector (with minimum distance), Hash_value_1 Float determine the component and the number of Hash_value_2 Float vectors contained in it. Exclude the first element of the list. If the number of vectors in … … the component is less than desired, then repeat Hash_value_K Float step 3. Vector keys Nested Table of Int 4. Determine the vectors in selected components and goto step 2. And table, that store the choice of pivots for certain hash function 4.1.4 Scalability Number of hash function Int 1. Search (step 1-4) for k=1. Pivot_1 Int 2. Selected component contains vectors nearest to given query, and we try to insert new Pivot_2 Int vector into selected component. If the number … … of vectors in this connected component is less Pivot_K Int than the maximum allowed, then this component is updated with a new vector. We 81 Here, number of pivots is also predefined during the Query response time was 10 times faster than the preprocessing stage. one in naive full scan. We also have a statistics table that is needed only for preprocessing. It helps to determine optimal Method based on LSH produced the following results: parameters for indexing structure. 1-NN query – 62% 10-NN query – 50% Value of K Int 50-NN query – 34% Average query time Float However, hashing algorithm worked approximately 40 times faster than the previous one. 6. Conclusion 4.1.2 Preprocessing and index structure In this paper two different algorithms for construction multidimensional indexing were described. These algorithms were implemented and analyzed with Step 1: Preprocessing. different parameters on real dataset. Based on the To find optimal values for K and L we build indexing results of experiments, we can draw the following structures for different K from 3 to 15 and appropriate conclusions: first method proposed gives a sufficiently L, calculated by the formula (7) accurate result, but compared to LSH lose heavily in time. If the k-NN search is not 2k, but 3-4k, then the A. Build indexing structure for certain value precision increases, but difference between tree and K. hashing query response time is even more significant. B. Hash all points p ∈ P with L hash functions. References C. Run testing set on this indexing structure. D. Store value of K and average query time [1] Lars Arge, Mark de Berg, Herman J. Haverkort, Ke in statistics table. Yi: The Priority RTree: A Practically Efficient and E. Delete indexing structure. WorstCase Optimal RTree F. Repeat A-E, until K < 16. [2] Stefan Berchtold, Daniel A. Keim, Hans-Peter G. Retrieve value of K from statistics table Kriegei: The X-tree: An Index Structure for High- with minimal query time Dimensional Data, In Proceedings of the 22nd H. Build indexing structure with obtained International Conference on Very Large Databases value of K and appropriate value of L (1996), pp. 28-39. [3] D. Bremner, E. Demaine, J. Erickson, J. Iacono, S. Step 2: Search. Langerman, P. Morin, and G. Toussaint, "Output- sensitive algorithms for computing nearest- I. Hash query point q with hash function gi neighbor decision boundaries," Discrete and J. Retrieve all points from bucket gi(q). Computational Geometry, Vol. 33, No. 4, 2005, pp. K. If retrieved points are not enough to 593-604. satisfy k-NN query, then repeat 1-3 with [4] Ciaccia et al. M-tree: An Efficient Access Method next hash function gi+1 for Similarity Search in Metric Spaces. VLDB- L. Sort retrieved points and return first k 1997, 1997 elements as an answer. [5] Thomas M. Cover and Peter E. Hart, "Nearest neighbor pattern classification," IEEE Transactions on Information Theory, (1967) Vol. 13 (1) pp. 21- 5. Experiment and analysis 27 [6] Gionis, A.; Indyk, P., Motwani, R. (1999). , To compare described algorithms, dataset refereed "Similarity Search in High Dimensions via to content-based image retrieval was considered. It Hashing". Proceedings of the 25th Very Large consists of 25000 rows with 41 attributes, representing Database (VLDB) Conference. different image characteristics. Methods were [7] Guttman: R-Trees: A Dynamic Index Structure for implemented within Microsoft SQL-Server DBMS. Spatial Searching. SIGMOD Conference 1984: 47- During experiments average accuracy (percent of exact 57 nearest-neighbors in retrieved approximate answer) [8] Yannis Manolopoulos, Alexandros Nanopoulos, and relative time consumption were calculated. Apostolos N. Papadopoulos, Yannis Theodoridis: Results for spanning-tree algorithm: R-Trees: Theory and Applications, Springer, 2005. ISBN 1-85233-977-2 1-NN query – 87% [9] Nigsch, F.; A. Bender, B. van Buuren, J. Tissen, E. 10-NN query – 70% Nigsch & J.B.O. Mitchell (2006). "Melting Point 50-NN query – 79% Prediction Employing k-nearest Neighbor Algorithms and Genetic Parameter Optimization". 82 Journal of Chemical Information and Modeling 46 [14] Stephan Gьnnemann, Hardy Kremer, Dominik (6): 2412–2422. Lenhard, and Thomas Seidl: Subspace Clustering [10] Alexander Thomasian, Lijuan Zhang: Persistent for Indexing High Dimensional Data: A Main clustered main memory index for accelerating k- Memory Index based on Local Reductions and NN queries on high dimensional dataset Individual Multi-Representations. Proceeding Multimedia Tools and Applications Volume 38 EDBT/ICDT 2011 Joint Conference Issue 2, June 2008 [15] P. Indyk, R. Motwani. Approximate nearest [11] Heng Tao Shen, Xiaofang Zhou, Aoying Zhou: neighbor: towards removing the curse of An adaptive and dynamic dimensionality reduction dimensionality. Proceedings of the Symposium on method for high-dimensional indexing. The VLDB Theory of Computing, 1998. Journal, Volume 16 Issue 2, April 2007 [16] M. Datar, N. Immorlica, P. Indyk, and [12] Gylfi Thуr Gudmundsson, Bjцrn Thуr Jуnsson, V.Mirrokni. Locality sensitive hashing scheme Laurent Amsaleg: A large-scale performance study based on p-stable distributions. Proceedings of the of cluster-based high-dimensional indexing. ACM Symposium on Computational Proceeding VLS-MCMR '10 - international Geometry,2004. workshop on Very-large-scale multimedia corpus, [17] P. Indyk, A. Andoni. Near optimal hashing mining and retrieval algorithms for approximate nearest neighbors in [13] Stanislav Barton, Valйrie Gouet-Brunet and Marta high dimensions. Foundations of Computer Rukoz: Large Scale Disk-Based Metric Indexing Science, 2006. Structure for Approximate Information Retrieval [18] N. Alion, B. Chazelle. Approximate nearest by Content. Proceeding EDBT/ICDT 2011 Joint neighbors and Fast Johnson-Lindenstrauss Conference Transform. Proceedings of the Symposium on Theory of Computing, 2006. ♣ This work was supported by RFBR (project 10-07-00156-а). 83