Similarity Multidimensional Indexing♣
                            © Elena Mikhaylova, Boris Novikov, Anton Volokhov
                                   Saint Petersburg state university
                   egmichailova@mail.ru, borisnov@acm.org, a.v.volokhov@gmail.com


                       Abstract                                        Multidimensional searching is primarily associated
                                                                   with processing spatial and spatio-temporal data.
        The multidimensional k-NN (k nearest                       Another      class    of    applications,    processing
       neighbors) query problem arises in a                        multidimensional data includes systems based on
       large variety of database applications,                     various flavor of a text vector model for methods of
       including information retrieval, natural                    data mining, pattern recognition, data compression etc.
       language processing, and data mining.                       Data obtained from the application of this class are
       To solve it efficiently, database needs an                  typically characterized by high dimensionality.
       indexing structure supporting this kind                         Typically, the processing of multidimensional data
       of search. However, exact solution is                       requires searching on the basis of proximity or
       hardly feasible in multidimensional                         similarity, rather than exact attributes equality. The
       space. In this paper we describe and                        most common search query is to find K nearest
       analyze an indexing technique for                           neighbors for a given dataset.
       approximate solution of k-NN problem.                       In this work we present and compare two approaches
       Construction of the indexing tree is                        for similarity multidimensional search. One idea is
       based on clustering. Construction of                        based on tree-clustering structure, the second one uses
       hash indexing is based on s-stable                          LSH method. The former is better in precision and the
       distributions. Indices are implemented                      latter supersedes in the speed of computation. The
       on top of high-performance industrial                       remaining part of the paper is organized as follows.
       DBMS.                                                       Section 2 contains the overview of related work.
                                                                   Section 3 informally outlines the techniques used in
       Keywords: k-NN search, multidimensional                     our approach. Our algorithms and data structures are
       indexing, LSH                                               presented in section 4, followed by analysis of
                                                                   experiments or results in section 5.
1. Introduction
    Efficiency of search is critical for modern                    2. Related work
information retrieval. Commonly, search queries                        Years of multidimensional searching evolution led
retrieve relatively small portion of information. In this          to development of various indexing algorithms. Best
case, indexing search is much more efficient, than the             known of them are R-trees [8], X-tree [2], AV-files
full scan. Any given process or stored object can be               and many others [3]. In the past decade, considerable
characterized by a set of features that are usually                attention is attracted by the scheme of space-sensitive
called attributes. The purpose of any index is quick               hashing (LSH - locality sensitive hashing). M-trees [4]
access to the object by the values of some of its                  are intended for indexing the points of a metric space.
attributes. In other words, the indices provide effective              R-tree indices [8] work well for low-dimensional
implementation of associative search.                              spatial data. But R-tree family degrades rapidly due to
    If attribute values belong to a linearly ordered set,          the overlap in index pages in high dimensions. The
searching can be implemented with one-dimensional                  problem of multidimensional search is extremely
indices. These indexing structures (such as B-trees and            difficult with large-dimensional vectors. It is well
hash-files) are well studied in 70-80 years. But many              known that the construction of universal indexing
real applications have a goal to find by values of                 structures for large-scale data is impossible. This fact,
several attributes or by attributes that can’t be naturally        known as the "curse of dimensionality," determined by
linearly ordered. In this case we need to implement                the topology of the multidimensional space. It does not
searching in multidimensional space. Of the numerous               depend on the index constructing method. The
multidimensional indexing structures proposed in the               practical consequence is that performance of any index
80-ies., only R-trees [1] passed the test of time. The R-          structure, starting at some dimension, becomes worse
tree indexes are widely used in various applications               than a full scan. This problem can be somewhat
and implemented in industrial databases.                           mitigated by considering the indexing structures and
                                                                   algorithms that give approximate results. Typically,
Proceedings of the 13th All-Russian scientific conference          the vectors represent only the proximity of the source
“Digital Libraries: Advanced Methods and Technologies,             application objects (eg vectors for information
Digital Collections” - RCDL'2011, Voronezh, Russia,                retrieval can not accurately express the meaning of the
2011


                                                              77
documents). According to that, it is reasonably to use            other. The remaining points are stored separately, not
approximate methods for finding. For example, LSH                 clustered. Such a structure is obtained unbalanced. In
[6] gives only a probabilistic guarantee of correct               order to create the indexing structure, in [13], the
answer for k-NN query [9, 5].                                     distances between the selected set of pivots and the
   Locality-sensitive hashing works with vector space.            data objects is computed, sorted and nearest distances
This approach was introduced [15] and well developed              are stored in separated tables. For search at first query
[16-17] in last decade.                                           is compared with pivots.
   In paper [15] the idea of using randomized
hyperlanes for “sketching” initial dataset was                    3. The approach and rationale
presented.
   Then, in paper [16] was introduced improved                        In this paper, we describe two methods that were
method, which uses pivots from s-stable distribution to           used for approximate multidimensional search. The
hash vectors from the dataset. Idea to preprocess                 first one is actually a variation of a matrix tree, based
dataset with dimensionality reduction methods [18]                on spanning tree clustering algorithm. The second is
allows implementing new algorithm that appears to be              an implementation of the local-sensitive hashing
near optimal LSH method [17]. However, it works                   (LSH), specifically tuned for high computational
only with Gaussian norm in Euclidean space.                       speed.
    In many cases, it is computationally difficult not                The goal of this work is to construct the sub-
only to find and to prepare vectors (e.g., feature                optimal index structure for approximate searching of
extraction of images), but even to calculate the                  multidimensional objects based on features of
function of similarity (or distance between vectors).             similarity (or distance function). This structure will
Search could be greatly accelerated if it is possible to          provide a reasonable response time for practical ranges
store a matrix of pairwise distances between objects.             of different parameters. In order to evaluate the
Unfortunately, this solution is not scalable because the          characteristics and performance of this structure, we
size of the index in this case, will square dependence            have implemented the model over the data provided
on the number of objects.                                         industrial relational DBMS. Perhaps this decision
    In [10] k-NN search problem is dealt with the                 affects the efficiency but also ensures quick
double filtering effect of clustering and indexing. The           implementation, suitable for use in the prototype, for
clustering algorithm ensures that the largest cluster fits        example, in an experimental system for content based
into main memory and that only clusters closest to a              image retrieval. That is, the developed system allows
query point need to be searched and hence loaded into             finding images on a number of characteristics similar
main memory. In each cluster data is distributed with             to the one desired.
ordered-partition tree (OP-tree) main memory resident                 If we have a reasonable matrix of pairwise
index, which is efficient for processing k-NN queries.            distances, the problem of finding K nearest neighbors
    High-dimensional clustering is used by some                   can be solved by one-dimensional index range scan.
content-based image retrieval systems to partition the            This operation is efficiently implemented in any
data into groups (clusters), which are then indexed to            relational database. In order to limit the size of an
accelerate processing of queries. Recently, the Cluster           index, we will not store all pairwise distances, but only
Pruning approach was proposed in [12] as a simple                 the distance to the K nearest points. Usually, search
way to produce such clusters. The evaluation of the               engines need to solve the problem of K-nn only for
algorithm was performed within an image indexing                  small values of K.
context. The paper [12] discusses the parameters that
affect the efficient to the algorithm, and proposes
changes to the basic algorithm to improve                         3.1 Locality-sensitive idea
performance.
    [11] presents an adaptive Multi-level Mahalanobis-              Hash functions family is called locality sensitive
based Dimensionality Reduction (MMDR) technique                   with parameters (R, cR, P1, P2), if following:
for high-dimensional indexing. The MMDR technique
discovers elliptical clusters for more effective
                                                                     V1 − V2 < R → PrH (h( p ) = h(q )) > P1            (1)
dimensionality reduction by using only the low-                                                                    .
dimensional subspaces, data points in the different axis            V1 − V2 > cR → PrH (h( p ) = h(q )) < P2
systems are indexed using a single B+-tree. The
technique is highly scalable in terms of data size and
dimensionality. It is also dynamic and adaptive to                  Intuitively it means exactly that points with low
insertions. An extensive performance study was
                                                                   v1 − v2 (near in initial space) will collide into the
conducted using both real and synthetic datasets, and
the results show that our technique not only achieves             same value with higher probability, then points with
higher precision, but also enables queries to be                  higher one.
processed efficiently.                                               LSH method is based on hashing dataset with
   In paper [14] a new clustering method is proposed:             functions chosen uniformly at random from locality-
to group together only points that are close to each              sensitive family. To satisfy approximate nearest


                                                             78
neighbor query, we don’t need to perform full scan,               E. To a new level we repeat step A – D. So we
but can find neighbors only in buckets, where query                   built the second level of our indexing
point falls.                                                          structure.
To construct LSH family, one needs to choose hash                 F. We add levels with steps A – E until the
functions and determine probabilities P1 and P2.                      number of allocated centroids on the top-level
Indexing structure, according to these parameters can                 became sufficiently small. The latter set of
do nearest neighbors searching with following                         centroids forms the upper level of the index
query(2) and preprocessing(3) time:                                   tree.
                                                                  Search is organized as follows:
                    ( log1 / P1 )                   (2)
                                                                  A. Begin from upper level component.
                    ( log1 / P2 )                                 B. Calculate distances between the current level's
                n                   .                                 vectors and given search query.
                                                                  C. We choose the nearest vector and the
                    ( log1 / P1 )                   (3)               corresponding component. We proceed to the
                 1+
                                                                      next level and repeat step B. Then, using
                    ( log1 / P2 )
               n                        .                             precalculated matrix of distances, we choose
                                                                      the nearest centroid and go to the next level.
                                                                  D. On lowest level we select nearest vectors to
                                                                      the query.
4. Data structures and algorithms                                 By adding new vector:
                                                                  A. Repeat steps A-D from search, we find the
                                                                      lowest level component which centre is nearest
4.1 Algorithms description                                            to the specified vector.
                                                                  B. If number of vectors exceeds the upper limit,
    In the experiments, we would rely on high-                        we do not divide it into two parts, and climb
performance relational database system. Our storage                   one level up and perform re-clustering for all
structure was constructed on Microsoft SQL-server.                    vectors caught in this component.
The data were presented in the form of points of
multidimensional space. We consider the objects as
multidimensional vectors.                                      4.1.2. s-stable distribution LSH


4.1.1. Spanning tree                                           This randomized algorithm is using the idea, described
                                                               in [16]
    The first storage structure organized as follows.             Here we use scalar product with random vectors
For a given vector space we construct a minimum                from s-stable distribution to “sketch” initial high
length spanning tree. We accept vectors as individual          dimensional vector.
graph components and compute all pairwise distances               Locality-sensitive hashing family with parameters
between them.                                                  (R , cR, P1, P2)
    The metric space is defined as a pair (D; d) where            (P1,P2 – probabilities of “good” and “bad”
D denotes the domain of objects and d: DЧD → R is a            collisions, R – query radii, c – approximation factor) is
total function which must have the well-known non-             defined as follows:
negativity, symmetry, identity and triangle inequality
properties. The distance is calculated using the metric                            ⎢a ⋅v+b⎥                          (4)
                                                                            h(v) = ⎢        .
                                                                                   ⎣ w ⎥⎦
of L1.
    We create index structure so:
    A. Add to the tree minimal length arc if endpoints            a – random vector from s-stable distribution,
        of this arc belong to different components. At         predefined in pivots table; b – random shift, chosen
        the same time we restrict maximum number of            uniformly at random from [0..w]; v – vector from
        arcs entering each vertex.                             initial dataset.
    B. Repeat step A until the number of components               Due to s-stability, for every two vectors (p, q)
        becomes equal to one.                                  distance between their projections (a.p – a.q) is
    C. Split the tree to l components by removing l
        longest arcs. During splitting we monitor the
                                                               distributed as    p − q s X where X is a s-stable
        number of vertices in the connected                    distribution. It means that if two points a near( low
        component – it must be more then min and                p − q ) then they should collide with high
        less then max.
    D. For all building component we find the central          probability, and if they are far they should collide with
        vector – centroid. These centroids are vectors         small probability.
        in same space, but theirs quantity is much                However, if we use these functions to preprocess
        smaller then amount of source vectors, and we          dataset, we will have a plenty of false-positive answers
        consider them as new level vectors.


                                                          79
д = 1 − ( 1 − P1K ) L because of low difference
                                                                      We build a spanning tree of minimum
between P1 and P2                                                 length(depth?) and divide it into components. For arcs
   To increase it, we construct composite hash                    not belonging to spanning tree component number is
function gi , i ∈ 1..L as a concatenation of hash                 equal to zero, otherwise, this field indicates the
functions hi, j , j ∈ 1..K .                                      number of components, which the arc belongs to.
                                                                      This table will be very large, but it is necessary
For new hash functions P1 = P1K, P2 = P2K.                        only at the base index constructing stage. Later during
   To increase total probability of retrieving answer,            search and insertion, we need only some of pairwise
we hash dataset with L different hash functions gi.               distances.
Equation (5) describes probability д of getting a                     Vector table is as follows:
correct answer.                                                     Vector number                 Int
                                                                    Component number              Int
                                                                    Distance                      Float
               д = 1 − ( 1 − P1K )L .                  (5)
                                                                    Low-level vectors number Int
                                                                      For top-level vectors component number equals
   To make an indexing structure, we need to                      zero, else this parameter indicates the number of
precompute parameters L, K and w.                                 components, which contains this vector.
   w can be estimated using formula (6)                               We use the distance between vectors and
                                                                  component center by searching. For top-level vectors
                               log1 / P1               (6)        this distance is zero. For every vector-centroid we
          wmin = argminw                 .
                                                                  store low-level vector number - is necessary for know
                               log1 / P2
                                                                  how many vectors contains this component.
  Parameters K and L are choosing experimentally by                   The table structure component:
constructing different data structures, satisfying                Component number                Int
equation (7)                                                      Level                           Int
                                                                  Vectors amount                  Int
                                                                  Centroid number                 Int
                 ⎢ log1 / д ⎥                          (7)
              L= ⎢
                           (   k ⎥
                 ⎣ − log 1 − P1 ⎦  )
                                   .                                 Vectors amount indicates how many vectors are at
                                                                  the lowest level of this component.


                                                                  4.1.2 Index structure construction
4.2 Indexing on spanning tree
                                                                     To construct a spanning tree minimum length we
4.2.1. Data structures                                            use the following algorithm:
                                                                     1. We consider multi-dimensional vectors as
    For indexing structure we used 4 tables. First table                  vertices of the tree and the distance between
stores coordinates of vectors. In order to the table                      the vectors of the arc. At the lower level we
structure not to depend on the dimension of space,                        assume each node a separate connected
each vector is represented in the table as a set of rows -                component. We set all component numbers to
one row for each coordinate of the vector. Each row                       zero
has the following structure:                                         2. Compute all pairwise distances and sort them
  Vector number               Int                                         in ascending order.
  Coordinate number           Int                                    3. Select next minimum length arc with zero
  Value                       Float                                        component number. Consider vectors - the
    The second table stores calculated distances                           ends of the selected arc. If these vectors
between all pairs of vectors. Algorithm performance                        belong to one component, then we skip this arc
doesn't depend on distance measure. In our                                 and repeat step 3 with the next arc.
experiments distance was calculated using the metric                 4. Check the number of arcs included in these
of L1, but for the method it does not matter. Our                          peaks and owned by any component. The
function is defined as follows: r(x, y)=∑|xi-yi|. With                     classical tree construction algorithm was
these distances construct a table of distances in each                     modified to avoid height degree of nodes. To
row that stores the distance between two vectors.                          ensure this rule the number of edges adjacent
    Table with attributes:                                                 to a node is restricted with a parameter. (no
  Vector number 1               Int                                        more Nin). If at least one end of the arc
                                                                          constraint is violated, then we skip this arc
  Vector number 2               Int
                                                                          (with marking – set component number
  Distance                      Float
                                                                          negative) and goto step 2.
  Component number              Int


                                                             80
   5.  Add to the spanning tree arc between two                           recalculate centroid’s coordinates and the
       vertices.                                                          number of vectors inside components on all
   6. Connect these components are connected into                         levels.
       one.                                                           3. If number of vectors exceeds the upper limit,
   7. Continue to step 3 - 6 up until all the vertices                    we do not divide it into two parts. We climb
       not included in one connected component.                           one level up and find all vectors belonging to
   8. Divide constructed spanning tree into several                       selected component of the current level, and
       connected components. Each component must                          perform re-clustering for all vectors of this
       have from Nmin to Nmax vertices. Number of                         component. We rebuild spanning tree for a
       components depends on the number of vectors                        new set of vectors and divide it into
       stored in the table, and is determined with Nmin                   connected components. We are doing so at
       and Nmax. When we split a spanning tree of                         every level of the indexing structure if current
       connected components of the arcs belonging to                      component is overcrowded.
       the spanning tree, we select the arc with                      Operation of completion thus obtained is
       maximum length. Exclude this arc from the                 sufficient labor intensive. But we anticipate that most
       spanning tree, forming two connected                      queries will be directed at search and not an upgrade.
       components. Check that the number of vectors
       in the resulting components more Nmin. If not,            4.2 LSH-method
       it returns back and mark the removed arc.
       Iterate arcs from the spanning tree, while not            4.1.1. Data structures
       succeed to break a connected component into
       two parts. Then divide each new piece, while              To implement locality-sensitive hashing using s-stable
       the size of the component will not appear                 distributions we used following structures:
       between Nmin to Nmax.                                     Table contained vector number and nested table with
   9. In each component we find the centroid. To                 its coordinates.
       find value of the i-th component centroid we
       average values of the i-th component's vectors             Vector number         Int
       in a given connected component. For each                   Coordinates           Nested Table of Float
       component, we store the number of vectors in
       it.
   10. Consider centroids as multidimensional                    Table with predefined pivots from normal distribution
       vectors on the next level. If vectors amount is           and random shift for them.
       more than given Nmax we repeat steps 1-8 of
       our algorithm.                                              Pivot number         Int
                                                                   Coordinates          Nested Table of Float
4.1.3 Search                                                       Random Shift         Float
                                                                 L tables for different hash functions with identical
   1.    Begin at the top level.                                 structure: K columns with hash values of simple hash
   2.    Compute distances between stored vectors of             functions and nested table for keys of vectors that have
         current level (inside current component) and            the same hash values. We can store all hash values in
         query vector, order the list by distance in             the same table because number of simple hash
         ascending order. If current level is lowest, we         functions is defined on preprocessing stage.
         found an answer.
   3.    Take first vector (with minimum distance),               Hash_value_1          Float
         determine the component and the number of
                                                                  Hash_value_2          Float
         vectors contained in it. Exclude the first
         element of the list. If the number of vectors in         …                     …
         the component is less than desired, then repeat          Hash_value_K          Float
         step 3.                                                  Vector keys           Nested Table of Int
   4.    Determine the vectors in selected components
         and goto step 2.                                        And table, that store the choice of pivots for certain
                                                                 hash function
4.1.4 Scalability
                                                                  Number of hash function      Int
    1.   Search (step 1-4) for k=1.
                                                                  Pivot_1                      Int
    2.   Selected component contains vectors nearest
         to given query, and we try to insert new                 Pivot_2                      Int
         vector into selected component. If the number            …                            …
         of vectors in this connected component is less           Pivot_K                      Int
         than the maximum allowed, then this
         component is updated with a new vector. We


                                                            81
Here, number of pivots is also predefined during the              Query response time was 10 times faster than the
preprocessing stage.                                           one in naive full scan.

We also have a statistics table that is needed only for
preprocessing. It helps to determine optimal                   Method based on LSH produced the following results:
parameters for indexing structure.                                1-NN query – 62%
                                                                  10-NN query – 50%
 Value of K                  Int                                  50-NN query – 34%
 Average query time          Float                             However, hashing algorithm worked approximately 40
                                                               times faster than the previous one.

                                                               6. Conclusion
4.1.2 Preprocessing and index structure
                                                                   In this paper two different algorithms for
construction
                                                               multidimensional indexing were described. These
                                                               algorithms were implemented and analyzed with
Step 1: Preprocessing.
                                                               different parameters on real dataset. Based on the
To find optimal values for K and L we build indexing
                                                               results of experiments, we can draw the following
structures for different K from 3 to 15 and appropriate
                                                               conclusions: first method proposed gives a sufficiently
L, calculated by the formula (7)
                                                               accurate result, but compared to LSH lose heavily in
                                                               time. If the k-NN search is not 2k, but 3-4k, then the
       A.   Build indexing structure for certain value
                                                               precision increases, but difference between tree and
            K.
                                                               hashing query response time is even more significant.
       B.   Hash all points p ∈ P with L hash
            functions.                                         References
       C.   Run testing set on this indexing structure.
       D.   Store value of K and average query time            [1] Lars Arge, Mark de Berg, Herman J. Haverkort, Ke
            in statistics table.                                   Yi: The Priority RTree: A Practically Efficient and
       E.   Delete indexing structure.                             WorstCase Optimal RTree
       F.   Repeat A-E, until K < 16.                          [2] Stefan Berchtold, Daniel A. Keim, Hans-Peter
       G.   Retrieve value of K from statistics table              Kriegei: The X-tree: An Index Structure for High-
            with minimal query time                                Dimensional Data, In Proceedings of the 22nd
       H.   Build indexing structure with obtained                 International Conference on Very Large Databases
            value of K and appropriate value of L                  (1996), pp. 28-39.
                                                               [3] D. Bremner, E. Demaine, J. Erickson, J. Iacono, S.
  Step 2: Search.                                                  Langerman, P. Morin, and G. Toussaint, "Output-
                                                                   sensitive algorithms for computing nearest-
       I.   Hash query point q with hash function gi               neighbor decision boundaries," Discrete and
       J.   Retrieve all points from bucket gi(q).                 Computational Geometry, Vol. 33, No. 4, 2005, pp.
       K.   If retrieved points are not enough to                  593-604.
            satisfy k-NN query, then repeat 1-3 with           [4] Ciaccia et al. M-tree: An Efficient Access Method
            next hash function gi+1                                for Similarity Search in Metric Spaces. VLDB-
       L.   Sort retrieved points and return first k               1997, 1997
            elements as an answer.                             [5] Thomas M. Cover and Peter E. Hart, "Nearest
                                                                   neighbor pattern classification," IEEE Transactions
                                                                   on Information Theory, (1967) Vol. 13 (1) pp. 21-
5. Experiment and analysis                                         27
                                                               [6] Gionis, A.; Indyk, P., Motwani, R. (1999). ,
    To compare described algorithms, dataset refereed              "Similarity Search in High Dimensions via
to content-based image retrieval was considered. It                Hashing". Proceedings of the 25th Very Large
consists of 25000 rows with 41 attributes, representing            Database (VLDB) Conference.
different image characteristics. Methods were                  [7] Guttman: R-Trees: A Dynamic Index Structure for
implemented within Microsoft SQL-Server DBMS.                      Spatial Searching. SIGMOD Conference 1984: 47-
During experiments average accuracy (percent of exact              57
nearest-neighbors in retrieved approximate answer)             [8] Yannis Manolopoulos, Alexandros Nanopoulos,
and relative time consumption were calculated.                     Apostolos N. Papadopoulos, Yannis Theodoridis:
    Results for spanning-tree algorithm:                           R-Trees: Theory and Applications, Springer, 2005.
                                                                   ISBN 1-85233-977-2
   1-NN query – 87%                                            [9] Nigsch, F.; A. Bender, B. van Buuren, J. Tissen, E.
   10-NN query – 70%                                               Nigsch & J.B.O. Mitchell (2006). "Melting Point
   50-NN query – 79%                                               Prediction     Employing      k-nearest   Neighbor
                                                                   Algorithms and Genetic Parameter Optimization".


                                                          82
   Journal of Chemical Information and Modeling 46                  [14] Stephan Gьnnemann, Hardy Kremer, Dominik
   (6): 2412–2422.                                                     Lenhard, and Thomas Seidl: Subspace Clustering
[10] Alexander Thomasian, Lijuan Zhang: Persistent                     for Indexing High Dimensional Data: A Main
   clustered main memory index for accelerating k-                     Memory Index based on Local Reductions and
   NN queries on high dimensional dataset                              Individual Multi-Representations. Proceeding
   Multimedia Tools and Applications Volume 38                         EDBT/ICDT 2011 Joint Conference
   Issue 2, June 2008                                               [15] P. Indyk, R. Motwani. Approximate nearest
[11] Heng Tao Shen, Xiaofang Zhou, Aoying Zhou:                        neighbor: towards removing the curse of
   An adaptive and dynamic dimensionality reduction                    dimensionality. Proceedings of the Symposium on
   method for high-dimensional indexing. The VLDB                      Theory of Computing, 1998.
   Journal, Volume 16 Issue 2, April 2007                           [16] M. Datar, N. Immorlica, P. Indyk, and
[12] Gylfi Thуr Gudmundsson, Bjцrn Thуr Jуnsson,                       V.Mirrokni. Locality sensitive hashing scheme
   Laurent Amsaleg: A large-scale performance study                    based on p-stable distributions. Proceedings of the
   of cluster-based high-dimensional indexing.                         ACM        Symposium         on     Computational
   Proceeding VLS-MCMR '10 - international                             Geometry,2004.
   workshop on Very-large-scale multimedia corpus,                  [17] P. Indyk, A. Andoni. Near optimal hashing
   mining and retrieval                                                algorithms for approximate nearest neighbors in
[13] Stanislav Barton, Valйrie Gouet-Brunet and Marta                  high dimensions. Foundations of Computer
   Rukoz: Large Scale Disk-Based Metric Indexing                       Science, 2006.
   Structure for Approximate Information Retrieval                  [18] N. Alion, B. Chazelle. Approximate nearest
   by Content. Proceeding EDBT/ICDT 2011 Joint                         neighbors and Fast Johnson-Lindenstrauss
   Conference                                                          Transform. Proceedings of the Symposium on
                                                                       Theory of Computing, 2006.

♣
    This work was supported by RFBR (project 10-07-00156-а).


                                                               83