=Paper= {{Paper |id=Vol-256/paper-5 |storemode=property |title=VP-tree: Content-Based Image Indexing |pdfUrl=https://ceur-ws.org/Vol-256/submission_12.pdf |volume=Vol-256 |dblpUrl=https://dblp.org/rec/conf/syrcodis/Markov07 }} ==VP-tree: Content-Based Image Indexing== https://ceur-ws.org/Vol-256/submission_12.pdf
                          VP-tree: Content-Based Image Indexing

                                                      © Il'ya Markov ♣
                                                                     TP   PT




                                            Saint-Petersburg State University
                                                ilya.markov@gmail.com
                                                 HT                            TH




                            Abstract                               VP-tree was firstly introduced in [10]. Basic index
                                                               building and search algorithms and their improvements
           Recently, a lot of indexing methods were            were proposed. Improved algorithms store additional
           developed for multidimensional spaces and for       information about partition’s boundaries and distances
           image feature vector spaces in particular. This     between each node and its ancestors to speed up the
           paper introduces an implementation of a             search. Also they form buckets by collapsing subtrees
           vantage point tree method. The goal is to           near leaf level into a flat structure in order to save
           define dependencies between method's                space. Characteristics of the VP-tree are compared with
           parameters and index efficiency criteria in the     those of the k-d-tree and the image retrieval task is
           context of the content-based image retrieval        taken up as an example. The results show that even
           task.                                               simple implementation of the VP-tree is not worse than
                                                               the k-d-tree.
1 Introduction                                                     [4] is all about content-based image retrieval task
                                                               and the VP-tree is considered to be one of the solutions
An image is just a two-dimensional array of pixels and         for the image indexing problem in that area. Additive
there are various ways of building up high-level               and multiplicative optimistic adjustment mechanisms of
abstractions from it; for example, color distributions         the search threshold parameter are proposed. Also N-
based on histograms. This process is usually referred to
                                                               ary tree is taken as a data structure, instead of binary
as a feature extraction process. Each scalar piece of          tree.
such high-level abstractions is called a feature. The set          [6] proposes n-nearest neighbor search algorithm,
of features that comprises coherent high-level
                                                               which is shown by experiments to scale up well with
interpretations forms a feature class. Example feature         the size of the dataset and the desired number of nearest
classes are color, texture and shape [4].                      neighbors. Experiments also show that the searching in
    Images in the database go through the feature
                                                               the VP-tree is more efficient than that for the R*-tree
extraction process, so every image is represented as a         [1] and M-tree [5]. In the same work solutions for the
K-dimensional feature vector, where K is the number of         update problem are proposed.
features used to represent an image. The basic
assumption is that two image feature vectors are similar
according to a distance metric if and only if the              3 Problem definition
corresponding images are similar too. So the image             As mentioned in [4], partitioning methods are generally
retrieval problem may be considered as a                       based on absolute coordinate values of the vector space.
multidimensional nearest-neighbor search problem.              For example, a partition in a K-dimensional hypercube
                                                               is characterized by K pairs of coordinate values, each of
2 Related works                                                which specifies the covering interval in the respective
                                                               dimension. This type of partitioning structure is useful
Recently, a lot of indexing methods were developed for
                                                               for queries based on absolute coordinates (such as range
multidimensional spaces and for image feature vector           queries), but not so useful for nearest-neighbor search
spaces in particular. R*-tree [1], k-d-tree [2], K-D-B-        because the search structure in general doesn’t maintain
tree [8], SR-tree [7], SS-tree [9], VP-tree [4, 10], M-tree    the distance information between points within a
[5] and MVP-tree [3] are among them.                           partition and partition’s boundaries. As it turns out, this
 ♣
                                                               information is critical in pruning the search space for
 TPI would like to thank my scientific adviser Boris
      PT


                                                               multi-dimensional nearest-neighbor search. VP-tree
 Novikov and team leader Natalia Vassilieva for their          splits the search space by using relative distances
 help and support.                                             between vantage point and its children, therefore, it is
 This work was partially supported by RFBR (grant 07-
 T


                                                               easy to calculate the distance between point and
 07-00268a).                                                   boundaries of partition, which it belongs to.
 Proceedings of the Spring Young Researcher's                      Moreover, VP-tree employs bounding spheres and
 Colloquium On Database and Information Systems                therefore precision of space partitioning should be
 SYRCoDIS, Moscow, Russia, 2007                                better than of those based on bounding rectangles,
because with increase of space dimension sphere                           with the quantity of points falling within the two
volume decreases as compared to volume of cube in                         concentric circles being as little as possible. [10] shows
which it can be inscribed.                                                that in order to meet these requirements we need to
    In our case the problem of image database indexing                    choose vantage points from the corners of the space
naturally appeared during the content-based image                         with the maximum second moment of their distances
retrieval system prototype development. In accordance                     from the other points.
with previous statements the VP-tree was chosen as an
index structure for that task.
    While there are some works on this structure, the
problem of the dependencies between index efficiency
criteria and its parameters (arity, for instance) has not
been investigated yet. We implement the VP-tree (based
on [10] and [4]) and mark out a number of building
algorithms and VP-tree parameters. Also some index
efficiency criteria are considered. Our main goal is to
define dependencies between them and give
recommendations for tuning the VP-tree structure.

4 Basics
Given a metric space ( S , d ) and its finite subset
 S D ⊂ S representing a database. Since any metric's
range may be normalized to the interval [0, 1] without                        Figure 1: Data set partitioning and nearest-neighbor
affecting the nearest-neighbor relation, we can consider                   search. All points from the S > subset (such as p ′ ) are
only those metrics without loss of generality [10].
                                                                            too far from the data point q′ (further than σ ). And
    Let us discuss binary partitioning case for
simplicity. Assume some data point v from S D is                              points from S ≤ (such as p′′ ) are too far from q ′′ .
chosen as the vantage point. For any other point
 p ∈ S D − {v} distances from v are computed and the                      5 Algorithms and parameters
median µ is chosen among them. Then the whole data
set S D is divided into two parts in accordance with the                  5.1 Building the index
vantage point and the median: S ≤ is a set of points                      We use simple algorithms based on [10] to find vantage
which distances from v are equal or less than µ and                       points and build VP-tree. In accordance with [4], N-ary
 S > is a set of points which distances from v are greater                tree was chosen as a data structure. It means that we
                                                                          need to find N-1 border instead of one median to split
than µ .
                                                                          search space into N partitions.
     Suppose we need to find nearest neighbors for some                       It is necessary to mention that the index structure
point q with their distances from q being less than a                     implementation is not the main problem of our image
specific threshold σ . It turns out that if d (v, q ) ≤ µ − σ             retrieval system prototype, therefore, database tables
we need to explore only the S ≤ subset and if                             instead of disk blocks were chosen as an index data
                                                                          storage for simplicity.
 d (v, q ) > µ + σ the only S > subset has to be explored.
     This observation is based on the triangle inequality:                function BuildVP_TreeLevel(S)
if d (v, q ) ≤ µ − σ then for each p ∈ S > the following is                  if S = ∅ return ∅
                                                                             node.vp := GetVP(S)
true: d ( q, p ) ≥| d (v, p ) − d (v, q ) |>| µ − ( µ − σ ) |= σ , i.e.
                                                                             node.borders := GetBorders(S, vp, N)
d (q, p) > σ . It means that the                     subset can be           for each sequential lborder, uborder from borders
excluded from a search.                                                          BuildVP_TreeLevel(
    And if d (v, q ) > µ + σ then for each p ∈ S ≤ we                              {s ∈ S - {vp} | lborder < d(p, s) ≤ uborder} )
have d ( q, p ) ≥| d (v, q ) − d (v, p ) |>| ( µ + σ ) − µ ) |= σ . So       return node
 d (q, p) > σ and we can exclude the S ≤ subset.                          function GetVP(S)
Preceding is shown in Figure 1.                                              P := random sample of S
     So we can effectively prune one half of the search                      for each p ∈ P
space if the following conditions are met: the power of                          D := random sample of S
the S > subset approximately equals the power of the                             spread = GetSecondMoment(D, p)
 S ≤ subset and d (v, q ) does not meet the following                            if (spread > best_spread)
                                                                                      best_spread := spread
two-sided inequality: µ − σ < d (v, q ) ≤ µ + σ . Thus,
                                                                                      vp := p
the main task is to build approximately balanced tree
   return vp                                                 sets are – the less time the computations take and the
                                                             more inaccurate vantage points and borders are.
function GetBorders(S, vp, N)
                                                                 The fifth important parameter is a tree arity (AR).
   P := random sample of S
   sort P in accordance with distances from vp               5.3 Searching the tree
   for i from 1 to N – 1
       p_low := P[i * |P| / N]                               To perform a search on the VP-tree we generalize the
       p_up := P[i * |P| / N + 1]                            searching algorithm from [10] for the N-ary tree. The
       border = (d(vp, p_up) – d(vp, p_low)) / 2             following function returns all nearest neighbors for a
       borders.add(border)                                   query point q within a threshold σ . Vp parameter is a
   return borders                                            root of a VP-tree at first.

5.2 Parameters                                               function GetNN(vp, q, σ , result)
                                                                dist := d(vp, q)
It is necessary to discuss GetBorders function in more          if (dist <= σ ) and (vp != q) result.add(vp)
detail. This simple algorithm works as follows: selected        if vp is a leaf return
sample set is sorted by the distances from each p ∈ P to        lborder := max{b ∈ vp.borders | b < dist - σ }
the vantage point in ascending order. Sorted set is             uborder := min{b ∈ vp.borders | b ≥ dist + σ }
divided into N parts. Arithmetic mean of the distances
of the two boundary points is treated as a border               for each new_vp between lborder and uborder
between those parts.                                                GetNN(new_vp, q, σ , result)
    Such implementation is too straightforward and can
violate the following restriction from the previous          6 Theoretical basics of experiments and
section: the quantity of points falling within the two       hypotheses
concentric circles must be as little as possible. The
restriction will be violated if the margins between          The goal of the experiments is to define the index
boundary points and borders are too small (smaller           efficiency criteria, which are further discussed, and to
than σ ). Thus, the DDR (distance delta rate) parameter      determine the dependencies between these criteria and
                                                             the parameters mentioned in the previous section. We
is used to increase the margins. In that case the border
                                                             plan to define such dependencies from each of the
can be set not only between boundary points of the two
                                                             parameters (leaving other ones unchanged during each
sets, but between points with the maximum distance
                                                             experiment) and from some parameters combinations.
from each other within bounds determined by the DDR.
The example is shown in Figure 2.                            6.1 Searching time
                                                             We consider the searching time to be the most
                                                             important criterion, because the main purpose of an
          a) Simple border detection algorithm               index is to speed up a fetching process. This value can
                                                             be compared with the time of the whole database scan,
                                                             since we need to look through the whole database to
 b) Improved algorithm: border was moved to the right        find the nearest neighbors for some point without using
by one point (it is allowed by the current DDR value). If    any index. More formally, the following ratio can be
 we move it to the left by 3 points margins will increase                                              t
                                                             calculated to define the criterion: t s = ind , where t ind
          but the tree will be too unbalanced                                                           t rs
      Figure 2: Border detection between two sets            is a nearest-neighbors searching time with an index
    DDR parameter is a trade-off between balancing and       utilization and t rs is a row scan time.
margins. So it has an influence on both conditions of the        Moreover, this criterion can be computed in terms of
tree building process mentioned in the previous section      distance metric d (.) evaluation operations. Obviously,
and must be chosen carefully.                                to find nearest neighbors for some point without using
    There are three other parameters that take part in the   an index, one needs to do as many metric evaluation
process: the CRVP (capacity rate for vantage point), the     operations as many images there are in the database. So
CRSM (capacity rate for second moment) and the CRB           we can denote this type of criterion as t d and equate it
(capacity rate for borders). Each of them defines a ratio
of the sample set size in the vantage point search, the                             Σ ind     Σ
                                                             to the following:            = ind , where Σ ind is a
second moment calculation and the border detection,                                  Σ rs    | SD |
accordingly, to the whole database size. Here is a trade-    quantity of the distance metric evaluation operations
off between precision of that objects detection and the      during a nearest-neighbor search with an index
duration of the tree building process. The bigger the        utilization.
sample sets are – the more precise vantage points and             t s and t d are not necessarily proportional, because
borders we have, but the more time it takes to build a
                                                             the second representation does not take the sorting
tree (more points are involved in the process while each
                                                             process duration into account, while we have to sort
point needs some calculations). The smaller the sample
                                                             points to get the final result.
   Two kinds of experiments can be conducted to                   volume is less than the volume of a cube, in which the
compute these values: single-neighbor and multiple-               sphere is inscribed and this ratio decreases with the
neighbors searches with a query-point q . The first one           increase of the searching space dimension.
looks up for the most relevant point p , i.e. the point
with the minimum d (q, p ) value, while the second one            7 Conclusion
has to find nearest neighbors with their distances from           We have implemented the VP-tree index structure as an
 q being less than a specific threshold σ . In the first          alternate solution for the nearest-neighbor search
case, system with an index has to calculate only one              problem in the context of the content-based image
subset of the node q to find the most relevant point, that        retrieval task. Different criteria were proposed to
                                                                  estimate index efficiency. We plan to hold a number of
one with the lowest upper border, whereas it may need
                                                                  experiments to define those criteria and to determine the
to look into other sub-trees to get more points in the
                                                                  dependencies between them and algorithms' parameters,
second case.
                                                                  stated in the work.
6.2 Index degradation
                                                                  References
As the database size grows new points are added to the
index, so the tree gets more and more unbalanced. This            [1] N. Beckmann, H. P. Kriegel, R. Schneider, and B.
process affects the criteria, mentioned above, and can                Seeger. The R*-tree: An Efficient and Robust
completely reduce the index efficiency right up to the                Access Method for Points and Rectangles. In ACM
need of its full rebuilding. We plan to hold some                     SIGMOD Record, volume 19(2), pages 322–331,
experiments to define this affection, especially that                 1990.
cases with complete degradation.                                  [2] J. L. Bentley. Multidimensional Binary Search
    The first type of experiments implies adding                      Trees Used for Associative Searching. In
arbitrary points to the database and computing the                    Communications of ACM, volume 18(9), pages
above criteria to define the rules of their changing. For             509–517, 1975.
the second reason, i.e. defining the worst cases, we plan         [3] T. Bozkaya and M. Ozsoyoglu. Distance-based
to add quite similar points to force the tree to become               Indexing for High-Dimensional Metric Spaces. In
unbalanced rapidly. All types of experiments imply                    ACM SIGMOD Record, volume 26, pages 357–368,
adding a great amount of points to affect the searching               1997.
criteria.                                                         [4] T. C. Chiueh. Content-based image indexing. In
                                                                      Proceedings of the 20th VLDB Conference, pages
6.3 Index building process duration                                   582–593, 1994.
Based on section 5.3, the tree building process duration          [5] P. Ciaccia, M. Patella, P. Zezula. M-tree: An
can be quite an important criterion in some cases. It                 Efficient Access Method for Similarity Search in
depends on four parameters, mentioned in section 4:                   Metric Spaces. In Proceedings of the 23rd
CRVP, CRSM, CRB and AR. From the algorithms of                        International Conference on VLDB, Athens,
that section we can derive the following formula for the              Greece, pages 426-435, 1997.
mean                   tree           building            time:   [6] A. W.-C. Fu, P. M.-S. Chan, Y.-L. Cheung, Y. S.
                                                                      T




                                                                      Moon. Dynamic vp-tree indexing for n-nearest
 t mean = k ⋅ Σ vp ⋅ (CRVP ⋅ CRSM ⋅ | S | +CRB⋅ | S |) , where
                                         2
                                                                      neighbor search given pair-wise distances. VLDB
k is some coefficient and Σ vp is a vantage points                    Journal, volume 9(2), pages 154-173, 2000.  T




                                                                  [7] N. Katayama, S. Satoh. The SR-tree: An Index
quantity (which strongly depends on the AR parameter).
                                                                      Structure for High-Dimensional Nearest Neighbor
This formula arises from the following reasoning: we
                                                                      Queries. In Proceedings of the 1997 ACM
need k ⋅ (CRVP⋅ | S |) ⋅ (CRSM ⋅ | S |) time to process a
                                                                      SIGMOD International Conference on
vantage point sample set and to calculate a second                    Management of Data, Tucson, Arizona.
moment for each point, based on another sample set,               [8] J. T. Robinson. The K-D-B-tree: A Search
while searching for a vantage point. And additional                   Structure for Large Multidimensional Dynamic
k ⋅ (CRB⋅ | S |) quantity of time is needed to find point's           Indexes. In Proceedings of the ACM SIGMOD, Ann
borders. We plan to check this formula during the                     Arbor, USA, pages 10-18, 1981.
experiments.                                                      [9] D. A. White, R. Jain. Similarity Indexing With the
                                                                      SS-tree. In Proceedings of the 12th International
6.4 Image feature vector space dimension                              Conference on Data Engineering, New Orleans,
There are different image features we intend to use in                USA, pages 516-523, 1996.
our system, so the dependencies between the dimension             [10] P. Yianilos. Data Structures and Algorithms for
of the image feature vectors and index efficiency                     Nearest Neighbor Search in General Metric Spaces.
criteria, mentioned above, are very important to us and               In Proceedings of the 3rd Annual ACM-SIAM
we plan to hold some experiments to define them.                      Symposium on Discrete Algorithms, Orlando, pages
    As we said before, the VP-tree is based on bounding               311-322, 1992.
spheres. And the main advantage of a sphere is that its