=Paper=
{{Paper
|id=Vol-256/paper-5
|storemode=property
|title=VP-tree: Content-Based Image Indexing
|pdfUrl=https://ceur-ws.org/Vol-256/submission_12.pdf
|volume=Vol-256
|dblpUrl=https://dblp.org/rec/conf/syrcodis/Markov07
}}
==VP-tree: Content-Based Image Indexing==
VP-tree: Content-Based Image Indexing
© Il'ya Markov ♣
TP PT
Saint-Petersburg State University
ilya.markov@gmail.com
HT TH
Abstract VP-tree was firstly introduced in [10]. Basic index
building and search algorithms and their improvements
Recently, a lot of indexing methods were were proposed. Improved algorithms store additional
developed for multidimensional spaces and for information about partition’s boundaries and distances
image feature vector spaces in particular. This between each node and its ancestors to speed up the
paper introduces an implementation of a search. Also they form buckets by collapsing subtrees
vantage point tree method. The goal is to near leaf level into a flat structure in order to save
define dependencies between method's space. Characteristics of the VP-tree are compared with
parameters and index efficiency criteria in the those of the k-d-tree and the image retrieval task is
context of the content-based image retrieval taken up as an example. The results show that even
task. simple implementation of the VP-tree is not worse than
the k-d-tree.
1 Introduction [4] is all about content-based image retrieval task
and the VP-tree is considered to be one of the solutions
An image is just a two-dimensional array of pixels and for the image indexing problem in that area. Additive
there are various ways of building up high-level and multiplicative optimistic adjustment mechanisms of
abstractions from it; for example, color distributions the search threshold parameter are proposed. Also N-
based on histograms. This process is usually referred to
ary tree is taken as a data structure, instead of binary
as a feature extraction process. Each scalar piece of tree.
such high-level abstractions is called a feature. The set [6] proposes n-nearest neighbor search algorithm,
of features that comprises coherent high-level
which is shown by experiments to scale up well with
interpretations forms a feature class. Example feature the size of the dataset and the desired number of nearest
classes are color, texture and shape [4]. neighbors. Experiments also show that the searching in
Images in the database go through the feature
the VP-tree is more efficient than that for the R*-tree
extraction process, so every image is represented as a [1] and M-tree [5]. In the same work solutions for the
K-dimensional feature vector, where K is the number of update problem are proposed.
features used to represent an image. The basic
assumption is that two image feature vectors are similar
according to a distance metric if and only if the 3 Problem definition
corresponding images are similar too. So the image As mentioned in [4], partitioning methods are generally
retrieval problem may be considered as a based on absolute coordinate values of the vector space.
multidimensional nearest-neighbor search problem. For example, a partition in a K-dimensional hypercube
is characterized by K pairs of coordinate values, each of
2 Related works which specifies the covering interval in the respective
dimension. This type of partitioning structure is useful
Recently, a lot of indexing methods were developed for
for queries based on absolute coordinates (such as range
multidimensional spaces and for image feature vector queries), but not so useful for nearest-neighbor search
spaces in particular. R*-tree [1], k-d-tree [2], K-D-B- because the search structure in general doesn’t maintain
tree [8], SR-tree [7], SS-tree [9], VP-tree [4, 10], M-tree the distance information between points within a
[5] and MVP-tree [3] are among them. partition and partition’s boundaries. As it turns out, this
♣
information is critical in pruning the search space for
TPI would like to thank my scientific adviser Boris
PT
multi-dimensional nearest-neighbor search. VP-tree
Novikov and team leader Natalia Vassilieva for their splits the search space by using relative distances
help and support. between vantage point and its children, therefore, it is
This work was partially supported by RFBR (grant 07-
T
easy to calculate the distance between point and
07-00268a). boundaries of partition, which it belongs to.
Proceedings of the Spring Young Researcher's Moreover, VP-tree employs bounding spheres and
Colloquium On Database and Information Systems therefore precision of space partitioning should be
SYRCoDIS, Moscow, Russia, 2007 better than of those based on bounding rectangles,
because with increase of space dimension sphere with the quantity of points falling within the two
volume decreases as compared to volume of cube in concentric circles being as little as possible. [10] shows
which it can be inscribed. that in order to meet these requirements we need to
In our case the problem of image database indexing choose vantage points from the corners of the space
naturally appeared during the content-based image with the maximum second moment of their distances
retrieval system prototype development. In accordance from the other points.
with previous statements the VP-tree was chosen as an
index structure for that task.
While there are some works on this structure, the
problem of the dependencies between index efficiency
criteria and its parameters (arity, for instance) has not
been investigated yet. We implement the VP-tree (based
on [10] and [4]) and mark out a number of building
algorithms and VP-tree parameters. Also some index
efficiency criteria are considered. Our main goal is to
define dependencies between them and give
recommendations for tuning the VP-tree structure.
4 Basics
Given a metric space ( S , d ) and its finite subset
S D ⊂ S representing a database. Since any metric's
range may be normalized to the interval [0, 1] without Figure 1: Data set partitioning and nearest-neighbor
affecting the nearest-neighbor relation, we can consider search. All points from the S > subset (such as p ′ ) are
only those metrics without loss of generality [10].
too far from the data point q′ (further than σ ). And
Let us discuss binary partitioning case for
simplicity. Assume some data point v from S D is points from S ≤ (such as p′′ ) are too far from q ′′ .
chosen as the vantage point. For any other point
p ∈ S D − {v} distances from v are computed and the 5 Algorithms and parameters
median µ is chosen among them. Then the whole data
set S D is divided into two parts in accordance with the 5.1 Building the index
vantage point and the median: S ≤ is a set of points We use simple algorithms based on [10] to find vantage
which distances from v are equal or less than µ and points and build VP-tree. In accordance with [4], N-ary
S > is a set of points which distances from v are greater tree was chosen as a data structure. It means that we
need to find N-1 border instead of one median to split
than µ .
search space into N partitions.
Suppose we need to find nearest neighbors for some It is necessary to mention that the index structure
point q with their distances from q being less than a implementation is not the main problem of our image
specific threshold σ . It turns out that if d (v, q ) ≤ µ − σ retrieval system prototype, therefore, database tables
we need to explore only the S ≤ subset and if instead of disk blocks were chosen as an index data
storage for simplicity.
d (v, q ) > µ + σ the only S > subset has to be explored.
This observation is based on the triangle inequality: function BuildVP_TreeLevel(S)
if d (v, q ) ≤ µ − σ then for each p ∈ S > the following is if S = ∅ return ∅
node.vp := GetVP(S)
true: d ( q, p ) ≥| d (v, p ) − d (v, q ) |>| µ − ( µ − σ ) |= σ , i.e.
node.borders := GetBorders(S, vp, N)
d (q, p) > σ . It means that the subset can be for each sequential lborder, uborder from borders
excluded from a search. BuildVP_TreeLevel(
And if d (v, q ) > µ + σ then for each p ∈ S ≤ we {s ∈ S - {vp} | lborder < d(p, s) ≤ uborder} )
have d ( q, p ) ≥| d (v, q ) − d (v, p ) |>| ( µ + σ ) − µ ) |= σ . So return node
d (q, p) > σ and we can exclude the S ≤ subset. function GetVP(S)
Preceding is shown in Figure 1. P := random sample of S
So we can effectively prune one half of the search for each p ∈ P
space if the following conditions are met: the power of D := random sample of S
the S > subset approximately equals the power of the spread = GetSecondMoment(D, p)
S ≤ subset and d (v, q ) does not meet the following if (spread > best_spread)
best_spread := spread
two-sided inequality: µ − σ < d (v, q ) ≤ µ + σ . Thus,
vp := p
the main task is to build approximately balanced tree
return vp sets are – the less time the computations take and the
more inaccurate vantage points and borders are.
function GetBorders(S, vp, N)
The fifth important parameter is a tree arity (AR).
P := random sample of S
sort P in accordance with distances from vp 5.3 Searching the tree
for i from 1 to N – 1
p_low := P[i * |P| / N] To perform a search on the VP-tree we generalize the
p_up := P[i * |P| / N + 1] searching algorithm from [10] for the N-ary tree. The
border = (d(vp, p_up) – d(vp, p_low)) / 2 following function returns all nearest neighbors for a
borders.add(border) query point q within a threshold σ . Vp parameter is a
return borders root of a VP-tree at first.
5.2 Parameters function GetNN(vp, q, σ , result)
dist := d(vp, q)
It is necessary to discuss GetBorders function in more if (dist <= σ ) and (vp != q) result.add(vp)
detail. This simple algorithm works as follows: selected if vp is a leaf return
sample set is sorted by the distances from each p ∈ P to lborder := max{b ∈ vp.borders | b < dist - σ }
the vantage point in ascending order. Sorted set is uborder := min{b ∈ vp.borders | b ≥ dist + σ }
divided into N parts. Arithmetic mean of the distances
of the two boundary points is treated as a border for each new_vp between lborder and uborder
between those parts. GetNN(new_vp, q, σ , result)
Such implementation is too straightforward and can
violate the following restriction from the previous 6 Theoretical basics of experiments and
section: the quantity of points falling within the two hypotheses
concentric circles must be as little as possible. The
restriction will be violated if the margins between The goal of the experiments is to define the index
boundary points and borders are too small (smaller efficiency criteria, which are further discussed, and to
than σ ). Thus, the DDR (distance delta rate) parameter determine the dependencies between these criteria and
the parameters mentioned in the previous section. We
is used to increase the margins. In that case the border
plan to define such dependencies from each of the
can be set not only between boundary points of the two
parameters (leaving other ones unchanged during each
sets, but between points with the maximum distance
experiment) and from some parameters combinations.
from each other within bounds determined by the DDR.
The example is shown in Figure 2. 6.1 Searching time
We consider the searching time to be the most
important criterion, because the main purpose of an
a) Simple border detection algorithm index is to speed up a fetching process. This value can
be compared with the time of the whole database scan,
since we need to look through the whole database to
b) Improved algorithm: border was moved to the right find the nearest neighbors for some point without using
by one point (it is allowed by the current DDR value). If any index. More formally, the following ratio can be
we move it to the left by 3 points margins will increase t
calculated to define the criterion: t s = ind , where t ind
but the tree will be too unbalanced t rs
Figure 2: Border detection between two sets is a nearest-neighbors searching time with an index
DDR parameter is a trade-off between balancing and utilization and t rs is a row scan time.
margins. So it has an influence on both conditions of the Moreover, this criterion can be computed in terms of
tree building process mentioned in the previous section distance metric d (.) evaluation operations. Obviously,
and must be chosen carefully. to find nearest neighbors for some point without using
There are three other parameters that take part in the an index, one needs to do as many metric evaluation
process: the CRVP (capacity rate for vantage point), the operations as many images there are in the database. So
CRSM (capacity rate for second moment) and the CRB we can denote this type of criterion as t d and equate it
(capacity rate for borders). Each of them defines a ratio
of the sample set size in the vantage point search, the Σ ind Σ
to the following: = ind , where Σ ind is a
second moment calculation and the border detection, Σ rs | SD |
accordingly, to the whole database size. Here is a trade- quantity of the distance metric evaluation operations
off between precision of that objects detection and the during a nearest-neighbor search with an index
duration of the tree building process. The bigger the utilization.
sample sets are – the more precise vantage points and t s and t d are not necessarily proportional, because
borders we have, but the more time it takes to build a
the second representation does not take the sorting
tree (more points are involved in the process while each
process duration into account, while we have to sort
point needs some calculations). The smaller the sample
points to get the final result.
Two kinds of experiments can be conducted to volume is less than the volume of a cube, in which the
compute these values: single-neighbor and multiple- sphere is inscribed and this ratio decreases with the
neighbors searches with a query-point q . The first one increase of the searching space dimension.
looks up for the most relevant point p , i.e. the point
with the minimum d (q, p ) value, while the second one 7 Conclusion
has to find nearest neighbors with their distances from We have implemented the VP-tree index structure as an
q being less than a specific threshold σ . In the first alternate solution for the nearest-neighbor search
case, system with an index has to calculate only one problem in the context of the content-based image
subset of the node q to find the most relevant point, that retrieval task. Different criteria were proposed to
estimate index efficiency. We plan to hold a number of
one with the lowest upper border, whereas it may need
experiments to define those criteria and to determine the
to look into other sub-trees to get more points in the
dependencies between them and algorithms' parameters,
second case.
stated in the work.
6.2 Index degradation
References
As the database size grows new points are added to the
index, so the tree gets more and more unbalanced. This [1] N. Beckmann, H. P. Kriegel, R. Schneider, and B.
process affects the criteria, mentioned above, and can Seeger. The R*-tree: An Efficient and Robust
completely reduce the index efficiency right up to the Access Method for Points and Rectangles. In ACM
need of its full rebuilding. We plan to hold some SIGMOD Record, volume 19(2), pages 322–331,
experiments to define this affection, especially that 1990.
cases with complete degradation. [2] J. L. Bentley. Multidimensional Binary Search
The first type of experiments implies adding Trees Used for Associative Searching. In
arbitrary points to the database and computing the Communications of ACM, volume 18(9), pages
above criteria to define the rules of their changing. For 509–517, 1975.
the second reason, i.e. defining the worst cases, we plan [3] T. Bozkaya and M. Ozsoyoglu. Distance-based
to add quite similar points to force the tree to become Indexing for High-Dimensional Metric Spaces. In
unbalanced rapidly. All types of experiments imply ACM SIGMOD Record, volume 26, pages 357–368,
adding a great amount of points to affect the searching 1997.
criteria. [4] T. C. Chiueh. Content-based image indexing. In
Proceedings of the 20th VLDB Conference, pages
6.3 Index building process duration 582–593, 1994.
Based on section 5.3, the tree building process duration [5] P. Ciaccia, M. Patella, P. Zezula. M-tree: An
can be quite an important criterion in some cases. It Efficient Access Method for Similarity Search in
depends on four parameters, mentioned in section 4: Metric Spaces. In Proceedings of the 23rd
CRVP, CRSM, CRB and AR. From the algorithms of International Conference on VLDB, Athens,
that section we can derive the following formula for the Greece, pages 426-435, 1997.
mean tree building time: [6] A. W.-C. Fu, P. M.-S. Chan, Y.-L. Cheung, Y. S.
T
Moon. Dynamic vp-tree indexing for n-nearest
t mean = k ⋅ Σ vp ⋅ (CRVP ⋅ CRSM ⋅ | S | +CRB⋅ | S |) , where
2
neighbor search given pair-wise distances. VLDB
k is some coefficient and Σ vp is a vantage points Journal, volume 9(2), pages 154-173, 2000. T
[7] N. Katayama, S. Satoh. The SR-tree: An Index
quantity (which strongly depends on the AR parameter).
Structure for High-Dimensional Nearest Neighbor
This formula arises from the following reasoning: we
Queries. In Proceedings of the 1997 ACM
need k ⋅ (CRVP⋅ | S |) ⋅ (CRSM ⋅ | S |) time to process a
SIGMOD International Conference on
vantage point sample set and to calculate a second Management of Data, Tucson, Arizona.
moment for each point, based on another sample set, [8] J. T. Robinson. The K-D-B-tree: A Search
while searching for a vantage point. And additional Structure for Large Multidimensional Dynamic
k ⋅ (CRB⋅ | S |) quantity of time is needed to find point's Indexes. In Proceedings of the ACM SIGMOD, Ann
borders. We plan to check this formula during the Arbor, USA, pages 10-18, 1981.
experiments. [9] D. A. White, R. Jain. Similarity Indexing With the
SS-tree. In Proceedings of the 12th International
6.4 Image feature vector space dimension Conference on Data Engineering, New Orleans,
There are different image features we intend to use in USA, pages 516-523, 1996.
our system, so the dependencies between the dimension [10] P. Yianilos. Data Structures and Algorithms for
of the image feature vectors and index efficiency Nearest Neighbor Search in General Metric Spaces.
criteria, mentioned above, are very important to us and In Proceedings of the 3rd Annual ACM-SIAM
we plan to hold some experiments to define them. Symposium on Discrete Algorithms, Orlando, pages
As we said before, the VP-tree is based on bounding 311-322, 1992.
spheres. And the main advantage of a sphere is that its