<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>PP-Index: Using Permutation Prefixes for Efficient and Scalable Approximate Similarity Search</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pisa - ITALY andrea.esuli@isti.cnr.it</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>We present the Permutation Pre x Index (PP-Index), an index data structure that allows to perform e cient approximate similarity search. The PP-Index belongs to the family of the permutationbased indexes, which are based on representing any indexed object with \its view of the surrounding world", i.e., a list of the elements of a set of reference objects sorted by their distance order with respect to the indexed object. In its basic formulation, the PP-Index is strongly biased toward e ciency, treating e ectiveness as a secondary aspect. We show how the e ectiveness can easily reach optimal levels just by adopting two \boosting" strategies: multiple index search and multiple query search. Such strategies have nice parallelization properties that allow to distribute the search process in order to keep high e ciency levels. We study both the e ciency and the e ectiveness properties of the PP-Index. We report experiments on collections of sizes up to one hundred million images, represented in a very high-dimensional similarity space based on the combination of ve MPEG-7 visual descriptors.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Categories and Subject Descriptors</title>
      <p>E.1 [Data]: Data Structures; H.3.3 [Information
Systems]: Information Storage and Retrieval|Search process
approximate similarity search, metric space, scalability</p>
    </sec>
    <sec id="sec-2">
      <title>1. INTRODUCTION</title>
      <p>
        The similarity search model [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] is a search model in
which, given a query q and a collection of objects D, all
belonging to a domain O, the objects in D have to be sorted
Copyright c 2009 for the individual papers by the papers’ authors.
Copying permitted for private and academic purposes. Re-publication of material
from this volume requires permission by the copyright owners. This volume
is published by its editors.
      </p>
      <p>LSDS-IR Workshop. July 2009. Boston, USA.
by their similarity to the query, according to a given distance
function d : O O ! R+ (i.e., the closer two objects are,
the most similar they are considered). Typically only the
k-top ranked objects are returned (k-NN query), or those
within a maximum distance value r (range query).</p>
      <p>One of the main research topics on similarity search is the
study of the scalability of similarity search methods when
applied to high-dimensional similarity spaces.</p>
      <p>
        The well known \curse of dimensionality" [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] is one of the
hardest obstacles that researchers have to deal with when
working on this topic. Along the years, such obstacle has
been attacked by many proposals, using many di erent
approaches. The earliest and most direct approach consisted
in trying to improve the data structures used to perform
exact similarity search. Research moved then toward the
exploration of approximate similarity search methods, mainly
proposing variants of exact methods in which some of the
constraints that guarantee the exactness of the results are
relaxed, trading e ectiveness for e ciency.
      </p>
      <p>
        Approximate methods [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] that are not derived from
exact methods have been also proposed. On this eld, the
recent research on permutation-based indexes (PBI) [
        <xref ref-type="bibr" rid="ref1 ref6">1, 6</xref>
        ] has
shown a promising direction toward scalable data structures
for similarity search.
      </p>
      <p>In this work we present the Permutation Pre x Index
(PPIndex), an approximate similarity search structure belonging
to the family of the permutation-based indexes. We describe
the PP-Index data structures and algorithms, and test it on
data sets containing up to 100 million objects, distributed
on a very high-dimensional similarity space. Experiments
show that the PP-Index is a very e cient and scalable data
structure both at index time and at search time, and it also
allows to obtain very good e ectiveness values. The
PPIndex has also nice parallelization properties that allow to
distribute both the index and the search process in order to
further improve e ciency.
2.</p>
    </sec>
    <sec id="sec-3">
      <title>RELATED WORKS</title>
      <p>
        The PP-Index belongs to the family of the
permutationbased indexes, a recent family of data structure for
approximate similarity search, which has been independently
introduced by Amato and Savino [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and Chavez et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>The PP-Index has however a key di erence with respect to
previously presented PBIs: as we detail in this section, such
PBIs use permutations in order to estimate the real distance
order of the indexed objects with respect to a query. The
PP-Index instead uses the permutation pre xes in order to
quickly retrieve a reasonably-sized set of candidate objects,
which are likely to be at close distance to the query object,
then leaving to the original distance function the selection
of the best elements among the candidates.</p>
      <p>
        For a more detailed review of the most relevant methods
for similarity search in metric spaces we point the reader to
the book of Zezula et al. [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. The recent work of Patella
and Ciaccia [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] more speci cally analyzes and classi es the
characteristics of many approximate search methods.
      </p>
      <p>
        Chavez et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] present an approximate similarity search
method based on the intuition of \predicting the closeness
between elements according to how they order their distances
towards a distinguished set of anchor objects ".
      </p>
      <p>A set of reference objects R = fr0; : : : ; rjRj 1g O is
de ned by randomly selecting jRj objects from D. Every
object oi 2 D is then represented by oi , consisting of the
list of identi ers of reference objects, sorted by their
distance with respect to the object oi. More formally, oi is
a permutation of h0; : : : ; jRj 1i so that, for 0 &lt; i &lt; jRj
it holds either (i) d(oi; r ox (i 1)) &lt; d(oi; r ox (i)), or (ii)
d(oi; r ox (i 1)) = d(oi; r ox (i)) and ox (i 1) &lt; ox (i),
where ox (x) returns the i-th value of ox .</p>
      <p>
        All the permutations for the index objects are stored in
main memory. Given a query q, all the indexed permutations
are sorted by their similarity with q, using a similarity
measure de ned on permutations. The real distance d between
the query and the objects in the data set is then computed
by selecting the objects from the data set following the order
of similarity of their permutations, until the requested
number of objects is retrieved. An example of similarity measure
on permutations is the Spearman Footrule Distance [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]:
SF D(ox; oy) =
r2RjP ( ox ; r)
      </p>
      <p>P ( oy ; r)j
(1)
where P ( ox ; r) returns the position of the reference object
r in the permutation assigned to ox .</p>
      <p>Chavez et al. do not discuss the applicability of their
method to very large data sets, i.e., when the permutations
cannot be all kept in main memory.</p>
      <p>
        Amato and Savino [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], independently of [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], propose an
approximate similarity search method based on the intuition
of representing the objects in the search space with \their
view of the surrounding world ".
      </p>
      <p>
        For each object oi 2 D, they compute the permutation
oi in the same manner as [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. All the permutations are
used to build a set of inverted lists, one for each reference
object. The inverted list for a reference object ri stores the
position of such reference object in each of the indexed
permutations. The inverted lists are used to rank the indexed
objects by their SF D value (equation 1) with respect to a
query object q, similarly to [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In fact, if full-length
permutations are used to represent the indexed objects and the
query, the search process is perfectly equivalent to the one
of [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], the authors propose two optimizations that
improve the e ciency of the search process, not a ecting
the accuracy of the produced ranking. Both optimizations
are based on the intuition that the information about the
order of the closest reference objects is more relevant than
the information about distant ones.
      </p>
      <p>One optimization consists in inserting into the inverted
lists only the information related to okii , i.e., the part of
oi including only the rst ki elements of the permutation,
thus reducing by a factor jkRij the size of the index. For
example, given jRj = 500 a value of ki = 100 reduces by
ve times the number of disk accesses with respect to using
full-length permutations, with a negligible loss in accuracy.</p>
      <p>Similarly, a value ks is adopted for the query, in order to
select only the rst ks elements of q. Given jRj = 500 a
value of ks = 50 reduces by ten times the number of disk
accesses, also slightly improving the accuracy.
3.</p>
    </sec>
    <sec id="sec-4">
      <title>THE PP-Index</title>
      <p>The PP-Index represents each indexed object with a very
short permutation pre x.</p>
      <p>The PP-Index data structures consists of a pre x tree kept
in main memory, indexing the permutation pre xes, and a
data storage kept on disk, from which objects are retrieved
by sequential disk accesses.</p>
      <p>
        This con guration of data structures is interestingly
similar to the one used by Bawa et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], however, it is relevant
to note that our work and [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] are based on completely
different approaches to the problem. The latter proposes the
LSH-Forest, an improvement to the LSH-Index [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] that is
based on using hash keys of variable lenght. These are used
to identify a set of candidate objects with hash keys that
have a pre x match with the hash key of to the query. Thus
the LSH-Forest, like the other LSH-based methods, is based
only on probabilistic considerations, while the PP-Index, like
the other PBIs, relies on geometrical considerations.
      </p>
      <p>More generally, a key di erence between the PBI model
and the LSH model is that the hash functions of the LSH
model are solely derived from the similarity measures in use,
independently of the way the indexed objects are distributed
in the similarity space, while in the SPI model the reference
objects provide information about this aspect.</p>
      <p>The PP-Index is designed to allow very e cient indexing
by performing bulk processing of all the objects indexed.
Such bulk processing model is based on the intuitive
assumption that the data, in the very large collections the PP-Index
is designed for, have a relatively static nature. However, it is
easy to provide the PP-Index with update capabilities (see
Section 3.6).
3.1</p>
    </sec>
    <sec id="sec-5">
      <title>Data structures</title>
      <p>Given a collection of objects D to be indexed, and the
similarity measure d, a PP-Index is built by specifying a set
of reference objects R, and a permutation pre x length l.</p>
      <p>Any object oi 2 D is represented by a permutation pre x
woi consisting of the rst l elements of the permutation oi ,
i.e., woi = loi . Any object oi 2 D is also associated with
a data block. A data block boi contains (i) the information
required to univocally identify the object oi and (ii) the
essential data used by the function d in order to compute the
similarity between the object oi and any other object in O.</p>
      <p>The pre x tree of the PP-Index is built on all the
permutation pre xes generated for the indexed objects. The leaf
at the end of a path relative to a permutation pre x w keeps
the information required to retrieve the data blocks relative
to the objects represented by w from the data storage.</p>
      <p>The data storage consists of a le in which all the data
blocks are sequentially stored. The order of objects
(represented by data blocks) in the data storage is the same as the
one produced by performing an ordered visit of the pre x
tree. This is a key property of the PP-Index, which allows
to use the pre x tree to e ciently access the data storage.</p>
      <p>Speci cally, the leaf of the pre x tree relative to
permutation pre x w contains two counter values hswtart and hewnd,
and two pointer values pswtart and pewnd. The counter values
BuildIndex(D; d; R; l)
1 pref ixT ree EmptyPrefixTree()
2 dataStorage EmptyDataStorage()
3 for i 0 to jD 1j
4 do oi GetObject(D; i)
5 dataBlockoi GetDataBlock(oi)
6 poi Append(dataBlockoi ; dataStorage)
7 woi ComputePrefix(oi; R; d; l)
8 hoi i
9 Insert(woi ; hoi ; poi ; pref ixT ree)
10 L ListPointersByOrderedVisit(pref ixT ree)
11 P CreateInvertedList(L)
12 ReorderStorage(dataStorage; P )
13 CorrectLeafValues(pref ixT ree; dataStorage)
14 index NewIndex(d; R; l; pref ixT ree; dataStorage)
15 return index
indicate the ordinal position in the sequence of data blocks
in the data storage of the rst and the last data blocks
relative to the permutation pre x w. The pointer values indicate
instead the byte o set in the data storage where the data
blocks are e ectively serialized. In case the data blocks have
a xed size s, the p values can be omitted and computed
when necessary as pw = hw s.</p>
      <p>The data storage implementation must allow, given any
two pointers p0 and p00, to sequentially retrieve all the data
blocks included between them.
3.2</p>
    </sec>
    <sec id="sec-6">
      <title>Building the index</title>
      <p>Figure 1 shows a pseudo-code description of the indexing
function for the PP-Index.</p>
      <p>The indexing algorithm rst initializes an empty pre x
tree in main memory, and an empty le on disk, to be used
as the data storage.</p>
      <p>The algorithm takes in input an object oi 2 D, for i from
0 to jD 1j, and appends its data block at the current
end position pend of the data storage le. Then the
algorithm computes, for the object oi, the permutation pre x
woi (ComputePrefix), and inserts woi into the pre x tree.
The values hoi = i and poi = pend are stored in the leaf
of the pre x tree corresponding to permutation pre x woi .
When more that one value has to be stored in a leaf, a list
is created.</p>
      <p>Figure 2 shows an example list of permutation pre xes
generated for a set of objects and the data structures
resulting from the above discussed rst phase of data indexing.</p>
      <p>The successive phase (ReorderStorage) consists in
reordering the data blocks in the data storage to satisfy the
order constrains described in the previous section. An
ordered visit of the pre x tree is made in order to produce
a list L of the hoi values stored in the leaves. Thus, the
hoi values in the list L are sorted in alphabetical order of
the permutation pre xes their relative objects are associated
with.</p>
      <p>Data blocks in the data storage are reordered following
the order of appearance of hoi values in list L. For example,
given a list for L = h0; 4; 8; 6; 1; 3; 5; 9; 2; 7i, the data block
relative to object o7, identi ed in the list by the value ho7 =
7, has to be moved to the last position in the data storage,
since ho7 appears in the last position of the list L (see the
values in the leaves of the pre x tree in Figure 2).</p>
      <p>To e ciently perform the reordering, the list L is
inverted into a list P . The i-th position of the list P
indicates the new position in which the i-th element of the
data storage has to be moved. For the above example,
o0 ... o4 ... o8... o6 ... o1 ... o3 ... o5 ... o9 ... o2 ... o7 ...</p>
      <p>Data storage</p>
      <p>Figure 3: The nal index data structure.</p>
      <p>P = h0; 4; 8; 5; 1; 6; 3; 9; 2; 7i.</p>
      <p>
        Once P is generated the data storage is reordered
accordingly, using an m-way merge sorting method [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. The
advantages of using this method are that it involves only
sequential disk accesses, and that it has a small (and con
gurable) main memory space occupation.
      </p>
      <p>In order to obtain the nal index structure, the values in
the leaves of the pre x tree have to be updated accordingly
to the new data storage (CorrectLeafValues). This is
obtained by performing an ordered visit to the pre x tree,
the same performed when building the list L, synchronized
with a sequential scan of the reordered data storage. The
number of elements in the list of a leaf determines the two
hstart and hend values that replace such list, and also the
number of data blocks to be sequentially read from the data
storage, in ordered to determine the pstart and pend values.</p>
      <p>Figure 3 shows the nal index data structure.
3.3</p>
    </sec>
    <sec id="sec-7">
      <title>Search function</title>
      <p>The search function is designed to use the index in order to
Index characteristics
|D|=10, |R||=6, l=3</p>
      <p>Permutation prefixes
wo0 =&lt;1, 3, 2&gt; wo1 =&lt;2, 3, 0&gt;
wo2 =&lt;5, 2, 3&gt; wo3 =&lt;4, 1, 3&gt;
wo4 =&lt;1, 3, 2&gt; wo5 =&lt;4, 1, 3&gt;
wo6 =&lt;1, 3, 4&gt; wo7 =&lt;5, 2, 3&gt;
wo8 =&lt;1, 3, 2&gt; wo9 =&lt;4, 3, 5&gt;
e ciently answer to k nearest neighbor queries. The search
strategy consists in searching a subtree of the pre x tree
that identi es a speci ed number of candidate objects all
represented by permutation pre xes having a pre x match
with the permutation pre x representing the query.</p>
      <p>A k-NN query is composed of the query object q, the k
value, and the z value, indicating the minimum number of
candidate objects among which the k nearest neighbors have
to be selected.</p>
      <p>The FindCandidates function determines the smallest
subtree of the pre x tree having a pre x match with the
permutation pre x wq, i.e., the permutation pre x relative
to the query q, and retrieving at least z objects. The
function returns two pointers pstart and pend to the positions in
the data storage of the data blocks of the rst and the last
candidate objects.</p>
      <p>The distance of each candidate object with the query is
computed, using the distance function d. A heap is used to
keep track of the k objects closest to the query.
3.4</p>
    </sec>
    <sec id="sec-8">
      <title>Prefix tree optimizations</title>
      <p>In order to reduce the main memory occupation of the
prex tree it is possible to simplify its structure without a
ecting the quality of results. These are search-speci c
optimizations, and a non-optimized version of the pre x tree should
be saved for other operations (e.g., update and merge).</p>
      <p>
        A rst optimization consists in pruning any path reaching
a leaf which is composed of only-child, given that this kind
of path does not add relevant information to distinguish
between di erent existing groups in the index. Another
optimization consists in compressing any path of the pre x tree
composed entirely of only-children into a single label [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ],
thus saving the memory space required to keep the chain of
nodes composing the path.
      </p>
      <p>A PP-Index-speci c optimization, applicable when the z
value is hardcoded into the search function, consists in
reducing to a single leaf the subtrees of the pre x tree that
points to less than z objects, given that none of such
subtrees will be ever selected by the search function.
3.5</p>
    </sec>
    <sec id="sec-9">
      <title>Improving the search effectiveness</title>
      <p>The \basic" search function described is Section 3.3 is
strongly biased toward e ciency, treating e ectiveness as
a secondary aspect. The PP-Index allows to easily tune
e ectiveness/e ciency trade-o , and e ectiveness can
easily reach optimal levels just by adopting the two following
\boosting" strategies:</p>
      <p>Multiple index : t indexes are built, based on di erent
R1 : : : Rt sets of reference objects. This is based on the
intuition that di erent reference object sets produce many
differently shaped partitions of the similarity space, resulting
in a more complete coverage of the area around queries.</p>
      <p>A search process the using multiple index strategy can be
parallelized by distributing the indexes over multiple
machines, or just on di erent processes/CPUs on the same
machine, maintaing almost the same performance of the basic
search function, with a negligible overhead for merging the t
k-NN results, as far as there are enough hardware resources
to support the number of indexes involved in the process.</p>
      <p>Multiple query : at search time, p additional permutation
pre xes from the query permutation pre x wq are
generated, by swapping the position of some of its elements. The
geometric rationale is that a permutation pre x w0 di
ering from another permutation pre x w00 for the swap of two
adjacent/near elements identi es an area Vw0 of the
similarity space adjacent/near to Vw00 allowing to extend the search
process to areas of the search space that are likely to contain
relevant objects.</p>
      <p>The heuristic we adopt in our experiments for swapping
permutation pre x elements consists in sorting all the
reference objects pairs appearing in the permutation pre x by
their di erence of distance with respect to the query object.
Then the swapped permutation pre xes are generated by
rst selecting for swap those pairs of identi ers that have
the smallest distance di erence.</p>
      <p>The multiple query strategy can be parallelized by
distributing the queries over di erent processes/CPUs on the
machine handling the index structure.
3.6</p>
    </sec>
    <sec id="sec-10">
      <title>Update and merge, distributed indexing</title>
      <p>The PP-Index data structures allows to very e ciently
merge indexes built using the same parameters into a single
index. The merge functionality supports three operations:</p>
      <p>Supporting update operations: it is easy to add update
capabilities to an index by maintaining a few additional
data structures. Deleted objects are managed using a vector
of their identi ers. Newly inserted or modi ed objects are
stored in an all-in-memory secondary PP-Index used in
conjunction with the main index structure. A periodic merge
procedure is used when the secondary index reaches a given
memory occupation limit.</p>
      <p>Indexing very large collection: the main memory
occupation of a pre x tree reaches its maximum during the
indexing process, when it has to be entirely kept in memory,
while during search, thanks to the optimization methods
described in Section 3.4, its size can be reduced by orders of
magnitude. This issue is solved building a number of smaller
partial indexes and then merging them into the nal index.</p>
      <p>Distributing the indexing process: the indexing process
of smaller indexes can be distributed of di erent machines,
given that the information contained in any smaller index is
completely independent of the one contained in the others.
Also the merge process can be distributed, if it is performed
in a number of steps that involve the creation of intermediate
indexes.</p>
      <p>The merge process consists in merging the pre x trees
of the source indexes into a single pre x tree, i.e., by
enumerating, in alphabetical order, all the permutation pre xes
contained in the source indexes.</p>
      <p>Such enumeration can be easily produced by performing
a parallel ordered visit of all the pre x trees being merged.
If the pre x trees of the source indexes are saved to the
storage using a depth rst visit, the merge process requires
only a single read of the serialized pre x trees. Obviously,
the new pre x tree is directly serialized on disk during the
permutation pre x enumeration process.</p>
      <p>In the case of an update process, the identi ers of the
deleted objects are used to skip deleted objects during the
merge process.</p>
      <p>The data storages are merged during the permutation
prex enumeration.</p>
    </sec>
    <sec id="sec-11">
      <title>EXPERIMENTS</title>
    </sec>
    <sec id="sec-12">
      <title>The CoPhIR data set</title>
      <p>
        The CoPhIR1 [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] data set has been recently developed
within the SAPIR project, and it is currently the largest
multimedia metadata collection available for research
purposes. It consists of a crawl of 106 millions images from the
Flickrphoto sharing website.
      </p>
      <p>
        The information relative to ve MPEG-7 visual
descriptors [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] have been extracted from each image, resulting in
more than 240 gigabytes of XML description data.
      </p>
      <p>We have randomly selected 100 images from the
collection as queries and we have run experiments using the rst
million (1M), ten millions (10M), and 100 millions (100M)
images from the data set.</p>
      <p>
        We have run experiments on a linear combination of the
ve distance functions for the ve descriptors, using the
weights proposed in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>Descriptor
Scalable Color
Color Structure</p>
      <p>Color Layout</p>
      <p>Edge Histogram
Homogeneous Texture</p>
      <p>Type
L1</p>
      <p>L1
sum of L2</p>
      <p>L1
L1</p>
      <p>We have tested a basic con guration based on the use of
a single index and the search function described in Section
3.3, i.e., an e ciency-aimed con guration.</p>
      <p>We have then tested the use of multiple indexes , on
congurations using 2, 4 and 8 indexes, and also the multiple
query search strategy by using a total of 2, 4, and 8
multiple queries. We have also tested the combination of the two
strategies.</p>
      <p>We have applied the index optimization strategies
described in Section 3.4 to all the generated indexes.</p>
      <p>On the held-out data we have tested various z
values, in this paper we report the results obtained for z =
1; 000, which has produced a good trade-o in e
ectiveness/e ciency.</p>
      <p>The experiments have been run on a desktop machine
running Windows XP Professional, equipped with a Intel
Pentium Core 2 Quad 2.4 GHz CPU, a single 1 TB Seagate
Barracuda 7,200 rpm SATA disk (with 32 MB cache), and 4
GB RAM. The PP-Index has been implemented in c#. All
the experiments have been run in a single-threaded
application, with a completely sequential execution of the multiple
index/query searches.</p>
      <p>
        We have evaluated the e ectiveness of the PP-Index by
adopting a ranking-based measure and a distance-based
measure [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], recall and relative distance error, de ned as:
Recall(k) =
RDE(k) =
jDqk \ Pqkj
      </p>
      <p>k
i=1
k
1 X d(q; Pqk(i))
k d(q; Dqk(i))
1
(2)
(3)
where Dq is the list of the elements of D sorted by their
distance with respect to q, Dqk is the list of the k closest
elements, Pqk is the list returned by the algorithm, and Lqk(i)
returns the i-th element of the list L.
4.3</p>
      <p>Results
jDj
1M
10M
100M
indexing pre x tree size
time (sec) full comp.
419 7.7 MB 91 kB
4385 53.8 MB 848 kB
45664 354.5 MB 6.5 MB
data
storage
349 MB
3.4 GB
34 GB
l0
2.1
2.7
3.5</p>
      <p>Table 2 reports the indexing times for the various data set
sizes (jRj = 100), showing the almost perfect linear
proportion between indexing time and data set size. With respect
to the indexing times we note that: (i) the twelve hours
time, required to build the 100M index for the jRj = 100, is
in line with the fourteen hours we have measured to build
a text search index on the descriptions and the comments
associated with the indexed images; (ii) this time refers to
a completely sequential indexing process, not leveraging on
the parallelization possibilities described in Section 3.6; (iii)
we have not explored the possibility of using a similarity
search data structure in order to answer l-NN query on the
R set necessary to build the permutation pre x.</p>
      <p>The table also shows the resulting memory occupation
of the pre x tree before and after the application of the
compression strategies described in Section 3.4. The
values shows how such strategies allows to reduce by orders of
magnitude the main memory requirement of the PP-Index
(at least by a factor fty in our case) without a ecting the
quality of the results.</p>
      <p>As expected, the disk occupation is perfectly linear with
respect to the data set size, given that the disk data storage
contains only a sequential serialization of data blocks (375
bytes each one).</p>
      <p>The last column of Table 2 reports the average depth of
the leaves of the pre x tree, after the compression. The l0
values show that the l value is not crucial in the de nition
of a PP-Index, given that the only requirement is to choose
a l value large enough in order to perform a su cient di
erentiation of the indexed objects.</p>
      <p>The graph of Figure 5 plots the search time with respect
to the size of R and the data set size, for k = 100 (single
index, single query). For the worst case, with jRj = 100,
we have measured an average 0:239 seconds search time on
the 100M index, with an average of less than eight
thousands candidates retrieved from the data storage (see Table
3). The search time decreases in a direct proportion the
decrease of the z0 value, which follows from the more detailed
partitioning of objects into the permutation pre x space,
determined by the increase of jRj. Even though the z0 value
increases as D gets larger, the increase of z0 is largely
subproportional to the growth of D: when D grows by a factor
100, z0 increases at worst by a factor 6:6.</p>
      <p>A possible issue with the z0 value is that it is not so close
the z value, and it has potentially no limits. However,
during the experiments the z0 value has never been a critical
factor with respect to e ciency: we have not observed any
extreme case in which z0 &gt;&gt; z, and the z0 value has never
required more than a single read operation from disk to
retrieve all the candidate objects (e.g., retrieving 10; 000 data
blocks from the data storage involves reading only 3.7 MB
from disk). We leave to future works an investigation of the
relations of the z value with the other parameters.</p>
      <p>Figure 6 shows the e ectiveness of the PP-Index with
respect to the size of the R and the data set size, using a
single-index/single-query con guration, for k = 100.</p>
      <p>E ectiveness values improve with the increase of jRj for
the 10M and 100M data sets, while the 1M data set shows
the inverse tendency. This con rms the intuition that larger
data sets requires a richer permutation pre x space
(generated by a larger set R) to better distribute their elements,
until a limit is reached and objects became too sparse in the
permutation pre x space and the e ectiveness worsen.</p>
      <p>The maximum-e ciency (0.210 seconds answer time)
conguration of PP-Index has obtained a 18:3% recall and 8:1%
RDE on the 100M data set, for k = 100.</p>
      <p>Figures 7 and 8 show respectively the e ects on e
ectiveness of the multiple index and multiple query strategies, for
three k values.</p>
      <p>With respect to the multiple index strategy we have
measured a great improvement on both measures reaching a 74%
recall (four times better than the single-index case) and a
0:7% RDE (eleven times better) for the eight index case.</p>
      <p>For the above mentioned eight index con guration we have
measured an average 1:72 seconds search time, for a
completely sequential search process. The four index con
guration allows to reach a 52% recall (67% for k = 10) and just
a 2:2% RDE with a sub-second answer time.</p>
      <p>It is relevant to note that, given the small memory
occupation of the compressed pre x tree, we have been able to
simultaneously load eight 100M indexes into the memory,
thus practically performing search on an 800 million objects
index, though with replicated data, on a single computer.</p>
      <p>The multiple query strategy also shows relevant
improvements, though of minor entity with respect to the multiple
index strategy. This is in part motivated by the fact that
many of the queries, generated by permuting the elements
of the original query permutation pre x, actually resulted
in retrieving the same candidates of other queries2. On the
100M index, for jRj = 1; 000, on the average, 1:92 distinct
queries to be e ective in retrieving candidates for the two
queries con guration, 3:18 queries for the four queries con
guration, and 5:25 queries for the eight queries con guration.</p>
      <p>Figure 9 shows the e ectiveness of the combined
multiple query and multiple index search strategies, using eight
queries and eight indexes, for jRj = 1; 000. We have
measured an average search time of 12.45 seconds, for a fully
sequential search process. This setup produces almost exact
results, with a recall &gt; 97% and a RDE &lt; 0:01%.</p>
      <p>
        We have measured, on the average, a total of 370; 000 data
blocks retrieved from the data storage among the average
44:5 queries being e ectively used to access the data storages
for each original query. Although this z0 value is relatively
high, it just represents the 0:3% of the whole collection. This
is a very low value considering, for example, that Lv et al.
[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], proposing a multiple query strategy for the LSH-Index,
have measured a percentage of distance computations with
respect to the data set size, in order to obtain a 96% recall,
of 4:4% on a 1.3 million objects data set and of 6:3% on a
2.6 million objects data set.
4.4
      </p>
    </sec>
    <sec id="sec-13">
      <title>Comparison experiments</title>
      <p>It is a hard task to run comparative experiments on novel
and very large data sets, such as CoPhIR due to many
reasons: (i) lack of previous results on the same data set; (ii)
lack of a publicly available implementation for many of the
methods involved in the comparison; (iii) when an
implementation is available, it is typically not designed to scale
2Such candidates are read only once from the data storage.
to very large data sets, but just a proof of concept; (iv)
moreover, the implementation is usually designed to take in
input only a speci c data type/format, which makes harder
to port the application to di erent data types.</p>
      <p>
        For this reasons we have currently limited our comparison
to replicating two experiments that, among the others, are
most closely related to our work: Batko et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], which have
run experiments using an early release of the CoPhIR data
set, and Amato and Savino [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], whose proposal is the most
closely related to our own.
      </p>
      <p>
        Batko et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] have run experiments on the CoPhIR data
set, with data sets size of 10 and 50 millions images3. They
reports an 80% recall level for 100-NN queries on both
collection. For the 50M they test both a 80-CPU infrastructure
with 250 GB RAM, keeping all the index data in memory,
and a 32-CPU infrastructure with 32 disks storing the index
data, obtaining a 0:44 seconds answer time for the
memorybased solution and 1:4 seconds for the disk-based one.
      </p>
      <p>
        The PP-Index could achieve a better performance than
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] by distributing the search process, yet using much less
resources than [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. We have simulated the distribution of the
eight indexes on eight distinct computers, each one using two
processes executing four queries each, measuring the query
answer time as the time of the slowest of the 16 processes
plus the time to merge the 16 100-NN intermediate results.
We have measured an average 1:02 second answer time to
obtain &gt; 95% recall on the 50M data set.
      </p>
      <p>
        Amato and Savino [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] test their method on the Corel data
set4. The data set consists of 50; 000 32-dimensions color
HSV histograms extracted from the images. The distance
function used to compare the histograms is L1.
      </p>
      <p>
        Replicating [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], we have selected 50 random objects as
queries, and indexed the rest of the collection. Given the
small size of the data set, we have set jRj = 50. The time
required for generating the PP-Index is 4.9 second, with a
3We suppose they use the same linear combination of visual
descriptors of our experiments, given that two authors are
also the authors of [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], from which we take our weights.
4http://kdd.ics.uci.edu/databases/CorelFeatures/
CorelFeatures.html
disk occupation of 13 MB and a memory occupation of 450
kB. In [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] the index structure generated by setting ki = 100
is estimated to require 20 MB. This value does not include
the HSV histograms, which are required to reorder the
retrieved objects by the true similarity.
      </p>
      <p>
        The maximum recall level obtained in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] for k = 50 is
about 54%, requiring to read 2:4 MB of data from disk (600
blocks of 4 kB size). The PP-Index, in a
single-index/fourquery con guration (z = 500), obtains a 89:6% recall,
requiring to read just 900 kB of data from disk, in four
sequential reads. The single-index/single-query con guration
obtains a 66% recall.
      </p>
    </sec>
    <sec id="sec-14">
      <title>CONCLUSIONS</title>
      <p>We have introduced the PP-Index, an approximate
similarity search data structure based on the use of short
permutation pre xes. We have described the PP-Index data
structures and algorithms, including a number of
optimization methods and search strategies aimed at improving the
scalability of the index, its e ciency, and its e ectiveness.</p>
      <p>The PP-Index has been designed to take advantage of the
relatively static nature one could expect from very large
collections. However, as we have described, it is easy to support
fast update operations.</p>
      <p>We have evaluated the PP-Index on a very large and
highdimensional data set. Results show that it is both e cient
and e ective in performing similarity search, and it scales
well to very large data sets.</p>
      <p>We have shown how a limited-resources con guration
obtains good e ectiveness results in less than a second, and
how almost exact results are produced in a relatively short
amount of time. The parallel processing capabilities of the
PP-Index allow to distribute the search process in order to
further improve its e ciency.</p>
      <p>
        The comparison with experimental results published for
two closely related method, which are among the
topperformers on the task, shows that the PP-Index
outperforms the compared methods, both in e ciency and e
ectiveness. Only one [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] of the works we compare with uses a
data set of a size comparable to our largest one. We plan to
extend the comparison with some of the competing methods,
by porting them on the larger data set sizes.
      </p>
      <p>
        The PP-Index has been already used to build a performing
similarity search system5 [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>There are many aspect of our proposal that are worth to
be further investigated. For example, the R set is a crucial
element of the PP-Index. We plan to study element
selection policies alternative to the random policy, e.g., selecting
centroids of clusters of D, or the most frequent queries from
a query log.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Amato</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Savino</surname>
          </string-name>
          .
          <article-title>Approximate similarity search in metric spaces using inverted les</article-title>
          .
          <source>In INFOSCALE '08: Proceeding of the 3rd International ICST Conference on Scalable Information Systems</source>
          , pages
          <fpage>1</fpage>
          {
          <fpage>10</fpage>
          ,
          <string-name>
            <surname>Vico</surname>
            <given-names>Equense</given-names>
          </string-name>
          , Italy,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Batko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Falchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lucchese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Novak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Perego</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Rabitti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sedmidubsky</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Zezula</surname>
          </string-name>
          . Crawling, indexing, and
          <article-title>similarity searching images on the web</article-title>
          .
          <source>In Proceedings of SEDB '08, the 16th Italian Symposium on Advanced Database Systems</source>
          , pages
          <fpage>382</fpage>
          {
          <fpage>389</fpage>
          ,
          <string-name>
            <surname>Mondello</surname>
          </string-name>
          , Italy,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Batko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kohoutkova</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Zezula</surname>
          </string-name>
          .
          <article-title>Combining metric features in large collections</article-title>
          .
          <source>SISAP '08, 1st International Workshop on Similarity Search and Applications</source>
          , pages
          <volume>79</volume>
          {
          <fpage>86</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Bawa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Condie</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Ganesan</surname>
          </string-name>
          .
          <article-title>Lsh forest: self-tuning indexes for similarity search</article-title>
          .
          <source>In WWW '05: Proceedings of the 14th international conference on World Wide Web</source>
          , pages
          <volume>651</volume>
          {
          <fpage>660</fpage>
          ,
          <string-name>
            <surname>Chiba</surname>
          </string-name>
          , Japan,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bolettieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Esuli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Falchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lucchese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Perego</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Piccioli</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Rabitti</surname>
          </string-name>
          .
          <article-title>CoPhIR: a test collection for content-based image retrieval</article-title>
          .
          <source>CoRR, abs/0905.4627</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>E.</given-names>
            <surname>Chavez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Figueroa</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Navarro</surname>
          </string-name>
          .
          <article-title>E ective proximity retrieval by ordering permutations</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)</source>
          ,
          <volume>30</volume>
          (
          <issue>9</issue>
          ):
          <volume>1647</volume>
          {
          <fpage>1658</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>E.</given-names>
            <surname>Chavez</surname>
          </string-name>
          , G. Navarro,
          <string-name>
            <given-names>R.</given-names>
            <surname>Baeza-Yates</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <surname>J. Marroqu n.</surname>
          </string-name>
          <article-title>Proximity searching in metrix spaces</article-title>
          .
          <source>ACM Computing Surveys</source>
          ,
          <volume>33</volume>
          (
          <issue>3</issue>
          ):
          <volume>273</volume>
          {
          <fpage>321</fpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Ciaccia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Patella</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Zezula</surname>
          </string-name>
          .
          <article-title>M-tree: An e cient access method for similarity search in metric spaces</article-title>
          .
          <source>In VLDB '97: Proceedings of the 23rd International Conference on Very Large Data Bases</source>
          , pages
          <volume>426</volume>
          {
          <fpage>435</fpage>
          , Athens, Greece,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>P.</given-names>
            <surname>Diaconis</surname>
          </string-name>
          .
          <article-title>Group representation in probability and statistics</article-title>
          .
          <source>IMS Lecture Series</source>
          ,
          <volume>11</volume>
          ,
          <year>1988</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Esuli</surname>
          </string-name>
          .
          <article-title>MiPai: using the PP-Index to build an e cient and scalable similarity search system</article-title>
          .
          <source>In SISAP '09, Proceedings of the 2nd International Workshop on Similarity Search and Applications</source>
          , Prague, CZ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>P.</given-names>
            <surname>Indyk</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Motwani</surname>
          </string-name>
          .
          <article-title>Approximate nearest neighbors: towards removing the curse of dimensionality</article-title>
          .
          <source>In STOC '98: Proceedings of the 30th ACM symposium on Theory of computing</source>
          , pages
          <volume>604</volume>
          {
          <fpage>613</fpage>
          , Dallas, USA,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>H. V.</given-names>
            <surname>Jagadish</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. O.</given-names>
            <surname>Mendelzon</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Milo</surname>
          </string-name>
          .
          <article-title>Similarity-based queries</article-title>
          .
          <source>In PODS '95: Proceedings of the 14th ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems</source>
          , pages
          <volume>36</volume>
          {
          <fpage>45</fpage>
          , San Jose, US,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>D.</given-names>
            <surname>Knuth</surname>
          </string-name>
          .
          <source>The Art of Computer Programming</source>
          , volume
          <volume>3</volume>
          : Sorting and Searching, chapter
          <volume>5</volume>
          .4:
          <string-name>
            <given-names>External</given-names>
            <surname>Sorting</surname>
          </string-name>
          , pages
          <fpage>248</fpage>
          <lpage>{</lpage>
          379. second edition,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Lv</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Josephson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Charikar</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          <article-title>. Multi-probe lsh: e cient indexing for high-dimensional similarity search</article-title>
          .
          <source>In VLDB '07: Proceedings of the 33rd international conference on Very large data bases</source>
          , pages
          <volume>950</volume>
          {
          <fpage>961</fpage>
          , Vienna, Austria,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Morrison</surname>
          </string-name>
          .
          <article-title>Patricia|practical algorithm to retrieve information coded in alphanumeric</article-title>
          .
          <source>J. ACM</source>
          ,
          <volume>15</volume>
          (
          <issue>4</issue>
          ):
          <volume>514</volume>
          {
          <fpage>534</fpage>
          ,
          <year>1968</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <source>[16] MPEG-7. Mpeg-7. multimedia content description interfaces</source>
          ,
          <source>part3: Visual</source>
          ,
          <year>2002</year>
          . ISO/IEC 15938-3:
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>M.</given-names>
            <surname>Patella</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Ciaccia</surname>
          </string-name>
          .
          <article-title>The many facets of approximate similarity search</article-title>
          .
          <source>SISAP '08, 1st International Workshop on Similarity Search and Applications</source>
          , pages
          <volume>10</volume>
          {
          <fpage>21</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>P.</given-names>
            <surname>Zezula</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Amato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Dohnal</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Batko</surname>
          </string-name>
          .
          <article-title>Similarity Search: The Metric Space Approach (Advances in Database Systems)</article-title>
          .
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>