=Paper= {{Paper |id=Vol-480/paper-2 |storemode=property |title=PP-Index: Using Permutation Prefixes for Efficient and Scalable Approximate Similarity Search |pdfUrl=https://ceur-ws.org/Vol-480/paper2.pdf |volume=Vol-480 |dblpUrl=https://dblp.org/rec/conf/sigir/Esuli09 }} ==PP-Index: Using Permutation Prefixes for Efficient and Scalable Approximate Similarity Search== https://ceur-ws.org/Vol-480/paper2.pdf
      PP-Index: Using Permutation Prefixes for Efficient and
             Scalable Approximate Similarity Search

                                                                    Andrea Esuli
                   Istituto di Scienza e Tecnologie dell’Informazione- Consiglio Nazionale delle Ricerche
                                        via Giuseppe Moruzzi, 1- 56124, Pisa - ITALY
                                                           andrea.esuli@isti.cnr.it


ABSTRACT                                                                      by their similarity to the query, according to a given distance
We present the Permutation Prefix Index (PP-Index), an in-                    function d : O × O → R+ (i.e., the closer two objects are,
dex data structure that allows to perform efficient approxi-                  the most similar they are considered). Typically only the
mate similarity search.                                                       k-top ranked objects are returned (k-NN query), or those
   The PP-Index belongs to the family of the permutation-                     within a maximum distance value r (range query).
based indexes, which are based on representing any indexed                       One of the main research topics on similarity search is the
object with “its view of the surrounding world”, i.e., a list                 study of the scalability of similarity search methods when
of the elements of a set of reference objects sorted by their                 applied to high-dimensional similarity spaces.
distance order with respect to the indexed object.                               The well known “curse of dimensionality” [7] is one of the
   In its basic formulation, the PP-Index is strongly biased                  hardest obstacles that researchers have to deal with when
toward efficiency, treating effectiveness as a secondary as-                  working on this topic. Along the years, such obstacle has
pect. We show how the effectiveness can easily reach opti-                    been attacked by many proposals, using many different ap-
mal levels just by adopting two “boosting” strategies: multi-                 proaches. The earliest and most direct approach consisted
ple index search and multiple query search. Such strategies                   in trying to improve the data structures used to perform ex-
have nice parallelization properties that allow to distribute                 act similarity search. Research moved then toward the ex-
the search process in order to keep high efficiency levels.                   ploration of approximate similarity search methods, mainly
   We study both the efficiency and the effectiveness proper-                 proposing variants of exact methods in which some of the
ties of the PP-Index. We report experiments on collections                    constraints that guarantee the exactness of the results are
of sizes up to one hundred million images, represented in a                   relaxed, trading effectiveness for efficiency.
very high-dimensional similarity space based on the combi-                       Approximate methods [17] that are not derived from ex-
nation of five MPEG-7 visual descriptors.                                     act methods have been also proposed. On this field, the re-
                                                                              cent research on permutation-based indexes (PBI) [1, 6] has
                                                                              shown a promising direction toward scalable data structures
Categories and Subject Descriptors                                            for similarity search.
E.1 [Data]: Data Structures; H.3.3 [Information Sys-                             In this work we present the Permutation Prefix Index (PP-
tems]: Information Storage and Retrieval—Search process                       Index), an approximate similarity search structure belonging
                                                                              to the family of the permutation-based indexes. We describe
                                                                              the PP-Index data structures and algorithms, and test it on
General Terms                                                                 data sets containing up to 100 million objects, distributed
Algorithms,Experimentation,Performance                                        on a very high-dimensional similarity space. Experiments
                                                                              show that the PP-Index is a very efficient and scalable data
Keywords                                                                      structure both at index time and at search time, and it also
                                                                              allows to obtain very good effectiveness values. The PP-
approximate similarity search, metric space, scalability                      Index has also nice parallelization properties that allow to
                                                                              distribute both the index and the search process in order to
1.    INTRODUCTION                                                            further improve efficiency.
  The similarity search model [12] is a search model in
which, given a query q and a collection of objects D, all                     2.   RELATED WORKS
belonging to a domain O, the objects in D have to be sorted
                                                                                The PP-Index belongs to the family of the permutation-
                                                                              based indexes, a recent family of data structure for approxi-
                                                                              mate similarity search, which has been independently intro-
                                                                              duced by Amato and Savino [1] and Chavez et al. [6].
                                                                                The PP-Index has however a key difference with respect to
                                                                              previously presented PBIs: as we detail in this section, such
Copyright c 2009 for the individual papers by the papers’ authors. Copy-      PBIs use permutations in order to estimate the real distance
ing permitted for private and academic purposes. Re-publication of material   order of the indexed objects with respect to a query. The
from this volume requires permission by the copyright owners. This volume
is published by its editors.                                                  PP-Index instead uses the permutation prefixes in order to
LSDS-IR Workshop. July 2009. Boston, USA.                                     quickly retrieve a reasonably-sized set of candidate objects,
which are likely to be at close distance to the query object,          full-length permutations, with a negligible loss in accuracy.
then leaving to the original distance function the selection             Similarly, a value ks is adopted for the query, in order to
of the best elements among the candidates.                             select only the first ks elements of Πq . Given |R| = 500 a
   For a more detailed review of the most relevant methods             value of ks = 50 reduces by ten times the number of disk
for similarity search in metric spaces we point the reader to          accesses, also slightly improving the accuracy.
the book of Zezula et al. [18]. The recent work of Patella
and Ciaccia [8] more specifically analyzes and classifies the          3.    THE PP-Index
characteristics of many approximate search methods.                       The PP-Index represents each indexed object with a very
   Chávez et al. [6] present an approximate similarity search         short permutation prefix.
method based on the intuition of “predicting the closeness                The PP-Index data structures consists of a prefix tree kept
between elements according to how they order their distances           in main memory, indexing the permutation prefixes, and a
towards a distinguished set of anchor objects”.                        data storage kept on disk, from which objects are retrieved
   A set of reference objects R = {r0 , . . . , r|R|−1 } ⊂ O is        by sequential disk accesses.
defined by randomly selecting |R| objects from D. Every                   This configuration of data structures is interestingly simi-
object oi ∈ D is then represented by Πoi , consisting of the           lar to the one used by Bawa et al. [4], however, it is relevant
list of identifiers of reference objects, sorted by their dis-         to note that our work and [4] are based on completely dif-
tance with respect to the object oi . More formally, Πoi is            ferent approaches to the problem. The latter proposes the
a permutation of h0, . . . , |R| − 1i so that, for 0 < i < |R|         LSH-Forest, an improvement to the LSH-Index [11] that is
it holds either (i) d(oi , rΠox (i−1) ) < d(oi , rΠox (i) ), or (ii)   based on using hash keys of variable lenght. These are used
d(oi , rΠox (i−1) ) = d(oi , rΠox (i) ) and Πox (i − 1) < Πox (i),     to identify a set of candidate objects with hash keys that
where Πox (x) returns the i-th value of Πox .                          have a prefix match with the hash key of to the query. Thus
   All the permutations for the index objects are stored in            the LSH-Forest, like the other LSH-based methods, is based
main memory. Given a query q, all the indexed permutations             only on probabilistic considerations, while the PP-Index, like
are sorted by their similarity with Πq , using a similarity mea-       the other PBIs, relies on geometrical considerations.
sure defined on permutations. The real distance d between                 More generally, a key difference between the PBI model
the query and the objects in the data set is then computed             and the LSH model is that the hash functions of the LSH
by selecting the objects from the data set following the order         model are solely derived from the similarity measures in use,
of similarity of their permutations, until the requested num-          independently of the way the indexed objects are distributed
ber of objects is retrieved. An example of similarity measure          in the similarity space, while in the SPI model the reference
on permutations is the Spearman Footrule Distance [9]:                 objects provide information about this aspect.
        SF D(ox , oy ) = Σr∈R |P (Πox , r) − P (Πoy , r)|       (1)       The PP-Index is designed to allow very efficient indexing
                                                                       by performing bulk processing of all the objects indexed.
where P (Πox , r) returns the position of the reference object         Such bulk processing model is based on the intuitive assump-
r in the permutation assigned to Πox .                                 tion that the data, in the very large collections the PP-Index
   Chávez et al. do not discuss the applicability of their            is designed for, have a relatively static nature. However, it is
method to very large data sets, i.e., when the permutations            easy to provide the PP-Index with update capabilities (see
cannot be all kept in main memory.                                     Section 3.6).
   Amato and Savino [1], independently of [6], propose an
approximate similarity search method based on the intuition            3.1    Data structures
of representing the objects in the search space with “their               Given a collection of objects D to be indexed, and the
view of the surrounding world ”.                                       similarity measure d, a PP-Index is built by specifying a set
   For each object oi ∈ D, they compute the permutation                of reference objects R, and a permutation prefix length l.
Πoi in the same manner as [6]. All the permutations are                   Any object oi ∈ D is represented by a permutation prefix
used to build a set of inverted lists, one for each reference          woi consisting of the first l elements of the permutation Πoi ,
object. The inverted list for a reference object ri stores the         i.e., woi = Πloi . Any object oi ∈ D is also associated with
position of such reference object in each of the indexed per-          a data block. A data block boi contains (i) the information
mutations. The inverted lists are used to rank the indexed             required to univocally identify the object oi and (ii) the es-
objects by their SF D value (equation 1) with respect to a             sential data used by the function d in order to compute the
query object q, similarly to [6]. In fact, if full-length per-         similarity between the object oi and any other object in O.
mutations are used to represent the indexed objects and the               The prefix tree of the PP-Index is built on all the permu-
query, the search process is perfectly equivalent to the one           tation prefixes generated for the indexed objects. The leaf
of [6]. In [1], the authors propose two optimizations that             at the end of a path relative to a permutation prefix w keeps
improve the efficiency of the search process, not affecting            the information required to retrieve the data blocks relative
the accuracy of the produced ranking. Both optimizations               to the objects represented by w from the data storage.
are based on the intuition that the information about the                 The data storage consists of a file in which all the data
order of the closest reference objects is more relevant than           blocks are sequentially stored. The order of objects (repre-
the information about distant ones.                                    sented by data blocks) in the data storage is the same as the
   One optimization consists in inserting into the inverted            one produced by performing an ordered visit of the prefix
lists only the information related to Πkoii , i.e., the part of        tree. This is a key property of the PP-Index, which allows
Πoi including only the first ki elements of the permutation,           to use the prefix tree to efficiently access the data storage.
thus reducing by a factor |R| ki
                                 the size of the index. For               Specifically, the leaf of the prefix tree relative to permu-
example, given |R| = 500 a value of ki = 100 reduces by                tation prefix w contains two counter values hstart
                                                                                                                        w     and hend
                                                                                                                                   w ,
                                                                                                   start      end
five times the number of disk accesses with respect to using           and two pointer values pw         and pw . The counter values
BuildIndex(D, d, R, l)                                                                                 Index characteristics
 1 pref ixT ree ← EmptyPrefixTree()                                                                     |D|=10, |R|=6, l=3
                                                                                                                        |




 2 dataStorage ← EmptyDataStorage()                                                                    Permutation prefixes
 3 for i ← 0 to |D − 1|                                                                             wo =<1, 3, 2> wo =<2, 3, 0>
                                                                                                         0                          1
 4 do oi ← GetObject(D, i)                                                                          wo =<5, 2, 3> wo =<4, 1, 3>     3
 5      dataBlockoi ← GetDataBlock(oi )                                                                  2

                                                                                                    wo =<1, 3, 2> wo =<4, 1, 3>
 6      poi ← Append(dataBlockoi , dataStorage)                                                          4                          5

                                                                                                    wo =<1, 3, 4> wo =<5, 2, 3>
 7      woi ← ComputePrefix(oi , R, d, l)                                                                6                          7

                                                                                                    wo =<1, 3, 2> wo9 =<4, 3, 5>
 8      hoi ← i                                                                                          8


 9      Insert(woi , hoi , poi , pref ixT ree)
10 L ← ListPointersByOrderedVisit(pref ixT ree)                                                               Prefix tree
11 P ← CreateInvertedList(L)                                                                                  root
12 ReorderStorage(dataStorage, P )                                                                             1 2 4 5
13 CorrectLeafValues(pref ixT ree, dataStorage)
14 index ← NewIndex(d, R, l, pref ixT ree, dataStorage)
15 return index                                                                                      3         3            1 3                     2
          Figure 1: The BuildIndex function.                                                  2 4              0            3                 5         3
indicate the ordinal position in the sequence of data blocks
in the data storage of the first and the last data blocks rela-                   h0 h4 h8              h     h         h3 h5                     h9    h h
                                                                                  p0 p4 p8              p66   p11       p3 p5                     p9    p22 p77
tive to the permutation prefix w. The pointer values indicate
                                                                                                                                                                      main memory
instead the byte offset in the data storage where the data                                                                                                       secondary memory
blocks are effectively serialized. In case the data blocks have
a fixed size s, the p values can be omitted and computed
when necessary as pw = hw · s.                                              o0 ... o1 ... o2... o 3 ... o4 ... o5 ... o6 ... o7 ... o8 ... o9 ...
   The data storage implementation must allow, given any                                    Data storage
two pointers p0 and p00 , to sequentially retrieve all the data       Figure 2: Sample data and partially-built index data
blocks included between them.                                         structure after the first indexing phase.
3.2    Building the index                                                                                     Prefix tree
   Figure 1 shows a pseudo-code description of the indexing                                                   root
function for the PP-Index.                                                                                        1 2 4 5
   The indexing algorithm first initializes an empty prefix
tree in main memory, and an empty file on disk, to be used                                           3            3         1 3                     2
as the data storage.
   The algorithm takes in input an object oi ∈ D, for i from                                  2 4                 0          3                5         3
0 to |D − 1|, and appends its data block at the current
                                                                                    h h
                                                                                      start
                                                                                      132
                                                                                              end
                                                                                              132   h   134   h   230   h h start
                                                                                                                            413
                                                                                                                                        end
                                                                                                                                        413   h   435   h h
                                                                                                                                                         start
                                                                                                                                                         532
                                                                                                                                                                  end
                                                                                                                                                                  532
end position pend of the data storage file. Then the algo-                          p p
                                                                                      start
                                                                                      132
                                                                                              end
                                                                                              132   p   134   p   230   p p start
                                                                                                                            413
                                                                                                                                        end
                                                                                                                                        413   p   435   p p
                                                                                                                                                         start
                                                                                                                                                         532
                                                                                                                                                                  end
                                                                                                                                                                  532

rithm computes, for the object oi , the permutation prefix
                                                                                                                                                                      main memory
woi (ComputePrefix), and inserts woi into the prefix tree.                                                                                                       secondary memory
The values hoi = i and poi = pend are stored in the leaf
of the prefix tree corresponding to permutation prefix woi .
When more that one value has to be stored in a leaf, a list                  o0 ... o4 ... o8... o 6 ... o1 ... o3 ... o5 ... o9 ... o2 ... o7 ...
is created.                                                                                Data storage
   Figure 2 shows an example list of permutation prefixes                   Figure 3: The final index data structure.
generated for a set of objects and the data structures result-
ing from the above discussed first phase of data indexing.            P = h0, 4, 8, 5, 1, 6, 3, 9, 2, 7i.
   The successive phase (ReorderStorage) consists in re-                Once P is generated the data storage is reordered accord-
ordering the data blocks in the data storage to satisfy the           ingly, using an m-way merge sorting method [13]. The ad-
order constrains described in the previous section. An or-            vantages of using this method are that it involves only se-
dered visit of the prefix tree is made in order to produce            quential disk accesses, and that it has a small (and config-
a list L of the hoi values stored in the leaves. Thus, the            urable) main memory space occupation.
hoi values in the list L are sorted in alphabetical order of            In order to obtain the final index structure, the values in
the permutation prefixes their relative objects are associated        the leaves of the prefix tree have to be updated accordingly
with.                                                                 to the new data storage (CorrectLeafValues). This is
   Data blocks in the data storage are reordered following            obtained by performing an ordered visit to the prefix tree,
the order of appearance of hoi values in list L. For example,         the same performed when building the list L, synchronized
given a list for L = h0, 4, 8, 6, 1, 3, 5, 9, 2, 7i, the data block   with a sequential scan of the reordered data storage. The
relative to object o7 , identified in the list by the value ho7 =     number of elements in the list of a leaf determines the two
7, has to be moved to the last position in the data storage,          hstart and hend values that replace such list, and also the
since ho7 appears in the last position of the list L (see the         number of data blocks to be sequentially read from the data
values in the leaves of the prefix tree in Figure 2).                 storage, in ordered to determine the pstart and pend values.
   To efficiently perform the reordering, the list L is in-             Figure 3 shows the final index data structure.
verted into a list P . The i-th position of the list P in-
dicates the new position in which the i-th element of the             3.3   Search function
data storage has to be moved. For the above example,                    The search function is designed to use the index in order to
Search(q, k, z, index)                                              ducing to a single leaf the subtrees of the prefix tree that
  1 (pstart , pend ) ← FindCandidates(q, index.pref ixT ree,
  2 index.R, index.d, index.l, z)                                   points to less than z objects, given that none of such sub-
  3 resultsHeap ← EmptyHeap()                                       trees will be ever selected by the search function.
  4 cursor ← pstart
  5 while cursor ≤ pend                                             3.5    Improving the search effectiveness
  6 do dataBlock ← Read(cursor, index.dataStorage)
  7     AdvanceCursor(cursor)                                          The “basic” search function described is Section 3.3 is
  8     distance ← index.d(q, dataBlock.data)                       strongly biased toward efficiency, treating effectiveness as
  9     if resultsHeap.size < k                                     a secondary aspect. The PP-Index allows to easily tune
 10        then Insert(resultsHeap, distance, dataBlock.id)         effectiveness/efficiency trade-off, and effectiveness can eas-
 11        else if distance < resultsHeap.top.distance              ily reach optimal levels just by adopting the two following
 12                  then ReplaceTop(resultsHeap, distance,         “boosting” strategies:
 13                       dataBlock.id)
 14 Sort(resultsHeap)                                                  Multiple index : t indexes are built, based on different
 15 return resultsHeap                                              R1 . . . Rt sets of reference objects. This is based on the intu-
                                                                    ition that different reference object sets produce many dif-
FindCandidates(q, pref ixT ree, R, d, l, z)                         ferently shaped partitions of the similarity space, resulting
  1 wq ← ComputePrefix(q, R, d, l)                                  in a more complete coverage of the area around queries.
  2 for i ← l to 1                                                     A search process the using multiple index strategy can be
  3 do wqi ← SubPrefix(wq , i)
                                                                    parallelized by distributing the indexes over multiple ma-
  4     node ← SearchPath(wqi , pref ixT ree)
  5     if node 6= nil
                                                                    chines, or just on different processes/CPUs on the same ma-
  6        then minLeaf ← GetMin(node, pref ixT ree)                chine, maintaing almost the same performance of the basic
  7              maxLeaf ← GetMax(node, pref ixT ree)               search function, with a negligible overhead for merging the t
  8              if (maxLeaf.hend − minLeaf.hstart + 1) ≥ z         k-NN results, as far as there are enough hardware resources
  9                 ∨i = 1                                          to support the number of indexes involved in the process.
10                  then return (minLeaf.pstart ,                      Multiple query: at search time, p additional permutation
11                        maxLeaf.pend )
12 return (0, 0)
                                                                    prefixes from the query permutation prefix wq are gener-
                                                                    ated, by swapping the position of some of its elements. The
             Figure 4: The Search function.                         geometric rationale is that a permutation prefix w0 differ-
efficiently answer to k nearest neighbor queries. The search        ing from another permutation prefix w00 for the swap of two
strategy consists in searching a subtree of the prefix tree         adjacent/near elements identifies an area Vw0 of the similar-
that identifies a specified number of candidate objects all         ity space adjacent/near to Vw00 allowing to extend the search
represented by permutation prefixes having a prefix match           process to areas of the search space that are likely to contain
with the permutation prefix representing the query.                 relevant objects.
   A k-NN query is composed of the query object q, the k               The heuristic we adopt in our experiments for swapping
value, and the z value, indicating the minimum number of            permutation prefix elements consists in sorting all the ref-
candidate objects among which the k nearest neighbors have          erence objects pairs appearing in the permutation prefix by
to be selected.                                                     their difference of distance with respect to the query object.
   The FindCandidates function determines the smallest              Then the swapped permutation prefixes are generated by
subtree of the prefix tree having a prefix match with the           first selecting for swap those pairs of identifiers that have
permutation prefix wq , i.e., the permutation prefix relative       the smallest distance difference.
to the query q, and retrieving at least z objects. The func-           The multiple query strategy can be parallelized by dis-
tion returns two pointers pstart and pend to the positions in       tributing the queries over different processes/CPUs on the
the data storage of the data blocks of the first and the last       machine handling the index structure.
candidate objects.
   The distance of each candidate object with the query is          3.6    Update and merge, distributed indexing
computed, using the distance function d. A heap is used to            The PP-Index data structures allows to very efficiently
keep track of the k objects closest to the query.                   merge indexes built using the same parameters into a single
                                                                    index. The merge functionality supports three operations:
3.4    Prefix tree optimizations                                      Supporting update operations: it is easy to add update
   In order to reduce the main memory occupation of the pre-        capabilities to an index by maintaining a few additional
fix tree it is possible to simplify its structure without affect-   data structures. Deleted objects are managed using a vector
ing the quality of results. These are search-specific optimiza-     of their identifiers. Newly inserted or modified objects are
tions, and a non-optimized version of the prefix tree should        stored in an all-in-memory secondary PP-Index used in con-
be saved for other operations (e.g., update and merge).             junction with the main index structure. A periodic merge
   A first optimization consists in pruning any path reaching       procedure is used when the secondary index reaches a given
a leaf which is composed of only-child, given that this kind        memory occupation limit.
of path does not add relevant information to distinguish be-          Indexing very large collection: the main memory occu-
tween different existing groups in the index. Another opti-         pation of a prefix tree reaches its maximum during the in-
mization consists in compressing any path of the prefix tree        dexing process, when it has to be entirely kept in memory,
composed entirely of only-children into a single label [15],        while during search, thanks to the optimization methods de-
thus saving the memory space required to keep the chain of          scribed in Section 3.4, its size can be reduced by orders of
nodes composing the path.                                           magnitude. This issue is solved building a number of smaller
   A PP-Index-specific optimization, applicable when the z          partial indexes and then merging them into the final index.
value is hardcoded into the search function, consists in re-          Distributing the indexing process: the indexing process
                                                                                         0.300
of smaller indexes can be distributed of different machines,
                                                                                         0.250
given that the information contained in any smaller index is




                                                                       Search time (s)
completely independent of the one contained in the others.                               0.200

Also the merge process can be distributed, if it is performed                            0.150
in a number of steps that involve the creation of intermediate
                                                                                         0.100          100M
indexes.                                                                                                10M
                                                                                         0.050
   The merge process consists in merging the prefix trees                                               1M

of the source indexes into a single prefix tree, i.e., by enu-                           0.000
                                                                                                        100        200         500         1000
merating, in alphabetical order, all the permutation prefixes                                                            |R|
contained in the source indexes.
                                                                 Figure 5: Search time w.r.t. to the size of R, for
   Such enumeration can be easily produced by performing
                                                                 z = 1, 000 and k = 100 (single index, single query).
a parallel ordered visit of all the prefix trees being merged.
If the prefix trees of the source indexes are saved to the          We have tested a basic configuration based on the use of
storage using a depth first visit, the merge process requires    a single index and the search function described in Section
only a single read of the serialized prefix trees. Obviously,    3.3, i.e., an efficiency-aimed configuration.
the new prefix tree is directly serialized on disk during the       We have then tested the use of multiple indexes , on con-
permutation prefix enumeration process.                          figurations using 2, 4 and 8 indexes, and also the multiple
   In the case of an update process, the identifiers of the      query search strategy by using a total of 2, 4, and 8 multi-
deleted objects are used to skip deleted objects during the      ple queries. We have also tested the combination of the two
merge process.                                                   strategies.
   The data storages are merged during the permutation pre-         We have applied the index optimization strategies de-
fix enumeration.                                                 scribed in Section 3.4 to all the generated indexes.
                                                                    On the held-out data we have tested various z val-
                                                                 ues, in this paper we report the results obtained for z =
4.     EXPERIMENTS                                               1, 000, which has produced a good trade-off in effective-
                                                                 ness/efficiency.
4.1     The CoPhIR data set                                         The experiments have been run on a desktop machine run-
   The CoPhIR1 [5] data set has been recently developed          ning Windows XP Professional, equipped with a Intel Pen-
within the SAPIR project, and it is currently the largest        tium Core 2 Quad 2.4 GHz CPU, a single 1 TB Seagate
multimedia metadata collection available for research pur-       Barracuda 7,200 rpm SATA disk (with 32 MB cache), and 4
poses. It consists of a crawl of 106 millions images from the    GB RAM. The PP-Index has been implemented in c#. All
Flickrphoto sharing website.                                     the experiments have been run in a single-threaded applica-
   The information relative to five MPEG-7 visual descrip-       tion, with a completely sequential execution of the multiple
tors [16] have been extracted from each image, resulting in      index/query searches.
more than 240 gigabytes of XML description data.                    We have evaluated the effectiveness of the PP-Index by
   We have randomly selected 100 images from the collec-         adopting a ranking-based measure and a distance-based
tion as queries and we have run experiments using the first      measure [17], recall and relative distance error, defined as:
million (1M), ten millions (10M), and 100 millions (100M)
                                                                                                                   |Dqk ∩ Pqk |
images from the data set.                                                                   Recall(k)         =                                         (2)
   We have run experiments on a linear combination of the                                                                k
                                                                                                                       k
five distance functions for the five descriptors, using the                                                        1 X     d(q, Pqk (i))
weights proposed in [3].                                                                     RDE(k)           =                            −1           (3)
                                                                                                                   k i=1 d(q, Dqk (i))

         Descriptor         Type      Dimensions    Weight       where Dq is the list of the elements of D sorted by their
       Scalable Color        L1          64           2          distance with respect to q, Dqk is the list of the k closest
      Color Structure        L1          64           3          elements, Pqk is the list returned by the algorithm, and Lkq (i)
       Color Layout       sum of L2      80           2          returns the i-th element of the list L.
      Edge Histogram         L1          62           4
    Homogeneous Texture      L1          12          0.5         4.3            Results
Table 1: Details on the five MPEG-7 visual descrip-                 |D|                    indexing           prefix tree size         data       l0
tors used in CoPhIR.                                                                       time (sec)         full          comp.      storage
                                                                    1M                     419                7.7 MB        91 kB      349 MB     2.1
                                                                    10M                    4385               53.8 MB       848 kB     3.4 GB     2.7
4.2     Configurations and evaluation measures                      100M                   45664              354.5 MB 6.5 MB          34 GB      3.5

  We have explored the effect of using different sized R sets,   Table 2: Indexing times (with |R| = 100), resulting
by running the experiments using three R set sizes consist-      index sizes, and average prefix tree depth l0 (after
ing of 100, 200, 500, and 1, 000 reference objects. We have      prefix tree compression with z = 1, 000).
adopted a random selection policy of objects from D for the
generation of the various R sets, following [6], which reports      Table 2 reports the indexing times for the various data set
the random selection policy as a good performer.                 sizes (|R| = 100), showing the almost perfect linear propor-
  In all the experiments we have used a fixed value of l = 6.    tion between indexing time and data set size. With respect
                                                                 to the indexing times we note that: (i) the twelve hours
1
    http://cophir.isti.cnr.it/                                   time, required to build the 100M index for the |R| = 100, is
                 0.20                                                                    0.10

                                                                 100M                                                           k=100
                                                                 10M                     0.08                                   k=10
                 0.15
     RDE(k)                                                      1M                                                             k=1




                                                                             RDE(k)
                                                                                         0.06
                 0.10
                                                                                         0.04

                 0.05
                                                                                         0.02

                 0.00                                                                    0.00
                          100          200           500       1000                             1           2               4   8
                                             |R|                                                                |indexes|
                  1.0                                                                     1.0

                              100M                                                                  k=100
                  0.8         10M                                                         0.8       k=10
                              1M                                                                    k=1
     Recall(k)




                                                                             Recall(k)
                  0.6                                                                     0.6

                  0.4                                                                     0.4

                  0.2                                                                     0.2

                  0.0                                                                     0.0
                          100          200           500       1000                             1           2               4   8
                                             |R|                                                                |indexes|
Figure 6: Effectiveness with respect of the size of R                   Figure 7: Multiple index search strategy on the
set, for k = 100 and z = 1, 000 (single index, single                   100M index, using |R| = 1, 000 and z = 1, 000.
query).
                        |R|                   |D|                       partitioning of objects into the permutation prefix space, de-
                                     1M      10M       100M             termined by the increase of |R|. Even though the z 0 value
                        100          4,075   5,817     7,941            increases as D gets larger, the increase of z 0 is largely sub-
                        200          3,320   5,571     7,302            proportional to the growth of D: when D grows by a factor
                        500          1,803   5,065     6,853
                                                                        100, z 0 increases at worst by a factor 6.6.
                        1,000        1,091   4,748     6,644
                                                                           A possible issue with the z 0 value is that it is not so close
           Table 3: Average z 0 value, for z = 1, 000.                  the z value, and it has potentially no limits. However, dur-
                                                                        ing the experiments the z 0 value has never been a critical
in line with the fourteen hours we have measured to build               factor with respect to efficiency: we have not observed any
a text search index on the descriptions and the comments                extreme case in which z 0 >> z, and the z 0 value has never
associated with the indexed images; (ii) this time refers to            required more than a single read operation from disk to re-
a completely sequential indexing process, not leveraging on             trieve all the candidate objects (e.g., retrieving 10, 000 data
the parallelization possibilities described in Section 3.6; (iii)       blocks from the data storage involves reading only 3.7 MB
we have not explored the possibility of using a similarity              from disk). We leave to future works an investigation of the
search data structure in order to answer l-NN query on the              relations of the z value with the other parameters.
R set necessary to build the permutation prefix.                           Figure 6 shows the effectiveness of the PP-Index with re-
   The table also shows the resulting memory occupation                 spect to the size of the R and the data set size, using a
of the prefix tree before and after the application of the              single-index/single-query configuration, for k = 100.
compression strategies described in Section 3.4. The val-                  Effectiveness values improve with the increase of |R| for
ues shows how such strategies allows to reduce by orders of             the 10M and 100M data sets, while the 1M data set shows
magnitude the main memory requirement of the PP-Index                   the inverse tendency. This confirms the intuition that larger
(at least by a factor fifty in our case) without affecting the          data sets requires a richer permutation prefix space (gener-
quality of the results.                                                 ated by a larger set R) to better distribute their elements,
   As expected, the disk occupation is perfectly linear with            until a limit is reached and objects became too sparse in the
respect to the data set size, given that the disk data storage          permutation prefix space and the effectiveness worsen.
contains only a sequential serialization of data blocks (375               The maximum-efficiency (0.210 seconds answer time) con-
bytes each one).                                                        figuration of PP-Index has obtained a 18.3% recall and 8.1%
   The last column of Table 2 reports the average depth of              RDE on the 100M data set, for k = 100.
the leaves of the prefix tree, after the compression. The l0               Figures 7 and 8 show respectively the effects on effective-
values show that the l value is not crucial in the definition           ness of the multiple index and multiple query strategies, for
of a PP-Index, given that the only requirement is to choose             three k values.
a l value large enough in order to perform a sufficient differ-            With respect to the multiple index strategy we have mea-
entiation of the indexed objects.                                       sured a great improvement on both measures reaching a 74%
   The graph of Figure 5 plots the search time with respect             recall (four times better than the single-index case) and a
to the size of R and the data set size, for k = 100 (single             0.7% RDE (eleven times better) for the eight index case.
index, single query). For the worst case, with |R| = 100,                  For the above mentioned eight index configuration we have
we have measured an average 0.239 seconds search time on                measured an average 1.72 seconds search time, for a com-
the 100M index, with an average of less than eight thou-                pletely sequential search process. The four index configura-
sands candidates retrieved from the data storage (see Table             tion allows to reach a 52% recall (67% for k = 10) and just
3). The search time decreases in a direct proportion the de-            a 2.2% RDE with a sub-second answer time.
crease of the z 0 value, which follows from the more detailed              It is relevant to note that, given the small memory occu-
                    0.10                                                            1e-3

                                                           k=100                               100M
                    0.08                                   k=10                     8e-4       10M
        RDE(k)                                             k=1                                 1M
                    0.06




                                                                        RDE(k)
                                                                                    6e-4

                    0.04                                                            4e-4

                    0.02                                                            2e-4

                    0.00                                                              0
                           1           2               4   8                               1     2    5   10   20   50   100
                                           |queries|                                                      k
                     1.0                                                            1.00

                               k=100
                     0.8       k=10                                                 0.99
                               k=1
        Recall(k)




                                                                        Recall(k)
                     0.6                                                            0.98

                     0.4                                                            0.97
                                                                                               100M
                     0.2                                                            0.96       10M
                                                                                               1M
                     0.0                                                            0.95
                           1           2               4   8                               1     2    5   10   20   50   100
                                           |queries|                                                      k
Figure 8: Multiple query search strategy on the                    Figure 9: Combined multiple query and multiple in-
100M index, using |R| = 1, 000 and z = 1, 000.                     dex strategies, using eight queries and eight indexes,
                                                                   using |R| = 1, 000 and z = 1, 000.
pation of the compressed prefix tree, we have been able to
simultaneously load eight 100M indexes into the memory,            to very large data sets, but just a proof of concept; (iv)
thus practically performing search on an 800 million objects       moreover, the implementation is usually designed to take in
index, though with replicated data, on a single computer.          input only a specific data type/format, which makes harder
   The multiple query strategy also shows relevant improve-        to port the application to different data types.
ments, though of minor entity with respect to the multiple            For this reasons we have currently limited our comparison
index strategy. This is in part motivated by the fact that         to replicating two experiments that, among the others, are
many of the queries, generated by permuting the elements           most closely related to our work: Batko et al. [2], which have
of the original query permutation prefix, actually resulted        run experiments using an early release of the CoPhIR data
in retrieving the same candidates of other queries2 . On the       set, and Amato and Savino [1], whose proposal is the most
100M index, for |R| = 1, 000, on the average, 1.92 distinct        closely related to our own.
queries to be effective in retrieving candidates for the two          Batko et al. [2] have run experiments on the CoPhIR data
queries configuration, 3.18 queries for the four queries config-   set, with data sets size of 10 and 50 millions images3 . They
uration, and 5.25 queries for the eight queries configuration.     reports an 80% recall level for 100-NN queries on both col-
   Figure 9 shows the effectiveness of the combined multi-         lection. For the 50M they test both a 80-CPU infrastructure
ple query and multiple index search strategies, using eight        with 250 GB RAM, keeping all the index data in memory,
queries and eight indexes, for |R| = 1, 000. We have mea-          and a 32-CPU infrastructure with 32 disks storing the index
sured an average search time of 12.45 seconds, for a fully         data, obtaining a 0.44 seconds answer time for the memory-
sequential search process. This setup produces almost exact        based solution and 1.4 seconds for the disk-based one.
results, with a recall > 97% and a RDE < 0.01%.                       The PP-Index could achieve a better performance than
   We have measured, on the average, a total of 370, 000 data      [2] by distributing the search process, yet using much less
blocks retrieved from the data storage among the average           resources than [2]. We have simulated the distribution of the
44.5 queries being effectively used to access the data storages    eight indexes on eight distinct computers, each one using two
for each original query. Although this z 0 value is relatively     processes executing four queries each, measuring the query
high, it just represents the 0.3% of the whole collection. This    answer time as the time of the slowest of the 16 processes
is a very low value considering, for example, that Lv et al.       plus the time to merge the 16 100-NN intermediate results.
[14], proposing a multiple query strategy for the LSH-Index,       We have measured an average 1.02 second answer time to
have measured a percentage of distance computations with           obtain > 95% recall on the 50M data set.
respect to the data set size, in order to obtain a 96% recall,        Amato and Savino [1] test their method on the Corel data
of 4.4% on a 1.3 million objects data set and of 6.3% on a         set4 . The data set consists of 50, 000 32-dimensions color
2.6 million objects data set.                                      HSV histograms extracted from the images. The distance
                                                                   function used to compare the histograms is L1 .
4.4          Comparison experiments                                   Replicating [1], we have selected 50 random objects as
  It is a hard task to run comparative experiments on novel        queries, and indexed the rest of the collection. Given the
and very large data sets, such as CoPhIR due to many rea-          small size of the data set, we have set |R| = 50. The time
sons: (i) lack of previous results on the same data set; (ii)      required for generating the PP-Index is 4.9 second, with a
lack of a publicly available implementation for many of the        3
                                                                     We suppose they use the same linear combination of visual
methods involved in the comparison; (iii) when an imple-           descriptors of our experiments, given that two authors are
mentation is available, it is typically not designed to scale      also the authors of [3], from which we take our weights.
                                                                   4
                                                                     http://kdd.ics.uci.edu/databases/CorelFeatures/
2
    Such candidates are read only once from the data storage.      CorelFeatures.html
disk occupation of 13 MB and a memory occupation of 450             [3] M. Batko, P. Kohoutkova, and P. Zezula. Combining
kB. In [1] the index structure generated by setting ki = 100            metric features in large collections. SISAP ’08, 1st
is estimated to require 20 MB. This value does not include              International Workshop on Similarity Search and
the HSV histograms, which are required to reorder the re-               Applications, pages 79–86, 2008.
trieved objects by the true similarity.                             [4] M. Bawa, T. Condie, and P. Ganesan. Lsh forest:
   The maximum recall level obtained in [1] for k = 50 is               self-tuning indexes for similarity search. In WWW ’05:
about 54%, requiring to read 2.4 MB of data from disk (600              Proceedings of the 14th international conference on
blocks of 4 kB size). The PP-Index, in a single-index/four-             World Wide Web, pages 651–660, Chiba, Japan, 2005.
query configuration (z = 500), obtains a 89.6% recall, re-          [5] P. Bolettieri, A. Esuli, F. Falchi, C. Lucchese,
quiring to read just 900 kB of data from disk, in four se-              R. Perego, T. Piccioli, and F. Rabitti. CoPhIR: a test
quential reads. The single-index/single-query configuration             collection for content-based image retrieval. CoRR,
obtains a 66% recall.                                                   abs/0905.4627, 2009.
5.      CONCLUSIONS                                                 [6] E. Chávez, K. Figueroa, and G. Navarro. Effective
   We have introduced the PP-Index, an approximate simi-                proximity retrieval by ordering permutations. IEEE
larity search data structure based on the use of short per-             Transactions on Pattern Analysis and Machine
mutation prefixes. We have described the PP-Index data                  Intelligence (TPAMI), 30(9):1647–1658, 2008.
structures and algorithms, including a number of optimiza-          [7] E. Chavez, G. Navarro, R. Baeza-Yates, and
tion methods and search strategies aimed at improving the               J. Marroquı́n. Proximity searching in metrix spaces.
scalability of the index, its efficiency, and its effectiveness.        ACM Computing Surveys, 33(3):273–321, 2001.
   The PP-Index has been designed to take advantage of the          [8] P. Ciaccia, M. Patella, and P. Zezula. M-tree: An
relatively static nature one could expect from very large col-          efficient access method for similarity search in metric
lections. However, as we have described, it is easy to support          spaces. In VLDB ’97: Proceedings of the 23rd
fast update operations.                                                 International Conference on Very Large Data Bases,
   We have evaluated the PP-Index on a very large and high-             pages 426–435, Athens, Greece, 1997.
dimensional data set. Results show that it is both efficient        [9] P. Diaconis. Group representation in probability and
and effective in performing similarity search, and it scales            statistics. IMS Lecture Series, 11, 1988.
well to very large data sets.                                      [10] A. Esuli. MiPai: using the PP-Index to build an
   We have shown how a limited-resources configuration ob-              efficient and scalable similarity search system. In
tains good effectiveness results in less than a second, and             SISAP ’09, Proceedings of the 2nd International
how almost exact results are produced in a relatively short             Workshop on Similarity Search and Applications,
amount of time. The parallel processing capabilities of the             Prague, CZ, 2009.
PP-Index allow to distribute the search process in order to        [11] P. Indyk and R. Motwani. Approximate nearest
further improve its efficiency.                                         neighbors: towards removing the curse of
   The comparison with experimental results published for               dimensionality. In STOC ’98: Proceedings of the 30th
two closely related method, which are among the top-                    ACM symposium on Theory of computing, pages
performers on the task, shows that the PP-Index outper-                 604–613, Dallas, USA, 1998.
forms the compared methods, both in efficiency and effec-          [12] H. V. Jagadish, A. O. Mendelzon, and T. Milo.
tiveness. Only one [2] of the works we compare with uses a              Similarity-based queries. In PODS ’95: Proceedings of
data set of a size comparable to our largest one. We plan to            the 14th ACM SIGACT-SIGMOD-SIGART
extend the comparison with some of the competing methods,               symposium on Principles of database systems, pages
by porting them on the larger data set sizes.                           36–45, San Jose, US, 1995.
   The PP-Index has been already used to build a performing        [13] D. Knuth. The Art of Computer Programming, volume
similarity search system5 [10].                                         3: Sorting and Searching, chapter 5.4: External
   There are many aspect of our proposal that are worth to              Sorting, pages 248–379. second edition, 1998.
be further investigated. For example, the R set is a crucial       [14] Q. Lv, W. Josephson, Z. Wang, M. Charikar, and
element of the PP-Index. We plan to study element selec-                K. Li. Multi-probe lsh: efficient indexing for
tion policies alternative to the random policy, e.g., selecting         high-dimensional similarity search. In VLDB ’07:
centroids of clusters of D, or the most frequent queries from           Proceedings of the 33rd international conference on
a query log.                                                            Very large data bases, pages 950–961, Vienna, Austria,
6.      REFERENCES                                                      2007.
    [1] G. Amato and P. Savino. Approximate similarity             [15] D. R. Morrison. Patricia—practical algorithm to
        search in metric spaces using inverted files. In                retrieve information coded in alphanumeric. J. ACM,
        INFOSCALE ’08: Proceeding of the 3rd International              15(4):514–534, 1968.
        ICST Conference on Scalable Information Systems,           [16] MPEG-7. Mpeg-7. multimedia content description
        pages 1–10, Vico Equense, Italy, 2008.                          interfaces, part3: Visual, 2002. ISO/IEC 15938-3:2002.
    [2] M. Batko, F. Falchi, C. Lucchese, D. Novak,                [17] M. Patella and P. Ciaccia. The many facets of
        R. Perego, F. Rabitti, J. Sedmidubský, and P. Zezula.          approximate similarity search. SISAP ’08, 1st
        Crawling, indexing, and similarity searching images on          International Workshop on Similarity Search and
        the web. In Proceedings of SEDB ’08, the 16th Italian           Applications, pages 10–21, 2008.
        Symposium on Advanced Database Systems, pages              [18] P. Zezula, G. Amato, V. Dohnal, and M. Batko.
        382–389, Mondello, Italy, 2008.                                 Similarity Search: The Metric Space Approach
5
    http://mipai.esuli.it/                                              (Advances in Database Systems). 2005.