Content Based Retrieval on Very Large Visual
                         Document Archives

                                              © Giuseppe Amato
                                                 ISTI-CNR,
                                                  Pisa, Italy
                                        firstname.lastname@isti.cnr.it


                                                               significant improvement boost is obtained at
This work was partially supported by the VISITO Tuscany        the expense of some minor imprecision in the
project, co-funded by the Tuscany Region under the POR
CREO FESR program and by the ASSETS project, co-funded         search result. Recently permutation-based
by the European Commission within the ICT Policy Support       methods, where documents are represented as
Programme (ICT PSP).                                           permutations of a set of reference objects,
                                                               have been defined. In these methods,
This tutorial will discusses the issues related to             similarity between documents is approximated
content based retrieval in very large dataset of               by comparing permutations. Permutations-
visual documents. Content based retrieval                      based indexes allow retrieval to be executed
typically is not performed using the visual                    very efficiently, in datasets containing
content itself, rather visual features are                     hundred millions images. A new very active
extracted and retrieval is performed searching                 field of research is that related to the use of
by similarity on the extracted features.                       local visual features to compare and retrieve
Similarity search is a difficult task because                  images. Local features offer much higher
efficient techniques to process database or text               retrieval quality, however, the efficiency issue
queries cannot be applied here. Therefore in                   is orders of magnitude more difficult.
the last decades researcher have investigated                  Currently, most techniques are based on a
techniques for executing similarity search                     quantization of local features as Bag-of-Word
efficiently and in a scalable way. One popular                 and the use of inverted files (as for instance
way to compare similarity between visual                       Lucene). However, the association of words
documents is the use of global visual features                 with local features is still a difficult task.
and to measure their similarity (or                            Recent new challenging research directions
dissimilarity) by using a similarity (or                       also include the study of techniques for
distance) function. Various indexing strategies                answering to keyword based queries just using
and search algorithms based on distance                        the visual content of documents. For instance,
function were defined during the last decade.                  suppose you want to retrieve pictures of the
A relevant research direction has been that of                 ”Pisa Leaning Tower”, without using any
the tree-based access methods, that allow                      metadata. In this case the problem is twofold.
search algorithms just to inspect a small                      Techniques that offer the same time high
portion of the dataset.                                        accuracy     and     efficiency    should     be
Limitations of tree-based techniques where                     investigated. A very similar problem is that of
addressed by defining techniques for                           automatically annotating pictures. For instance
approximate    similarity   search,  where                     consider the scenario where pictures taken
                                                               with mobile phones are automatically
                                                               annotated as soon as they are acquired.
Proceedings of the 14th All-Russian Conference
"Digital Libraries:       Advanced Methods and                 The tutorial will offer an overview of the state
Technologies, Digital Collections" ＿ RCDL-2012,                of the art in this topic and will discuss open
Pereslavl-Zalesskii, Russia,   October 15-18 2012.             research directions.


                                                           2