Content Based Retrieval on Very Large Visual Document Archives © Giuseppe Amato ISTI-CNR, Pisa, Italy firstname.lastname@isti.cnr.it significant improvement boost is obtained at This work was partially supported by the VISITO Tuscany the expense of some minor imprecision in the project, co-funded by the Tuscany Region under the POR CREO FESR program and by the ASSETS project, co-funded search result. Recently permutation-based by the European Commission within the ICT Policy Support methods, where documents are represented as Programme (ICT PSP). permutations of a set of reference objects, have been defined. In these methods, This tutorial will discusses the issues related to similarity between documents is approximated content based retrieval in very large dataset of by comparing permutations. Permutations- visual documents. Content based retrieval based indexes allow retrieval to be executed typically is not performed using the visual very efficiently, in datasets containing content itself, rather visual features are hundred millions images. A new very active extracted and retrieval is performed searching field of research is that related to the use of by similarity on the extracted features. local visual features to compare and retrieve Similarity search is a difficult task because images. Local features offer much higher efficient techniques to process database or text retrieval quality, however, the efficiency issue queries cannot be applied here. Therefore in is orders of magnitude more difficult. the last decades researcher have investigated Currently, most techniques are based on a techniques for executing similarity search quantization of local features as Bag-of-Word efficiently and in a scalable way. One popular and the use of inverted files (as for instance way to compare similarity between visual Lucene). However, the association of words documents is the use of global visual features with local features is still a difficult task. and to measure their similarity (or Recent new challenging research directions dissimilarity) by using a similarity (or also include the study of techniques for distance) function. Various indexing strategies answering to keyword based queries just using and search algorithms based on distance the visual content of documents. For instance, function were defined during the last decade. suppose you want to retrieve pictures of the A relevant research direction has been that of ”Pisa Leaning Tower”, without using any the tree-based access methods, that allow metadata. In this case the problem is twofold. search algorithms just to inspect a small Techniques that offer the same time high portion of the dataset. accuracy and efficiency should be Limitations of tree-based techniques where investigated. A very similar problem is that of addressed by defining techniques for automatically annotating pictures. For instance approximate similarity search, where consider the scenario where pictures taken with mobile phones are automatically annotated as soon as they are acquired. Proceedings of the 14th All-Russian Conference "Digital Libraries: Advanced Methods and The tutorial will offer an overview of the state Technologies, Digital Collections" _ RCDL-2012, of the art in this topic and will discuss open Pereslavl-Zalesskii, Russia, October 15-18 2012. research directions. 2