Visual Features Selection Giuseppe Amato, Fabrizio Falchi, and Cladio Gennaro ISTI-CNR, via G. Moruzzi 1, 56124 Pisa, Italy {giuseppe.amato,fabrizio.falchi,claudio.gennaro}@isti.cnr.it Abstract. The state-of-the-art algorithms for large visual content recog- nition and content based similarity search today use the “Bag of Fea- tures” (BoF) or “Bag of Words” (BoW) approach. The idea, borrowed from text retrieval, enables the use of inverted files. A very well known issue with the BoF approach is that the query images, as well as the stored data, are described with thousands of words. This poses obvious efficiency problems when using inverted files to perform efficient image matching. In this paper, we propose and compare various techniques to reduce the number of words describing an image to improve efficiency. Keywords: bag of features, bag of words, local features, content based image retrieval, landmark recognition 1 INTRODUCTION During the last decade, the use of local features, as for instance SIFT [Lowe, 2004], has obtained an increasing appreciation for its good performance in tasks of im- age matching, object recognition, landmark recognition, and image classification. The total number of local features extracted from an image depends on its visual content and size. However, the average number of features extracted from an im- age is in the order of thousands. The BoF approach [Sivic and Zisserman, 2003] quantizes local features extracted from images representing them with the clos- est local feature chosen from a fixed visual vocabulary of local features (visual words). Matching of images represented with the BoF approach is performed with traditional text retrieval techniques. However a query image is associated with thousands of visual words. There- fore, the search algorithm on inverted files has to access thousands of different posting lists. As mentioned in [Zhang et al., 2009], ”a fundamental difference be- tween an image query (e.g. 1500 visual terms) is largely ignored in existing index design. This difference makes the inverted list inappropriate to index images.” From the very beginning [Sivic and Zisserman, 2003] some words reduction tech- niques were used (e.g. removing 10% of the more frequent images). To improve efficiency, many different approaches have been considered in- cluding GIST descriptos [Douze et al., 2009], Fisher Kernel [Zhang et al., 2009] and Vector of Locally Aggregated Descriptors (VLAD) [Jégou et al., 2010]. How- ever, their usage does not allow the use of traditional text search engine which has actually been another benefit of the BoF approach. 2 Giuseppe Amato, Fabrizio Falchi, and Cladio Gennaro In order to mitigate the above problems, this paper proposes, discusses, and evaluates some methods to reduce the number of visual words assigned to images. This paper is a summary of a longer paper that will presented at VISAPP 2013 [Amato et al., 2013]. 2 PROPOSED APPROACH The goal of the BoF approach is to substitute each description of the region around an interest points (i.e., each local features ) of the images with visual words obtained from a predefined vocabulary in order to apply traditional text retrieval techniques to content-based image retrieval. At the end of the process, each image is described as a set of visual words. The retrieval phase is then performed using text retrieval techniques considering a query image as disjunc- tive text-query. Typically, the cosine similarity measure in conjunction with a term weighting scheme is adopted for evaluating the similarity between any two images. The proposed words reduction criteria are: random, scale, tf, idf, tf*id. Each proposed criterion is based on the definition of a score that allows us to assign each local feature or word, describing an image, an estimate of its importance. Thus, local features or words can be ordered and only the most important ones can be retained. The percentage of information to discard is configurable through a score threshold, allowing trade-off between efficiency and effectiveness. The random criterion was used as a baseline. It assigns random score to features. The scale criterion is based on the information about the size of the region from which the local features was extracted: the larger the region, the higher the score. The retrieval engine used in the experiments is built as following: 1. For each image in the dataset the SIFT local features are extracted for the identified regions around interest points. 2. A vocabulary of words is selected among all the local features using the k-means algorithm. 3. The Random or Scale reduction techniques are performed (if requested). 4. Each image is described following the BoF approach, i.e., with the ID of the nearest word in the vocabulary to each local feature. 5. The tf, idf, or tf*idf reduction technique are performed (if requested). 6. Each image of the test set is used as a query for searching in the training set. The similarity measure adopted for comparing two images is the Cosine between the query vector and the image vectors corresponding to the set of words assigned to the images. The weight assigned to each word of the vectors are calculated using tf*idf measure. 7. In case the system is requested to identify the content of the image, the landmark of the most similar image in the dataset (which is labeled) is assigned to the query image. Visual Features Selection 3 3 Experimental results The quality of the retrieved images is typically evaluated by means of precision and recall measures. As in many other papers, we combined these information by means of the mean Average Precision (mAP), which represents the area below the precision and recall curve. For evaluating the performance of the various reduction techniques approaches, we use the Oxford Building datasets that was presented in [Philbin et al., 2007] and has been used in many other papers. The dataset consists of 5,062 images of 55 buildings in Oxford. The ground truth consists of 55 queries and related sets of results divided in best, correct, ambiguous and not relevant. The vocabulary used has one million words. 0.6 0.5 0.4 mAP 0.3 tf*idf 0.2 idf tf 0.1 scale random 0 100 1000 Avg Words per Image Fig. 1. Mean average precision of the various selection criteria obtained on the Oxford Buildings 5k dataset. We first report the results obtained in a content based image retrieval scenario using the Oxford Building dataset using the ground truth given by the authors [Philbin et al., 2007]. In Figure 1 we report the mAP obtained On the x-axis we reported the average words per image obtained after the reduction. Note that the x-axis is logarithmic. We first note that all the reduction techniques significantly outperform naive random approach and that both the idf and scale approaches are able to achieve very good mAP results (about 0.5) while reducing the average number of words per image from 3,200 to 800. Thus, just taking the 25% of the most relevant words, we achieve the 80% of the effectiveness. The comparison between the idf and scale approaches reveals that scale is preferable for reduction up to 500 words per image. Thus, it seems very important to discard small regions of interest up to 500 words. While the average number of words is useful to describe the length of the image description, it is actually the number of distinct words per image that have 4 Giuseppe Amato, Fabrizio Falchi, and Cladio Gennaro 0.6 0.5 0.4 mAP 0.3 tf*idf 0.2 idf tf 0.1 scale random 0 100 1000 Avg Distinct Words per Image Fig. 2. Mean average precision of the various selection criteria obtained on the Oxford Buildings 5k dataset. more impact on the efficiency of searching using inverted index. Thus, in Figure 2 we report mAP with respect to the average number of distinct words. In this case the results obtained by tf*idf and tf are very similar to the ones obtained by idf. In fact, considering tf in the reduction results in a smaller number of average distinct words per image for the same vales of average number of words. References [Amato et al., 2013] Amato, G., Falchi, F., and Gennaro, C. (2013). On reducing the number of visualwords in the bag-of-features representation. In VISAPP 2013 - Proceedings of the International Conference on Computer Vision Theory and Appli- cations. [Douze et al., 2009] Douze, M., Jégou, H., Sandhawalia, H., Amsaleg, L., and Schmid, C. (2009). Evaluation of gist descriptors for web-scale image search. In Proceedings of the ACM International Conference on Image and Video Retrieval, CIVR ’09, pages 19:1–19:8, New York, NY, USA. ACM. [Jégou et al., 2010] Jégou, H., Douze, M., and Schmid, C. (2010). Improving bag-of- features for large scale image search. Int. J. Comput. Vision, 87:316–336. [Lowe, 2004] Lowe, D. G. (2004). Distinctive image features from scale-invariant key- points. International Journal of Computer Vision, 60(2):91–110. [Philbin et al., 2007] Philbin, J., Chum, O., Isard, M., Sivic, J., and Zisserman, A. (2007). Object retrieval with large vocabularies and fast spatial matching. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition. [Sivic and Zisserman, 2003] Sivic, J. and Zisserman, A. (2003). Video google: A text retrieval approach to object matching in videos. In Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2, ICCV ’03, pages 1470–, Washington, DC, USA. IEEE Computer Society. [Zhang et al., 2009] Zhang, X., Li, Z., Zhang, L., Ma, W.-Y., and Shum, H.-Y. (2009). Efficient indexing for large scale visual search. In Computer Vision, 2009 IEEE 12th International Conference on, pages 1103 –1110.