The University of Amsterdam’s Concept Detection System at ImageCLEF 2009 Koen E. A. van de Sande, Theo Gevers and Arnold W. M. Smeulders Intelligent Systems Lab Amsterdam (ISLA), University of Amsterdam ksande@uva.nl Abstract Our group within the University of Amsterdam participated in the large-scale visual concept detection task of ImageCLEF 2009. Our experiments focus on increasing the robustness of the individual concept detectors based on the bag-of-words approach, and less on the hierarchical nature of the concept set used. To increase the robustness of individual concept detectors, our experiments emphasize in particular the role of visual sampling, the value of color invariant features, the influence of codebook con- struction, and the effectiveness of kernel-based learning parameters. The participation in ImageCLEF 2009 has been successful, resulting in the top ranking for the large-scale visual concept detection task in terms of both EER and AUC. For 40 out of 53 indi- vidual concepts, we obtain the best performance of all submissions to this task. For the hierarchical evaluation, which considers the whole hierarchy of concepts instead of single detectors, using the concept likelihoods estimated by our detectors directly works better than scaling these likelihoods based on the class priors. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.4 Systems and Software; I.4.7 [Image Processing and Computer Vision]: Feature Measurement General Terms Performance, Measurement, Experimentation Keywords Color, Invariance, Concept Detection, Object and Scene Recognition, Bag-of-Words, Photo An- notation, Spatial Pyramid 1 Introduction Robust image retrieval is highly relevant in a world that is adapting swiftly to visual communi- cation. Online services like Flickr show that the sheer number of photos available online is too much for any human to grasp. Many people place their entire photo album on the internet. Most commercial image search engines provide access to photos based on text or other metadata, as this is still the easiest way for a user to describe an information need. The indices of these search engines are based on the filename, associated text or (social) tagging. This results in disappoint- ing retrieval performance when the visual content is not mentioned, or properly reflected in the associated text. In addition, when the photos originate from non-English speaking countries, such as China, or the Netherlands, querying the content becomes much harder. Data flow conventions Image Sampled image regions Visual features Visual Kernel- Sampling Codebook Codebook feature based Codeword frequency distribution strategy transform extraction learning Codebook library Learning parameters Concept confidence Figure 1: University of Amsterdam’s ImageCLEF 2009 concept detection scheme, using the con- ventions shown on the right. The scheme serves as the blueprint for the organization of Section 2. To cater for robust image retrieval, the promising solutions from literature are in majority concept-based [16], where detectors are related to objects, like a telephone, scenes, like a kitchen, and people, like big group. Any one of those brings an understanding of the current content. The elements in such a lexicon offer users a semantic entry by allowing them to query on presence or absence of visual content elements. The Large-Scale Visual Concept Detection Task [12] evaluates 53 visual concept detectors. The concepts used are from the personal photo album domain: beach holidays, snow, plants, indoor, mountains, still-life, small group of people, portrait. For more information on the dataset and concepts used, see the overview paper [12]. Based on our previous work on concept detection [19, 15], we have focused on improving the robustness of the visual features used in our concept detectors. Systems with the best performance in image retrieval [11, 19] and video retrieval [22, 15] use combinations of multiple features for concept detection. The basis for these combinations is formed by good color features and multiple point sampling strategies. This paper is organized as follows. Section 2 defines our concept detection system. Section 3 details our experiments and results. Finally, in section 4, conclusions are drawn. 2 Concept Detection System We perceive concept detection as a combined computer vision and machine learning problem. Given an n-dimensional visual feature vector xi , the aim is to obtain a measure, which indicates whether semantic concept ωj is present in photo i. We may choose from various visual feature extraction methods to obtain xi , and from a variety of supervised machine learning approaches to learn the relation between ωj and xi . The supervised machine learning process is composed of two phases: training and testing. In the first phase, the optimal configuration of features is learned from the training data. In the second phase, the classifier assigns a probability p(ωj |xi ) to each input feature vector for each semantic concept. 2.1 Sampling Strategy The visual appearance of a concept has a strong dependency on the viewpoint under which it is recorded. Salient point methods [17] introduce robustness against viewpoint changes by selecting points, which can be recovered under different perspectives. Another solution is to simply use many points, which is achieved by dense sampling. We summarize our sampling approach in Figure 2. Harris-Laplace point detector In order to determine salient points, Harris-Laplace relies on a Harris corner detector. By applying it on multiple scales, it is possible to select the characteristic scale of a local corner using the Laplacian operator [17]. Hence, for each corner the Harris-Laplace detector selects a scale-invariant point if the local image structure under a Laplacian operator has a stable maximum. Spatio-temporal sampling Harris- Laplace Spatial pyramid Dense sampling Figure 2: General scheme for sampling of image regions, including Harris-Laplace and dense point selection, and a spatial pyramid. Detail of Figure 1, using the same conventions. Dense point detector For concepts with many homogenous areas, like scenes, corners are often rare. Hence, for these concepts relying on a Harris-Laplace detector can be suboptimal. To counter the shortcoming of Harris-Laplace, random and dense sampling strategies have been proposed [4, 6]. We employ dense sampling, which samples an image grid in a uniform fashion using a fixed pixel interval between regions. In our experiments we use an interval distance of 6 pixels and sample at multiple scales. Spatial pyramid weighting Both Harris-Laplace and dense sampling give an equal weight to all keypoints, irrespective of their spatial location in the image frame. In order to overcome this limitation, Lazebnik et al. [7] suggest to repeatedly sample fixed subregions of an image, e.g. 1x1, 2x2, 4x4, etc., and to aggregate the different resolutions into a so called spatial pyramid, which allows for region-specific weighting. Since every region is an image in itself, the spatial pyramid can be used in combination with both the Harris-Laplace point detector and dense point sampling [18]. Reported results using concept detection experiments are not yet conclusive in the ideal spatial pyramid configuration, some claim 2x2 is sufficient [7], others suggest to include 1x3 also [11]. We use a spatial pyramid of 1x1, 2x2, and 1x3 regions in our experiments. 2.2 Visual Feature Extraction In the previous section, we addressed the dependency of the visual appearance of semantic con- cepts on the viewpoint under which they are recorded. However, the lighting conditions during photography also play an important role. We [19] analyzed the properties of color features under classes of illumination changes within the diagonal model of illumination change, and specifically for data sets consisting of Flickr images. In ImageCLEF, the images used also originate from Flickr. Here we summarize the main findings. We present an overview of the visual features used in Figure 3. The features are computed around salient points obtained from the Harris-Laplace detector and dense sampling. SIFT The SIFT feature proposed by Lowe [10] describes the local shape of a region using edge orientation histograms. The gradient of an image is shift-invariant: taking the derivative cancels out offsets [19]. Under light intensity changes, i.e. a scaling of the intensity channel, the gradient direction and the relative gradient magnitude remain the same. Because the SIFT feature is normalized, the gradient magnitude changes have no effect on the final feature. To compute SIFT features, we use the version described by Lowe [10]. OpponentSIFT OpponentSIFT describes all the channels in the opponent color space using SIFT features. The information in the O3 channel is equal to the intensity information, while the Invariant visual descriptors SIFT Opponent- SIFT C-SIFT RGB-SIFT Figure 3: General scheme of the visual feature extraction methods used in our ImageCLEF 2009 experiments. other channels describe the color information in the image. The feature normalization, as effective in SIFT, cancels out any local changes in light intensity. C-SIFT The C-SIFT feature uses the C invariant [5], which can be intuitively seen as the gradient (or derivative) for the normalized opponent color space O1/I and O2/I. The I intensity channel remains unchanged. C-SIFT is known to be scale-invariant with respect to light intensity. See [1, 19] for detailed evaluation. RGB-SIFT For the RGB-SIFT, the SIFT feature is computed for each RGB channel inde- pendently. Due to the normalizations performed within SIFT, it is equal to transformed color SIFT [19]. The feature is scale-invariant, shift-invariant, and invariant to light color changes and shift. 2.3 Codebook Transform To avoid using all visual features in an image, while incorporating translation invariance and a robustness to noise, we follow the well known codebook approach, see e.g. [8, 6, 23, 20, 19]. First, we assign visual features to discrete codewords predefined in a codebook. Then, we use the frequency distribution of the codewords as a compact feature vector representing an image frame. Two important variables in the codebook representation are codebook construction and codeword assignment. An extensive comparison of codebook representation variables is presented by Van Gemert et al. in [20]. Here we detail codebook construction and codeword assignment using hard and soft variants, following the scheme in Figure 4. Codebook construction We employ k-means clustering. K-means partitions the visual feature space by minimizing the variance between a predefined number of k clusters. The advantage of the k-means algorithm is its simplicity. A disadvantage of k-means is its emphasis on clusters of dense areas in feature space. Hence, k-means does not spread clusters evenly throughout feature space. We fix the visual codebook to a maximum of 4000 codewords. Hard-assignment Given a codebook of codewords, obtained from clustering, the traditional codebook approach describes each feature by the single best representative codeword in the code- Codebook representation Soft- assign Codebook Clustering SVM library Hard- assign Figure 4: General scheme for transforming visual features into a codebook, where we distinguish between codebook construction using clustering and codeword assignment using soft and hard variants. We combine various codeword frequency distributions into a codebook library. This then forms the input to an SVM classifier. book, i.e. hard-assignment. Basically, an image is represented by a histogram of codeword fre- quencies describing the probability density over codewords. Soft-assignment In a recent paper [20], it is shown that the traditional codebook approach may be improved by using soft-assignment through kernel codebooks. A kernel codebook uses a kernel function to smooth the hard-assignment of image features to codewords. Out of the various forms of kernel-codebooks, we selected codeword uncertainty based on its empirical performance [20]. Codebook library Each of the possible sampling methods from Section 2.1 coupled with each visual feature extraction method from Section 2.2, a clustering method, and an assignment ap- proach results in a separate visual codebook. An example is a codebook based on dense sampling of RGB-SIFT features in combination with hard-assignment. We collect all possible codebook combinations in a visual codebook library. Naturally, the codebooks can be combined using var- ious configurations. For simplicity, we employ equal weights in our experiments when combining codebooks to form a library. 2.4 Kernel-based Learning Learning robust concept detectors from large-scale visual codebooks is typically achieved by kernel- based learning methods. From all kernel-based learning approaches on offer, the support vector machine is commonly regarded as a solid choice. An overview is given together with the codebook transformations in Figure 4. Support vector machine We use the support vector machine framework [21] for supervised learning of concepts. Here we use the LIBSVM implementation [2] with probabilistic output [13, 9]. The parameter of the support vector machine we optimize is C. In order to handle imbalance in the number of positive versus negative training examples, we fix the weights of the positive and negative class by estimation from the class priors on training data. It was shown by Zhang et al. [23] that in a codebook-approach to concept detection the earth movers distance and χ2 kernel are to be preferred. We employ the χ2 kernel, as it is less expensive in terms of computation. 3 Concept Detection Experiments 3.1 Submitted Runs We have submitted five different runs. All runs use both Harris-Laplace and dense sampling with the SVM classifier. We do not use the EXIF metadata provided for the photos. Our system has been developed based on the PASCAL VOC [3] and TRECVID Sound and Vision datasets [14]. For ImageCLEF, we have learned new concept models based on the provided annotations. The only parameter specifically optimized for this dataset is the slack parameter C of the SVM. All other parameter settings are the same as in our PASCAL VOC 2008 system [19]. Extracting features, training models and applying those models on the test set was finished within 72 hours. • OpponentSIFT: single color descriptor with hard assignment. • 2-SIFT: two color descriptors (OpponentSIFT and SIFT) with hard assignment. • 4-SIFT: four color descriptors (OpponentSIFT, C-SIFT, RGB-SIFT and SIFT) with hard assignment. • Rescaled 4-SIFT: the same ordering of images as 4-SIFT, but with all concept detector outputs linearly scaled so the number of images with a score > 0.5 is equal to the concept prior probability in the training set. • Soft 4-SIFT: four color descriptors (OpponentSIFT, C-SIFT, RGB-SIFT and SIFT) with soft assignment. The soft assignment parameters have been taken from our PASCAL VOC 2008 system [19]. 3.2 Evaluation Per Concept In table 1, the overall scores for the evaluation of concept detectors are shown. As for the evaluation of single detectors only the ranking of the images within a single concept matters, the rescaled version of 4-SIFT achieves the exact same performance as 4-SIFT. We note that the 4-SIFT run with hard assignment achieves not only the highest performance amongst our runs, but also over all other runs submitted to the Large-Scale Visual Concept Detection task. In table 2, the Area Under the Curve scores have been split out per concept. We observe that the three aesthetic concepts have the lowest scores. This comes as no surprise, because these concepts are highly subjective: even human annotators only agree around 80% of the time with each other. For virtually all concepts besides the aesthetic ones, either the Soft 4-SIFT or the Hard 4-SIFT is the best run. This confirms our beliefs that these (color) descriptors are not redundant when used in combinations. Therefore, we recommend the use of these 4 descriptors instead of 1 or 2. The difference in overall performance between the Soft 4-SIFT or the Hard 4- SIFT run is quite small. Because the soft codebook assignment smoothing parameter was directly taken from a different dataset, we expect that the soft assignment run could be improved if the soft assignment parameter was selected with cross-validation on the training set. Together, our Run name Codebook Average EER Average AUC 4-SIFT Hard-assignment 0.2345 0.8387 Rescaled 4-SIFT Hard-assignment 0.2345 0.8387 Soft 4-SIFT Soft-assignment 0.2355 0.8375 2-SIFT Hard-assignment 0.2435 0.8300 OpponentSIFT Hard-assignment 0.2530 0.8217 Table 1: Overall results of the University of Amsterdam evaluated over all concepts in the Large- Scale Visual Concept Detection Task using the equal error rate (EER) and the area under the curve (AUC). Concept 4-SIFT Soft 4-SIFT 2-SIFT Opp.SIFT Concept 4-SIFT Soft 4-SIFT 2-SIFT Opp.SIFT Clouds 0,958 0,958 0,951 0,945 No-Visual-Time 0,833 0,835 0,822 0,815 Sunset-Sunrise 0,953 0,954 0,947 0,946 Indoor 0,830 0,835 0,823 0,810 Sky 0,945 0,948 0,935 0,930 Familiy-Friends 0,834 0,834 0,822 0,813 Landscape-Nature 0,944 0,942 0,940 0,936 Partylife 0,834 0,834 0,831 0,819 Sea 0,935 0,930 0,932 0,926 Vehicle 0,832 0,832 0,832 0,822 Mountains 0,934 0,931 0,930 0,922 Animals 0,818 0,828 0,811 0,797 Lake 0,911 0,903 0,912 0,900 Citylife 0,826 0,826 0,819 0,813 Beach-Holidays 0,906 0,907 0,898 0,884 Still-Life 0,824 0,825 0,808 0,795 Trees 0,903 0,902 0,892 0,881 Spring 0,822 0,801 0,812 0,791 Water 0,901 0,903 0,892 0,886 Canvas 0,817 0,810 0,803 0,790 Night 0,898 0,895 0,895 0,892 Summer 0,813 0,813 0,791 0,782 River 0,897 0,889 0,891 0,883 Macro 0,812 0,791 0,805 0,795 Outdoor 0,890 0,896 0,879 0,871 No-Visual-Season 0,805 0,806 0,794 0,782 Food 0,895 0,895 0,881 0,877 Small-Group 0,792 0,795 0,784 0,776 Desert 0,891 0,865 0,891 0,884 Single-Person 0,792 0,795 0,780 0,769 Building-Sights 0,880 0,882 0,873 0,861 Out-of-focus 0,792 0,781 0,784 0,774 Big-Group 0,881 0,877 0,870 0,858 No-Visual-Place 0,789 0,786 0,781 0,779 Plants 0,877 0,881 0,853 0,839 Overexposed 0,788 0,782 0,777 0,771 Flowers 0,868 0,875 0,846 0,836 Neutral-Illumination 0,778 0,783 0,775 0,774 Autumn 0,870 0,866 0,863 0,849 Sunny 0,763 0,765 0,744 0,741 Portrait 0,865 0,864 0,857 0,846 Motion-Blur 0,744 0,747 0,725 0,710 Underexposed 0,858 0,859 0,857 0,854 Sports 0,695 0,695 0,679 0,673 No-Persons 0,850 0,858 0,837 0,826 Aesthetic-Impression 0,658 0,662 0,657 0,657 Partly-Blurred 0,852 0,852 0,845 0,830 Overall-Quality 0,656 0,656 0,653 0,658 Winter 0,843 0,846 0,832 0,828 Fancy 0,565 0,559 0,580 0,583 Snow 0,846 0,845 0,829 0,825 Average 0,8387 0,8375 0,8300 0,8217 Day 0,841 0,845 0,831 0,824 No-Blur 0,843 0,845 0,836 0,823 Table 2: Results per concept for our runs in the Large-Scale Visual Concept Detection Task using the Area Under the Curve. The highest score per concept is highlighted using a grey background. The concepts are ordered by their highest score. runs obtain the highest Area Under the Curve scores for 40 out of 53 concepts in the Photo Annotation task (20 for Soft 4-SIFT, 17 for 4-SIFT and 3 for the other runs). This analysis has shown us that our system is falling behind for concepts that correspond to conditions we have included invariance against. Our method is designed to be robust to unsharp images, so for Out-of-focus, Partly-Blurred and No-Blur there are better approaches possible. For the concepts Overexposed, Underexposed, Neutral-Illumination, Night and Sunny, recognizing how the scene is illuminated is very important. Because we are using invariant color descriptors, a lot of the discriminative lighting information is no longer present in the descriptors. Again, there should be better approaches possible for these concepts, such as estimating the color temperature and overall light intensity. Our system was developed on other datasets, and only the concept models were specifically learned for the Photo Annotation dataset. Its good performance on this dataset, without changing the parameter settings, shows that it is generic and generalizes to multiple datasets. But, our system only performs well on this dataset because the train and test set come from the same source and have been obtained at the same time. Generalization across the boundary of multiple datasets is still an unsolved problem: for photos downloaded from Flickr in a different season or general web images, the performance will be significantly worse. However, all systems participating in the Photo Annotation task are ‘overtrained’ in this sense, and the models they learned too specific. An interesting avenue for future editions is to have a second test set with photos from a different source or moment in time, so this problem can be investigated further. Average Annotation Score Run name Codebook with agreement without agreement Soft 4-SIFT Soft-assignment 0.7831 0.7598 4-SIFT Hard-assignment 0.7812 0.7578 2-SIFT Hard-assignment 0.7780 0.7544 OpponentSIFT Hard-assignment 0.7705 0.7464 Rescaled 4-SIFT Hard-assignment 0.7503 0.7312 Table 3: Results using the hierarchical evaluation measures for our runs in the Large-Scale Visual Concept Detection Task. 3.3 Evaluation Per Image For the hierarchical evaluation, overall results are shown in table 3. When compared to the evaluation per concept, the Soft 4-SIFT run is now slightly better than the normal 4-SIFT run. Our attempt to improve performance for the hierarchical evaluation measure using a linear rescaling of the concept likelihoods has had the opposite effect: the normal 4-SIFT run is better than the Rescaled 4-SIFT run. Therefore, further investigation into building a cascade of concept classifiers is needed, as simply using the individual concept classifiers with their class priors does not work. 4 Conclusion Our focus on invariant visual features for concept detection in ImageCLEF 2009 has been suc- cessful. It has resulted in the top ranking for the large-scale visual concept detection task in terms of both EER and AUC. For 40 individual concepts, we obtain the best performance of all submissions to the task. For the hierarchical evaluation, using the concept likelihoods estimated by our detectors directly works better than scaling these likelihoods based on the class priors. Acknowledgements This work was supported by the EC-FP6 VIDI-Video project. References [1] G. J. Burghouts and J. M. Geusebroek. Performance evaluation of local color invariants. Computer Vision and Image Understanding, 113:48–62, 2009. [2] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. [3] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2008 (VOC2008) Results. [4] L. Fei-Fei and P. Perona. A bayesian hierarchical model for learning natural scene categories. In IEEE Conference on Computer Vision and Pattern Recognition, volume 2, pages 524–531, San Diego, USA, 2005. [5] J. M. Geusebroek, R. van den Boomgaard, A. W. M. Smeulders, and H. Geerts. Color invariance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(12):1338– 1350, 2001. [6] F. Jurie and B. Triggs. Creating efficient codebooks for visual recognition. In IEEE Interna- tional Conference on Computer Vision, pages 604–610, Beijing, China, 2005. [7] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In IEEE Conference on Computer Vision and Pattern Recognition, volume 2, pages 2169–2178, New York, USA, 2006. [8] T. K. Leung and J. Malik. Representing and recognizing the visual appearance of materials using three-dimensional textons. International Journal of Computer Vision, 43(1):29–44, 2001. [9] H.-T. Lin, C.-J. Lin, and R. C. Weng. A note on Platt’s probabilistic outputs for support vector machines. Machine Learning, 68(3):267–276, 2007. [10] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91–110, 2004. [11] M. Marszalek, C. Schmid, H. Harzallah, and J. van de Weijer. Learning object representa- tions for visual object class recognition, 2007. Visual Recognition Challenge workshop, in conjunction with IEEE International Conference on Computer Vision, Rio de Janeiro, Brazil. [12] S. Nowak and P. Dunker. Overview of the clef 2009 large scale visual concept detection and annotation task. In CLEF working notes 2009, Corfu, Greece, 2009. [13] J. C. Platt. Probabilities for SV machines. In A. J. Smola, P. L. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 61–74. MIT Press, 2000. [14] A. F. Smeaton, P. Over, and W. Kraaij. Evaluation campaigns and TRECVid. In ACM International Workshop on Multimedia Information Retrieval, pages 321–330, Santa Barbara, USA, 2006. [15] C. G. M. Snoek, K. E. A. van de Sande, O. de Rooij, B. Huurnink, J. C. van Gemert, J. R. R. Uijlings, and et al. . The MediaMill TRECVID 2008 semantic video search engine. In Proceedings of the 6th TRECVID Workshop, Gaithersburg, USA, November 2008. [16] C. G. M. Snoek and M. Worring. Concept-based video retrieval. Foundations and Trends in Information Retrieval, 4(2):215–322, 2009. [17] T. Tuytelaars and K. Mikolajczyk. Local invariant feature detectors: A survey. Foundations and Trends in Computer Graphics and Vision, 3(3):177–280, 2008. [18] K. E. A. van de Sande, T. Gevers, and C. G. M. Snoek. A comparison of color features for visual concept classification. In ACM International Conference on Image and Video Retrieval, pages 141–150, 2008. [19] K. E. A. van de Sande, T. Gevers, and C. G. M. Snoek. Evaluating color descriptors for object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, (in press), 2010. [20] J. C. van Gemert, C. J. Veenman, A. W. M. Smeulders, and J. M. Geusebroek. Visual word ambiguity. IEEE Transactions on Pattern Analysis and Machine Intelligence, (in press), 2010. [21] V. N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York, USA, 2nd edition, 2000. [22] D. Wang, X. Liu, L. Luo, J. Li, and B. Zhang. Video diver: generic video indexing with diverse features. In ACM International Workshop on Multimedia Information Retrieval, pages 61–70, Augsburg, Germany, 2007. [23] J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid. Local features and kernels for classi- fication of texture and object categories: A comprehensive study. International Journal of Computer Vision, 73(2):213–238, 2007.