I2R ImageCLEF Photo Annotation 2009 Working Notes Jiquan Ngiam and Hanlin Goh Institute for Infocomm Research, Singapore, 1 Fusionopolis Way, Singapore 138632 {jqngiam, hlgoh}@i2r.a-star.edu.sg Abstract This paper describes the method that was used for our two submission runs for the ImageCLEF Photo Annotation task. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor- mation Search and Retrieval; H.3.4 Systems and Software; H.3.7 Digital Libraries; H.2.3 [Database Managment]: Languages—Query Languages General Terms Measurement, Performance, Experimentation Keywords Photo annotation, Global and local features, Support Vector Machines (SVM), feature selection 1 Introduction The ImageCLEF Photo Annotation 2009 task involved 53 concepts spanning from abstract con- cepts (aesthetics, blur) to visual elements (trees, people). Although an ontology was provided, our method did not rely on the ontology heavily, except for handling disjoint cases. Our method follows the framework in [9] , involving Support Vector Machines and extended Gaussian Kernels over the χ2 distance. We use a variety of global features, novel local region selectors and simple greedy feature selection. 2 Image Features 2.1 Global Features The following global features were each computed over the entire image. Each feature essential provides a histogram for the image. In features where a quantized HSV space was used, the following quantization parameters were employed: 12 Hue Bins, 3 Saturation Bins, 3 Value Bins. Bins were of equal width in each dimension. This results in a total of 108 bins. The choice of these parameters was motivated by [6] 2.1.1 HSV Histogram - 108 dim The quantized HSV histogram (as above) was used as a feature vector. 2.1.2 Color Auto Correlogram (CAC) - 432 dim We computed the CAC over a quantized HSV space with 4 distances 1,3,5,7. This gives us a feature vector of 108*4 = 432 dimensions. For each color & distance pair (c, d), we computed the probability of finding the same color at exactly distance d away. Refer to [6] for details. 2.1.3 Color Coherence Vector (CCV) - 216 dim We computed the CCV over a quantized HSV space. Since there are two states - coherent and incoherent, this gives us a feature vector of 108*2 = 216 dimensions. We set the tau parameter to 1% of the image size. Refer to [7] for details. 2.1.4 Census Transform (CT) - 256 dim The CT histogram is a simple transformation of each pixel into a 8-bit value based on its 8 surrounding neighbors (two states, either ¿= or ¡ its neighbor). This provides a feature vector of 256 dimensions (histogram of the CT values). Refer to [8] for details. 2.1.5 Edge Orientation Histogram - 37 dim We used the LTI-Lib’s Canny Edge detector to compute the edge orientation histogram. Each pixel is assigned to either an edge (with orientation) or non-edge. Orientations are quantized into 5 degree angle bins, giving a total of 36 bins for 180 degrees. An additional bin is concatenated for non-edges. This gives a final vector of 37 dimensions. 2.1.6 Interest Point Based SIFT - 500 dim We used the SIFT binary provided by David Lowe [5] to compute SIFT descriptors. The descriptors were quantized into 500 visual words. A visual words dictionary was computed using k-Means clustering. 2.1.7 Densely Sampled SIFT - 1500 dim We densely sampled SIFT points at 10 pixel spacings, 4 scales (4, 8, 12, 16 px radius) and 1 orientation. The points are similarly quantized into 1500 visual words by k-Means. This scheme follows that of [1]. 2.2 Local Region Features For a number of concepts, the classification problem can be framed in the Multiple Instance Learning (MIL) framework. In essence, a concept (e.g. Mountain) exists if and only if a region within the image demonstrates the concept. Hence, one is motivated to consider whether it is possible to improve performance by finding appropriate region(s) to consider in an image. We define a local region to be a bounding box. Given a bounding box in an image, one can compute image features similar to the global features. Hence, we define a local region feature to be a feature vector that is extracted based only on region in a bounding box. Therefore, the problem of finding local region features is reduced to one of finding good bounding boxes for each image. 2.2.1 Local Region Selectors To find good local region selectors (bounding boxes), we frame the problem in a MIL setting. In this setting, each image is a bag-of-regions and regions are considered to be true iff they contain the target concept. Furthermore, a bag is true iff it contains a true region. We used EM-Diverse Density [10] together with Efficient Subwindow Search (ESS) [4] to search for a target concept with good diverse density. We note that since ESS is able to consider all possible rectangular subwindows, the algorithm essentially considers all possible bounding boxes. However, multiple restarts are required since the algorithm is susceptible to local minimas. For each concept, we learned a local region selector based on the densely sample SIFT features. From each local region, we extract only the HSV, interest point SIFT and densely sampled SIFT histograms. These three histograms form the (concept-specific) local features for each image. 3 Learning and Feature Selection 3.1 Support Vector Machines For the final classification, we used LIBSVM [2] with probability estimates (provided with the software). Each concept was treated separately as a individual classification task. Following [9], we used extended Gaussian kernels with the χ2 distance. X 1 2 K(Si , Sj ) = χ (f (Si ), f (Sj )) µf f ∈f eatures Both local and global features were treated in the same manner. µf is the average χ2 distance for a particular feature; we used it to normalize the distances across different features. We also performed experiments on cost-sensitive SVMs but the results did not vary that much. 3.2 Feature Selection Unsurprisingly, different features work well with different concepts. To combine different features, one could incorporate weighting into the kernel function. This method is adopted by the INRIA group in their VOC2007 submission [3]. However, learning these weights is non-trivial and one often resorts to ad-hoc methods such as genetic algorithms. We chose a simpler method for feature selection in which a greedy algorithm is used. Further- more, we do employ any partial weighting scheme for the features. Using equal error rate (ERR) as out performance measure, the algorithm is described as follows: Greedy Algorithm 1. F = all global features 2. For each feature f ∈ F : Compute error rate if f is removed 3. Remove the feature which results in best improvement 4. Repeat (2-3) until removing any feature results in worse performance 5. Consider each feature f ∈ All F eatures − F : Compute error rate if f is added 6. Consider each feature f ∈ F : Compute error rate if f is removed 7. Add or remove the feature which gives best improvement 8. Repeat (5-7) until local optima is reached 9. Return F The appendix contains a list of concepts and the features that were selected for each of them. 3.3 Hierarchical Evaluation While there was a new hierarchical measure introduced for the task, we did not specifically optimize for it. The lack of the annotator agreement values also made it more difficult to optimize for this new measure. However, for the classes that were specified as disjoint, we did simple post processing on the probability estimates to ensure that exactly 1 of the concepts is ≥ 0.5. This was achieved by simply moving the probability estimates to 0.5 ± . 3.4 Observations We noticed that the SVM was robust to optional concepts and class imbalance. For optional concepts like Sunny, the SVM was able to correctly classify many “unlabelled” data points correctly (these were classified as negative in a validation set). The local features were useful for the following concepts: Partylife, Snow, Spring, Autumn, No Visual Season, Trees, Mountains, Macro, Portrait, Small Group, Animals, Vehicle, Over- all Quality, Fancy. 4 Submissions and Results Two final submissions were made for this task. One used all the features available while the other used only the global features. The official results of the two runs in terms of Average Equal Error Rate (Avg. EER), Average Area Under Curve (Avg. AUC), Average Annotation Score with Annotator Agreement (Avg. AS with AA) and Average Annotation Score without Annotator Agreement (Avg. AS without AA) are reported in Table 1. Based on the Avg. EER measure, our run with all features was ranked 6 out of 74 submitted runs, while the run using only global features was ranked 11. Comparing the best runs from each group, were reported to be third in the list of twenty participating groups. Evaluating our performance based on the Average Annotation Score with and without the use of Annotator Agreement, our runs were ranked second (global features only) and third (all features). Avg. AS Avg. AS Submission Run Avg. EER Avg. AUC with AA without AA All Features 0.253296 0.813893 0.82751185 0.8080815 (CVIUI2R 22 2 1244628714641.txt) Global Features Only (CVIUI2R 22 2 1244629050173.txt) 0.255945 0.811421 0.8276921 0.8082839 Table 1: Results of Runs evaluated with EER, AUC and the hierarchical measures References [1] Anna Bosch, Andrew Zisserman, and Xavier Muoz. Scene classification using a hybrid gener- ative/discriminative approach. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 30(4):712–727, 2008. [2] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. [3] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. http://www.pascal-network. org/challenges/VOC/voc2007/workshop/index.html. [4] C. H. Lampert, M. B. Blaschko, and T. Hofmann. Beyond sliding windows: Object local- ization by efficient subwindow search. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8, 2008. [5] David G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60:91–110, 2004. [6] T Ojala, M Rautiainen, E Matinmikko, and M Aittola. Semantic image retrieval with hsv correlograms. In 12th Scandinavian Conference on Image Analysis, pages 621 – 627, 2001. [7] Greg Pass, Ramin Zabih, and Justin Miller. Comparing images using color coherence vec- tors. In MULTIMEDIA ’96: Proceedings of the fourth ACM international conference on Multimedia, pages 65–73, New York, NY, USA, 1996. ACM. [8] Jianxin Wu and James M. Rehg. Where am i: Place instance and category recognition using spatial pact. Computer Vision and Pattern Recognition, IEEE Computer Society Conference on, 0:1–8, 2008. [9] J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid. Local features and kernels for classi- fication of texture and object categories: A comprehensive study. Int. J. Comput. Vision, 73(2):213–238, 2007. [10] Qi Zhang and Sally A. Goldman. Em-dd: An improved multiple-instance learning technique. In Advances in Neural Information Processing Systems, pages 1073–1080. MIT Press, 2001. A Appendix: Selected Features Concept Features1 0 gdsift1500 gsift500 gcac ldsift1500 1 gdsift1500 gsift500 gcac gccv gcthist 2 gsift500 gcac gccv gcthist 3 gdsift1500 gsift500 gcac gcthist 4 gdsift1500 gsift500 gcac ldsift1500 5 gdsift1500 gsift500 gcac ghsv 6 gdsift1500 gsift500 gcac gcthist 7 gdsift1500 gsift500 gcac 8 gdsift1500 gcac gccv ghsv gcthist 9 gdsift1500 gsift500 gcac ghsv gcthist ldsift1500 10 gsift500 gcac gccv ghsv gcthist 11 gdsift1500 gsift500 gcac gccv ghsv lsift500 ldsift1500 gedgehist 12 gdsift1500 gsift500 ghsv gcthist 13 gdsift1500 gsift500 ghsv ldsift1500 14 gdsift1500 gsift500 ghsv gcthist 15 gdsift1500 gsift500 gccv 16 gdsift1500 gsift500 gccv gcthist 17 gdsift1500 gsift500 gcac 18 gdsift1500 gsift500 gcac ghsv gedgehist 19 gdsift1500 gsift500 ldsift1500 20 gdsift1500 gcac 21 gdsift1500 gsift500 gccv 22 gdsift1500 gsift500 gccv gcthist gedgehist 23 gdsift1500 gsift500 gcac gccv 24 gsift500 gcac ghsv gcthist 25 gdsift1500 gcac gccv ghsv gcthist 26 gdsift1500 lsift500 ldsift1500 lhsv 27 gdsift1500 gsift500 gcac 28 gsift500 gcac gccv ghsv gcthist 29 gdsift1500 gsift500 gcac 30 gdsift1500 gsift500 gcac gccv gcthist 31 gdsift1500 gsift500 gcac gccv 32 gdsift1500 gsift500 gcthist 33 gdsift1500 gsift500 gcac gcthist 34 gdsift1500 gccv gcthist lsift500 35 gdsift1500 gsift500 gcac lhsv gedgehist 36 gdsift1500 gsift500 gccv ghsv 37 gdsift1500 gsift500 gccv ghsv 38 gdsift1500 gsift500 gccv 39 gdsift1500 gsift500 gcac 40 gdsift1500 gsift500 gccv ghsv gcthist 41 gdsift1500 gsift500 gcthist 42 gdsift1500 gsift500 gcthist 43 gdsift1500 gsift500 gcac gccv 44 gdsift1500 gsift500 gcthist gedgehist lsift500 lhsv 45 gdsift1500 gsift500 gccv gcthist 46 gdsift1500 gsift500 gcac 47 gdsift1500 gsift500 gcac ldsift1500 gedgehist 48 gdsift1500 gsift500 gcac gccv gcthist gedgehist 49 gdsift1500 gsift500 gcthist ldsift1500 50 gdsift1500 gsift500 gcac gcthist 51 gdsift1500 gsift500 gcac gccv ldsift1500 lsift500 52 gdsift1500 gsift500 gcac gccv ghsv ldsift1500 gedgehist 1 ‘g’ indicates global feature and ‘l’ indicates local feature.