SZTAKI @ ImageCLEF 2011∗ Bálint Daróczy Róbert Pethes András A. Benczúr Data Mining and Web search Research Group, Informatics Laboratory Computer and Automation Research Institute of the Hungarian Academy of Sciences {daroczyb, rpethes, benczur}@ilab.sztaki.hu Abstract We participated in the ImageCLEF 2011 Photo Annotation and Wikipedia Image Re- trieval Tasks. Our approach to the ImageCLEF 2011 Photo Annotation is based on a kernel weighting procedure using visual Fisher kernels and a Flickr-tag based Jensen- Shannon divergence based kernel. We trained a Gaussian Mixture Model (GMM) to define a generative model over the feature vectors extracted from the image patches. To represent each image with high-level descriptors we calculated Fisher vectors from different visual features of the images. These features were sampled at various scales and partitions such as Harris-Laplace detected patches, scale and spatial pyramids. We calculated distance matrices from the descriptors of train images to combine different high-level descriptors and the tag based similarity matrix. With this uniform repre- sentation we had the possibility to learn the natural weights for each category over the different type of descriptors. This re-weightning resulted 0.01838 MAP increase over the average kernel results. We used the weighted kernels for learning linear SVM models for each of the 99 concepts independently. For the Wikipedia Image Retrieval Task we used the search engine of the Hungarian Academy of Sciences as our informa- tion retrieval system that is based on Okapi BM25 ranking. We calculated light Fisher vectors to represent the content of the images and performed nearest-neighbour search on them. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor- mation Search and Retrieval; H.3.4 Systems and Software; H.3.7 Digital Libraries; H.2.3 [Database Managment]: Languages—Query Languages General Terms Measurement, Performance, Experimentation Keywords SIFT, Color moments, Gaussian mixtures, Fisher vector, kernel methods, SVM 1 Introduction In this paper we describe our approach to the ImageCLEF 2011 Photo Annotation an Wikipedia Image Retrieval evaluation campaigns [6, 7]. Our image classification is based on a combination ∗ This work was supported by the EU FP7 Projects LAWA (Large-Scale Longitudinal Web Analytics). of visual and textual (Flickr tag) information defining uniform kernel matrices. We measured the similarity between the set of image tags using the Jensen-Shannon divergence. We extracted nor- malized Fisher vectors to calculate the distance kernel for the content of the images. In Wikipedia Image Retrieval Task we used the search engine of the Hungarian Academy of Sciences [2] as our information retrieval system that is based on Okapi BM25 [8]. We extracted light Fisher vectors for each image to measure image similarity. 2 Photo Annotation Task To reduce the effect of noise on codebook generation and bag-of-words modeling, we smoothed the images. Although the images are not of the same resolution, we did not rescale them. One of the reasons was to avoid adding noise. In addition, with a properly normalized bag-of-words modeling we did not need to calculate fixed number of samples per image. We used feature vectors to describe the visual content of an image by approximately 13000 descriptors per image per modality. We sampled the patches with dense multi-scale grid and Harris-Laplace point detection. We calculated SIFT (Scale Invariant Feature Transform[5]) and RGB color descriptors for each patch. For each type of low-level descriptor we trained a Gaussian Mixture Model (GMM) with 256 Gaussians. The training of GMM models with about 3 million training points took 20 minutes per descriptor with our open-source CUDA GMM implementation [3]. For the SIFT descriptors the training was performed after reducing the dimension of descriptors to 80 by PCA. By the Color moments the dimension reduction resulted performance loss so we did not adopted PCA for it. The normalized Fisher gradient vector computed from GMM of SIFT descriptors is a well known technique to represent an image with only one vector per pooling (we used 1x1, 1x3 and 2x2 spatial pyramids [9] ). We also calculated Fisher vectors on the Harris-Laplacian detected corner descriptors. Our overall procedure is shown in Fig. 1. Our GMM procedure is based on the standard expectation maximization (EM) algorithm with a non-hierarchical structure. We resolved the well known vulnerability of EM algorithm to underflow especially computing the conditional probabilities with large (50+) codebooks in fp32/fp64 precision. This is a limitation of GPGPU cards, additionally in fp64 they are usually more than twice as slow as in fp32. Since it is not sufficient enough to use logarithm instead of values we implemented a magnitude summation algorithm. Read the details in our paper[3]. Our source code along with preprocessed GMM models and codes for Fisher vector calculation is available free for research use at http://datamining.sztaki.hu/?q=en/GPU-GMM. The Fisher vectors can also be computed parallel in k · D independent calculations where D is the dimension of the low-level features and k is the number of Gaussians. If neither the dimension nor the number of clusters is more than the maximal number of threads for a GPU block, the computational time depends only on the number of low-level features. Our implementation with calculating all the gradients of the sampling points is seven-times faster than a well-tuned locally optimal Fisher vector implementation on a fast CPU and 44-times faster if we calculate the same algorithm on the CPU[3]. The calculation of Fisher vectors took about 1.5 hours per modality. We extracted 9 Fisher vectors per image for each pooling: one Fisher vector for all the patches, one for the Harris-Laplace detected points, three for the 1x3 and four for the 2x2 spatial partitions. We normalized the resulted vectors with Fisher information matrix, power and L2 normalization. Worth to mention, as our feature extraction methods eventuated high number of descriptors for each part of the image and our Fisher calculation method is not cutting the lower probabilities, the resulted Fisher vectors were highly non-sparse without any normalization. 2.1 Pre-Calculated kernel combination for linear SVM Learning linear SVM models on Bag-of-Words models is a widely used technique[10, 1]. One of the main problem is to define the kernel function between the training instances. We used pre- computed normalized distance kernels instead of Fisher kernels because our preliminary experi- ments showed better annotation performance with different training sets. Beside the classification Figure 1: Feature extraction and classification procedure (Photo Annotation) performance gain the dimension of the kernels are less than the dimension of a regular Fisher vector (dimension of the kernel = #training images = 8k vs. dimension of F isher vector = 40k − 49k) . In addition, with pre-computed kernels we had the ability to combine textual and visual kernels without increasing the dimension of the learning problem. We averaged the corre- sponding basic kernels of the spatial pyramids. This resulted four kernels per low-level descriptor per image (Harris-Laplace, full image, 1x3 and 2x2 spatial pyramid). Additionally we measured the distance between two image according to their tags with Jensen-Shannon divergence. Our choice was inspired by the similarity properties of Jensen-Shannon against Kullback-Leibler divergence. To learn the weights of the different basic kernels we splitted the training set into two parts on account of sparse annotation. We trained SVM models per category ([4]) on the two part of the data and evaluated it on the other half of the training set. We used the independently evaluated predictions for the basic kernels and determined a combination between them for each category. We used the existing Flickr tags to create probability distributions. If an image had no tag we set its similarity to be zero if measured against itself and one if measured against the other images. We assumed that the specified weights are linked to the ideal combination of the basic kernels. Our final weighted kernel for each category c between images X and Y was 1 X X K T K(X, Y )c = αck Kk (X, It ) ∗ Kk (Y, It ). (1) |K| t=1 k=1 The Kk (X, It ) denotes the basic kernels, It is the tth training image and T is the number of training images of the collection. For the visual Fisher vectors we measured the similarity between two vectors with Manhattan distance. For the Flickr tag based probability distributions we adopted the Jensen-Shannon divergence as similarity measure. distM anhattan (Fk (X), Fk (It )) Kkvisual (X, It ) = (2) max ∗X arg maxt Kk (X, It ) Kktextual (X, It ) = distJensen−Shannon (X, It ) (3) where Fk (X) denotes the Fisher vector of X for the kth pooling. Since the output of our classifier was a summarized values of the weighted dot-products of the support vectors and the test instances, we used the sigmoid function to map the output of the SVM classifier to a floating point prediction between zero and one. 1 P redictionf loat = (4) 1 + exp−1∗svmoutput Our method gained 1.8 % increase in MAP over the average sum of the basic kernels if we also included a textual based kernel beside the visual kernels. As seen in Table 1 if we adopted the kernel weighting only to the basic visual kernels we measured a 0.3 % increase. For the example-based evaluation we needed to define a mapping from the floating point predictions into a binary annotation. We applied two strategies. In the first method we shifted the borderline between the positive and the negative samples till the annotation on the training set had the highest precision and recall. In the second method we assumed that the relative occurrence of a category in the training and the test set were similar and shifted the borderline according to it. The previous had much higher F-score (0.545341 vs. 0.593088) and higher Semantic R-Precision (0.70853 vs. 0.71928). Worth to mention, from our submissions the averaged visual kernel had the highest Semantic R-Precision (0.72902450). Table 1: Photo Annotation results Kernel aggregation MAP EER AUC visual + textual run3 weighted 0.438744 0.243574 0.827621 visual + textual run2 (we cssj) weighted 0.436294 0.241691 0.827747 visual + textual run1 (avg cssj) average 0.420406 0.243885 0.828322 visual run2 weighted 0.369688 0.263449 0.806691 visual run1 (avg cns) average 0.367054 0.264328 0.805142 textual run1 (jensen) only one 0.345616 0.338127 0.717966 3 The Wikipedia Image Retrieval Task We used the Hungarian Academy of Sciences search engine [2] as our information retrieval system based on Okapi BM25 ranking [8]. We applied the English, German and French annotations and articles independently for indexing. We made no differentiation between the title and the body of the annotation. Since file names often contain relevant keywords and also often as substring, we gave score proportional to the length of the matching substring. Since the indexing of all substrings is infeasible, we only performed this step for those documents that already matched at least one query term in their body. For the WikipediaMM task we also deployed query expansion by an online WordNet1 . We added groups of synonyms with reduces weight so that only the score of the first few best per- forming synonym was added to the final score to avoid overscoring long lists of synonyms. Nearest neighbor search was performed over light Fisher vector. We call it light Fisher vectors because we calculated descriptors only on a dense grid (16 pixel step sampling) and we did not extracted different poolings. We used the same Gaussian Mixture Model for the RGB color moment descriptors as in the Photo Annotation task. Our Relevance Feedback method used the first 10 results of the aggregated score of the three language query results. We calculated the Jensen-Shannon distance from the first 10 hits to the lower ranked hits with weight according to their rank. For the documents ranked lower (i > 9) the new score was 1 X 9 scorei = ∗ (1 − distJensen−Shannon (Di , Dj )) (5) i + 1 j=0 where Di and Dj denotes the probability distribution of the ith and jth document. We found the text based score more accurate. Therefore we adopted our Relevance Feedback procedure using the visual similarity of the first hundred hits to re-rank the documents. This resulted 0.3 % increase in MAP (0.2167 vs. 0.2136). 4 Conclusions For Photo Annotation Task, we successfully applied visual and textual kernel matrices as instance matrices for SVM learning. We used our own implementations for Low-level feature extraction, to train GMM models and calculate Fisher vectors. Our kernel weightning procedure resulted 1.8 % improvement over the average combination of the textual and visual kernels. We are planning to including new pre-computed kernels in the system and applying generative models to improving the determination of weights for kernel aggregation. For image retrieval task, our Relevance Feedback method improved the baseline Okapi BM25 based textual system. We also plan to strengthen our retrieval results by using more sophisticated methods for text and image retrieval fusion. 1 http://wordnet.princeton.edu/ References [1] J. Ah-Pine, S. Clinchant, G. Csurka, and Y. Liu. XRCEs Participation in ImageCLEF 2009. 2009. [2] András A. Benczúr, Károly Csalogány, Eszter Friedman, Dániel Fogaras, Tamás Sarlós, Máté Uher, and Eszter Windhager. Searching a small national domain—preliminary report. In Proceedings of the 12th World Wide Web Conference (WWW), Budapest, Hungary, 2003. [3] E. Bodzsár, B. Daróczy, I. Petrás, and András A. Benczúr. GMM based fisher vector calcu- lation on GPGPU. http://datamining.sztaki.hu/?q=en/GPU-GMM. [4] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. [5] D.G. Lowe. Object recognition from local scale-invariant features. In International Conference on Computer Vision, volume 2, pages 1150–1157, 1999. [6] S. Nowak and M. Huiskes. New strategies for image annotation: Overview of the photo annotation task at imageclef 2010. Working notes of CLEF, 2010, 2010. [7] A. Popescu, T. Tsikrika, and J. Kludas. Overview of the wikipedia retrieval task at imageclef 2010. Working Notes of CLEF, 2010, 2010. [8] Stephen E. Robertson and Karen Sparck Jones. Relevance weighting of search terms. In Document retrieval systems, pages 143–160. Taylor Graham Publishing, London, UK, UK, 1988. [9] C. Schmid S. Lazebnik and J. Ponce. Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New York, June 2006, 2006. [10] K. E. A. van de Sande, T. Gevers, and C. G. M. Snoek. Evaluating color descriptors for object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1582–1596, 2010.