-

and B. Sch¨olkopf. Large scale multiple kernel learning. Journal of Machine Learning Research

Fraunhofer FIRST's Submission to ImageCLEF2009 Photo Annotation Task: Non-sparse Multiple Kernel Learning

Alexander Binder

alexander.binder@first.fraunhofer.de 0 1

Motoaki Kawanabe

motoaki.kawanabe@first.fraunhofer.de 0 1 0 Fraunhofer Institute FIRST , Kekul ́estr. 7, 12489 Berlin , Germany 1 Measurement , Performance, Experimentation

2006

In order to achieve good performance in image annotation tasks, it is necessary to combine information from various image features. In our submission, we applied the nonsparse multiple kernel learning for feature combination proposed by Kloft et al.(2009) to the ImageCLEF2009 photo annotation data. Since some of the concepts of the ImageCLEF task are rather abstract, we conjectured that color histograms are informative for some categories such as sky and snow. Therefore we tried pyramid histograms of pixel colors. Since the images are not aligned, we sorted histograms at different places, when computing similarity of two images. Short description of our methods will be presented and obtained results will be discussed in this manuscript.

Recent research results show that combining information from various image features is inevitable to achieve good performance in image annotation tasks. With the support vector machine (SVM), this is implemented by mixing kernels (similarities between images) constructed from different image descriptors with appropriate weights. For instance, the average kernel with uniform weights or the optimal kernel trained by multiple kernel learning (MKL) have been used so far. Recently, Kloft et al.(2009) proposed the non-sparse MKL with Lp-regularizer, which bridges the abovementioned two extremes, i.e. the average kernel SVM and the standard MKL. The non-sparse MKL is successfully applied to object classification tasks; it could outperform the two baseline methods by optimizing the tuning parameter p ≥ 1 through cross validation. In our submission, we applied the novel technique for feature combination to ImageCLEF2009 photo annotation data based on bag-of-words [ 2 ] over SIFT features which is similar to a subset of the Pascal VOC 2008 winners [ 10, 3 ] submission. Furthermore, we tested sorted pyramid kernel with color histograms which will be explained in Section 2.1.1. We computed SIFT features [ 6, 7 ] on a dense grid of step size six over the color channels red, green, blue, normalized red, normalized green, normalized blue, opponent color 1, opponent color2, normalized opponent color 1,normalized opponent color 2, grey. We used radii 4,8,12,16 for the SIFT feature supports.

For the bag of words models we combined the color channels into the following five sets: grey, red-green-blue, normalized red-green-blue, opponent color 1 - opponent color2. The bag of words prototypes have been obtained using kmeans clustering with 800000 SIFT features randomly sampled from 3000 files. The number of prototypes was fixed to 4000 for each of the five channel combinations. Finally we computed the bag of words for image tilings 1x1, 2x2 and 3x1 in the sense of spatial pyrmaid levels [ 5, 1 ]. This results in 12 features of which we applied in each learning setting at most eleven due to 32 bit memory restrictions. Furthermore we employed pyramid histograms over color intensities features based on the color channels grey, opponent color 1,opponent color 2, green and blue. We chose the number of bins per pyramid cell to be ten. 2.1.1

Modifications of pyramid histograms over color intensities In order to improve the information content of the spatial pyramids for pyramid histograms over color intensities (PHoCol), we sorted the cells within a pyramid level according to the slant of the histogram. This reflects the notion that an image patch having high intensities in a certain colour channel results in an intensity histogram slanted towards higher intensities. The slant was determined via the entropy of the accumulated histogram of a histogram: ea(h)i =

X k≤i hk, a(h)i = P ea(h)i k ea(h)k , sl(h) = X −a(h)i ln(a(h)i) i In addition to that we included the difference between sort costs in the kernel. Algebraically a sort cost is just a mapping of the sort permutation of a pyramid level into the real numbers. In our specific case we resorted to sc(π) = C X max(v(π(k)) − v(k), 0)

k where v(i) is the height of the cell with index i with C being chosen so that the sort cost is upperbounded by one and lies at a similar scale compared to the χ2-distance. The intuition behind such a sort cost is to punish large changes of the vertical position of an image patch associated to a pyramid cell caused by sorting compared to the unsorted pyramid. This sort cost is obviously invariant to changes in horizontal position of pyramid cells.

The sort cost modified kernel is a product of the χ2-Kernel and a gaussian kernel over the difference of the sortcosts between two features

k(x, y) = exp(−σ(dχ2 (hx, hy) + (scx − scy)2)) . While it would be straightforward to replace this fixed weight kernel modification by an linearized MKL-approach, we omitted it due to 32 bit memory restrictions. We also experimented with an alternative formulation

k(x, y) = exp(−σ(dχ2 (hx, hy)(1 + (scx − scy)2)) for which the kernel property is not proven which however turned out to be nonnegative definite.

In the end we computed these features using tilings 4x4, no sorting, 3x1, sorted with sort cost, 16x16, sorted without sort cost over color channels noted above. While we do not regard such features as good standalone features, we realized their speed and potential value for contributing to some classifiers of the 53 classes in a proper MKL-procedure. We had low structured concepts with high variance in mind like sky, water, portrait, daytime. 2.2

Kernels 2.3 Experiments

We used the χ2-Kernel with the exception of pyramid histograms over color intensities. All kernels have been normalized.

We used support vector machines with average kernels, sparse L1-MKL, and the recently developed general non-sparse Lp-MKL [ 4 ]. The regularization parameter was fixed to one.

We submitted one single kernel result, namely phows over red-green-blue with 1x1 tiling as an early fallback, furthermore we submitted for each binary classifier the single best support vector learning algorithm measured by AP/AUC and FRR rate score over ten-fold crossvalidation (Lp-joint) using which generates three additional submissions

A) eleven bag of words kernels, namely red-green-blue, normalized red-green-blue and grey, all tilings, and opponent color 1 - opponent color2, 1x1 and 3x1 tiling.

B) nine bag of words kernels, namely red-green-blue, normalized red-green-blue and grey, all tilings, C) 10 kernels, namely bag of words red-green-blue, all tilings, grey 1x1 tiling, PHoCol over green and blue channels 4x4,3x1,16x16 tilings (various sort strategies) D) eleven kernels, namely bag of words red-green-blue and grey, all tilings, opponent color 1 opponent color2 and normalized opponent color 1 - opponent color2, both with 1x1 tiling, PHoCol over grey channels 4x4,3x1,16x16 tilings (various sort strategies) E) 10 kernels, namely bag of words red-green-blue, all tilings, grey 1x1 tiling, PHoCol over opponent color 1 and 2 channels 4x4,3x1,16x16 tilings (various sort strategies) We used for the standard SVM and the Lp-MKL code from the shogun toolbox [9, 8].

The experiments were conducted with C++-code on an Opteron-Cluster running on a 32 bit Linux which implied a hard limit for memory usage of 3GB. On average we used 20 CPUs. 2.4

Conclusion

On average combinations of only Bag of words features perform best. There exists classes for which adding the low level color descriptors improves performance. Since each binary classifier can be selected separately based on crossvalidation error adding the color features improves the final submission. By comparing the results from A and B versus C, D, and E one can clearly observe overfitting. In our point of view it comes from noisy feature extraction in connection with small sample sizes for many of the smaller concept classes. We doubt that the applied features permit a Bayes Error close to zero. In that sense there is still a need for better feature extraction despite the success story of Bag of Words representations.

Below are AP and AUC scores over 5-fold crossvalidation, the result of the best method is marked bold. For multiclass subproblems we computed for each method the average score over the subproblem and marked the subproblem in total bold. This work was supported in part by Federal Ministry of Economics and Technology of Germany under the project THESEUS (01MQ07018).

Gool, PASCAL [10] K. E. A. van de Sande, T. Gevers, and C. G. M. Snoek. Evaluating color descriptors for object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, (in press), 2010.

[1]

Anna

Bosch ,

Andrew

Zisserman , and Xavier Mun˜oz. Representing shape with a spatial pyramid kernel . In Proceedings of the 6th ACM international conference on Image and video retrieval (CIVR '07) , pages 401 - 408 , 2007 .

[2]

Csurka ,

Bray ,

Dance , and

Fan . Visual categorization with bags of keypoints . In Workshop on Statistical Learning in Computer Vision , ECCV, pages 1 - 22 , Prague, Czech Republic, May 2004 .

[3]

Everingham , L. Van A. Zisserman. The (VOC2008) Results . voc2008/workshop/index.html, 2008 . C. K. I. Williams , J. Winn , and Visual Object Classes Challenge 2008 http://www.pascal-network.org/challenges/VOC/

[4]

Marius

Kloft , Ulf Brefeld, Pavel Laskov, and S¨oren Sonnenburg. Non-sparse multiple kernel learning , 2008. Abstract of the NIPS workshop on Kernel Learning: Automatic Selection of Optimal Kernels.

[5]

Svetlana

Lazebnik , Cordelia Schmid, and

Jean

Ponce . Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories . In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '06) , volume 2 , pages 2169 - 2178 , New York, USA, 2006 .

[6]

David G.

Lowe . Distinctive image features from scale-invariant keypoints . International Journal of Computer Vision , 60 ( 2 ): 91 - 110 , 2004 .