Introduction

LSIS Scaled Photo Annotations: Discriminant Features SVM versus Visual Dictionary based on Image Frequency

Zhong-Qiu ZHAO

zhongqiuzhao@gmail.com 0 1 2

Herve GLOTIN

glotin@univ-tln.fr 0 2

Emilie DUMONT

emilie.r.dumont@gmail.com 0 2

H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor- mation Search and Retrieval; H.3.4 Systems and Software; H.3.7 Digital Libraries; H.2.3 [Database Managment]: Cross-Language Retrieval in Image Collections (ImageCLEF)|ImageCLEFphoto

0 LDA, Visual Dictionary, Generalized Descriptor of Fourier , Pro le Entropy Feature, SVM 1 School of Computer & Information, Hefei University of Technology , China 2 University of Sud Toulon-Var , USTV , France

In this paper, we used only visual information to implement ImageCLEF2009 Photo Annotation Task. Firstly, we extract various visual features: HSV and EDGE histograms, Gabor, and recent Descriptor of Fourier and Pro le Entropy Features. Then for each concept and features, we compute Linear Discriminant Analysis (LDA) to decrease the high dimension impact. Finally, we train support vector machines (SVMs), for which the outputs are considered as the con dences with which the samples belong to the concept. Also we propose a second model, an improved version of a Visual Dictionary (VD), which is built by visual words extracted for frequency templates in the training set. We describe the results of these 2 models, topics by topics, and we give perspectives for our VD method, that is more faster than SVM, and better than SVM for some topics. We also show that among the 19 teams, our SVM(LDA) run attains the AUC score of 0.721, and then occupies the 8th AUC rank among the 19 teams involved in this campaign, while our VD models would occupy the 10th rank.

Introduction

We use various features described in nexte section: PEF[ 2, 3 ], HSV and EDGE histograms, new Descriptor of Fourier [ 7 ], and Gabor. The we use an LDA to reduce their dimension. Finally, we use the Least Square support vector machine (LS-SVM) to produce concept similarity. Another original method called Visual Dictionary is proposed and implemented in section 5. 2

Visual Features

We use a new feature, the pixel 'pro le' entropy (PEF) [ 2 ], giving the entropy of a pixel pro les in horizontal and vertical directions. The advantage of PEF is to combine raw shape and texture representations, with a low CPU cost feature, and already gave good performances (second best rank in the o cial ImagEval 2006 campaign (see www.imageval.org)).

Here we use extended PEF [ 3 ] using the harmonic mean of the pixel of each row or column. The idea is that the object or pixel region distribution, which is lost in arithmetic mean projection, could be partly represented by the harmonic mean. These two projections are then expected to give complementary and/or concept dependant information. PEF are computed into three equal horizontal and vertical image slices, yieding to a total of 150 dimensions.

We also use classical features : HSV and EDGE histograms, and Gabor, and recent Descriptor of Fourier robust to rotation [ 7 ]. We train our two models on these features that represent a total of 400 dimensions. We use LDA to reduce the feature dimensions as depicted in next section. 3

Linear Discriminant Analysis

In general, the LDA [ 11 ] is used to nd an optimal subspace for classi cation in which the ratio of the between-class scatter and the within-class scatter is maximized. Let the between-class scatter matrix be de ned as

SB = Xc ni(Xi

i=1 c SW = X i=1 Xk2Ci

X (Xk

X)(X i

X)T Xi)(Xk

i T X ) and the within-class scatter matrix be de ned as where X = (Pjn=1 Xj)=n is the mean image of the ensemble, and X i = (Pjn=i1 Xji)=ni is the mean image of the ith class, ni is the number of samples in the ith class, c the number of classes, and Ci the ith class. As a result, the optimal subspace, Eoptimal by the LDA can be determined as follows:

Eoptimal = arg max

E jET SW Ej jET SBEj = [c1; c2; :::; cc 1] where [c1; c2; :::; cc 1] is the set of generalized eigenvectors of SB and SW corresponding to the largest generalized eigenvalues i; i = 1; 2; :::; c 1, i.e.,

SBEi = iSW Ei; i = 1; 2; :::; c 1 Thus, the feature vector, P , for any query face images, X, in the most discriminant sense can be calculated as follows:

P = EoTptimalU T X

In our image retrieval task, LDA output only 1 dimension since the classi cation problem for each concept is 2-class.

Fast classi cation using Least Squares Support Vector

In order to design fast image retrieval systems, we use the Least Squares Support Vector Machine (LS-SVM). The SVM [ 1 ] rst maps the data into a higher dimensional input space by some kernel functions, and then learns a separating hyperspace to maximize the margin. Currently, because of its good generalization capability, this technique has been widely applied in many areas such as face detection, image retrieval, and so on [ 4, 5 ]. The SVM is typically based on an "-insensitive cost function, meaning that approximation errors smaller than " will not increase the cost function value. This results in a quadratic convex optimization problem. So instead of using an "-insensitive cost function, a quadratic cost function can be used. The least squares support vector machines (LS-SVM) [ 6 ] are reformulations to the standard SVMs which lead to solving linear KKT systems instead, which is quite computationally attractive. Thus, in all our experiments, we will use the LS-SVMlab1.5 (http://www.esat.kuleuven.ac.be/sista/lssvmlab/).

In our experiments, the RBF kernel

K(x1 x2) = exp( jx1 x2j2= 2) is selected as the kernel function of our LS-SVM. So there is a corresponding parameter, , to be tuned. A large value of 2 indicates a stronger smoothing. Moreover, there is another parameter, , needing tuning to nd the tradeo between to stress minimizing of the complexity of the model and to stress good tting of the training data points.

We set these two parameters as and 2 = [4 25 100 400 600 800 1000 2000] respectively. So a total of hundred SVMs were constructed for each SVM model, and then we selected the best SVM using the validation set. 5

Visual Dictionary Method

The visual dictionary is an original method to annotated images which is an improvement of the method proposed in [ 8 ]. We construct a Concept Visual Dictionary composed by visual words intended to represent semantic concept which consists of ve steps :

Visual elements. Images are decomposed into visual elements where a visual element is an image area, i.e. images are split into a regular grid.

Representation of visual elements. We use the most classical and intuitive approach consisting in representing a visual word by usual features HSV, GABOR, EDGE, and also PEF [ 3 ], and DF [ 7 ].

Global Visual Dictionary. For each feature, we cluster visual elements using the K-Means algorithm with a prede ned number of clusters and using the Euclidean distance in order to group visual elements and to smooth some visual artifacts. And then, for each cluster, we select the medoid to be a visual word and to compose the visual dictionary of a feature. Image transcription. Based on the Global Visual Dictionary, we replace visual elements by the nearest visual word in the visual dictionary. And then, the image representation is based on the frequency of the visual words within the image for each feature.

Concept Visual Dictionary. We select the most discriminative visual words for a concept given to compose a Concept Visual Dictionary. To lter the words, we use a entropy-based reduction, which is developed from work carried out in [ 10 ].

In a second step, we propose an adaptation of the common text-based paradigm to annotated images. We used the tf-idf weighting scheme [ 9 ] in the vector space model together with cosine similarity to determine the similarity between a visual document and a concept. To use this scheme, we represent an image by the frequency of the visual words within the image for di erent features : HSV, GABOR, PEF, EDGE and DF. 6

Experimental Results

The models based on SVM to implement the image retrieval in the task is shown in Figure 1 and contains the following steps:

Step 1) Split the VCDT labeled image dataset into 2 sets, namely training image dataset and validation set.

Step 2) Extract the visual features from the training image data using our extraction method; Learn and perform LDA reduction on these features; train and generate lots of SVM with di erent parameters.

Step 3) Use the validation set to select the best model

Step 4) Extract the visual features from the VCDT test image database using our extraction method; perform LDA reduction on these features; and then use the best model to nd the best discriminant feature.

Step 5) Sort the test images by the distances from the positive training images and produce the nal rank result.

The same train and development sets have been used for the VD and SVM training. We submitted ve runs to the o cial evaluation, from which the two best are depicted here : Run SVM(LDA) It consists in performing SVM on the LDA of [ PEF150 + HSV + EDGE + DF ] features to reduce the impact of the highdimension malediction. The test of 10K images and 50 topics costed 2 minutes on usual pentium IV PC.

Run VD Is is a vector search system, using small icons from the images. The visual features are the HSV, edge, and Gabor. This model needs only 2 hours of training on a pentium IV 3Ghz, 4 GRam, and test is faster than SVM.

The Area Under the Curve (Receiving Operator Caracteristisc integral) for each topic and method are depicted in gure 2.

We show in Table 2 that the SVM(LDA([PEF150+HSV+EDGE+DF])) is better than VD with AUC = 0.72. It is our best run, it occupies the 8th rank among the 19 participating teams. The same SVM(LDA) strategy has been applied on an another set of features (AVEIR group features described in [ 12 ]), but results to AUC = 0.50. 7

Conclusion

The SVM model has an higher average AUC than VD (0.722 against 0.682), but VD is lighter than SVM, and is, for some topics, better than it. The table 3 gives the list of worst and best topics for VD compared to SVM. The worst are for example "Snow, Winter, Sky, Desert, Beach ...", that are maybe topics with one clear visual representation, for example we can imagine a dominant color and texture for snow or sky ... On the contrary, VD is better than SVM for "Flower, Vehicle, Food, Autumn,..." that are maybe concepts with higher visual variations in color, texture... Thus it suggests that statistics of simpler visual concepts are maybe better modelized by SVM, while more complex visual concepts may be better represented by our Visual Dictionary model. The respective performances of these two models shall also be tied with the number of training samples. We currently investigate research on this promising improved VD, and we propose an optimal fusion with SVM in order to bene t of the properties of the both. 0 . 8

S V M ( − ) a n d V i s u a l D i c o ( o − − )

Acknowledgment

This work was supported by French National Agency of Research (ANR-06-MDCA-002) and Research Fund for the Doctoral Program of Higher Education of China (200803591024).

[1] Vapnik , V. : Statistical learning theory . John Wiley, New York ( 1998 )

[2] Glotin , H. : Information retrieval and robust perception for a scaled multi-structuration,Thesis for habilitation of research direction , University Sud Toulon-Var, Toulon ( 2007 )

[3] Glotin , H. , Zhao , Z.Q. , Ayache , S.: E cient Image Concept Indexing by Harmonic & Arithmetic Pro les Entropy , IEEE International Conference on Image Processing, Cairo, Egypt, November 7 -11, ( 2009 )

[4] Waring , C.A. , Liu , X. : Face detection using spectral histograms and SVMs . IEEE Transactions on Systems, Man, and Cybernetics , Part

, 35 ( 3 ), 467 { 476 ( 2005 )

[5] Tong

, Edward , Chang : Support Vector Machine active learning for image retrieval , In Proceedings of the ninth ACM international conference on Multimedia, Canada , pp. 107 { 118 ( 2001 )

[6] Suykens , J.A.K. , Vandewalle , J.: Least Squares Support Vector Machine Classi ers , In Neural Processing Letters , 9 , 293{ 300 ( 1999 )

[7] Smach , F. , Lemaitre , C. , Gauthier, J.P. , Miteran , J. , Atri , M. : Generalized Fourier Descriptors with Applications to Objects Recognition in SVM Context , In 30, J. Math Imaging Vis 43 { 71 ( 2008 )

[8] Dumont

, Merialdo

: Video search using a visual dictionary . In CBMI 2007, 5th International Workshop on Content-Based Multimedia Indexing , June 25-27, 2007 , Bordeaux, France ( 2007 ).

[9] Salton , G. and Mcgill , M. J. : Introduction to Modern Information Retrieval, McGraw-Hill , Inc., New York, NY, USA ( 1986 )

[10] Jensen

and Shen

: Fuzzy-Rough Data Reduction with Ant Colony Optimization , In Fuzzy Sets and Systems ( 2004 )

[11] Belhumeur , P.N. , Hespanha , J.P. , Kriegman , D.J.: Eigenfaces versus sher faces . IEEE Trans. Pattern Anal. Machine Intell . 19 , 711 { 720 ( 1997 )

[12] Glotin

and al. : Comparison of Various AVEIR Visual Concept Detectors with an Index of Carefulness, In ImageClef09 proceedings ( 2009 )

[13] Nowak

, Dunker

: Overview of the CLEF 2009 Large Scale - Visual Concept Detection and Annotation Task , CLEF working notes 2009 , Corfu, Greece, ( 2009 ).