LSIS Scaled Photo Annotations: Discriminant Features SVM versus Visual Dictionary based on Image Frequency Zhong-Qiu ZHAO1,2 , Herve GLOTIN1 , Emilie DUMONT1 1 University of Sud Toulon-Var, USTV, France Lab. Sciences de l’Information et des Systemes LSIS, UMR CNRS 6168, La Garde-France 2 School of Computer & Information, Hefei University of Technology, China glotin@univ-tln.fr; zhongqiuzhao@gmail.com; emilie.r.dumont@gmail.com Abstract In this paper, we used only visual information to implement ImageCLEF2009 Photo Annotation Task. Firstly, we extract various visual features: HSV and EDGE his- tograms, Gabor, and recent Descriptor of Fourier and Profile Entropy Features. Then for each concept and features, we compute Linear Discriminant Analysis (LDA) to de- crease the high dimension impact. Finally, we train support vector machines (SVMs), for which the outputs are considered as the confidences with which the samples belong to the concept. Also we propose a second model, an improved version of a Visual Dictionary (VD), which is built by visual words extracted for frequency templates in the training set. We describe the results of these 2 models, topics by topics, and we give perspectives for our VD method, that is more faster than SVM, and better than SVM for some topics. We also show that among the 19 teams, our SVM(LDA) run attains the AUC score of 0.721, and then occupies the 8th AUC rank among the 19 teams involved in this campaign, while our VD models would occupy the 10th rank. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor- mation Search and Retrieval; H.3.4 Systems and Software; H.3.7 Digital Libraries; H.2.3 [Database Managment]: Cross-Language Retrieval in Image Collections (ImageCLEF)—ImageCLEFphoto Keywords LDA, Visual Dictionary, Generalized Descriptor of Fourier, Profile Entropy Feature, SVM. 1 Introduction ImageCLEF (http://www.imageclef.org/ImageCLEF2009) is the cross-language image retrieval track run as part of the Cross Language Evaluation Forum (CLEF) campaign. This track evaluates retrieval of images described by text captions based on queries in a different language; both text and image matching techniques are potentially exploitable. Therefore, the task provides a training database of 5000 images which are labeled by 53 concepts which has a hierarchy. Only these data are used to train retrieval models. The test database consists of 13000 images, for each of which participating groups are required to determine the presence/absence of the concepts. More details on the dataset can be found in [13]. We use various features described in nexte section: PEF[2, 3], HSV and EDGE histograms, new Descriptor of Fourier [7], and Gabor. The we use an LDA to reduce their dimension. Finally, we use the Least Square support vector machine (LS-SVM) to produce concept similarity. Another original method called Visual Dictionary is proposed and implemented in section 5. 2 Visual Features We use a new feature, the pixel ’profile’ entropy (PEF) [2], giving the entropy of a pixel profiles in horizontal and vertical directions. The advantage of PEF is to combine raw shape and texture representations, with a low CPU cost feature, and already gave good performances (second best rank in the official ImagEval 2006 campaign (see www.imageval.org)). Here we use extended PEF [3] using the harmonic mean of the pixel of each row or column. The idea is that the object or pixel region distribution, which is lost in arithmetic mean projection, could be partly represented by the harmonic mean. These two projections are then expected to give complementary and/or concept dependant information. PEF are computed into three equal horizontal and vertical image slices, yieding to a total of 150 dimensions. We also use classical features : HSV and EDGE histograms, and Gabor, and recent Descriptor of Fourier robust to rotation [7]. We train our two models on these features that represent a total of 400 dimensions. We use LDA to reduce the feature dimensions as depicted in next section. 3 Linear Discriminant Analysis In general, the LDA [11] is used to find an optimal subspace for classification in which the ratio of the between-class scatter and the within-class scatter is maximized. Let the between-class scatter matrix be defined as Xc i i SB = ni (X − X)(X − X)T i=1 and the within-class scatter matrix be defined as X c X i i SW = (Xk − X )(Xk − X )T i=1 Xk ∈Ci Pn i P ni where X = ( j=1 Xj )/n is the mean image of the ensemble, and X = ( j=1 Xji )/ni is the mean image of the ith class, ni is the number of samples in the ith class, c the number of classes, and Ci the ith class. As a result, the optimal subspace, Eoptimal by the LDA can be determined as follows: |E T SB E| Eoptimal = arg max T = [c1 , c2 , ..., cc−1 ] E |E SW E| where [c1 , c2 , ..., cc−1 ] is the set of generalized eigenvectors of SB and SW corresponding to the largest generalized eigenvalues λi ; i = 1, 2, ..., c − 1, i.e., SB Ei = λi SW Ei , i = 1, 2, ..., c − 1 Thus, the feature vector, P , for any query face images, X, in the most discriminant sense can be calculated as follows: T P = Eoptimal UT X In our image retrieval task, LDA output only 1 dimension since the classification problem for each concept is 2-class. 4 Fast classification using Least Squares Support Vector In order to design fast image retrieval systems, we use the Least Squares Support Vector Machine (LS-SVM). The SVM [1] first maps the data into a higher dimensional input space by some kernel functions, and then learns a separating hyperspace to maximize the margin. Currently, because of its good generalization capability, this technique has been widely applied in many areas such as face detection, image retrieval, and so on [4, 5]. The SVM is typically based on an ε-insensitive cost function, meaning that approximation errors smaller than ε will not increase the cost function value. This results in a quadratic convex optimization problem. So instead of using an ε-insensitive cost function, a quadratic cost function can be used. The least squares support vector machines (LS-SVM) [6] are reformulations to the standard SVMs which lead to solving linear KKT systems instead, which is quite computationally attractive. Thus, in all our experiments, we will use the LS-SVMlab1.5 (http://www.esat.kuleuven.ac.be/sista/lssvmlab/). In our experiments, the RBF kernel K(x1 − x2 ) = exp(−|x1 − x2 |2 /σ 2 ) is selected as the kernel function of our LS-SVM. So there is a corresponding parameter, σ , to be tuned. A large value of σ 2 indicates a stronger smoothing. Moreover, there is another parameter, γ, needing tuning to find the tradeoff between to stress minimizing of the complexity of the model and to stress good fitting of the training data points. We set these two parameters as σ 2 = [4 25 100 400 600 800 1000 2000] and γ = [4 8 16 32 64 128 256 512] respectively. So a total of hundred SVMs were constructed for each SVM model, and then we selected the best SVM using the validation set. 5 Visual Dictionary Method The visual dictionary is an original method to annotated images which is an improvement of the method proposed in [8]. We construct a Concept Visual Dictionary composed by visual words intended to represent semantic concept which consists of five steps : • Visual elements. Images are decomposed into visual elements where a visual element is an image area, i.e. images are split into a regular grid. • Representation of visual elements. We use the most classical and intuitive approach consist- ing in representing a visual word by usual features HSV, GABOR, EDGE, and also PEF [3], and DF [7]. • Global Visual Dictionary. For each feature, we cluster visual elements using the K-Means algorithm with a predefined number of clusters and using the Euclidean distance in order to group visual elements and to smooth some visual artifacts. And then, for each cluster, we select the medoid to be a visual word and to compose the visual dictionary of a feature. • Image transcription. Based on the Global Visual Dictionary, we replace visual elements by the nearest visual word in the visual dictionary. And then, the image representation is based on the frequency of the visual words within the image for each feature. • Concept Visual Dictionary. We select the most discriminative visual words for a concept given to compose a Concept Visual Dictionary. To filter the words, we use a entropy-based reduction, which is developed from work carried out in [10]. Figure 1: The framework for image retrieval of each topic In a second step, we propose an adaptation of the common text-based paradigm to annotated images. We used the tf-idf weighting scheme [9] in the vector space model together with cosine similarity to determine the similarity between a visual document and a concept. To use this scheme, we represent an image by the frequency of the visual words within the image for different features : HSV, GABOR, PEF, EDGE and DF. 6 Experimental Results The models based on SVM to implement the image retrieval in the task is shown in Figure 1 and contains the following steps: Step 1) Split the VCDT labeled image dataset into 2 sets, namely training image dataset and validation set. Step 2) Extract the visual features from the training image data using our extraction method; Learn and perform LDA reduction on these features; train and generate lots of SVM with different parameters. Step 3) Use the validation set to select the best model Step 4) Extract the visual features from the VCDT test image database using our extraction method; perform LDA reduction on these features; and then use the best model to find the best discriminant feature. Step 5) Sort the test images by the distances from the positive training images and produce the final rank result. The same train and development sets have been used for the VD and SVM training. We submitted five runs to the official evaluation, from which the two best are depicted here : • Run SVM(LDA) It consists in performing SVM on the LDA of [ PEF150 + HSV + EDGE + DF ] features to reduce the impact of the highdimension malediction. The test of 10K images and 50 topics costed 2 minutes on usual pentium IV PC. • Run VD Is is a vector search system, using small icons from the images. The visual features are the HSV, edge, and Gabor. This model needs only 2 hours of training on a pentium IV 3Ghz, 4 GRam, and test is faster than SVM. The Area Under the Curve (Receiving Operator Caracteristisc integral) for each topic and method are depicted in figure 2. Table 1: Our two best submitted runs of ImageCLEF2009 Photo Annotation Task RUN LAB Method Average Equal Error Rate AUC SVM(LDA) LSIS SVM(LDA([PEF150+HSV+EDGE+DF])) 0.3308 0.7209 VD LSIS Visual Dictionary 0.3607 0.6855 Table 2: The team best runs of ImageCLEF2009 Photo Annotation Task (from the Official eval- uations) RANK LAB Average EER AUC 1 ISIS 0.234476 0.838699 2 LEAR 0.249469 0.823105 3 FIR2 0.253566 0.817159 4 CVIUI2R 0.253296 0.813893 5 XRCE 0.267301 0.802704 6 bpacad 0.291718 0.773133 7 MMIS 0.312366 0.744231 *8 LSIS (SVM(LDA)) 0.330819 0.720931 9 IAM 0.330401 0.714825 10 LIP6 0.372169 0.673089 11 MRIM 0.383630 0.643450 12 AVEIR 0.440589 0.550866 - Random 0.500280 0.499307 13 CEA 0.500495 0.469035 14 TELECOM ParisTech 0.526302 0.459922 15 Wroclaw 0.446024 0.220957 16 KameyamaLab 0.452374 0.164048 17 UAIC 0.479700 0.105589 18 INAOE 0.484685 0.099306 19 apexlab 0.482693 0.070400 We show in Table 2 that the SVM(LDA([PEF150+HSV+EDGE+DF])) is better than VD with AUC = 0.72. It is our best run, it occupies the 8th rank among the 19 participating teams. The same SVM(LDA) strategy has been applied on an another set of features (AVEIR group features described in [12]), but results to AUC = 0.50. 7 Conclusion The SVM model has an higher average AUC than VD (0.722 against 0.682), but VD is lighter than SVM, and is, for some topics, better than it. The table 3 gives the list of worst and best topics for VD compared to SVM. The worst are for example ”Snow, Winter, Sky, Desert, Beach ...”, that are maybe topics with one clear visual representation, for example we can imagine a dominant color and texture for snow or sky ... On the contrary, VD is better than SVM for ”Flower, Vehicle, Food, Autumn,...” that are maybe concepts with higher visual variations in color, texture... Thus it suggests that statistics of simpler visual concepts are maybe better modelized by SVM, while more complex visual concepts may be better represented by our Visual Dictionary model. The respective performances of these two models shall also be tied with the number of training samples. We currently investigate research on this promising improved VD, and we propose an optimal fusion with SVM in order to benefit of the properties of the both. relative AUC (%) from SVM to Visual Dico AUC 0.55 0.65 0.75 0.85 −30 −25 −20 −15 −10 0.5 0.6 0.7 0.8 0.9 −5 10 0 5 5 9 28 21 44 52 30 3 41 13 22 7 27 18 31 48 16 47 24 51 20 26 11 17 45 15 10 25 36 29 14 37 23 2 32 8 53 1 4 34 38 33 46 19 42 39 40 50 35 43 49 12 6 5 9 28 21 44 52 30 3 41 13 22 7 27 18 31 48 16 47 24 51 20 26 11 17 45 15 10 25 36 29 14 37 23 2 32 8 53 1 4 34 38 33 46 19 42 39 40 50 35 43 49 12 6 SVM (−) and Visual Dico (o−−) topics topics Figure 2: Area Under the Curve (AUC) evaluations for each topic and method. The number of each topic is the one given by the organizers. Here they are sorted according to the relative performance of VD against SVM. The absolute AUC for SVM and for VD (o-) are given in the right figure, while their relative difference in percent is given in the left. The 10 worst and best topics are depicted in Table 3. Table 3: Lists of the 10 worst and ten best topics for the VD method compared to SVM Ten worst topics for VD Ten best topics for VD Snow Flowers Desert Partly-Blurred Day Partly-Blurred Sky Motion-Blur Single-Person Vehicle Overall-Quality Macro No-Visual-Time No-Blur Beach-Holidays Food Out-of-focus Autumn Winter Citylife Acknowledgment This work was supported by French National Agency of Research (ANR-06-MDCA-002) and Re- search Fund for the Doctoral Program of Higher Education of China (200803591024). References [1] Vapnik, V.: Statistical learning theory. John Wiley, New York (1998) [2] Glotin, H.: Information retrieval and robust perception for a scaled multi-structuration,Thesis for habilitation of research direction, University Sud Toulon-Var, Toulon (2007) [3] Glotin, H., Zhao, Z.Q., Ayache, S.: Efficient Image Concept Indexing by Harmonic & Arith- metic Profiles Entropy, IEEE International Conference on Image Processing, Cairo, Egypt, November 7-11, (2009) [4] Waring, C.A., Liu, X.: Face detection using spectral histograms and SVMs. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 35(3), 467–476 (2005) [5] Tong S., Edward, Chang : Support Vector Machine active learning for image retrieval, In Proceedings of the ninth ACM international conference on Multimedia, Canada, pp. 107–118 (2001) [6] Suykens, J.A.K., Vandewalle, J.: Least Squares Support Vector Machine Classifiers, In Neural Processing Letters, 9, 293–300 (1999) [7] Smach, F., Lemaitre, C., Gauthier, J.P., Miteran, J., Atri, M.: Generalized Fourier Descriptors with Applications to Objects Recognition in SVM Context, In 30, J. Math Imaging Vis 43–71 (2008) [8] Dumont E, Merialdo B. : Video search using a visual dictionary. In CBMI 2007, 5th Interna- tional Workshop on Content-Based Multimedia Indexing, June 25-27, 2007, Bordeaux, France (2007). [9] Salton, G. and Mcgill, M. J. : Introduction to Modern Information Retrieval, McGraw-Hill, Inc., New York, NY, USA (1986) [10] Jensen R. and Shen Q.: Fuzzy-Rough Data Reduction with Ant Colony Optimization, In Fuzzy Sets and Systems (2004) [11] Belhumeur, P.N., Hespanha, J.P., Kriegman, D.J.: Eigenfaces versus fisher faces. IEEE Trans. Pattern Anal. Machine Intell. 19, 711–720 (1997) [12] Glotin H. and al. : Comparison of Various AVEIR Visual Concept Detectors with an Index of Carefulness, In ImageClef09 proceedings (2009) [13] Nowak S., Dunker P. : Overview of the CLEF 2009 Large Scale - Visual Concept Detection and Annotation Task, CLEF working notes 2009, Corfu, Greece, (2009).