Introduction

Linear SVM for new Pyramidal Multi-Level Visual only Concept Detection in CLEF 2010 Challenge

S´ebastien PARIS

Herv´e GLOTIN

glotin@univ-tln.fr 0

LSIS DYNI

Univ Paul C´ezanne

av Escadrille Normandie-Niemen

MARSEILLE CEDEX

0 LSIS DYNI, Univ du Sud-Toulon-Var, av de l'Universit ́e - BP20132 , 83957 LA GARDE , USA

For the Visual Concept Detection of CLEF 2010 Challenge, using only visual information, we propose a novel multi-level spatial pyramidal (sp) features : the spELBP (Extended Local Binary Pattern). In this paper we first present these features and few others that are similar : the spELBOP (Extended Local Binary Orientation Pattern), and the spHOEE (Histogram of Oriented Edge Energy). Then we discuss why our features feed state-of-the-art linear SVM algorithms for the Detection of Concept. Our scores are ranked, over the 15 participating teams, 8th according to the F-mesure evaluation, and 9th according to the MAP evaluation. We compare each topic score to the best system, and we finaly discuss on further extension of our approach 1.

Introduction

The VCDT 2010 challenge (see [NH10]) consists in detection of visual concept.

Basically, this challenge can be considered as a supervised classification problem, more precisely by training models on efficient features with a ”one-againstall” approach. In recent years in computer vision, in order to reduce the semantic gap in object categorization problem, two popular approaches have emerged offering efficient performances. The first one, a.k.a. ”Bag of Words” (BoW) (see [ZZY+,YYGH]), consists in building a dictionary of visual words given a large pool of feature vectors, usually some SIFT descriptors [Low]. SIFT descriptors can be computed over a regular spatial grid or on interest point outputs of specific detectors (corners, edges, blobs, . . . ) such Harris or Lowe detectors [HS88,Low]. Following a dictionary learning step usually done by a vector quantification of all the total amount of feature vectors. The vector quantification is usually done by a K-means or GMM algorithms [YYGH]. More efficient dictionaries can be retrieved with sparse learning tools [WYKY+].

The second approach is based on Local Binary Pattern (LBP) descriptors (a.k.a. CENTRIST in [WR]). The feature vector is defined by occurrences of each 256 patterns encoding the neighborhood relation to a central pixel. 1 This work has been supported by ANR-06-MDCA-002 AVEIR ANR

For both approaches, adding a multi-level pyramidal architecture, permits to improve considerably the performances. This technic divides the image in sub-windows and weights adequately each corresponding feature vectors before concatenation (see [LSP]). The price of this kind of architecture is to deal with much larger vectors as input of classifiers. Large-scales binary supervised classification problems arise naturally with theses descriptors.

The next section describes more precisely the descriptors we developped in the challenge, especially the novel descriptor spELBOP. The fourth section overviews the large-scale binary classifier we use : the linear SVM Classifier TRON (L2 regularized with a L2 loss function). 2

Pyramidal Multi-Level Features

For each of the three following descriptors, a spatial pyramid architecture is used to divide the entire image I into N s possibly overlapping sub-windows. More precisely, a L levels pyramid is defined for l = 1, . . . , L, where image I of size N y × N y is divided into possibly overlapping sub-windows of size hl × wl. Histograms are computed for each sub-windows and weighted by cl = j=m1,a..x.h,Ll{hj} . j=m1,a..x.w, Ll{wj} . Finally, concatenation of the N s weighted histograms defines the global feature vector. In our implementation, hl = ⌊N y.ry,l⌋ and wl = ⌊N x.rx,l⌋ where ry,l and rx,l are elements of vectors ry and rx. Shifts in x-y axis are defined by integers δy,l = ⌊N y.dy,l⌋ and δx,l = ⌊N x.dx,l⌋ where dy,l and dx,l are elements of vectors dy and dx respectively. Overlapping windows can be obtained if dy,l ≤ ry,l and/or dx,l ≤ rx,l. The total number of sub-windows is equal to N s = X ⌊ (1 − ry,l) ⌋.⌊ (1 − rx,l) ⌋.

l=1,...,L (dy,l + 1) (dx,l + 1) 2.1

The spHOEE Feature

Following [DT,MBM], a histogram of the L1-normalized orientation edge energy filter responses is constructed for the N o different orientations. Theses responses are obtained by convolution of the gray image with two odd elongated oriented filters (horizontal and vertical gradients) at scale σ. L1-normalized magnitudes with a block of size hn×wn and signed angles are computed from theses gradients. Each N s sub-window histogram is computed efficiently thanks to the integral histogram method. The total dimension of the feature vector is d = N s.N o. The spHOEE feature (a.k.a. spHOG in [MBM,MB]) offers state-of-the-art performances in databases such CALTECH 101 or INRIA pedestrians. 2.2

The novel spELBP Feature

Local Binary Pattern (LBP) are powerful parametric descriptors encoding relation between intensity of a central pixel and intensities of its 8 adjacent neighbors (see [LZL+07]). Widely used in face recognition (see [SGM09]), LBP shows also their efficiency in scene categorization ([WR]) compared to BoW with SIFT descriptors. In [LZL+07], a multi-scale extension (MSLBP) consists in encoding relation of a central block of pixels of size s × s with its 8 neighbors capturing more global details. Each block area is computed with the help of the integral image. We propose here a spatial pyramid architecture for the MSLBP so-called spELBP. This novel descriptor captures details of the scale s at given sub-windows location. Let S the number of scales, the total dimension of the spELBP descriptor is d = 256.N s.S. 2.3

The spELBOP Feature

This novel descriptor is derived from the two last. Here instead of encoding the raw pixel values of a block of size s, we propose to encode the orientations of the block. As with the spHOEE features, orientations are retrieved by i) applying convolution with the two odd elongated oriented filters at scale σ and ii) computing the signed angles. The total dimension of the spELBOP descriptor is the same as the latter, i.e. d = 256.N s.S. 3

Large Scale SVM

Learning a topic (a room) with the one-against-all approach is equivalent to a binary supervised classification task. We deal with a training set D = {xi, yi}, i = 1, . . . , N where xi ⊂ Rd represents a feature vector and yi ∈ {−1, 1} its corresponding label. Max-margin classifiers like SVM are known to offer state-of-theart performances. However with high dimension feature vectors and numerous examples, training SVM can be too computational expensive (∼ O(dN 3)). For large scale problems, one alternative is to use a max-margin linear classifier which offers often the same amount of performances than the non-linear version [LWK,WYKY+]. The linear SVM used here consists in finding the hyperplane parameter w minimizing the sum of a L2 loss function and a L2 regulation term:

N mwin ( 12 wT w + C X max 1 − yiwT xi, 0 i=1 2) .

(1) In [LWK], the problem is solved with a Trust Region Newton algorithm (TRON). We use the modified version of TRON proposed by ([MBM]) managing dense features. 4

Results at VCDT 2010

In the CLEF VCDT 2010 task, preliminary tests on the development set resulted in selecting the spELBP feature for our final system. We choose a L = 3 levels 1 1 T pyramid rx = ry = dx = dy = 1, , leading to N s = 42 sub-windows and 2 4 a total of d = 10752 dimensions for this feature.

For each topic, hyperparameters C or λ of classifiers are tuned with a 5 cross-validation by minimizing the Balanced Error Rate (BER). Then models are learned on entire training sets.

All our run are visual only runs (we only use the image pixels to detect the topics).

Over the 15 participating teams, and for the example-based evaluation applying the F-Measure, our system has the 8th rank. For the seconde evaluation, our system has the 9th rank according to the concept-based Mean Average Precision (MAP) (this measure showed better characteristics than the EER and AUC in a recent study).

Score details are given at : http://www.imageclef.org/2010/PhotoAnnotationExampleEvaluationResults We give below the team best run sorted list with the rank, Fmesure, Team and Run ID : 1 0.680070 ISIS 1276866836402 uva-isis-mkl-mixed-mixed.txt binary.txt 2 0.639441 XRCE 1277144880578 xrce SVM EF Visual.txt binary.txt 3 0.634064 HHI 1277376108118 ic 10 test s eiq space.txt binary.txt 4 0.596141 IJS 1277145320629 final ijs feit run1.txt binary.txt 5 0.581652 LEAR 1277147818075 lear TP Visual CVF 0 D6.txt binary.txt 6 0.558674 MEIJI 1276044391982 AUTO VisualOnly BagOfVisualWords of meiji.txt binary.txt 7 0.530841 Romania 1276777329876 run CR+lapl2inv.txt binary.txt 8 0.530317 LSIS 1277226430977 DYNI LSIS RUN2 COPIE.txt binary.txt 9 0.482390 WROCLAW 1277754391699 imageClef2010-grid20x20-xy rgb dev hes.quick matrix... 10 0.476983 LIG 1277153756343 clefResults.txt binary.txt 11 0.450934 CEALIST 1277046611397 cealist fastSB 6600.submission.txt binary.txt 12 0.427661 BPACAD 1277129525816 bp acad hoggmm.txt binary.txt 13 0.224987 MLKD 1277149221968 Visual2.txt binary.txt 14 0.208564 INSUNHIT 1277043267771 finalresult50.txt binary.txt 15 0.174392 UPMC 1277139769194 output multiviewFINAL.txt binary.txt

The best MAP run is also ISIS run with 0.407, while the best LSIS run MAP equals 0.234.

The figure 1 gives the Average Precision (AP) for each topic for the best LSIS run. In order to compare our scores to the state of the art, we plot in figure 2 the AP of the best run of the challenge versus our. We then clearly see that the scores are well correlated, and that LSIS AP are below (or for some topics equal) to the ISIS AP. This loss of precision for the LSIS run is figured for each topic in 3. We see then that the difference is about 0.15 points of AP in average.

Conclusion and perspectives

In this paper, the VCDT 2010 challenge is solved by a large-scale classifier trained on multi-level descriptors and is ranked 8th for the F-mesure evaluation over the 15 teams. Training theses classifiers is extremely fast compared to the classical SVM counterpart. The Average Precision of our system is below the best run of the challenge with a nearly constant decrease of around 0.15. We could then assume that our approach could be globaly improved, as it seems to generalize well to all topics. Further improvements can be expected by adding denseSIFT descriptor with sparse learning and spatial pooling (see [WYKY+]), and using Multiple Kernel Learning method for fusionning features. [DT] [HS88] [Low]

Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In CVPR’05.

C. Harris and M. Stephens. A combined corner and edge detector. In Proc 4th Alvey Vision Conf, 1988.

David G. Lowe. Object recognition from local scale-invariant features. In ICCV’99. [LSP] Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR’06. [LWK] Chih-Jen Lin, Ruby C. Weng, and S. Sathiya Keerthi. Trust region newton method for logistic regression. J. Mach. Learn. Res., 9. [LZL+07] ShengCai Liao, XiangXin Zhu, Zhen Lei, Lun Zhang, and Stan Z. Li. Learning multi-scale block local binary patterns for face recognition. In ICB, 2007. [MB] S. Maji and A.C. Berg. Max-margin additive classifiers for detection.

ICCV’09. [MBM] S. Maji, A.C. Berg, and J. Malik. Classification using intersection kernel support vector machines is efficient. CVPR’08, June. [NH10] S. Nowak and M. Huiskes. New strategies for image annotation: Overview of the photo annotation task at imageclef 2010. In Working notes of CLEF 2010, 2010. [SGM09] Caifeng Shan, Shaogang Gong, and Peter W. McOwan. Facial expression recognition based on local binary patterns: A comprehensive study. Image Vision Comput., 27(6), 2009. [WR] Jianixn Wu and James M. Rehg. Where am i: Place instance and category recognition using spatial pact. CVPR’08. [WYKY+] Jinjun Wang, Jianchao Yang, Fengjun Lv Kai Yu, Thomas Huang, and Yihong Gong. Locality-constrained linear coding for image classification.

CVPR’10. [YYGH] Jianchao Yang, Kai Yu, Yihong Gong, and Thomas S. Huang. Linear spatial pyramid matching using sparse coding for image classification. In CVPR’09. [ZZY+] Xi Zhou, Xiaodan Zhuang, Shuicheng Yan, Shih-Fu Chang, Mark HasegawaJohnson, and Thomas S. Huang. Sift-bag kernel for video event analysis. In MM’08. ACM.