=Paper= {{Paper |id=Vol-1176/CLEF2010wn-ImageCLEF-ParisEt2010 |storemode=property |title=Linear SVM for new Pyramidal Multi-Level Visual only Concept Detection in CLEF 2010 Challenge |pdfUrl=https://ceur-ws.org/Vol-1176/CLEF2010wn-ImageCLEF-ParisEt2010.pdf |volume=Vol-1176 }} ==Linear SVM for new Pyramidal Multi-Level Visual only Concept Detection in CLEF 2010 Challenge== https://ceur-ws.org/Vol-1176/CLEF2010wn-ImageCLEF-ParisEt2010.pdf
    Linear SVM for new Pyramidal Multi-Level
    Visual only Concept Detection in CLEF 2010
                     Challenge

                   Sébastien PARIS1 , Hervé GLOTIN2
    1
    LSIS DYNI, Univ Paul Cézanne, av Escadrille Normandie-Niemen,13397
                        MARSEILLE CEDEX 20
 2
   LSIS DYNI, Univ du Sud-Toulon-Var, av de l’Université - BP20132, 83957
                               LA GARDE
                sebastien.paris@lsis.org, glotin@univ-tln.fr


         Abstract. For the Visual Concept Detection of CLEF 2010 Challenge,
         using only visual information, we propose a novel multi-level spatial pyra-
         midal (sp) features : the spELBP (Extended Local Binary Pattern). In
         this paper we first present these features and few others that are similar
         : the spELBOP (Extended Local Binary Orientation Pattern), and the
         spHOEE (Histogram of Oriented Edge Energy). Then we discuss why
         our features feed state-of-the-art linear SVM algorithms for the Detec-
         tion of Concept. Our scores are ranked, over the 15 participating teams,
         8th according to the F-mesure evaluation, and 9th according to the MAP
         evaluation. We compare each topic score to the best system, and we
         finaly discuss on further extension of our approach 1 .


1       Introduction
The VCDT 2010 challenge (see [NH10]) consists in detection of visual concept.
    Basically, this challenge can be considered as a supervised classification prob-
lem, more precisely by training models on efficient features with a ”one-against-
all” approach. In recent years in computer vision, in order to reduce the semantic
gap in object categorization problem, two popular approaches have emerged of-
fering efficient performances. The first one, a.k.a. ”Bag of Words” (BoW) (see
[ZZY+ ,YYGH]), consists in building a dictionary of visual words given a large
pool of feature vectors, usually some SIFT descriptors [Low]. SIFT descriptors
can be computed over a regular spatial grid or on interest point outputs of specific
detectors (corners, edges, blobs, . . . ) such Harris or Lowe detectors [HS88,Low].
Following a dictionary learning step usually done by a vector quantification of
all the total amount of feature vectors. The vector quantification is usually done
by a K-means or GMM algorithms [YYGH]. More efficient dictionaries can be
retrieved with sparse learning tools [WYKY+ ].
    The second approach is based on Local Binary Pattern (LBP) descriptors
(a.k.a. CENTRIST in [WR]). The feature vector is defined by occurrences of
each 256 patterns encoding the neighborhood relation to a central pixel.
1
    This work has been supported by ANR-06-MDCA-002 AVEIR ANR
    For both approaches, adding a multi-level pyramidal architecture, permits
to improve considerably the performances. This technic divides the image in
sub-windows and weights adequately each corresponding feature vectors before
concatenation (see [LSP]). The price of this kind of architecture is to deal with
much larger vectors as input of classifiers. Large-scales binary supervised classi-
fication problems arise naturally with theses descriptors.
    The next section describes more precisely the descriptors we developped
in the challenge, especially the novel descriptor spELBOP. The fourth section
overviews the large-scale binary classifier we use : the linear SVM Classifier
TRON (L2 regularized with a L2 loss function).


2      Pyramidal Multi-Level Features
For each of the three following descriptors, a spatial pyramid architecture is
used to divide the entire image I into N s possibly overlapping sub-windows.
More precisely, a L levels pyramid is defined for l = 1, . . . , L, where image
I of size N y × N y is divided into possibly overlapping sub-windows of size
l × wl . Histograms
h
      max {hj }
                   are computed
                        max {wj }
                                 for each sub-windows and weighted by cl =
    j=1,...,L         j=1,...,L
         hl       .        wl       . Finally, concatenation of the N s weighted his-
tograms defines the global feature vector. In our implementation, hl = ⌊N y.ry,l ⌋
and wl = ⌊N x.rx,l ⌋ where ry,l and rx,l are elements of vectors ry and r x . Shifts
in x-y axis are defined by integers δy,l = ⌊N y.dy,l ⌋ and δx,l = ⌊N x.dx,l ⌋ where
dy,l and dx,l are elements of vectors dy and dx respectively. Overlapping win-
dows can be obtained if dy,l ≤ ry,l and/or dx,l ≤ rx,l . The total number of
                                  X (1 − ry,l ) (1 − rx,l )
sub-windows is equal to N s =          ⌊           ⌋.⌊          ⌋.
                                         (dy,l + 1) (dx,l + 1)
                                    l=1,...,L


2.1      The spHOEE Feature
Following [DT,MBM], a histogram of the L1-normalized orientation edge energy
filter responses is constructed for the N o different orientations. Theses responses
are obtained by convolution of the gray image with two odd elongated oriented
filters (horizontal and vertical gradients) at scale σ. L1-normalized magnitudes
with a block of size hn ×wn and signed angles are computed from theses gradients.
Each N s sub-window histogram is computed efficiently thanks to the integral
histogram method. The total dimension of the feature vector is d = N s.N o.
The spHOEE feature (a.k.a. spHOG in [MBM,MB]) offers state-of-the-art per-
formances in databases such CALTECH 101 or INRIA pedestrians.

2.2      The novel spELBP Feature
Local Binary Pattern (LBP) are powerful parametric descriptors encoding rela-
tion between intensity of a central pixel and intensities of its 8 adjacent neighbors
(see [LZL+ 07]). Widely used in face recognition (see [SGM09]), LBP shows also
their efficiency in scene categorization ([WR]) compared to BoW with SIFT de-
scriptors. In [LZL+ 07], a multi-scale extension (MSLBP) consists in encoding
relation of a central block of pixels of size s × s with its 8 neighbors capturing
more global details. Each block area is computed with the help of the inte-
gral image. We propose here a spatial pyramid architecture for the MSLBP
so-called spELBP. This novel descriptor captures details of the scale s at given
sub-windows location. Let S the number of scales, the total dimension of the
spELBP descriptor is d = 256.N s.S.

2.3   The spELBOP Feature
This novel descriptor is derived from the two last. Here instead of encoding
the raw pixel values of a block of size s, we propose to encode the orientations
of the block. As with the spHOEE features, orientations are retrieved by i)
applying convolution with the two odd elongated oriented filters at scale σ and
ii) computing the signed angles. The total dimension of the spELBOP descriptor
is the same as the latter, i.e. d = 256.N s.S.


3     Large Scale SVM
Learning a topic (a room) with the one-against-all approach is equivalent to a bi-
nary supervised classification task. We deal with a training set D = {xi , yi }, i =
1, . . . , N where xi ⊂ Rd represents a feature vector and yi ∈ {−1, 1} its corre-
sponding label. Max-margin classifiers like SVM are known to offer state-of-the-
art performances. However with high dimension feature vectors and numerous
examples, training SVM can be too computational expensive (∼ O(dN 3 )). For
large scale problems, one alternative is to use a max-margin linear classifier
which offers often the same amount of performances than the non-linear version
[LWK,WYKY+ ]. The linear SVM used here consists in finding the hyperplane
parameter w minimizing the sum of a L2 loss function and a L2 regulation term:
                        (             N
                                                                )
                          1 T       X
                                                      T
                                                             2
                    min     w w+C        max 1 − yi w xi , 0      .              (1)
                    w     2          i=1

In [LWK], the problem is solved with a Trust Region Newton algorithm (TRON).
We use the modified version of TRON proposed by ([MBM]) managing dense
features.


4     Results at VCDT 2010
In the CLEF VCDT 2010 task, preliminary tests on the development set resulted
in selecting the spELBP feature for our final system. We choose a L = 3 levels
                                      T
                                   1 1
pyramid rx = r y = dx = dy = 1, ,          leading to N s = 42 sub-windows and
                                   2 4
a total of d = 10752 dimensions for this feature.
    For each topic, hyperparameters C or λ of classifiers are tuned with a 5
cross-validation by minimizing the Balanced Error Rate (BER). Then models
are learned on entire training sets.
    All our run are visual only runs (we only use the image pixels to detect the
topics).
    Over the 15 participating teams, and for the example-based evaluation apply-
ing the F-Measure, our system has the 8th rank. For the seconde evaluation, our
system has the 9th rank according to the concept-based Mean Average Precision
(MAP) (this measure showed better characteristics than the EER and AUC in
a recent study).




         Fig. 1. Average Precision of the best LSIS run for all the topics.



    Score details are given at :
http://www.imageclef.org/2010/PhotoAnnotationExampleEvaluationResults
We give below the team best run sorted list with the rank, Fmesure, Team and
Run ID :
1 0.680070 ISIS 1276866836402 uva-isis-mkl-mixed-mixed.txt binary.txt
2 0.639441 XRCE 1277144880578 xrce SVM EF Visual.txt binary.txt
3 0.634064 HHI 1277376108118 ic 10 test s eiq space.txt binary.txt
4 0.596141 IJS 1277145320629 final ijs feit run1.txt binary.txt
5 0.581652 LEAR 1277147818075 lear TP Visual CVF 0 D6.txt binary.txt
Fig. 2. Average Precision of the best ISIS run versus best LSIS run for all the topics.




6 0.558674 MEIJI 1276044391982 AUTO VisualOnly BagOfVisualWords of meiji.txt binary.txt
7 0.530841 Romania 1276777329876 run CR+lapl2inv.txt binary.txt
8 0.530317 LSIS 1277226430977 DYNI LSIS RUN2 COPIE.txt binary.txt
9 0.482390 WROCLAW 1277754391699 imageClef2010-grid20x20-xy rgb dev hes.quick matrix...
10 0.476983 LIG 1277153756343 clefResults.txt binary.txt
11 0.450934 CEALIST 1277046611397 cealist fastSB 6600.submission.txt binary.txt
12 0.427661 BPACAD 1277129525816 bp acad hoggmm.txt binary.txt
13 0.224987 MLKD 1277149221968 Visual2.txt binary.txt
14 0.208564 INSUNHIT 1277043267771 finalresult50.txt binary.txt
15 0.174392 UPMC 1277139769194 output multiviewFINAL.txt binary.txt

   The best MAP run is also ISIS run with 0.407, while the best LSIS run MAP
equals 0.234.
   The figure 1 gives the Average Precision (AP) for each topic for the best
LSIS run. In order to compare our scores to the state of the art, we plot in figure
2 the AP of the best run of the challenge versus our. We then clearly see that
the scores are well correlated, and that LSIS AP are below (or for some topics
equal) to the ISIS AP. This loss of precision for the LSIS run is figured for each
topic in 3. We see then that the difference is about 0.15 points of AP in average.
Fig. 3. Difference, for each topic, of the Average Precision between the best LSIS run
and the best run of VCDT2010 (from ISIS).



5      Conclusion and perspectives
In this paper, the VCDT 2010 challenge is solved by a large-scale classifier trained
on multi-level descriptors and is ranked 8th for the F-mesure evaluation over the
15 teams. Training theses classifiers is extremely fast compared to the classical
SVM counterpart. The Average Precision of our system is below the best run
of the challenge with a nearly constant decrease of around 0.15. We could then
assume that our approach could be globaly improved, as it seems to generalize
well to all topics. Further improvements can be expected by adding denseSIFT
descriptor with sparse learning and spatial pooling (see [WYKY+ ]), and using
Multiple Kernel Learning method for fusionning features.


References
[DT]      Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human
          detection. In CVPR’05.
[HS88]    C. Harris and M. Stephens. A combined corner and edge detector. In Proc
          4th Alvey Vision Conf, 1988.
[Low]     David G. Lowe. Object recognition from local scale-invariant features. In
          ICCV’99.
[LSP]     Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. Beyond bags of fea-
          tures: Spatial pyramid matching for recognizing natural scene categories. In
          CVPR’06.
[LWK]     Chih-Jen Lin, Ruby C. Weng, and S. Sathiya Keerthi. Trust region newton
          method for logistic regression. J. Mach. Learn. Res., 9.
[LZL+ 07] ShengCai Liao, XiangXin Zhu, Zhen Lei, Lun Zhang, and Stan Z. Li. Learn-
          ing multi-scale block local binary patterns for face recognition. In ICB, 2007.
[MB]      S. Maji and A.C. Berg. Max-margin additive classifiers for detection.
          ICCV’09.
[MBM]     S. Maji, A.C. Berg, and J. Malik. Classification using intersection kernel
          support vector machines is efficient. CVPR’08, June.
[NH10]    S. Nowak and M. Huiskes. New strategies for image annotation: Overview
          of the photo annotation task at imageclef 2010. In Working notes of CLEF
          2010, 2010.
[SGM09] Caifeng Shan, Shaogang Gong, and Peter W. McOwan. Facial expression
          recognition based on local binary patterns: A comprehensive study. Image
          Vision Comput., 27(6), 2009.
[WR]      Jianixn Wu and James M. Rehg. Where am i: Place instance and category
          recognition using spatial pact. CVPR’08.
[WYKY+ ] Jinjun Wang, Jianchao Yang, Fengjun Lv Kai Yu, Thomas Huang, and
          Yihong Gong. Locality-constrained linear coding for image classification.
          CVPR’10.
[YYGH] Jianchao Yang, Kai Yu, Yihong Gong, and Thomas S. Huang. Linear spatial
          pyramid matching using sparse coding for image classification. In CVPR’09.
[ZZY+ ]   Xi Zhou, Xiaodan Zhuang, Shuicheng Yan, Shih-Fu Chang, Mark Hasegawa-
          Johnson, and Thomas S. Huang. Sift-bag kernel for video event analysis. In
          MM’08. ACM.