<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Linear SVM for new Pyramidal Multi-Level Visual only Concept Detection in CLEF 2010 Challenge</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>S´ebastien PARIS</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Herv´e GLOTIN</string-name>
          <email>glotin@univ-tln.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>LSIS DYNI</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Univ Paul C´ezanne</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>av Escadrille Normandie-Niemen</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>MARSEILLE CEDEX</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>LSIS DYNI, Univ du Sud-Toulon-Var, av de l'Universit ́e - BP20132</institution>
          ,
          <addr-line>83957 LA GARDE</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>For the Visual Concept Detection of CLEF 2010 Challenge, using only visual information, we propose a novel multi-level spatial pyramidal (sp) features : the spELBP (Extended Local Binary Pattern). In this paper we first present these features and few others that are similar : the spELBOP (Extended Local Binary Orientation Pattern), and the spHOEE (Histogram of Oriented Edge Energy). Then we discuss why our features feed state-of-the-art linear SVM algorithms for the Detection of Concept. Our scores are ranked, over the 15 participating teams, 8th according to the F-mesure evaluation, and 9th according to the MAP evaluation. We compare each topic score to the best system, and we finaly discuss on further extension of our approach 1.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>The VCDT 2010 challenge (see [NH10]) consists in detection of visual concept.</p>
      <p>Basically, this challenge can be considered as a supervised classification
problem, more precisely by training models on efficient features with a
”one-againstall” approach. In recent years in computer vision, in order to reduce the semantic
gap in object categorization problem, two popular approaches have emerged
offering efficient performances. The first one, a.k.a. ”Bag of Words” (BoW) (see
[ZZY+,YYGH]), consists in building a dictionary of visual words given a large
pool of feature vectors, usually some SIFT descriptors [Low]. SIFT descriptors
can be computed over a regular spatial grid or on interest point outputs of specific
detectors (corners, edges, blobs, . . . ) such Harris or Lowe detectors [HS88,Low].
Following a dictionary learning step usually done by a vector quantification of
all the total amount of feature vectors. The vector quantification is usually done
by a K-means or GMM algorithms [YYGH]. More efficient dictionaries can be
retrieved with sparse learning tools [WYKY+].</p>
      <p>The second approach is based on Local Binary Pattern (LBP) descriptors
(a.k.a. CENTRIST in [WR]). The feature vector is defined by occurrences of
each 256 patterns encoding the neighborhood relation to a central pixel.
1 This work has been supported by ANR-06-MDCA-002 AVEIR ANR</p>
      <p>For both approaches, adding a multi-level pyramidal architecture, permits
to improve considerably the performances. This technic divides the image in
sub-windows and weights adequately each corresponding feature vectors before
concatenation (see [LSP]). The price of this kind of architecture is to deal with
much larger vectors as input of classifiers. Large-scales binary supervised
classification problems arise naturally with theses descriptors.</p>
      <p>The next section describes more precisely the descriptors we developped
in the challenge, especially the novel descriptor spELBOP. The fourth section
overviews the large-scale binary classifier we use : the linear SVM Classifier
TRON (L2 regularized with a L2 loss function).
2</p>
    </sec>
    <sec id="sec-2">
      <title>Pyramidal Multi-Level Features</title>
      <p>For each of the three following descriptors, a spatial pyramid architecture is
used to divide the entire image I into N s possibly overlapping sub-windows.
More precisely, a L levels pyramid is defined for l = 1, . . . , L, where image
I of size N y × N y is divided into possibly overlapping sub-windows of size
hl × wl. Histograms are computed for each sub-windows and weighted by cl =
j=m1,a..x.h,Ll{hj} . j=m1,a..x.w, Ll{wj} . Finally, concatenation of the N s weighted
histograms defines the global feature vector. In our implementation, hl = ⌊N y.ry,l⌋
and wl = ⌊N x.rx,l⌋ where ry,l and rx,l are elements of vectors ry and rx. Shifts
in x-y axis are defined by integers δy,l = ⌊N y.dy,l⌋ and δx,l = ⌊N x.dx,l⌋ where
dy,l and dx,l are elements of vectors dy and dx respectively. Overlapping
windows can be obtained if dy,l ≤ ry,l and/or dx,l ≤ rx,l. The total number of
sub-windows is equal to N s = X ⌊ (1 − ry,l) ⌋.⌊ (1 − rx,l) ⌋.</p>
      <p>l=1,...,L (dy,l + 1) (dx,l + 1)
2.1</p>
      <sec id="sec-2-1">
        <title>The spHOEE Feature</title>
        <p>Following [DT,MBM], a histogram of the L1-normalized orientation edge energy
filter responses is constructed for the N o different orientations. Theses responses
are obtained by convolution of the gray image with two odd elongated oriented
filters (horizontal and vertical gradients) at scale σ. L1-normalized magnitudes
with a block of size hn×wn and signed angles are computed from theses gradients.
Each N s sub-window histogram is computed efficiently thanks to the integral
histogram method. The total dimension of the feature vector is d = N s.N o.
The spHOEE feature (a.k.a. spHOG in [MBM,MB]) offers state-of-the-art
performances in databases such CALTECH 101 or INRIA pedestrians.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>The novel spELBP Feature</title>
        <p>Local Binary Pattern (LBP) are powerful parametric descriptors encoding
relation between intensity of a central pixel and intensities of its 8 adjacent neighbors
(see [LZL+07]). Widely used in face recognition (see [SGM09]), LBP shows also
their efficiency in scene categorization ([WR]) compared to BoW with SIFT
descriptors. In [LZL+07], a multi-scale extension (MSLBP) consists in encoding
relation of a central block of pixels of size s × s with its 8 neighbors capturing
more global details. Each block area is computed with the help of the
integral image. We propose here a spatial pyramid architecture for the MSLBP
so-called spELBP. This novel descriptor captures details of the scale s at given
sub-windows location. Let S the number of scales, the total dimension of the
spELBP descriptor is d = 256.N s.S.
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>The spELBOP Feature</title>
        <p>This novel descriptor is derived from the two last. Here instead of encoding
the raw pixel values of a block of size s, we propose to encode the orientations
of the block. As with the spHOEE features, orientations are retrieved by i)
applying convolution with the two odd elongated oriented filters at scale σ and
ii) computing the signed angles. The total dimension of the spELBOP descriptor
is the same as the latter, i.e. d = 256.N s.S.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Large Scale SVM</title>
      <p>Learning a topic (a room) with the one-against-all approach is equivalent to a
binary supervised classification task. We deal with a training set D = {xi, yi}, i =
1, . . . , N where xi ⊂ Rd represents a feature vector and yi ∈ {−1, 1} its
corresponding label. Max-margin classifiers like SVM are known to offer
state-of-theart performances. However with high dimension feature vectors and numerous
examples, training SVM can be too computational expensive (∼ O(dN 3)). For
large scale problems, one alternative is to use a max-margin linear classifier
which offers often the same amount of performances than the non-linear version
[LWK,WYKY+]. The linear SVM used here consists in finding the hyperplane
parameter w minimizing the sum of a L2 loss function and a L2 regulation term:</p>
      <p>N
mwin ( 12 wT w + C X max 1 − yiwT xi, 0
i=1
2)
.</p>
      <p>(1)
In [LWK], the problem is solved with a Trust Region Newton algorithm (TRON).
We use the modified version of TRON proposed by ([MBM]) managing dense
features.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Results at VCDT 2010</title>
      <p>In the CLEF VCDT 2010 task, preliminary tests on the development set resulted
in selecting the spELBP feature for our final system. We choose a L = 3 levels
1 1 T
pyramid rx = ry = dx = dy = 1, , leading to N s = 42 sub-windows and
2 4
a total of d = 10752 dimensions for this feature.</p>
      <p>For each topic, hyperparameters C or λ of classifiers are tuned with a 5
cross-validation by minimizing the Balanced Error Rate (BER). Then models
are learned on entire training sets.</p>
      <p>All our run are visual only runs (we only use the image pixels to detect the
topics).</p>
      <p>Over the 15 participating teams, and for the example-based evaluation
applying the F-Measure, our system has the 8th rank. For the seconde evaluation, our
system has the 9th rank according to the concept-based Mean Average Precision
(MAP) (this measure showed better characteristics than the EER and AUC in
a recent study).</p>
      <p>Score details are given at :
http://www.imageclef.org/2010/PhotoAnnotationExampleEvaluationResults
We give below the team best run sorted list with the rank, Fmesure, Team and
Run ID :
1 0.680070 ISIS 1276866836402 uva-isis-mkl-mixed-mixed.txt binary.txt
2 0.639441 XRCE 1277144880578 xrce SVM EF Visual.txt binary.txt
3 0.634064 HHI 1277376108118 ic 10 test s eiq space.txt binary.txt
4 0.596141 IJS 1277145320629 final ijs feit run1.txt binary.txt
5 0.581652 LEAR 1277147818075 lear TP Visual CVF 0 D6.txt binary.txt
6 0.558674 MEIJI 1276044391982 AUTO VisualOnly BagOfVisualWords of meiji.txt binary.txt
7 0.530841 Romania 1276777329876 run CR+lapl2inv.txt binary.txt
8 0.530317 LSIS 1277226430977 DYNI LSIS RUN2 COPIE.txt binary.txt
9 0.482390 WROCLAW 1277754391699 imageClef2010-grid20x20-xy rgb dev hes.quick matrix...
10 0.476983 LIG 1277153756343 clefResults.txt binary.txt
11 0.450934 CEALIST 1277046611397 cealist fastSB 6600.submission.txt binary.txt
12 0.427661 BPACAD 1277129525816 bp acad hoggmm.txt binary.txt
13 0.224987 MLKD 1277149221968 Visual2.txt binary.txt
14 0.208564 INSUNHIT 1277043267771 finalresult50.txt binary.txt
15 0.174392 UPMC 1277139769194 output multiviewFINAL.txt binary.txt</p>
      <p>The best MAP run is also ISIS run with 0.407, while the best LSIS run MAP
equals 0.234.</p>
      <p>The figure 1 gives the Average Precision (AP) for each topic for the best
LSIS run. In order to compare our scores to the state of the art, we plot in figure
2 the AP of the best run of the challenge versus our. We then clearly see that
the scores are well correlated, and that LSIS AP are below (or for some topics
equal) to the ISIS AP. This loss of precision for the LSIS run is figured for each
topic in 3. We see then that the difference is about 0.15 points of AP in average.</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and perspectives</title>
      <p>In this paper, the VCDT 2010 challenge is solved by a large-scale classifier trained
on multi-level descriptors and is ranked 8th for the F-mesure evaluation over the
15 teams. Training theses classifiers is extremely fast compared to the classical
SVM counterpart. The Average Precision of our system is below the best run
of the challenge with a nearly constant decrease of around 0.15. We could then
assume that our approach could be globaly improved, as it seems to generalize
well to all topics. Further improvements can be expected by adding denseSIFT
descriptor with sparse learning and spatial pooling (see [WYKY+]), and using
Multiple Kernel Learning method for fusionning features.
[DT]
[HS88]
[Low]</p>
      <p>Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human
detection. In CVPR’05.</p>
      <p>C. Harris and M. Stephens. A combined corner and edge detector. In Proc
4th Alvey Vision Conf, 1988.</p>
      <p>David G. Lowe. Object recognition from local scale-invariant features. In
ICCV’99.
[LSP] Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. Beyond bags of
features: Spatial pyramid matching for recognizing natural scene categories. In
CVPR’06.
[LWK] Chih-Jen Lin, Ruby C. Weng, and S. Sathiya Keerthi. Trust region newton
method for logistic regression. J. Mach. Learn. Res., 9.
[LZL+07] ShengCai Liao, XiangXin Zhu, Zhen Lei, Lun Zhang, and Stan Z. Li.
Learning multi-scale block local binary patterns for face recognition. In ICB, 2007.
[MB] S. Maji and A.C. Berg. Max-margin additive classifiers for detection.</p>
      <p>ICCV’09.
[MBM] S. Maji, A.C. Berg, and J. Malik. Classification using intersection kernel
support vector machines is efficient. CVPR’08, June.
[NH10] S. Nowak and M. Huiskes. New strategies for image annotation: Overview
of the photo annotation task at imageclef 2010. In Working notes of CLEF
2010, 2010.
[SGM09] Caifeng Shan, Shaogang Gong, and Peter W. McOwan. Facial expression
recognition based on local binary patterns: A comprehensive study. Image
Vision Comput., 27(6), 2009.
[WR] Jianixn Wu and James M. Rehg. Where am i: Place instance and category
recognition using spatial pact. CVPR’08.
[WYKY+] Jinjun Wang, Jianchao Yang, Fengjun Lv Kai Yu, Thomas Huang, and
Yihong Gong. Locality-constrained linear coding for image classification.</p>
      <p>CVPR’10.
[YYGH] Jianchao Yang, Kai Yu, Yihong Gong, and Thomas S. Huang. Linear spatial
pyramid matching using sparse coding for image classification. In CVPR’09.
[ZZY+] Xi Zhou, Xiaodan Zhuang, Shuicheng Yan, Shih-Fu Chang, Mark
HasegawaJohnson, and Thomas S. Huang. Sift-bag kernel for video event analysis. In
MM’08. ACM.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>