=Paper= {{Paper |id=Vol-1458/E26_CRC35_RABOUY |storemode=property |title=New Graph Regularized Sparse Coding Improving Automatic Image Annotation |pdfUrl=https://ceur-ws.org/Vol-1458/E26_CRC35_RABOUY.pdf |volume=Vol-1458 |dblpUrl=https://dblp.org/rec/conf/lwa/RabouyPG15 }} ==New Graph Regularized Sparse Coding Improving Automatic Image Annotation== https://ceur-ws.org/Vol-1458/E26_CRC35_RABOUY.pdf
      New Graph regularized Sparse Coding Improving
              Automatic Image Annotation

            Céline RABOUY1,2 , Sébastien PARIS1,2 and Hervé GLOTIN1,2,3
     1 Aix-Marseille Université, CNRS, ENSAM, LSIS UMR 7296, 13397 Marseille, France
           2 Université de Toulon, CNRS, LSIS UMR 7296, 83957 La Garde, France
                     3 Institut Universitaire de France, 75005 Paris, France

                      {celine.rabouy,sebastien.paris}@lsis.org
                                 glotin@univ-tln.fr



        Abstract. Typical image classification pipeline for shallow architecture can be
        summarized by the following three main steps: i) a projection in high dimen-
        sional space of local features, ii) sparse constraints for the encoding scheme and
        iii) a pooling operation to obtain a global representation invariant to common
        transformation. Sparse Coding (SC) framework is one particular example of this
        general approach. The main problem raised by it is the local feature encoding
        which is done independently, loosing correlation of the input space. In this work
        we propose to simultaneously encode sparse codes to tackle this problem with
        Joint Sparse Coding (JSC) inspired by Graph regularized Sparse Coding (GSC).
        We experiment SC, GSC and JSC on UIUCsports and scenes15 database. We
        will show that results obtained, for UIUCsports, with SC (87.27 ± 1.33), JSC
        (84.17 ± 1.57) and the State-of-the-Art (88.47 ± 2.32 [23]) are tackled by a sim-
        ple fusion (95.37 ± 1.29). Several assumptions will be advanced to explain this
        phenomenon which can’t be generalized.

        Keywords: Scenes categorization, Sparse Coding, Graph regularized Sparse Cod-
        ing, Dictionary Learning, Scale Invariant Feature Transform, Spatial Pyramid
        Matching, Joint Sparse Coding.


1     Introduction
In the field of computer vision and signal processing, significant progress has been made
since the 2000s with more general methods such as Bag of Words (BoW) [19]. We have
at our disposal a significant number of databases as, for example, UIUCsportss [11],
scenes from 15 databases [8], where the goal is to label images into a finite number of
classes. The first way could be to evaluate the metric distance between two images. Un-
fortunately, due to the high dimensionality of this input space, most of these distances
are concentrated into a sub-manifold whatever the image class, making the discrimi-
nation by direct distances not robust. To overcome this problem, a solution has to be
    Copyright c 2015 by the papers authors. Copying permitted only for private and academic
    purposes. In: R. Bergmann, S. Gorg, G. Muller (Eds.): Proceedings of the LWA 2015 Work-
    shops: KDML, FGWM, IR, and FGDB. Trier, Germany, 7.-9. October 2015, published at
    http://ceur-ws.org




                                            193
designed to find a general application Ψ j (.; µ j ) with parameter µ j which characterizes
the class C j satisfying:

                dist(Ψ j (I1 ; µ j ), Ψ j (I2 ; µ j )) → 0 if I1 ∈ C j and I2 ∈ C j
              
                                                                                    (1)
                dist(Ψ j (I1 ; µ j ), Ψ j (I2 ; µ j )) → ∞ if I1 ∈ C j and I2 ∈
                                                                              / C j,

where I1 and I2 are two images. The choice of Ψ j represents a trade-off between its
representation capacity versus the µ j optimization difficulty. In general, in order to es-
timate/optimize µ j , we have to start from a local representation (patches) x ∈ Rd to
obtain the global representation Ψ j (.; µ j ). From Ψ j associated to BoW, Sparse Coding
(SC) [21], up to ConvNet [3, 9] follow the three main procedures: i) high dimension
local feature projection, ii) sparsity constraints into the representation model and iii)
non-linearity operation and pooling to obtain a global invariant representation.
    In this article, we will focus on a new formulation of encoding method, which corre-
sponds more specifically to procedure ii), inspired by SC and more generally by Graph
regularized Sparse Coding (GSC) [25]. This new formulation allows to encode simul-
taneously testing patches as with the GSC model which has good properties. Although
we will only work on a single layer, we will show that a simple fusion will allow to
improve considerably the classification accuracy and that our results will be close to
CNN (convolutional neural nets) [6, 18] initialized on Image Net as shown in [3]. This
article is divided into five parts. The first part focuses on SC models and its derivatives
(GSC especially). The second part presents our modeling Joint Sparse Coding (JSC).
The third part presents Graph regularized Sparse Coding (GSC) dictionary inspired by
[13]. A fourth part presents results we obtained on UIUCsports and scenes15 databases
and in the last part, we conclude on our contribution.


2    Related Works

In this part, we will focus on the encoding step using linear coding to reconstruct
inputs. An approximation of any patches x ∈ Rd can be given by xi = Dαi , where
D , [d1 , . . . , dK ] ∈ Rd×K is a given/trained dictionary where ∀k = 1, . . . , K, kdTk dk k22 = 1
      j
and dk ≥ 0. A patch is a vector extracted from an image. A dictionary is a matrix of
“words” allowing the patch reconstruction. In many encoding methods, three common
steps can be found: i) a projection into a higher dimension space with (K >> d) ii)
sparse constraints and iii) a non-linear operation procedure. If α∗i is obtained with Or-
dinary Least Square (OLS), the solution is full dense (all elements are non zero). One
way to get around this problem is the use of the `1 -norm constraint which corresponds
to Lasso problem [21] or Basis Pursuit [4]:

                                                 1
                      LSC (αi |xi ; D) = minK kxi − Dαi k22 + λkαi k1 ,                         (2)
                                         αi ∈R   2

with λ the regularization parameter associated to the SC formulation. This parameter
controls the sparsity level as is shown in [15]. Thus, the more λ is large, the more α∗i
(solution of eq.2) will be sparse.




                                             194
    Usually in SC framework, if we take two neighbor patches xi and x j (with a strong
correlation between them), their respective sparse codes, αi and α j , can lose this strong
correlation, especially indexes of non-zero inputs can completely mismatch. It means
they are involving different atoms for their patches’ reconstructions. An atom is an ele-
ment of the vector patch. There exist some SC variations which have been introduced to
tackle this behaviour. Principles of this improvement can be divided into two categories:
one plays on adding of proximity constraint into the loss directly while the second adds
some extra terms into the regularization term. To illustrate the first category, we can
cite two approaches: Local Constrained linear Coding (LCC) [24] and the Local Sparse
Coding (LSC) [20]. In the second category, we can mention GSC [25].
    We will define the set of pre-computed sparse codes of Xtrain , {xtrain1   , . . . , xtrain
                                                                                          N train
                                                                                                  }
by A  train     train        train
            , {α1 , . . . , αNtrain } where N train designates the number of local features
sampled from the training set. Indeed, this adds a spatial constraint in the regularization
term. Its equation is:
    LGSC (αi |xi , Atrain ; D, λ, β) = minK kxi − Dαi k22 + λkαi k1 + βLii αTi αi + 2βαTi hi , (3)
                                        αi ∈R

                 N train
where hi = ∑ Li j αtrain
                              
                    j    , L = Li j i, j=1,...,Ntrain is a Laplacian matrix and β a regu-
                   j6=i
larization parameter. The matrix L is defined by L = S − W, where W is a weight
                                        kxi −xtrain k2
matrix with and Wi, j = exp{−    σ2
                                   j   2
                                         } if xtrain
                                                j    ∈ V (xi ) (where V (xi ) is the set of
neighborhood of xi excluding xi itself), Wi, j = 0 else. The matrix S is diagonal and
       N train
Si,i = ∑ Wi, j . We propose to improve SC by simultaneously encoding all the test local
         j=1
patches (for example associated with a test image). This new modeling will be inspired
from the GSC.

3     Joint Sparse Coding - JSC
JSC principle is to jointly encode all local features Xtest = {xtest         test
                                                                1 , . . . , xN test } simultane-
                                                                     k
ously to overcome the decorrelation problem. We also enforce αi ≥ 0 in the previous
optimization problem. This additional constraint improves pooling performances, thus
avoiding to pool simultaneously on positive and negative sparse code values and de-
creasing as a consequence the final size vector by a factor by two. The equation of our
modeling is very similar to GSC:
LJSC (αi |xi , Atest ; D, λ) = minK kxi − Dαi k22 + λkαi k1 + βLii αTi αi + 2βαTi hi , s.t. αki ≥ 0,
                                αi ∈R
                                                                                         (4)
                 N test
where hi = ∑ Li j αtest
                           
                    j , L = Li j i, j=1,...,N test is a Laplacian matrix, β a regularization
                  j6=i
                                                          kxi −xtest 2
                                                                 j k2
parameter. Here, L = S − W, where Wi, j = exp{−                σ2
                                                                       } if xtest
                                                                              j ∈ V (xi ), Wi, j = 0
                      N test
else and Si,i = ∑ Wi, j . Here, Atest , {αtest         test
                                          1 , . . . , αN test } are computed and stacked ini-
                          j=1
tially. In practice N test << N train , so we need to store only a sparse K × N test matrix.




                                                   195
Our Laplacian matrix (N test × N test ) is very sparse. If we don’t need to compute the
full matrix, one way is to only calculate the non-zero elements ((v + 1) × N test ) with
the previous formulation. Each column of this ((v + 1) × N test ) matrix is denoted by
Li . To realize this, we use a fast NN-search technical (FLANN) [14] which speeds up
the computation considerably. Thus, the solution of eq.4 is given by a modified Fea-
ture Sign Search (FSS) algorithm [10] by adding a) a positivity constraint on sparse
codes and b) integrating the two right terms (in β) of eq.4 in the gradient formula-
tion used during the FSS algorithm. JSC is given by the algorithm 1. To illustrate the


Algorithm 1 Joint Sparse Coding
  Inputs: D, λ, β, Xtest , σ and v
  for i = 1 : N test do
    [Vi , disti ] = v-nn search of xtest
                                       i   into Xtest
    Vi are indexes of xi neighbors in Xtest
    Compute Li from disti and σ
  end for
  Atest = lasso(Xtest ; D, λ)
  for i = 1 : N test do
    αi = JSC(xtest i ,A
                        test , D, L , V , λ, β)
                                   i i
  end for
  Output: Atest



correlation problem, viewed with SC, we compare the normalized correlation computed
between two inputs vectors with the normalized correlation computed with their respec-
tive output vectors. In this example, 300 different pairs, extracted from UIUCsports lo-
cal features, are chosen to realize this. The normalized correlation formulation between
                                      Ty
x and y is given by ρ(x, y) = kxkx kyk       ∈ [0, 1]. We also introduce the scalar value
                                           2      2
   2             300 j