Visual Scenes Clustering Using Variational Incremental Learning of Infinite
                       Generalized Dirichlet Mixture Models
                           Wentao Fan                                               Nizar Bouguila
              Electrical and Computer Engineering                  Institute for Information Systems Engineering
                 Concordia University, Canada                               Concordia University, Canada
                  wenta fa@encs.concordia.ca                                nizar.bouguila@concordia.ca

                          Abstract                                   to modern data mining applications (i.e. modern applications
                                                                     involve generally dynamic data sets).
     In this paper, we develop a clustering approach                 Nowadays, the most popular Bayesian nonparametric formal-
     based on variational incremental learning of a                  ism is the Dirichlet process (DP) [Neal, 2000; Teh et al.,
     Dirichlet process of generalized Dirichlet (GD) dis-            2004] generally translated to a mixture model with a count-
     tributions. Our approach is built on nonparametric              ably infinite number of components in which the difficulty
     Bayesian analysis where the determination of the                of selecting the appropriate number of clusters, that usu-
     complexity of the mixture model (i.e. the number                ally occurs in the finite case, is avoided. A common way
     of components) is sidestepped by assuming an in-                to learn Dirichlet process model is through Markov chain
     finite number of mixture components. By leverag-                Monte Carlo (MCMC) techniques. Nevertheless, MCMC ap-
     ing an incremental variational inference algorithm,             proaches have several drawbacks such as the high compu-
     the model complexity and all the involved model’s               tational cost and the difficulty of monitoring convergence.
     parameters are estimated simultaneously and effec-              These shortcomings of MCMC approaches can be solved
     tively in a single optimization framework. More-                by adopting an alternative namely variational inference (or
     over, thanks to its incremental nature and Bayesian             variational Bayes) [Attias, 1999], which is a deterministic
     roots, the proposed framework allows to avoid                   approximation technique that requires a modest amount of
     over- and under-fitting problems, and to offer good             computational power. Variational inference has provided
     generalization capabilities. The effectiveness of the           promising performance in many applications involving mix-
     proposed approach is tested on a challenging appli-             ture models [Corduneanu and Bishop, 2001; Constantinopou-
     cation involving visual scenes clustering.                      los et al., 2006; Fan et al., 2012; 2013]. In our work, we
                                                                     employ an incremental version of variational inference pro-
1   Introduction                                                     posed by [Gomes et al., 2008] to learn infinite generalized
Incremental clustering plays a crucial role in many data min-        Dirichlet (GD) mixtures in the context where data points are
ing and computer vision applications [Opelt et al., 2006;            supposed to arrive sequentially. The consideration of the
Sheikh et al., 2007; Li et al., 2007]. Incremental clustering        GD distribution is motivated by its promising performance
is particularly efficient in the following scenarios: when data      when handling non-Gaussian data, and in particular propor-
points are obtained sequentially, when the available memory          tional data (which are subject to two restrictions: nonneg-
is limited, or when we have large-scale data sets to deal with.      ativity and unit-sum) which are naturally generated in sev-
Bayesian approaches have been widely used to develop pow-            eral data mining, machine learning, computer vision, and
erful clustering techniques. Bayesian approaches applied for         bioinformatics applications [Bouguila and Ziou, 2006; 2007;
incremental clustering fall basically into two categories: para-     Boutemedjet et al., 2009]. Examples of applications include
metric and non-parametric, and allow to mimic the human              textual documents (or images) clustering where a given doc-
learning process which is based on iterative accumulation of         ument (or image) is described as a normalized histogram of
knowledge. As opposed to parametric approaches in which              words (or visual words) frequencies.
a fixed number of parameters is considered, Bayesian non-            The main contributions of this paper are listed as the follow-
parametric approaches use an infinite-dimensional parameter          ing: 1) we develop an incremental variational learning al-
space and allow the complexity of models to grow with data           gorithm for the infinite GD mixture model, which is much
size. The consideration of an infinite-dimensional parame-           more efficient when dealing with massive and sequential data
ter space allows to determine appropriate model complexity,          as opposed to the corresponding batch approach; 2) we ap-
which is normally referred to as the problem of model selec-         ply the proposed approach to tackle a challenging real-world
tion or model adaptation. This is a crucial issue in clustering      problem namely visual scenes clustering. The effectiveness
since it permits to capture the underlying data structure more       and merits of our approach are illustrated through extensive
precisely, and also to avoid over- and under-fitting problems.       simulations. The rest of this paper is organized as follows.
This paper focuses on the latter one since it is more adapted        Section 2 presents the infinite GD mixture model. The incre-
mental variational inference framework for model learning is                       random variables α          ~ Since the formal conjugate prior
                                                                                                        ~ and β.
described in Section 3. Section 4 is devoted to the experimen-                     for Beta distribution is intractable, we adopt Gamma pri-
tal results. Finally, conclusion follows in Section 5.                             ors G(·) to approximate the conjugate priors of α    ~ and β~ as:
                                                                                     α) = G(~
                                                                                   p(~                         ~      ~     ~
                                                                                              α|~u, ~v ) and p(β) = G(β|~s, t), with the assumption
2     The Infinite GD Mixture Model                                                that these parameters are statistically independent.
    ~ = (Y1 , . . . , YD ) be a D-dimensional random vector
Let Y
drawn from an infinite mixture of GD distributions:                                3     Model Learning
                              X∞                                                   In our work, we adopt an incremental learning framework
                 ~ |~π , α
               p(Y           ~ =
                         ~ , β)                   ~ |~
                                            πj GD(Y       ~j )
                                                     αj , β                  (1)   proposed in [Gomes et al., 2008] to learn the proposed in-
                                      j=1                                          finite GD mixture model through variational Bayes. In this
                                                                                   algorithm, data points can be sequentially processed in small
where ~π represents the mixing weights that are positive and                       batches where each one may contain one or a group of data
sum to one. α ~ j = (αj1 , . . . , αjD ) and β  ~j = (βj1 , . . . , βjD )          points. The model learning framework involves the following
are the positive parameters of the GD distribution associated                      two phases: 1) model building phase: to inference the opti-
with component j, while GD(Y        ~ |~    ~j ) is defined as
                                       αj , β                                      mal mixture model with the currently observed data points;
                                                                                   2) compression phase: to estimate which mixture component
                    D                                    l       γjl
       ~ |~
                   Y
               ~j ) =         Γ(αjl + βjl ) αjl −1       X                         that groups of data points should be assigned to.
    GD(Y  αj , β                            Y      1−                   Yk
                              Γ(αjl )Γ(βjl ) l
                        l=1                                      k=1               3.1     Model Building Phase
                                                                             (2)
         PD                                                                        For an observed data set X = (X  ~ 1, . . . , X
                                                                                                                                 ~ N ), we define Θ =
where l=1 Yl < 1 and 0 < yl < 1 for l = 1, . . . , D,
γjl = βjl − αjl+1 − βjl+1 for l = 1, . . . , D − 1, and                              ~ α
                                                                                   {Z,      ~ ~λ} as the set of unknown random variables. The
                                                                                        ~ , β,
γjD = β        − 1. Γ(·) is the gamma function defined by                          main target of variational Bayes is to estimate a proper ap-
         R jD
           ∞                                                                       proximation q(Θ) for the true posterior distribution p(Θ|X ).
Γ(x) = 0 ux−1 e−u du. Furthermore, we exploit an inter-
esting and convenient mathematical property of the GD dis-                         This problem can be solvedRby maximizing the free energy
tribution which is thoroughly discussed in [Boutemedjet et                         F(X , q), where F(X , q) = q(Θ) ln[p(X , Θ)/q(Θ)]dΘ. In
al., 2009], to transform the original data points into another                     our algorithm, inspired by [Blei and Jordan, 2005], we trun-
D-dimensional space where the features are conditionally in-                       cate the variational distribution q(Θ) at a value M , such
                                                                                                                                         PM
dependent and rewrite the infinite GD mixture model in the                         that λM = 1, πj = 0 when j > M , and j=1 πj =
following form                                                                     1, where the truncation level M is a variational parame-
                               ∞           D
                                                                                   ter which can be freely initialized and will be optimized
             ~ π, α   ~ =
                               X           Y                                       automatically during the learning process [Blei and Jor-
           p(X|~  ~ , β)              πj         Beta(Xl |αjl , βjl )        (3)
                                                                                   dan, 2005]. In order to achieve tractability, we also as-
                                j=1        l=1
                                                                                   sume that the approximated posterior distribution q(Θ) can
                                             Pl−1
where Xl = Yl and Xl = Yl /(1 − k=1 Yk ) for l > 1.                                be factorized into disjoint tractable factors as: q(Θ) =
                                                                                    QN             QM QD                         QM
Beta(Xl |αjl , βjl ) is a Beta distribution parameterized with                     [ i=1 q(Zi )][ j=1 l=1 q(αjl )q(βjl )][ j=1 q(λj )].
(αjl , βjl ).                                                                      By maximizing the free energy F(X , q) with respect to each
In this work, we construct the Dirichlet process through a                         variational factor, we can obtain the following update equa-
stick-breaking representation [Sethuraman, 1994]. There-                           tions for these factors:
fore, the mixing weights πj are constructed by recursively                                     YN M
                                                                                                  Y 1[Z =j]              YM D
                                                                                                                               Y
                                                                                         ~ =                                               ∗   ∗
breaking a unit length stick into an infinite number of pieces                         q(Z)                rij i     ,   q(~
                                                                                                                           α) =                G(αjl |ujl , vjl ) (5)
              Qj−1
as πj = λj k=1 (1 − λk ). λj is known as the stick break-                                        i=1 j=1                          j=1 l=1
ing variable and is distributed independently according to                                   M D                                         M
λj ∼ Beta(1, ξ), where ξ > 0 is the concentration param-
                                                                                             Y Y                                         Y
                                                                                      ~ =
                                                                                    q(β)                 G(βjl |s∗jl , t∗jl ), q(~λ) =         Beta(λj |aj , bj ) (6)
eter of the Dirichlet process.                                                               j=1 l=1                                     j=1
For an observed data set (X ~ 1, . . . , X
                                         ~ N ), we introduce a set of
                                                                                   where we have defined
mixture component assignment variables Z         ~ = (Z1 , . . . , ZN ),
                                                                                           exp(ρij )
                                                    ~ has an integer               rij = PM                                                                      (7)
one for each data point. Each element Zi of Z                                                 exp(ρij )
                                                                                          j=1
value j specifying the component from which X            ~ i is drawn.
                                                                                           D
The marginal distribution over Z    ~ is given by                                          X                                                                
                                                                                   ρij =         R
                                                                                                 e jl + (ᾱjl − 1) ln Xil + (β̄jl − 1) ln(1 − Xil )
                       N ∞        j−1            1[Zi =j]                                l=1
                       Y  Y        Y
               ~ ~λ) =
             p(Z|                     λj         (1 − λk )                   (4)                            j−1
                                                                                                            X
                         i=1 j=1           k=1                                             + ln λj +               ln(1 − λk )
                                                                                                            k=1
where 1[·] is an indicator function which equals to 1 when                                         N
Zi = j, and equals to 0 otherwise. Since our model frame-
                                                                                                   X
                                                                                   u∗jl = ujl +            Zi = j [Ψ(ᾱjl + β̄jl ) − Ψ(ᾱjl ) + β̄jl
work is Bayesian, we need to place prior distributions over                                        i=1
       ×Ψ0 (ᾱjl + β̄jl )(hln βjl i − ln β̄jl )]ᾱjl                         make an inference at some target time T where T ≥ N . we
                N                                                            can tackle this problem by scaling the observed data to the tar-
                                                                             get size T , which is equivalent to using the variational poste-
                X
s∗jl = sjl +          Zi = j [Ψ(ᾱjl + β̄jl ) − Ψ(β̄jl ) + ᾱjl
                                                                             rior distribution of the observed data N as a predictive model
                i=1
                                                                             of the future data [Gomes et al., 2008]. We then have a mod-
       ×Ψ0 (ᾱjl + β̄jl )(hln αjl i − ln ᾱjl )]β̄jl                         ified free energy for the compression phase in the following
                N
                X                                      N
                                                       X M
                                                         X                   form
 ∗
vjl = vjl −           Zi = j ln Xil , bj = ξj +                   hZi = ki             M   D                                            
                i=1                                    i=1 k=j+1
                                                                                      X   X                   p(αjl |ujl , vjl )      p(βjl |sjl , tjl )
                                                                                F=                       ln                      + ln
                N                                          N                                                     q(αjl )                 q(βjl )
                X                                          X                          j=1 l=1
 t∗jl = tjl −         Zi = j ln(1 − Xil ), aj = 1 +              hZi = ji                 M                                               M
                i=1                                        i=1
                                                                                          X              p(λj |ξj )   T X          X
                                                                                      +          ln                 +     |nc | ln   exp(ρcj )              (8)
                                                                                                          q(λj )      N
  where Ψ(·) is the digamma function, and h·i is the ex-                                  j=1                                   c        j=1

pectation evaluation. Note that, R  e is the lower bound of
                                                                              where |nc | represents the number of data points in clump
             Γ(α+β)
R = ln Γ(α)Γ(β) . Since this expectation is intractable,                           T
                                                                             c and N  is the data magnification factor. The corresponding
the second-order Taylor series expansion is applied to find                  update equations for maximizing this free energy function can
its lower bound. The expected values in the above formu-                     be obtained as
las are given by hZi = ji = rij , ᾱjl = hαjl i = u∗jl /vjl   ∗
                                                                ,                     exp(ρcj )
                    ∗   ∗
β̄jl = hβjl i = sjl /tjl , hln λj i = Ψ(aj ) − Ψ(aj + bj ),                   rcj = PM                                                                      (9)
                                                                                     j=1
                                                                                         exp(ρcj )
hln(1−λj )i = Ψ(bj )−Ψ(aj +bj ), hln αjl i = Ψ(u∗jl )−ln vjl   ∗
                                                                                      D
                     ∗        ∗
and hln βjl i = Ψ(sjl ) − ln tjl .                                            ρij =
                                                                                      X 
                                                                                            R
                                                                                            e jl + (ᾱjl − 1) lnhXcl i + (β̄jl − 1) ln(1 − hXcl i)
                                                                                                                                                            
After convergence, the currently observed data points are                             l=1
clustered into M groups according to corresponding respon-                                               j−1
sibilities rij through Eq. (7). According to [Gomes et al.,                                              X
                                                                                      + ln λj +                  ln(1 − λk )
2008], these newly formed groups of data points are also de-
                                                                                                         k=1
noted as “clumps”. Following [Gomes et al., 2008], these
                                                                                             T X
                                                          ~ i in
clumps are subject to the constraint that all data points X                  u∗jl = ujl +        |nc |rcj [Ψ(ᾱjl + β̄jl ) − Ψ(ᾱjl ) + β̄jl
                                                                                             N
the clump c share the same q(Zi ) ≡ q(Zc ) which is a key                                            c
factor in the following compression phase.                                           ×Ψ0 (ᾱjl + β̄jl )(hln βjl i − ln β̄jl )]ᾱjl
                                                                                           T X
Algorithm 1                                                                   s∗jl = sjl +         |nc |rcj [Ψ(ᾱjl + β̄jl ) − Ψ(β̄jl ) + ᾱjl
                                                                                           N
                                                                                                     c
 1: Choose the initial truncation level M .
 2: Initialize the values for hyper-parameters ujl , vjl , sjl , tjl and            ×Ψ0 (ᾱjl + β̄jl )(hln αjl i − ln ᾱjl )]β̄jl
    ξj .                                                                       ∗          T X
 3: Initialize the values of rij by K-Means algorithm.                        vjl = vjl −         |nc |rcj lnhXcl i
                                                                                          N
 4: while More data to be observed do                                                                c

 5:     Perform the model building phase through Eqs. (5) and (6).                         T     X
                                                                              t∗jl = tjl −                |nc |rcj ln(1 − hXcl i)
 6:     Initialize the compression phase using Eq. (10).                                     N
                                                                                                     c
 7:     while MC ≥ C do
 8:        for j = 1 to M do                                                                T X
                                                                              aj = 1 +          |nc |hZc = ji
 9:           if evaluated(j) = false then                                                  N
                                                                                                 c
10:               Split component j and refine this split using Eqs (9).
                                                                                                                  M
11:               ∆F(j) = change in Eq. (8).                                                T X                   X
12:               evaluated(j) = true.                                        bj = ξj +                  |nc |        hZc = ki
                                                                                            N
13:            end if                                                                            c               k=j+1
14:        end for
15:        Split component j with the largest value of ∆F(j).                 where hXcl i denotes average over all data points contained
16:        M = M + 1.                                                        in clump c.
17:      end while                                                           The first step of the compression phase is to assign each
18:      Discard the current observed data points.                           clump or data point to the component with the highest re-
19:      Save resulting components into next learning round.                 sponsibility rcj calculated from the model building phase as
20: end while
                                                                                                                 Ic = arg max rcj                          (10)
                                                                                                                            j


3.2    Compression Phase                                                     where {Ic } denote which component the clump (or data
Within the compression phase, we need to estimate clumps                     point) c belongs to in the compression phase. Next, we cy-
that are possibly belong to the same mixture component while                 cle through each component and split it along its principal
taking into consideration future arriving data. Now assume                   component into two subcomponents. This split is refined by
that we have already observed N data points, our aim is to                   updating Eqs. (9). The clumps are then hard assigned to one
of the two candidate components after convergence for refin-             256 pixels, and is composed of eight urban and natural scene
ing the split. Among all the potential splits, we select the one         categories: coast (360 images), forest (328 images), highway
that results in the largest change in the free energy (Eq. (8)).         (260 images), inside-city (308 images), mountain (374 im-
The splitting process repeats itself until a stopping criterion          ages), open country (410 images), street (292 images), and
is met. According to [Gomes et al., 2008], the stoping crite-            tall building (356 images). Figure 1 shows some sample im-
rion for the splitting process can be expressed as a limit on            ages from the different categories in the OT database.
the amount of memory required to store the components. In                Our methodology is based on the proposed incremental infi-
our case, the component memory cost for the mixture model                nite GD mixture model in conjunction with a bag-of-visual
is MC = 2DNc , where 2D is the number of parameters con-                 words representation, and can be summarized as follows:
tained in a D-variate GD component, and Nc is the number                 Firstly, we use the Difference-of-Gaussians (DoG) interest
of components. Accordingly, We can define an upper limit                 point detector to extract Scale-invariant feature transform
on the component memory cost C, and the compression phase                (SIFT) descriptors (128-dimensional) [Lowe, 2004] from
stops when MC ≥ C. As a result, the computational time and               each image. Secondly, K-Means algorithm is adopted to
the space requirement is bounded in each learning round. Af-             construct a visual vocabulary by quantizing these SIFT vec-
ter the compression phase, the currently observed data points            tors into visual words. As a result, each image is repre-
are discarded while the resulting components can be treated              sented as the frequency histogram over the visual words. We
in the same way as data points in the next round of leaning.             have tested different sizes of the visual vocabulary |W| =
Our incremental variational inference algorithm for infinite             [100, 1000], and the optimal performance was obtained for
GD mixture model is summarized in Algorithm 1.                           |W| = 750 according to our experimental results. Then, the
                                                                         Probabilistic Latent Semantic Analysis (pLSA) model [Hof-
                                                                         mann, 2001] is applied to the obtained histograms to rep-
                                                                         resent each image by a 55-dimensional proportional vector
                                                                         where 55 is the number of latent aspects. Finally, the pro-
                                                                         posed InGDMM is deployed to cluster the images supposed
                                                                         to arrive in a sequential way.
         coast            forest         highway         inside-city
                                                                         Table 1: Average rounded confusion matrix for the OT
                                                                         database calculated by InGDMM.
                                                                                              C     F     H      I    M     O     S     T
                                                                               Coast (C)      127   10    4      2    3     31     2     1
                                                                               Forest (F)     2     155    1    2      1     3    0      0
    mountain          open country         street       tall building       Highway (H)        0    0     122    1    0      3     3    1
                                                                           Inside-city (I)     2    4     2     119    3    2     15     7
           Figure 1: Sample images from the OT data set.
                                                                           Mountain (M)        6    21    4      5    139    9    1      2
                                                                          Open country (O)    2     22    19    15     9    131    3    4
                                                                               Street (S)     0      1     4    8      5     5    122    1
4       Visual Scenes Clustering
                                                                          Tall building (T)    4    9     7     23    3     19     3    110
In this section, the effectiveness of the proposed incremental
infinite GD mixture model (InGDMM) is tested on a chal-
lenging real-world application namely visual scenes cluster-             4.2     Experimental Results
ing. The problem is important since images are being pro-
duced at exponential increasing rates and very challenging               In our experiments, we randomly divided the OT database
due to the difficulty of capturing the variability of appearance         into two halves: one for constructing the visual vocabulary,
and shape of diverse objects belonging to the same scene,                another for testing. Since our approach is unsupervised, the
while avoiding confusing objects from different scenes. In               class labels are not involved in our experiments, except for
our experiments, we initialize the truncation level M as                 evaluation of the clustering results. The entire methodology
15. The initial values of the hyperparameters are set as:                was repeated 30 times to evaluate the performance. For com-
(ujl , vjl , sjl , tjl , ξj ) = (1, 0.01, 1, 0.01, 0.1), which have      parison, we have also applied three other mixture-modeling
been found to be reasonable choices according to our experi-             approaches: the finite GD mixture model (FiGDMM), the in-
mental results.                                                          finite Gaussian mixture model (InGMM) and the finite Gaus-
                                                                         sian mixture model (FiGMM). To make a fair comparison,
4.1       Database and Experimental Design                               all of the aforementioned approaches are learned through
In this paper, we test our approach on a challenging and pub-            incremental variational inference. Table 1 shows the aver-
licly available database known as the OT database, which was             age confusion matrix of the OT database calculated by the
introduced by Oliva and Torralba [Oliva and Torralba, 2001]              proposed InGDMM. Table 2 illustrates the average catego-
1
  . This database contains 2,688 images with the size of 256 ×           rization performance using different approaches for the OT
                                                                         database. As we can see from this table, it is obvious that
    1
        OT database is available at: http://cvcl.mit.edu/database.htm.   our approach (InGDMM) provides the best performance in
terms of the highest categorization rate (77.47%) among all      [Constantinopoulos et al., 2006] C.          Constantinopoulos,
the tested approaches. In addition, we can observe that better      M.K. Titsias, and A. Likas. Bayesian feature and model
                                                                    selection for Gaussian mixture models. IEEE Trans-
                                                                    actions on Pattern Analysis and Machine Intelligence,
Table 2: The average classification accuracy rate (Acc) (%)         28(6):1013 –1018, 2006.
obtained over 30 runs using different approaches.                [Corduneanu and Bishop, 2001] A. Corduneanu and C. M.
    Method    InGDMM      FiGDMM      InGMM      FiGMM              Bishop. Variational Bayesian model selection for mix-
    Acc(%)      77.47       74.25       72.54     70.19             ture distributions. In Proc. of the 8th International Con-
                                                                    ference on Artificial Intelligence and Statistics (AISTAT),
                                                                    pages 27–34, 2001.
performances are obtained for approaches that adopt the in-
                                                                 [Fan et al., 2012] W. Fan, N. Bouguila, and D. Ziou. Varia-
finite mixtures (InGDMM and InGMM) than the correspond-
ing finite mixtures (FiGDMM and FiGMM), which demon-                tional learning for finite Dirichlet mixture models and ap-
strate the advantage of using infinite mixture models over fi-      plications. IEEE Transactions on Neural Netw. Learning
nite ones. Moreover, according to Table 2, GD mixture has           Syst., 23(5):762–774, 2012.
higher performance than Gaussian mixture which verifies that     [Fan et al., 2013] Wentao Fan, Nizar Bouguila, and Djemel
the GD mixture model has better modeling capability than the        Ziou. Unsupervised hybrid feature extraction selection for
Gaussian for proportional data clustering.                          high-dimensional non-Gaussian data clustering with vari-
                                                                    ational inference. IEEE Transactions on Knowledge and
                                                                    Data Engineering, 25(7):1670–1685, 2013.
5   Conclusion
                                                                 [Gomes et al., 2008] R. Gomes, M. Welling, and P. Perona.
In this work, we have presented an incremental nonpara-             Incremental learning of nonparametric Bayesian mixture
metric Bayesian approach for clustering. The proposed ap-           models. In Proc. of IEEE Conference on Computer Vision
proach is based on infinite GD mixture models with a Dirich-        and Pattern Recognition (CVPR), pages 1–8, 2008.
let process framework, and is learned using an incremental
variational inference framework. Within this framework, the      [Hofmann, 2001] T. Hofmann. Unsupervised learning by
model parameters and the number of mixture components               probabilistic latent semantic analysis. Machine Learning,
are determined simultaneously. The effectiveness of the pro-        42(1/2):177–196, 2001.
posed approach has been evaluated on a challenging applica-      [Li et al., 2007] L.-J. Li, G. Wang, and L. Fei-Fei. Optimol:
tion namely visual scenes clustering. Future works could be         automatic online picture collection via incremental model
devoted to the application of the proposed algorithm for other      learning. In Proc. of IEEE Conference on Computer Vision
data mining tasks involving continually changing or growing         and Pattern Recognition (CVPR), pages 1–8, 2007.
volumes of proportional data.                                    [Lowe, 2004] D.G. Lowe. Distinctive image features from
                                                                    scale-invariant keypoints. International Journal of Com-
References                                                          puter Vision, 60(2):91–110, 2004.
[Attias, 1999] H. Attias. A variational Bayes framework for      [Neal, 2000] R. M. Neal. Markov chain sampling methods
  graphical models. In Proc. of Advances in Neural Infor-           for Dirichlet process mixture models. Journal of Compu-
  mation Processing Systems (NIPS), pages 209–215, 1999.            tational and Graphical Statistics, 9(2):249–265, 2000.
                                                                 [Oliva and Torralba, 2001] A. Oliva and A. Torralba. Mod-
[Blei and Jordan, 2005] D.M. Blei and M.I. Jordan. Varia-
                                                                    eling the shape of the scene: A holistic representation of
   tional inference for Dirichlet process mixtures. Bayesian
                                                                    the spatial envelope. International Journal of Computer
   Analysis, 1:121–144, 2005.
                                                                    Vision, 42:145–175, 2001.
[Bouguila and Ziou, 2006] N. Bouguila and D. Ziou. A             [Opelt et al., 2006] A. Opelt, A. Pinz, and A. Zisserman. In-
  hybrid SEM algorithm for high-dimensional unsuper-                cremental learning of object detectors using a visual shape
  vised learning using a finite generalized Dirichlet mixture.      alphabet. In Proc. of IEEE Conference on Computer Vision
  IEEE Transactions on Image Processing, 15(9):2657–                and Pattern Recognition (CVPR), volume 1, pages 3–10,
  2668, 2006.                                                       2006.
[Bouguila and Ziou, 2007] N. Bouguila and D. Ziou. High-         [Sethuraman, 1994] J. Sethuraman. A constructive definition
  dimensional unsupervised selection and estimation of a fi-        of Dirichlet priors. Statistica Sinica, 4:639–650, 1994.
  nite generalized Dirichlet mixture model based on mini-
                                                                 [Sheikh et al., 2007] Y.A. Sheikh, E.A. Khan, and
  mum message length. IEEE Transactions on Pattern Anal-
  ysis and Machine Intelligence, 29:1716–1731, 2007.                T. Kanade. Mode-seeking by medoidshifts. In Proc.
                                                                    of the IEEE 11th International Conference on Computer
[Boutemedjet et al., 2009] S. Boutemedjet, N. Bouguila, and         Vision (ICCV), pages 1–8, 2007.
  D. Ziou. A hybrid feature extraction selection approach        [Teh et al., 2004] Y.W. Teh, M.I. Jordan, M.J. Beal, and
  for high-dimensional non-Gaussian data clustering. IEEE           D.M. Blei. Hierarchical Dirichlet processes. Journal of
  Transactions on Pattern Analysis and Machine Intelli-             the American Statistical Association, 101:705–711, 2004.
  gence, 31(8):1429–1443, 2009.