Visual Scenes Clustering Using Variational Incremental Learning of Infinite Generalized Dirichlet Mixture Models Wentao Fan Nizar Bouguila Electrical and Computer Engineering Institute for Information Systems Engineering Concordia University, Canada Concordia University, Canada wenta fa@encs.concordia.ca nizar.bouguila@concordia.ca Abstract to modern data mining applications (i.e. modern applications involve generally dynamic data sets). In this paper, we develop a clustering approach Nowadays, the most popular Bayesian nonparametric formal- based on variational incremental learning of a ism is the Dirichlet process (DP) [Neal, 2000; Teh et al., Dirichlet process of generalized Dirichlet (GD) dis- 2004] generally translated to a mixture model with a count- tributions. Our approach is built on nonparametric ably infinite number of components in which the difficulty Bayesian analysis where the determination of the of selecting the appropriate number of clusters, that usu- complexity of the mixture model (i.e. the number ally occurs in the finite case, is avoided. A common way of components) is sidestepped by assuming an in- to learn Dirichlet process model is through Markov chain finite number of mixture components. By leverag- Monte Carlo (MCMC) techniques. Nevertheless, MCMC ap- ing an incremental variational inference algorithm, proaches have several drawbacks such as the high compu- the model complexity and all the involved model’s tational cost and the difficulty of monitoring convergence. parameters are estimated simultaneously and effec- These shortcomings of MCMC approaches can be solved tively in a single optimization framework. More- by adopting an alternative namely variational inference (or over, thanks to its incremental nature and Bayesian variational Bayes) [Attias, 1999], which is a deterministic roots, the proposed framework allows to avoid approximation technique that requires a modest amount of over- and under-fitting problems, and to offer good computational power. Variational inference has provided generalization capabilities. The effectiveness of the promising performance in many applications involving mix- proposed approach is tested on a challenging appli- ture models [Corduneanu and Bishop, 2001; Constantinopou- cation involving visual scenes clustering. los et al., 2006; Fan et al., 2012; 2013]. In our work, we employ an incremental version of variational inference pro- 1 Introduction posed by [Gomes et al., 2008] to learn infinite generalized Incremental clustering plays a crucial role in many data min- Dirichlet (GD) mixtures in the context where data points are ing and computer vision applications [Opelt et al., 2006; supposed to arrive sequentially. The consideration of the Sheikh et al., 2007; Li et al., 2007]. Incremental clustering GD distribution is motivated by its promising performance is particularly efficient in the following scenarios: when data when handling non-Gaussian data, and in particular propor- points are obtained sequentially, when the available memory tional data (which are subject to two restrictions: nonneg- is limited, or when we have large-scale data sets to deal with. ativity and unit-sum) which are naturally generated in sev- Bayesian approaches have been widely used to develop pow- eral data mining, machine learning, computer vision, and erful clustering techniques. Bayesian approaches applied for bioinformatics applications [Bouguila and Ziou, 2006; 2007; incremental clustering fall basically into two categories: para- Boutemedjet et al., 2009]. Examples of applications include metric and non-parametric, and allow to mimic the human textual documents (or images) clustering where a given doc- learning process which is based on iterative accumulation of ument (or image) is described as a normalized histogram of knowledge. As opposed to parametric approaches in which words (or visual words) frequencies. a fixed number of parameters is considered, Bayesian non- The main contributions of this paper are listed as the follow- parametric approaches use an infinite-dimensional parameter ing: 1) we develop an incremental variational learning al- space and allow the complexity of models to grow with data gorithm for the infinite GD mixture model, which is much size. The consideration of an infinite-dimensional parame- more efficient when dealing with massive and sequential data ter space allows to determine appropriate model complexity, as opposed to the corresponding batch approach; 2) we ap- which is normally referred to as the problem of model selec- ply the proposed approach to tackle a challenging real-world tion or model adaptation. This is a crucial issue in clustering problem namely visual scenes clustering. The effectiveness since it permits to capture the underlying data structure more and merits of our approach are illustrated through extensive precisely, and also to avoid over- and under-fitting problems. simulations. The rest of this paper is organized as follows. This paper focuses on the latter one since it is more adapted Section 2 presents the infinite GD mixture model. The incre- mental variational inference framework for model learning is random variables α ~ Since the formal conjugate prior ~ and β. described in Section 3. Section 4 is devoted to the experimen- for Beta distribution is intractable, we adopt Gamma pri- tal results. Finally, conclusion follows in Section 5. ors G(·) to approximate the conjugate priors of α ~ and β~ as: α) = G(~ p(~ ~ ~ ~ α|~u, ~v ) and p(β) = G(β|~s, t), with the assumption 2 The Infinite GD Mixture Model that these parameters are statistically independent. ~ = (Y1 , . . . , YD ) be a D-dimensional random vector Let Y drawn from an infinite mixture of GD distributions: 3 Model Learning X∞ In our work, we adopt an incremental learning framework ~ |~π , α p(Y ~ = ~ , β) ~ |~ πj GD(Y ~j ) αj , β (1) proposed in [Gomes et al., 2008] to learn the proposed in- j=1 finite GD mixture model through variational Bayes. In this algorithm, data points can be sequentially processed in small where ~π represents the mixing weights that are positive and batches where each one may contain one or a group of data sum to one. α ~ j = (αj1 , . . . , αjD ) and β ~j = (βj1 , . . . , βjD ) points. The model learning framework involves the following are the positive parameters of the GD distribution associated two phases: 1) model building phase: to inference the opti- with component j, while GD(Y ~ |~ ~j ) is defined as αj , β mal mixture model with the currently observed data points; 2) compression phase: to estimate which mixture component D  l γjl ~ |~ Y ~j ) = Γ(αjl + βjl ) αjl −1 X that groups of data points should be assigned to. GD(Y αj , β Y 1− Yk Γ(αjl )Γ(βjl ) l l=1 k=1 3.1 Model Building Phase (2) PD For an observed data set X = (X ~ 1, . . . , X ~ N ), we define Θ = where l=1 Yl < 1 and 0 < yl < 1 for l = 1, . . . , D, γjl = βjl − αjl+1 − βjl+1 for l = 1, . . . , D − 1, and ~ α {Z, ~ ~λ} as the set of unknown random variables. The ~ , β, γjD = β − 1. Γ(·) is the gamma function defined by main target of variational Bayes is to estimate a proper ap- R jD ∞ proximation q(Θ) for the true posterior distribution p(Θ|X ). Γ(x) = 0 ux−1 e−u du. Furthermore, we exploit an inter- esting and convenient mathematical property of the GD dis- This problem can be solvedRby maximizing the free energy tribution which is thoroughly discussed in [Boutemedjet et F(X , q), where F(X , q) = q(Θ) ln[p(X , Θ)/q(Θ)]dΘ. In al., 2009], to transform the original data points into another our algorithm, inspired by [Blei and Jordan, 2005], we trun- D-dimensional space where the features are conditionally in- cate the variational distribution q(Θ) at a value M , such PM dependent and rewrite the infinite GD mixture model in the that λM = 1, πj = 0 when j > M , and j=1 πj = following form 1, where the truncation level M is a variational parame- ∞ D ter which can be freely initialized and will be optimized ~ π, α ~ = X Y automatically during the learning process [Blei and Jor- p(X|~ ~ , β) πj Beta(Xl |αjl , βjl ) (3) dan, 2005]. In order to achieve tractability, we also as- j=1 l=1 sume that the approximated posterior distribution q(Θ) can Pl−1 where Xl = Yl and Xl = Yl /(1 − k=1 Yk ) for l > 1. be factorized into disjoint tractable factors as: q(Θ) = QN QM QD QM Beta(Xl |αjl , βjl ) is a Beta distribution parameterized with [ i=1 q(Zi )][ j=1 l=1 q(αjl )q(βjl )][ j=1 q(λj )]. (αjl , βjl ). By maximizing the free energy F(X , q) with respect to each In this work, we construct the Dirichlet process through a variational factor, we can obtain the following update equa- stick-breaking representation [Sethuraman, 1994]. There- tions for these factors: fore, the mixing weights πj are constructed by recursively YN M Y 1[Z =j] YM D Y ~ = ∗ ∗ breaking a unit length stick into an infinite number of pieces q(Z) rij i , q(~ α) = G(αjl |ujl , vjl ) (5) Qj−1 as πj = λj k=1 (1 − λk ). λj is known as the stick break- i=1 j=1 j=1 l=1 ing variable and is distributed independently according to M D M λj ∼ Beta(1, ξ), where ξ > 0 is the concentration param- Y Y Y ~ = q(β) G(βjl |s∗jl , t∗jl ), q(~λ) = Beta(λj |aj , bj ) (6) eter of the Dirichlet process. j=1 l=1 j=1 For an observed data set (X ~ 1, . . . , X ~ N ), we introduce a set of where we have defined mixture component assignment variables Z ~ = (Z1 , . . . , ZN ), exp(ρij ) ~ has an integer rij = PM (7) one for each data point. Each element Zi of Z exp(ρij ) j=1 value j specifying the component from which X ~ i is drawn. D The marginal distribution over Z ~ is given by X   ρij = R e jl + (ᾱjl − 1) ln Xil + (β̄jl − 1) ln(1 − Xil ) N ∞  j−1 1[Zi =j] l=1 Y Y Y ~ ~λ) = p(Z| λj (1 − λk ) (4) j−1 X i=1 j=1 k=1 + ln λj + ln(1 − λk ) k=1 where 1[·] is an indicator function which equals to 1 when N Zi = j, and equals to 0 otherwise. Since our model frame- X u∗jl = ujl + Zi = j [Ψ(ᾱjl + β̄jl ) − Ψ(ᾱjl ) + β̄jl work is Bayesian, we need to place prior distributions over i=1 ×Ψ0 (ᾱjl + β̄jl )(hln βjl i − ln β̄jl )]ᾱjl make an inference at some target time T where T ≥ N . we N can tackle this problem by scaling the observed data to the tar- get size T , which is equivalent to using the variational poste- X s∗jl = sjl + Zi = j [Ψ(ᾱjl + β̄jl ) − Ψ(β̄jl ) + ᾱjl rior distribution of the observed data N as a predictive model i=1 of the future data [Gomes et al., 2008]. We then have a mod- ×Ψ0 (ᾱjl + β̄jl )(hln αjl i − ln ᾱjl )]β̄jl ified free energy for the compression phase in the following N X N X M X form ∗ vjl = vjl − Zi = j ln Xil , bj = ξj + hZi = ki M D   i=1 i=1 k=j+1 X X p(αjl |ujl , vjl ) p(βjl |sjl , tjl ) F= ln + ln N N q(αjl ) q(βjl ) X X j=1 l=1 t∗jl = tjl − Zi = j ln(1 − Xil ), aj = 1 + hZi = ji M M i=1 i=1 X p(λj |ξj ) T X X + ln + |nc | ln exp(ρcj ) (8) q(λj ) N where Ψ(·) is the digamma function, and h·i is the ex- j=1 c j=1 pectation evaluation. Note that, R e is the lower bound of where |nc | represents the number of data points in clump Γ(α+β) R = ln Γ(α)Γ(β) . Since this expectation is intractable, T c and N is the data magnification factor. The corresponding the second-order Taylor series expansion is applied to find update equations for maximizing this free energy function can its lower bound. The expected values in the above formu- be obtained as las are given by hZi = ji = rij , ᾱjl = hαjl i = u∗jl /vjl ∗ , exp(ρcj ) ∗ ∗ β̄jl = hβjl i = sjl /tjl , hln λj i = Ψ(aj ) − Ψ(aj + bj ), rcj = PM (9) j=1 exp(ρcj ) hln(1−λj )i = Ψ(bj )−Ψ(aj +bj ), hln αjl i = Ψ(u∗jl )−ln vjl ∗ D ∗ ∗ and hln βjl i = Ψ(sjl ) − ln tjl . ρij = X  R e jl + (ᾱjl − 1) lnhXcl i + (β̄jl − 1) ln(1 − hXcl i)  After convergence, the currently observed data points are l=1 clustered into M groups according to corresponding respon- j−1 sibilities rij through Eq. (7). According to [Gomes et al., X + ln λj + ln(1 − λk ) 2008], these newly formed groups of data points are also de- k=1 noted as “clumps”. Following [Gomes et al., 2008], these T X ~ i in clumps are subject to the constraint that all data points X u∗jl = ujl + |nc |rcj [Ψ(ᾱjl + β̄jl ) − Ψ(ᾱjl ) + β̄jl N the clump c share the same q(Zi ) ≡ q(Zc ) which is a key c factor in the following compression phase. ×Ψ0 (ᾱjl + β̄jl )(hln βjl i − ln β̄jl )]ᾱjl T X Algorithm 1 s∗jl = sjl + |nc |rcj [Ψ(ᾱjl + β̄jl ) − Ψ(β̄jl ) + ᾱjl N c 1: Choose the initial truncation level M . 2: Initialize the values for hyper-parameters ujl , vjl , sjl , tjl and ×Ψ0 (ᾱjl + β̄jl )(hln αjl i − ln ᾱjl )]β̄jl ξj . ∗ T X 3: Initialize the values of rij by K-Means algorithm. vjl = vjl − |nc |rcj lnhXcl i N 4: while More data to be observed do c 5: Perform the model building phase through Eqs. (5) and (6). T X t∗jl = tjl − |nc |rcj ln(1 − hXcl i) 6: Initialize the compression phase using Eq. (10). N c 7: while MC ≥ C do 8: for j = 1 to M do T X aj = 1 + |nc |hZc = ji 9: if evaluated(j) = false then N c 10: Split component j and refine this split using Eqs (9). M 11: ∆F(j) = change in Eq. (8). T X X 12: evaluated(j) = true. bj = ξj + |nc | hZc = ki N 13: end if c k=j+1 14: end for 15: Split component j with the largest value of ∆F(j). where hXcl i denotes average over all data points contained 16: M = M + 1. in clump c. 17: end while The first step of the compression phase is to assign each 18: Discard the current observed data points. clump or data point to the component with the highest re- 19: Save resulting components into next learning round. sponsibility rcj calculated from the model building phase as 20: end while Ic = arg max rcj (10) j 3.2 Compression Phase where {Ic } denote which component the clump (or data Within the compression phase, we need to estimate clumps point) c belongs to in the compression phase. Next, we cy- that are possibly belong to the same mixture component while cle through each component and split it along its principal taking into consideration future arriving data. Now assume component into two subcomponents. This split is refined by that we have already observed N data points, our aim is to updating Eqs. (9). The clumps are then hard assigned to one of the two candidate components after convergence for refin- 256 pixels, and is composed of eight urban and natural scene ing the split. Among all the potential splits, we select the one categories: coast (360 images), forest (328 images), highway that results in the largest change in the free energy (Eq. (8)). (260 images), inside-city (308 images), mountain (374 im- The splitting process repeats itself until a stopping criterion ages), open country (410 images), street (292 images), and is met. According to [Gomes et al., 2008], the stoping crite- tall building (356 images). Figure 1 shows some sample im- rion for the splitting process can be expressed as a limit on ages from the different categories in the OT database. the amount of memory required to store the components. In Our methodology is based on the proposed incremental infi- our case, the component memory cost for the mixture model nite GD mixture model in conjunction with a bag-of-visual is MC = 2DNc , where 2D is the number of parameters con- words representation, and can be summarized as follows: tained in a D-variate GD component, and Nc is the number Firstly, we use the Difference-of-Gaussians (DoG) interest of components. Accordingly, We can define an upper limit point detector to extract Scale-invariant feature transform on the component memory cost C, and the compression phase (SIFT) descriptors (128-dimensional) [Lowe, 2004] from stops when MC ≥ C. As a result, the computational time and each image. Secondly, K-Means algorithm is adopted to the space requirement is bounded in each learning round. Af- construct a visual vocabulary by quantizing these SIFT vec- ter the compression phase, the currently observed data points tors into visual words. As a result, each image is repre- are discarded while the resulting components can be treated sented as the frequency histogram over the visual words. We in the same way as data points in the next round of leaning. have tested different sizes of the visual vocabulary |W| = Our incremental variational inference algorithm for infinite [100, 1000], and the optimal performance was obtained for GD mixture model is summarized in Algorithm 1. |W| = 750 according to our experimental results. Then, the Probabilistic Latent Semantic Analysis (pLSA) model [Hof- mann, 2001] is applied to the obtained histograms to rep- resent each image by a 55-dimensional proportional vector where 55 is the number of latent aspects. Finally, the pro- posed InGDMM is deployed to cluster the images supposed to arrive in a sequential way. coast forest highway inside-city Table 1: Average rounded confusion matrix for the OT database calculated by InGDMM. C F H I M O S T Coast (C) 127 10 4 2 3 31 2 1 Forest (F) 2 155 1 2 1 3 0 0 mountain open country street tall building Highway (H) 0 0 122 1 0 3 3 1 Inside-city (I) 2 4 2 119 3 2 15 7 Figure 1: Sample images from the OT data set. Mountain (M) 6 21 4 5 139 9 1 2 Open country (O) 2 22 19 15 9 131 3 4 Street (S) 0 1 4 8 5 5 122 1 4 Visual Scenes Clustering Tall building (T) 4 9 7 23 3 19 3 110 In this section, the effectiveness of the proposed incremental infinite GD mixture model (InGDMM) is tested on a chal- lenging real-world application namely visual scenes cluster- 4.2 Experimental Results ing. The problem is important since images are being pro- duced at exponential increasing rates and very challenging In our experiments, we randomly divided the OT database due to the difficulty of capturing the variability of appearance into two halves: one for constructing the visual vocabulary, and shape of diverse objects belonging to the same scene, another for testing. Since our approach is unsupervised, the while avoiding confusing objects from different scenes. In class labels are not involved in our experiments, except for our experiments, we initialize the truncation level M as evaluation of the clustering results. The entire methodology 15. The initial values of the hyperparameters are set as: was repeated 30 times to evaluate the performance. For com- (ujl , vjl , sjl , tjl , ξj ) = (1, 0.01, 1, 0.01, 0.1), which have parison, we have also applied three other mixture-modeling been found to be reasonable choices according to our experi- approaches: the finite GD mixture model (FiGDMM), the in- mental results. finite Gaussian mixture model (InGMM) and the finite Gaus- sian mixture model (FiGMM). To make a fair comparison, 4.1 Database and Experimental Design all of the aforementioned approaches are learned through In this paper, we test our approach on a challenging and pub- incremental variational inference. Table 1 shows the aver- licly available database known as the OT database, which was age confusion matrix of the OT database calculated by the introduced by Oliva and Torralba [Oliva and Torralba, 2001] proposed InGDMM. Table 2 illustrates the average catego- 1 . This database contains 2,688 images with the size of 256 × rization performance using different approaches for the OT database. As we can see from this table, it is obvious that 1 OT database is available at: http://cvcl.mit.edu/database.htm. our approach (InGDMM) provides the best performance in terms of the highest categorization rate (77.47%) among all [Constantinopoulos et al., 2006] C. Constantinopoulos, the tested approaches. In addition, we can observe that better M.K. Titsias, and A. Likas. Bayesian feature and model selection for Gaussian mixture models. IEEE Trans- actions on Pattern Analysis and Machine Intelligence, Table 2: The average classification accuracy rate (Acc) (%) 28(6):1013 –1018, 2006. obtained over 30 runs using different approaches. [Corduneanu and Bishop, 2001] A. Corduneanu and C. M. Method InGDMM FiGDMM InGMM FiGMM Bishop. Variational Bayesian model selection for mix- Acc(%) 77.47 74.25 72.54 70.19 ture distributions. In Proc. of the 8th International Con- ference on Artificial Intelligence and Statistics (AISTAT), pages 27–34, 2001. performances are obtained for approaches that adopt the in- [Fan et al., 2012] W. Fan, N. Bouguila, and D. Ziou. Varia- finite mixtures (InGDMM and InGMM) than the correspond- ing finite mixtures (FiGDMM and FiGMM), which demon- tional learning for finite Dirichlet mixture models and ap- strate the advantage of using infinite mixture models over fi- plications. IEEE Transactions on Neural Netw. Learning nite ones. Moreover, according to Table 2, GD mixture has Syst., 23(5):762–774, 2012. higher performance than Gaussian mixture which verifies that [Fan et al., 2013] Wentao Fan, Nizar Bouguila, and Djemel the GD mixture model has better modeling capability than the Ziou. Unsupervised hybrid feature extraction selection for Gaussian for proportional data clustering. high-dimensional non-Gaussian data clustering with vari- ational inference. IEEE Transactions on Knowledge and Data Engineering, 25(7):1670–1685, 2013. 5 Conclusion [Gomes et al., 2008] R. Gomes, M. Welling, and P. Perona. In this work, we have presented an incremental nonpara- Incremental learning of nonparametric Bayesian mixture metric Bayesian approach for clustering. The proposed ap- models. In Proc. of IEEE Conference on Computer Vision proach is based on infinite GD mixture models with a Dirich- and Pattern Recognition (CVPR), pages 1–8, 2008. let process framework, and is learned using an incremental variational inference framework. Within this framework, the [Hofmann, 2001] T. Hofmann. Unsupervised learning by model parameters and the number of mixture components probabilistic latent semantic analysis. Machine Learning, are determined simultaneously. The effectiveness of the pro- 42(1/2):177–196, 2001. posed approach has been evaluated on a challenging applica- [Li et al., 2007] L.-J. Li, G. Wang, and L. Fei-Fei. Optimol: tion namely visual scenes clustering. Future works could be automatic online picture collection via incremental model devoted to the application of the proposed algorithm for other learning. In Proc. of IEEE Conference on Computer Vision data mining tasks involving continually changing or growing and Pattern Recognition (CVPR), pages 1–8, 2007. volumes of proportional data. [Lowe, 2004] D.G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Com- References puter Vision, 60(2):91–110, 2004. [Attias, 1999] H. Attias. A variational Bayes framework for [Neal, 2000] R. M. Neal. Markov chain sampling methods graphical models. In Proc. of Advances in Neural Infor- for Dirichlet process mixture models. Journal of Compu- mation Processing Systems (NIPS), pages 209–215, 1999. tational and Graphical Statistics, 9(2):249–265, 2000. [Oliva and Torralba, 2001] A. Oliva and A. Torralba. Mod- [Blei and Jordan, 2005] D.M. Blei and M.I. Jordan. Varia- eling the shape of the scene: A holistic representation of tional inference for Dirichlet process mixtures. Bayesian the spatial envelope. International Journal of Computer Analysis, 1:121–144, 2005. Vision, 42:145–175, 2001. [Bouguila and Ziou, 2006] N. Bouguila and D. Ziou. A [Opelt et al., 2006] A. Opelt, A. Pinz, and A. Zisserman. In- hybrid SEM algorithm for high-dimensional unsuper- cremental learning of object detectors using a visual shape vised learning using a finite generalized Dirichlet mixture. alphabet. In Proc. of IEEE Conference on Computer Vision IEEE Transactions on Image Processing, 15(9):2657– and Pattern Recognition (CVPR), volume 1, pages 3–10, 2668, 2006. 2006. [Bouguila and Ziou, 2007] N. Bouguila and D. Ziou. High- [Sethuraman, 1994] J. Sethuraman. A constructive definition dimensional unsupervised selection and estimation of a fi- of Dirichlet priors. Statistica Sinica, 4:639–650, 1994. nite generalized Dirichlet mixture model based on mini- [Sheikh et al., 2007] Y.A. Sheikh, E.A. Khan, and mum message length. IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 29:1716–1731, 2007. T. Kanade. Mode-seeking by medoidshifts. In Proc. of the IEEE 11th International Conference on Computer [Boutemedjet et al., 2009] S. Boutemedjet, N. Bouguila, and Vision (ICCV), pages 1–8, 2007. D. Ziou. A hybrid feature extraction selection approach [Teh et al., 2004] Y.W. Teh, M.I. Jordan, M.J. Beal, and for high-dimensional non-Gaussian data clustering. IEEE D.M. Blei. Hierarchical Dirichlet processes. Journal of Transactions on Pattern Analysis and Machine Intelli- the American Statistical Association, 101:705–711, 2004. gence, 31(8):1429–1443, 2009.