-

Topic models for automatic image annotation

Marouane Ben Haj Ayech

benzartif@yahoo.fr marouane.ayech@yahoo.fr 0

Amiri Hamid

hamidlamiri@yahoo.com 0 0 LSTS laboratory ENIT Tunis , Tunisia

- Modern image retrieval systems, which allow users to use textual queries and perform content-based image retrieval (CBIR), depend greatly on Automatic Image Annotation. Many models namely unsupervised topic models have successfully been applied in text analysis and are showing encouraging results in automatic image annotation. In this work, we first describe the basic topic models: the latent semantic analysis (LSA), the probabilistic latent semantic analysis (PLSA) and the latent Dirichlet allocation (LDA). Since these models assume that documents are represented as “bag of words” in text analysis, we then describe BOV-based image representation, an analogous representation adapted to image annotation. Based on SIFT technique followed by vectorial quantization, this representation allow image to be as “bag of visterms” (BOV). Finally, we describe some advanced topic models: GM-PLSA, GM-LDA and CORR-LDA, which are used in image annotation.

- automatic image annotation topic models CBIR SIFT BOV

INTRODUCTION

Automatic image annotation, which means the association of words to whole images [1], became a crucial part of information retrieval systems especially content-based image retrieval ones. Indeed, users prefer using textual request when searching images. However, retrieving within CBIR involves using low-level visual features of images. So, we fall in the semantic gap problem. To resolve this problem, almost all researchers tried to implement efficient techniques of image annotation. These approaches must annotate images automatically to treat large image databases.

Many approaches have been proposed for semantic image annotation and retrieval and are roughly classified into two categories: supervised versus unsupervised approaches [9].

The first class treats the annotation problem as a supervised classification and the words as independent classes. An important principle of these methods is to perform similarity measure at the visual low-level and annotate unseen images by propagating the corresponding words. The most important works found in the literature are Chang et al., 2003[5] and Carneiro et al., 2007[6].

The second class treats images and texts as equivalent data. Thus, they apply an unsupervised learning over data in order to discover the correlation between visual features and textual words. So, the annotation is posed as statistical inference which treats images as a bag of words and features generated both by latent or hidden variables. Various models have adopted this idea. Mori et al propose a model that use co-occurrence between words and features to predict words for annotating unseen images. Duygulu et al propose a translation model between two languages one for blobs and another for words that translates blobs into words, i.e., it attaches words to new image regions. Lavrenko et al propose the continuous-space relevance model (CRM) in which word probabilities are estimated using multinomial distribution and the blob feature using a non-parametric kernel density estimate [9]. Some other models that associate latent aspects or topics with images are called topic models such as LSA, PLSA and LDA and are successfully used in text analysis. So, in this work, we describe them and their extensions which are designed for image annotation.

II.

Topic models require that data representation is based on Bag-of-Words model, which implies that spatial relationships between words are ignored. Thus, the common way followed is to represent data as an observation matrix noted A that we will explain in detail later.

The idea behind these models is to add levels of latent variables to model aspects or topics. Since they were designed to be applied in text analysis, we prefer keep the terminology used in text analysis and after that we present analogy with image annotation in section 3. In this section, we are interested in the following simple models: LSA, PLSA and LDA.

LSA is Linear Algebra-based model and exploits data, i.e. matrix A, from algebraic perspective, while PLSA and LDA are statistical models and belong to the class of probabilistic generative models.

A. Bag of Words (BOW )model significance; i.e., the term probability as “made home”.

“home made” has the same model simplifies data representation and when applied to a corpus of text, gives a simple data representation.

Suppose we have a corpus C that is a collection of documents = { 1

, … , }. Each document di consists of a set of words. C is represented by a term-by-document matrix ∈ ℝ ×

= ( , )i=1..N,j=1..M = n(di, wj) , where N is the number of documents, M is the vocabulary size and n(di, wj) is the number of occurrences of w j in document di. The vocabulary V is a set of all possible words in the corpus = { 1, … , }.

B. Latent Semantic Analysis (LSA)

LSA is a Linear Algebra-based model which consists in decomposing the term-by-document matrix using Singular value decomposition (SVD) [10].

≅ (1) where ∈ ℝ × , ∈ ℝ × , ∈ ℝ × and ∈ × .

LSA consists in projection of the original space onto reduced dimensionality space, which allows capture of hidden similarity between

terms. probabilistic interpretation [1].

Unfortunately,

LSA lacks a C. Probabilistic Latent Semantic Analysis (PLSA)

In this model, a document is not represented as a bag of words but is modeled as a mixture of aspects or topics. Each topic is represented as a multinomial words distribution [11].

This

model is based on a conditional independence assumption: independent of the document di it belongs to given a latent variable zk.

, = ( ) (2)

= ∑ =1 ( | ) (3)

Since the number of latent variables is smaller than the number of words or documents, zk behave as a bottleneck in predicting words.

Model fitting is performed using EM algorithm, which alternates two steps to assure maximum likelihood estimation: E-step: The conditional distribution from the previous estimate of parameters: , ,

( | ) = ∑ =1 ( | ) M-step: The parameters ( | ) and with the new expected values , : is computed (4) are updated

∑ =1 ( , ) ( | , ) = ∑ =1 ∑ =1 ( , ) ( | , )

( | ) = ∑ =1 ( , ) ( | , ) ∑ =1 ( , )

(5) (6) combination of two conjugate distributions, it normally has fewer parameters than PLSA model [4] and by consequent reduce overfitting.

LDA assumes the following generative process: 1. 2.

Choose N ~ Poisson(ξ) Choose θ ∼ Dir(α)

( | ) = (∑

=1 ) ∏ =1 ( ) =1 −1 3. For each of the N words wn (a) Choose a topic zn~ Multinomial(θ)

( | ) = = where wl=1 if l=i and wl=0 if l≠i (b) Choose a word wn from p(wn | zn;β), a multinomial probability conditioned on the topic zn.

( | , ) = ( = 1 = 1) = where N is the number of words contained in a document, = ( 1, … , , . . , ) is the document representation. in z, where = 1 and

= 0 if l≠i We note that: word in w, where = 1 and = 0 if l≠j = ( 1, … , , … , ) is the representation of the n th = ( 1, … , , … , ) is the representation of the nth word • •

The three-level hierarchical structure of LDA allows that many concepts may be associated to one document.

The words are generated by concepts now the analogy of this terminology with image annotation domain. visterms:

Thus, images represent documents and words correspond to visterms. In fact, an image is divided into regions (visterms).

Now, we describe briefly the procedure used for image representation which allows representing an image as set of First, we apply an interest-point detector on each image to extract characteristic points. There exist in the literature many techniques such as DoG (the difference of Gaussians) point detector. The chosen technique must detect invariant points to some geometric and photometric transformations. Then, we apply SIFT (Scale Invariant Feature Transform) to obtain local descriptors from each image. These descriptors are computed on the regions around each interest point identified by the detector. Since the descriptors are obtained and in order to get a fixed image representation, we quantize all descriptors into a discrete set of visterms using by example k-means. Each cluster obtained represents a visterm in the image.

A. Scale Invariant Feature Transform (SIFT)

Finally, complete content and organizational editing before formatting. Please take note of the following items when proofreading spelling and grammar:

SIFT is a method for extracting distinctive and invariant features from images that can be used to perform reliable matching between different views of an object or scene [3]. The stages of computation used to generate the set of image features are: • • • •

Scale-space extrema detection: the first stage uses an interest-point detector; difference of Gaussians (DOG) which is a function to identify potential interest points that are invariant to scale and orientation.

Keypoint localization: at each candidate location, a detailed model is fit to determine location and scale. Keypoints are selected based on measures of their stability.

Orientation assignement: One or more orientations are assigned to each keypoint location based on local image gradient directions.

Keypoint descriptor: The local image gradients are measured at the selected in the region around each keypoint.

These are transformed into a representation that allows for significant levels of local shape distorsion and change in illumination.

B. Vectorial Quantization

Once SIFT has been applied to the collection of images, we obtain a set of feature vectors corresponding to images regions.

We use k-means, an unsupervised quantization technique, and we apply it over the whole feature space. The k-means algorithm yields to a number of centroids (their number must be fixed before applying k-means).

Each centroid is a vector whose length equals to feature space dimension and is representative to a subset from feature space. These centroids will be the visterms required by BOV model. C. Bag-of-visterms

An image is modeled using the Bag-of-visterms model, which is a simple model that represents a document as an orderless set of terms. In the case of images, an image is therefore represented as a orderless sequence of visual terms, called visterms.

Given a collection of images, the first task to perform is to identify a set of all visterms used at least once in at least one image. This set is called the vocabulary. Although the image is a set, we fix an arbitrary ordering for it so we can refer to visterm1 through visterm

M where

M is the size of vocabulary. Once vocabulary has been fixed, each image is represented as a vector with integer entries of length M. If this vector is d then its j th component d j is the number of appearances of visterm j in the image. The length of image is

As seen above, the collection of images is represented as a Nby-M

matrix, where each row describes an image and each column corresponds to a visterm.

IV.

TOPIC MODELS FOR IMAGE ANNOTATION

In this section, we describe three advanced topic models: Gaussian-Multinomial PLSA, Gaussian-Multinomial LDA and Correspondence LDA. These three models are based on basic topic models seen above: PLSA and LDA. Given that these basic models are suitable only for modeling one-type of data, advanced topic models are designed to fit multi-type data. A. Gaussian-Multinomial PLSA (GM-PLSA)

GM-PLSA is a combination of two PLSA models: a standard PLSA to model textual words and a continuous PLSA to model visual features. These two models share a common distribution over latent variable z noted P(z|d).

The whole model, which is represented in figure 3, assumes the following generative process: 1. Select a document with probability ( ) 2.

Choose a latent aspect with probability ( | ) from a multinomial distribution conditioned on dicument 3. For each of the words, sample from a multinomial distribution

4. For each

of the

feature vectors, sample ( ) conditioned on the latent aspect multivariate

Gaussian

distribution conditioned on the latent aspect from

a ( | , ) B. Gaussian-Multinomial LDA (GM-LDA)

GM-LDA is a combination of two LDA models: a standard LDA to model textual words and a continuous LDA to model visual features. This model, represented in figure 4, shows that words topics which and regions of an image can come from different means that the whole document can contain multiple topics. Furthermore, we can view representation of the whole document (image features + as high-level words). given as follows: ( , , , , ) the set of associated words w and latent variables , z and v is 2. For each of theN image regions:

( ) on 3. For each of theM words: b. Sample a region description conditional

Sample ~

( ) b. Sample a word conditional on =1 the set of associated words w and latent variables , z and y is b. a.

b. ( , , , , ) CORR-LDA model assumes the following generative process: Sample a region description from a multivariate Gaussian distribution conditional on ~ ( | , , ) Sample ~

(1, … , ) Sample a region description from a multinomial distribution conditional on andz

CONCLUSION In this paper, we have described topic models. The basic models, PLSA and LDA are designed for one-type data and advanced models are designed for modeling multi-type data, especially for image modeling and annotation. 1135, 2003.

2003. from the 26th annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 127–134. ACM Press, annotation for multimodal image retrieval using Bayes point machines.

IEEE Trans. Circ. Systems Video Technol., 13 1 (2003), pp. 26–38. learning of semantic classes for image annotation and retrieval. IEEE Trans. Pattern Anal. Machine Intell., 29 3 (2007), pp. 394–410. [7] S. Tollari, “Image indexing and retrieval by combining textual and

Hoffman,

“Probabilistic latent semantic indexing”, SIGIR received student at ENIT, His research interest includes machine learning and Hamid

Amiri received the

Diploma of Electrotechnics, Information Technique in 1978 and the Braunschweig, Germany.

1983

at the obtained

TU the Doctorates Sciences in 1993. He was a Professor at the National School of Engineer of Tunis to 2009 he was at the Riyadh College of Telecom and Information. Currently, he is again at ENIT. His research is focused on • • • •

Image Processing.

Speech Processing.

Document Processing.

information ENIT in 2007and the M.Sc. degree from ENIT in 2009. He is now a phd degree in