=Paper=
{{Paper
|id=None
|storemode=property
|title=Topic Models for Automatic Image Annotation
|pdfUrl=https://ceur-ws.org/Vol-845/paper-9.pdf
|volume=Vol-845
}}
==Topic Models for Automatic Image Annotation==
Topic models for automatic image annotation
Marouane Ben Haj Ayech Faouizi Benzarti Amiri Hamid
LSTS laboratory LSTS laboratory LSTS laboratory
ENIT ENIT ENIT
Tunis, Tunisia Tunis, Tunisia Tunis, Tunisia
marouane.ayech@yahoo.fr benzartif@yahoo.fr hamidlamiri@yahoo.com
Abstract— Modern image retrieval systems, which allow users to works found in the literature are Chang et al., 2003[5] and
use textual queries and perform content-based image retrieval Carneiro et al., 2007[6].
(CBIR), depend greatly on Automatic Image Annotation. Many
models namely unsupervised topic models have successfully been The second class treats images and texts as equivalent data.
applied in text analysis and are showing encouraging results in Thus, they apply an unsupervised learning over data in order to
automatic image annotation. In this work, we first describe the discover the correlation between visual features and textual
basic topic models: the latent semantic analysis (LSA), the words. So, the annotation is posed as statistical inference which
probabilistic latent semantic analysis (PLSA) and the latent treats images as a bag of words and features generated both by
Dirichlet allocation (LDA). Since these models assume that latent or hidden variables. Various models have adopted this
documents are represented as “bag of words” in text analysis, we idea. Mori et al propose a model that use co-occurrence
then describe BOV-based image representation, an analogous between words and features to predict words for annotating
representation adapted to image annotation. Based on SIFT unseen images. Duygulu et al propose a translation model
technique followed by vectorial quantization, this representation between two languages one for blobs and another for words
allow image to be as “bag of visterms” (BOV). Finally, we that translates blobs into words, i.e., it attaches words to new
describe some advanced topic models: GM-PLSA, GM-LDA and image regions. Lavrenko et al propose the continuous-space
CORR-LDA, which are used in image annotation. relevance model (CRM) in which word probabilities are
estimated using multinomial distribution and the blob feature
Keywords- automatic image annotation; topic models ; CBIR ;
SIFT ; BOV.
using a non-parametric kernel density estimate [9]. Some other
models that associate latent aspects or topics with images are
I. INTRODUCTION called topic models such as LSA, PLSA and LDA and are
successfully used in text analysis. So, in this work, we describe
them and their extensions which are designed for image
Automatic image annotation, which means the association annotation.
of words to whole images [1], became a crucial part of
information retrieval systems especially content-based image II. RELATED WORK
retrieval ones. Indeed, users prefer using textual request when
searching images. However, retrieving within CBIR involves
using low-level visual features of images. So, we fall in the Topic models require that data representation is based on
semantic gap problem. To resolve this problem, almost all Bag-of-Words model, which implies that spatial relationships
researchers tried to implement efficient techniques of image between words are ignored. Thus, the common way followed is
annotation. These approaches must annotate images to represent data as an observation matrix noted A that we will
automatically to treat large image databases. explain in detail later.
Many approaches have been proposed for semantic image The idea behind these models is to add levels of latent
annotation and retrieval and are roughly classified into two variables to model aspects or topics. Since they were designed
categories: supervised versus unsupervised approaches [9]. to be applied in text analysis, we prefer keep the terminology
used in text analysis and after that we present analogy with
The first class treats the annotation problem as a supervised image annotation in section 3. In this section, we are interested
classification and the words as independent classes. An in the following simple models: LSA, PLSA and LDA.
important principle of these methods is to perform similarity
measure at the visual low-level and annotate unseen images by LSA is Linear Algebra-based model and exploits data, i.e.
propagating the corresponding words. The most important matrix A, from algebraic perspective, while PLSA and LDA
SIDOP’12 : 2nd Workshop on Signal and Document Processing
are statistical models and belong to the class of probabilistic ∑𝑁
𝑖=1 𝑛(𝑑𝑖,𝑤𝑗 )𝑃(𝑧𝑘|𝑑𝑖 ,𝑤𝑗 )
generative models. 𝑃�𝑤𝑗 �𝑧𝑘 � = ∑𝑀 𝑁 (5)
𝑚=1 ∑𝑖=1 𝑛(𝑑𝑖,𝑤𝑚 )𝑃(𝑧𝑘|𝑑𝑖 ,𝑤𝑚 )
A. Bag of Words (BOW )model ∑𝑁
𝑖=1 𝑛(𝑑𝑖 ,𝑤𝑗 )𝑃(𝑧𝑘|𝑑𝑖 ,𝑤𝑗 )
𝑃(𝑧𝑘 |𝑑𝑖 ) = ∑𝑀
(6)
This model is called also “Orderless Document 𝑗=1 𝑛(𝑑𝑖,𝑤𝑗 )
Representation”. Indeed, it assumes that order of words has no
significance; i.e., the term “home made” has the same
probability as “made home”.
BOW model simplifies data representation and when
applied to a corpus of text, gives a simple data representation.
Suppose we have a corpus C that is a collection of
documents 𝐶 = {𝑑1 , … , 𝑑𝑁 }. Each document di consists of a set
Figure 1. The graphical model of PLSA. Nodes inside a given
of words. C is represented by a term-by-document matrix box indicate that they are replicated the number of times
𝐴 ∈ ℝ𝑁×𝑀 = (𝑎𝑖,𝑗 )i=1..N,j=1..M = n(di , wj ) , where N is the indicated in the bottom left corner. Filled circles indicate
number of documents, M is the vocabulary size and n(di , wj ) observed random variables; unfilled are unobserved.
is the number of occurrences of wj in document di. The
vocabulary V is a set of all possible words in the corpus The maximum likelihood estimation is released by
𝑉 = {𝑤1 , … , 𝑤𝑁 }.
maximizing the objective function :
B. Latent Semantic Analysis (LSA)
𝑛�𝑤𝑖,𝑑𝑗 �
LSA is a Linear Algebra-based model which consists in 𝐿 = ∏𝑁 𝑀
𝑖=1 ∏𝑗=1 𝑃�𝑤𝑖 �𝑑𝑗 � (7)
decomposing the term-by-document matrix using Singular
value decomposition (SVD) [10]. where 𝑃�𝑤𝑖 �𝑑𝑗 � is given by (3).
𝑇
𝐴 ≅ 𝑈𝑆𝑉 (1) D. Latent Direchlet Allocation (LDA)
𝑁×𝑀 𝑁×𝐾 𝐾×𝐾
where 𝐴 ∈ ℝ ,𝑈 ∈ ℝ ,𝑆 ∈ ℝ and 𝐴 ∈ 𝑉 𝐾×𝑀 . In contrast to PLSA, LDA treats the multinomial weights
LSA consists in projection of the original space onto over topics as latent random variables. Those weights are
reduced dimensionality space, which allows capture of hidden sampled from a Dirichlet distribution, which is the conjugate
similarity between terms. Unfortunately, LSA lacks a prior of the multinomial distribution [8]. Since LDA is a
probabilistic interpretation [1]. combination of two conjugate distributions, it normally has
fewer parameters than PLSA model [4] and by consequent
C. Probabilistic Latent Semantic Analysis (PLSA) reduce overfitting.
In this model, a document is not represented as a bag of LDA assumes the following generative process:
words but is modeled as a mixture of aspects or topics. Each
topic is represented as a multinomial words distribution [11]. 1. Choose N ~ Poisson(ξ)
This model is based on a conditional independence 2. Choose θ ∼ Dir(α)
assumption: Each observed word wj is conditionally 𝑘
independent of the document di it belongs to given a latent 𝛤 (∑𝐾
𝑖=1 𝛼𝑖 )
𝑃 (𝜃 |𝛼 ) = 𝑘 � 𝜃𝑖 𝛼𝑖 −1
variable zk. ∏𝑖=1 𝛤(𝛼𝑖 )
𝑖=1
𝑃�𝑤𝑗 , 𝑑𝑖 � = 𝑃(𝑑𝑖 )𝑃�𝑤𝑖 �𝑑𝑗 � (2) 3. For each of the N words wn
𝑃�𝑤𝑖 �𝑑𝑗 � = ∑𝐾
𝑘=1 𝑃�𝑤𝑗 �𝑧𝑘 �𝑃(𝑧𝑘 |𝑑𝑖 ) (3) (a) Choose a topic zn~ Multinomial(θ)
𝑘
Since the number of latent variables is smaller than the 𝑙
number of words or documents, zk behave as a bottleneck in 𝑃(𝑧𝑛 |𝜃) = 𝜃𝑖 = � 𝜃𝑖 𝑤
predicting words. 𝑙=1
where wl=1 if l=i and wl=0 if l≠i
Model fitting is performed using EM algorithm, which
alternates two steps to assure maximum likelihood estimation: (b) Choose a word wn from p(wn | zn;β), a
E-step: The conditional distribution 𝑃�𝑧𝑘 �𝑑𝑖 , 𝑤𝑗 � is computed multinomial probability conditioned on the topic
from the previous estimate of parameters: zn.
𝑃(𝑧𝑘 |𝑑𝑖 )𝑃�𝑤𝑗 �𝑧𝑘 � 𝑃(𝑤𝑛 |𝑧𝑛 , 𝛽) = 𝑃(𝑤𝑗 = 1�𝑧 𝑖 = 1) = 𝛽𝑖𝑗
𝑃�𝑧𝑘 �𝑑𝑖 , 𝑤𝑗 � = ∑𝐾 (4)
𝑙=1 𝑃(𝑧𝑙 |𝑑𝑖 )𝑃�𝑤𝑗�𝑧𝑙 �
where N is the number of words contained in a document,
M-step: The parameters 𝑃 (𝑧𝑘 |𝑑𝑖 ) and 𝑃�𝑤𝑗 �𝑧𝑘 � are updated 𝒘 = (𝑤1 , … , 𝑤𝑛 , . . , 𝑤𝑁 ) is the document representation.
with the new expected values 𝑃�𝑧𝑘 �𝑑𝑖 , 𝑤𝑗 �:
SIDOP’12 : 2nd Workshop on Signal and Document Processing
𝑤𝑛 = (𝑤𝑛 1 , … , 𝑤𝑛 𝑗 , … , 𝑤𝑛 𝑉 ) is the representation of the nth First, we apply an interest-point detector on each image to
word in w, where 𝑤𝑛 𝑗 = 1 and 𝑤𝑛 𝑙 = 0 if l≠j extract characteristic points. There exist in the literature many
techniques such as DoG (the difference of Gaussians) point
𝑧𝑛 = (𝑧𝑛 1 , … , 𝑧𝑛 𝑖 , … , 𝑧𝑛 𝐾 ) is the representation of the nth word
detector. The chosen technique must detect invariant points to
in z, where 𝑧𝑛 𝑖 = 1 and 𝑤𝑛 𝑙 = 0 if l≠i some geometric and photometric transformations. Then, we
We note that: apply SIFT (Scale Invariant Feature Transform) to obtain local
descriptors from each image. These descriptors are computed
• The three-level hierarchical structure of LDA on the regions around each interest point identified by the
allows that many concepts may be associated to
detector. Since the descriptors are obtained and in order to get
one document.
a fixed image representation, we quantize all descriptors into a
• The words are generated by concepts discrete set of visterms using by example k-means. Each
cluster obtained represents a visterm in the image.
A. Scale Invariant Feature Transform (SIFT)
Finally, complete content and organizational editing before
formatting. Please take note of the following items when
proofreading spelling and grammar:
SIFT is a method for extracting distinctive and invariant
features from images that can be used to perform reliable
matching between different views of an object or scene [3].
The stages of computation used to generate the set of image
Figure 2. The graphical model of LDA features are:
• Scale-space extrema detection: the first stage uses
an interest-point detector; difference of Gaussians
To estimate the parameters of LDA, we must apply (DOG) which is a function to identify potential
inference and estimation procedures to the model. Thus, we interest points that are invariant to scale and
need to compute the posteriori distribution in inference step: orientation.
𝑃 (𝜃, 𝑧, 𝒘|𝛼, 𝛽 )
𝑃(𝜃, 𝑧|𝒘, 𝛼, 𝛽) = (8) • Keypoint localization: at each candidate location,
𝑃(𝒘|𝛼, 𝛽 ) a detailed model is fit to determine location and
where scale. Keypoints are selected based on measures of
their stability.
𝑁
𝑃(𝜃, 𝑧, 𝒘|𝛼, 𝛽 ) = 𝑃(𝜃|𝛼) � 𝑝(𝑧𝑛 |𝜃) 𝑝(𝑤𝑛 |𝑧𝑛 , 𝛽) • Orientation assignement: One or more orientations
are assigned to each keypoint location based on
𝑛=1
local image gradient directions.
𝑁
• Keypoint descriptor: The local image gradients are
𝑃(𝒘|𝛼, 𝛽 ) = � 𝑃(𝜃|𝛼) �� � 𝑝(𝑧𝑛 |𝜃 ) 𝑝(𝑤𝑛 |𝑧𝑛 , 𝛽 )� 𝑑𝜃 measured at the selected in the region around each
𝑛=1 𝑧𝑛 keypoint. These are transformed into a
representation that allows for significant levels of
The posterior distribution is intractable and the solution is to local shape distorsion and change in illumination.
use variational estimation methods [8].
B. Vectorial Quantization
Once SIFT has been applied to the collection of images,
III. BOV-BASED IMAGE REPRESENTATION we obtain a set of feature vectors corresponding to images
regions. We use k-means, an unsupervised quantization
technique, and we apply it over the whole feature space. The
Since aspect models are firstly used in text analysis, we k-means algorithm yields to a number of centroids (their
have used text terminology (document, word, ..). We present number must be fixed before applying k-means). Each
now the analogy of this terminology with image annotation centroid is a vector whose length equals to feature space
domain. dimension and is representative to a subset from feature space.
Thus, images represent documents and words correspond to These centroids will be the visterms required by BOV model.
visterms. In fact, an image is divided into regions (visterms). C. Bag-of-visterms
Now, we describe briefly the procedure used for image An image is modeled using the Bag-of-visterms model,
representation which allows representing an image as set of which is a simple model that represents a document as an
visterms: orderless set of terms. In the case of images, an image is
SIDOP’12 : 2nd Workshop on Signal and Document Processing
therefore represented as a orderless sequence of visual terms,
called visterms.
Given a collection of images, the first task to perform is to
identify a set of all visterms used at least once in at least one B. Gaussian-Multinomial LDA (GM-LDA)
image. This set is called the vocabulary. Although the image is GM-LDA is a combination of two LDA models: a standard
a set, we fix an arbitrary ordering for it so we can refer to LDA to model textual words and a continuous LDA to model
visterm1 through visterm M where M is the size of visual features. This model, represented in figure 4, shows that
vocabulary. Once vocabulary has been fixed, each image is words 𝑤𝑚 and regions 𝑟𝑛 of an image can come from different
represented as a vector with integer entries of length M. If this topics which means that the whole document can contain
vector is d then its jth component dj is the number of multiple topics. Furthermore, we can view 𝜃 as high-level
appearances of visterm j in the image. The length of image is representation of the whole document (image features +
𝑛 = ∑𝑀 𝑖=1 𝑑𝑗 .
words).
As seen above, the collection of images is represented as a N-
by-M matrix, where each row describes an image and each
column corresponds to a visterm.
IV. TOPIC MODELS FOR IMAGE ANNOTATION
In this section, we describe three advanced topic models:
Gaussian-Multinomial PLSA, Gaussian-Multinomial LDA and
Correspondence LDA. These three models are based on basic
topic models seen above: PLSA and LDA. Given that these
basic models are suitable only for modeling one-type of data,
Figure 4. The graphical model of GM-LDA
advanced topic models are designed to fit multi-type data.
A. Gaussian-Multinomial PLSA (GM-PLSA) The joint probability distribution of the set of image features r,
GM-PLSA is a combination of two PLSA models: a the set of associated words w and latent variables 𝜃, z and v is
standard PLSA to model textual words and a continuous given as follows:
PLSA to model visual features. These two models share a 𝑝(𝒓, 𝒘, 𝜃, 𝒛, 𝒗)
𝑁 𝑀
common distribution over latent variable z noted P(z|d).
= 𝑝(𝜃 |𝛼) �� 𝑝(𝑧𝑛 |𝜃 )𝑝(𝑟𝑛 |𝑧𝑛 , 𝜇, 𝜎)� �� 𝑝(𝑣𝑚 |𝜃 )𝑝(𝑤𝑛 |𝑣𝑚 , 𝛽 )�
The whole model, which is represented in figure 3, 𝑛=1 𝑚=1
assumes the following generative process:
GM-LDA model assumes the following generative process:
1. Select a document 𝑑𝑖 with probability 𝑃(𝑑𝑖 ) 1. Sample a Dirichlet random variable 𝜃
2. For each of the N image regions:
2. Choose a latent aspect 𝑧𝑘 with probability 𝑃(𝑧𝑘 |𝑑𝑖 ) from
a. Sample 𝑧𝑛 ~𝑀𝑢𝑙𝑡(𝜃)
a multinomial distribution conditioned on dicument 𝑑𝑖
b. Sample a region description 𝑟𝑛 conditional
3. For each of the words, sample 𝑤𝑚 from a multinomial on 𝑧𝑛
distribution 𝑀𝑢𝑙𝑡(𝜃𝑘 ) conditioned on the latent aspect 𝑧𝑘 3. For each of the M words:
4. For each of the feature vectors, sample 𝑟𝑛 from a a. Sample 𝑣𝑚 ~𝑀𝑢𝑙𝑡(𝜃)
multivariate Gaussian distribution 𝒩 (𝑥|𝜇𝑘 , 𝜎𝑘 ) b. Sample a word 𝑤𝑚 conditional on 𝑣𝑚
conditioned on the latent aspect 𝑧𝑘
C. Correspondance- LDA (CORR-LDA)
In CORR-LDA model, image features are firstly
generated and subsequently words are generated. Indeed, N
region features are generated. Then, for each of the M words,
one of the regions is selected from the image and a
corresponding word is drawn conditioned on the topic that
generates the selected region (Figure 5).
The joint probability distribution of the set of image features r,
the set of associated words w and latent variables 𝜃, z and y is
given as follows:
Figure 3. The graphical model of GM-PLSA
SIDOP’12 : 2nd Workshop on Signal and Document Processing
[7] S. Tollari, “Image indexing and retrieval by combining textual and
visual informations”, thesis in Université du Sud Toulon-Var, 2006.
𝑝(𝒓, 𝒘, 𝜃, 𝒛, 𝒚)
𝑁 𝑀 [8] D. M.Blei,A. Y.Ng,M. I.Jordan, “Latent Dirichlet Allocation”. Journal
of Machine Learning Research, 3, 993-1022,2003.
= 𝑝(𝜃|𝛼) �� 𝑝(𝑧𝑛 |𝜃)𝑝(𝑟𝑛 |𝑧𝑛 , 𝜇, 𝜎)� �� 𝑝(𝑦𝑚 |𝑁)𝑝(𝑤𝑛 |𝑦𝑚 , 𝒛, 𝛽)�
[9] Z. Li, Z. Shi, X. Liu, Z. Shi, ”Modeling continuous visual features for
𝑛=1 𝑚=1
semantic image annotation and retrieval”, 2011
CORR-LDA model assumes the following generative process:
1. Sample 𝜃~𝐷𝑖𝑟(𝜃|𝛼 ) [10] S. Deerwester, S. Dumais, T. Landauer, G. Furnas, and R. Harshman,
“Indexing by latent semantic analysis”, Journal of the American Society
2. For each of the N image regions: of Information Science, 41(6):391-407, 1990
a. Sample 𝑧𝑛 ~𝑀𝑢𝑙𝑡(𝜃) [11] T. Hoffman, “Probabilistic latent semantic indexing”, SIGIR
b. Sample a region description 𝑟𝑛 from a Conference, 1999
multivariate Gaussian distribution
conditional on 𝑧𝑛
𝑟𝑛 ~𝑃(𝑟|𝑧𝑛 , 𝜇, 𝜎 )
3. For each of the M words:
a. Sample 𝑦𝑚 ~𝑈𝑛𝑖𝑓(1, … , 𝑁)
b. Sample a region description 𝑤𝑚 from a
multinomial distribution conditional on BIBLIOGRAPHY
𝑦𝑚 and z
Ben Haj Ayech Marouane received an
engineering degree from ENIT in 2007and the
M.Sc. degree from ENIT in 2009. He is now a phd
student at ENIT, His research interest includes
information retrieval, machine learning and
computer vision.
V. CONCLUSION
Figure 5. The graphical model of CORR-LDA
In this paper, we have described topic models. The basic
models, PLSA and LDA are designed for one-type data and
advanced models are designed for modeling multi-type data,
especially for image modeling and annotation. Hamid Amiri received the Diploma of
Electrotechnics, Information Technique in 1978
and the PhD degree in 1983 at the TU
Braunschweig, Germany. He obtained the
Doctorates Sciences in 1993. He was a Professor
REFERENCES at the National School of Engineer of Tunis
(ENIT), Tunisia, from 1987 to 2001. From 2001
[1] F. Monay and D. Gatica-Perez, “On Image Auto-Annotation with Latent to 2009 he was at the Riyadh College of Telecom and Information.
Space Models”, ACM Multimedia (2003) , p. 275--278 Currently, he is again at ENIT. His research is focused on
• Image Processing.
[2] K. Barnard, P. Duygulu, N. Freitas, D. Forsyth, D. Blei, “Matching
• Speech Processing.
words and pictures,” Journal of Machine Learning Research, 3:1107-
1135, 2003. • Document Processing.
• Natural language processing
[3] D.G.Lowe, “Distinctive Image Features from Scale-Invariant
Keypoints”, International Journal of Computer, 2004.
[4] D.M. Blei, M.I. Jordan, “Modeling annotated Data”, In Proceedings of
the 26th annual International ACM SIGIR Conference on Research and
Development in Information Retrieval, pages 127–134. ACM Press,
2003.
[5] E. Chang, K. Goh, G. Sychay and G. Wu, CBSA: Content-based soft
annotation for multimodal image retrieval using Bayes point machines.
IEEE Trans. Circ. Systems Video Technol., 13 1 (2003), pp. 26–38.
[6] G. Carneiro, A.B. Chan, P.J. Moreno and N. Vasconcelos, Supervised
learning of semantic classes for image annotation and retrieval. IEEE
Trans. Pattern Anal. Machine Intell., 29 3 (2007), pp. 394–410.