<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Topic models for automatic image annotation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marouane Ben Haj Ayech</string-name>
          <email>benzartif@yahoo.fr</email>
          <email>marouane.ayech@yahoo.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Amiri Hamid</string-name>
          <email>hamidlamiri@yahoo.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>LSTS laboratory ENIT Tunis</institution>
          ,
          <country country="TN">Tunisia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>- Modern image retrieval systems, which allow users to use textual queries and perform content-based image retrieval (CBIR), depend greatly on Automatic Image Annotation. Many models namely unsupervised topic models have successfully been applied in text analysis and are showing encouraging results in automatic image annotation. In this work, we first describe the basic topic models: the latent semantic analysis (LSA), the probabilistic latent semantic analysis (PLSA) and the latent Dirichlet allocation (LDA). Since these models assume that documents are represented as “bag of words” in text analysis, we then describe BOV-based image representation, an analogous representation adapted to image annotation. Based on SIFT technique followed by vectorial quantization, this representation allow image to be as “bag of visterms” (BOV). Finally, we describe some advanced topic models: GM-PLSA, GM-LDA and CORR-LDA, which are used in image annotation.</p>
      </abstract>
      <kwd-group>
        <kwd>- automatic image annotation</kwd>
        <kwd>topic models</kwd>
        <kwd>CBIR</kwd>
        <kwd>SIFT</kwd>
        <kwd>BOV</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>INTRODUCTION</p>
      <p>Automatic image annotation, which means the association
of words to whole images [1], became a crucial part of
information retrieval systems especially content-based image
retrieval ones. Indeed, users prefer using textual request when
searching images. However, retrieving within CBIR involves
using low-level visual features of images. So, we fall in the
semantic gap problem. To resolve this problem, almost all
researchers tried to implement efficient techniques of image
annotation. These approaches must annotate images
automatically to treat large image databases.</p>
      <p>Many approaches have been proposed for semantic image
annotation and retrieval and are roughly classified into two
categories: supervised versus unsupervised approaches [9].</p>
      <p>The first class treats the annotation problem as a supervised
classification and the words as independent classes. An
important principle of these methods is to perform similarity
measure at the visual low-level and annotate unseen images by
propagating the corresponding words. The most important
works found in the literature are Chang et al., 2003[5] and
Carneiro et al., 2007[6].</p>
      <p>The second class treats images and texts as equivalent data.
Thus, they apply an unsupervised learning over data in order to
discover the correlation between visual features and textual
words. So, the annotation is posed as statistical inference which
treats images as a bag of words and features generated both by
latent or hidden variables. Various models have adopted this
idea. Mori et al propose a model that use co-occurrence
between words and features to predict words for annotating
unseen images. Duygulu et al propose a translation model
between two languages one for blobs and another for words
that translates blobs into words, i.e., it attaches words to new
image regions. Lavrenko et al propose the continuous-space
relevance model (CRM) in which word probabilities are
estimated using multinomial distribution and the blob feature
using a non-parametric kernel density estimate [9]. Some other
models that associate latent aspects or topics with images are
called topic models such as LSA, PLSA and LDA and are
successfully used in text analysis. So, in this work, we describe
them and their extensions which are designed for image
annotation.</p>
      <p>II.</p>
      <p>Topic models require that data representation is based on
Bag-of-Words model, which implies that spatial relationships
between words are ignored. Thus, the common way followed is
to represent data as an observation matrix noted A that we will
explain in detail later.</p>
      <p>The idea behind these models is to add levels of latent
variables to model aspects or topics. Since they were designed
to be applied in text analysis, we prefer keep the terminology
used in text analysis and after that we present analogy with
image annotation in section 3. In this section, we are interested
in the following simple models: LSA, PLSA and LDA.</p>
      <p>LSA is Linear Algebra-based model and exploits data, i.e.
matrix A, from algebraic perspective, while PLSA and LDA
are statistical models and belong to the class of probabilistic
generative models.</p>
      <p>A. Bag of Words (BOW )model
significance; i.e., the term
probability as “made home”.</p>
      <p>“home
made” has the same
model simplifies data representation and
when
applied to a corpus of text, gives a simple data representation.</p>
      <p>Suppose we have a corpus C that is a collection of
documents 
= { 1</p>
      <p>, … ,   }. Each document di consists of a set
of words. C is represented by a term-by-document matrix
 ∈ ℝ
 ×</p>
      <p>= (  , )i=1..N,j=1..M = n(di, wj) , where N is the
number of documents, M is the vocabulary size and n(di, wj)
is the number of occurrences of w
j in document di. The
vocabulary V is a set of all possible words in the corpus
 = { 1, … ,   }.</p>
      <p>B. Latent Semantic Analysis (LSA)</p>
      <p>LSA is a Linear Algebra-based model which consists in
decomposing the term-by-document matrix using Singular
value decomposition (SVD) [10].</p>
      <p>≅   
(1)
where  ∈ ℝ × ,  ∈ ℝ × ,  ∈ ℝ × and  ∈   × .</p>
      <p>LSA consists in projection of the original space onto
reduced dimensionality space, which allows capture of hidden
similarity
between</p>
      <p>terms.
probabilistic interpretation [1].</p>
      <p>Unfortunately,</p>
      <p>LSA
lacks a
C. Probabilistic Latent Semantic Analysis (PLSA)</p>
      <p>In this model, a document is not represented as a bag of
words but is modeled as a mixture of aspects or topics. Each
topic is represented as a multinomial words distribution [11].</p>
    </sec>
    <sec id="sec-2">
      <title>This</title>
      <p>model is based on a conditional independence
assumption:
independent of the document di it belongs to given a latent
variable zk.</p>
      <p>,   =  (  )    
(2)</p>
      <p>= ∑ =1       (  |  ) (3)</p>
      <p>Since the number of latent variables is smaller than the
number of words or documents, zk behave as a bottleneck in
predicting words.</p>
      <p>Model fitting is performed using EM algorithm, which
alternates two steps to assure maximum likelihood estimation:
E-step: The conditional distribution 
from the previous estimate of parameters:
    ,  
     ,</p>
      <p>(  |  )    
= ∑ =1  (  |  )    
M-step: The parameters  (  |  ) and     
with the new expected values      ,   :
is computed
(4)
are updated</p>
      <p />
      <p>∑ =1  (  ,  ) (  |  ,  )
= ∑ =1 ∑ =1  (  ,  ) (  |  ,  )</p>
      <p>(  |  ) = ∑ =1  (  ,  ) (  |  ,  )

∑ =1  (  ,  )</p>
      <p>(5)
(6)
combination of two conjugate distributions, it normally has
fewer parameters than PLSA model [4] and by consequent
reduce overfitting.</p>
      <p>LDA assumes the following generative process:
1.
2.</p>
    </sec>
    <sec id="sec-3">
      <title>Choose N ~ Poisson(ξ) Choose θ ∼ Dir(α)</title>
      <p>( | ) =
 (∑</p>
      <p>=1   )

∏ =1  (  )  =1

    −1
3. For each of the N words wn
(a) Choose a topic zn~ Multinomial(θ)</p>
      <p>(  | ) =   =
where wl=1 if l=i and wl=0 if l≠i
(b) Choose a word wn from p(wn | zn;β), a
multinomial probability conditioned on the topic
zn.</p>
      <p>(  |  ,  ) =  (  = 1   = 1) =  

   
where N is the number of words contained in a document,
 = ( 1, … ,   , . . ,   ) is the document representation.
in z, where  
 = 1 and</p>
      <p>= 0 if l≠i
We note that:
 
 
word in w, where  
 = 1 and  
 = 0 if l≠j
= (  1, … ,  
 , … ,    ) is the representation of the n
th
= (  1, … ,    , … ,    ) is the representation of the nth word
•
•</p>
      <p>The three-level hierarchical structure of LDA
allows that many concepts may be associated to
one document.</p>
      <p>The words are generated by concepts
now the analogy of this terminology with image annotation
domain.
visterms:</p>
      <p>Thus, images represent documents and words correspond to
visterms. In fact, an image is divided into regions (visterms).</p>
      <p>Now, we describe briefly the procedure used for image
representation which allows representing an image as set of
First, we apply an interest-point detector on each image to
extract characteristic points. There exist in the literature many
techniques such as DoG (the difference of Gaussians) point
detector. The chosen technique must detect invariant points to
some geometric and photometric transformations. Then, we
apply SIFT (Scale Invariant Feature Transform) to obtain local
descriptors from each image. These descriptors are computed
on the regions around each interest point identified by the
detector. Since the descriptors are obtained and in order to get
a fixed image representation, we quantize all descriptors into a
discrete set of visterms using by example k-means. Each
cluster obtained represents a visterm in the image.</p>
      <p>A. Scale Invariant Feature Transform (SIFT)</p>
      <p>Finally, complete content and organizational editing before
formatting. Please take note of the following items when
proofreading spelling and grammar:</p>
      <p>SIFT is a method for extracting distinctive and invariant
features from images that can be used to perform reliable
matching between different views of an object or scene [3].
The stages of computation used to generate the set of image
features are:
•
•
•
•</p>
      <p>Scale-space extrema detection: the first stage uses
an interest-point detector; difference of Gaussians
(DOG) which is a function to identify potential
interest points that are invariant to scale and
orientation.</p>
      <p>Keypoint localization: at each candidate location,
a detailed model is fit to determine location and
scale. Keypoints are selected based on measures of
their stability.</p>
      <p>Orientation assignement: One or more orientations
are assigned to each keypoint location based on
local image gradient directions.</p>
      <p>Keypoint descriptor: The local image gradients are
measured at the selected in the region around each
keypoint.</p>
      <p>These
are
transformed
into
a
representation that allows for significant levels of
local shape distorsion and change in illumination.</p>
      <sec id="sec-3-1">
        <title>B. Vectorial Quantization</title>
        <p>Once SIFT has been applied to the collection of images,
we obtain a set of feature vectors corresponding to images
regions.</p>
        <p>We
use
k-means, an
unsupervised
quantization
technique, and we apply it over the whole feature space. The
k-means algorithm yields to a number of centroids (their
number
must be fixed
before
applying
k-means).</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Each centroid is a vector whose length equals to feature space dimension and is representative to a subset from feature space. These centroids will be the visterms required by BOV model.</title>
      <sec id="sec-4-1">
        <title>C. Bag-of-visterms</title>
        <p>An image is modeled using the Bag-of-visterms model,
which is a simple model that represents a document as an
orderless set of terms. In the case of images, an image is
therefore represented as a orderless sequence of visual terms,
called visterms.</p>
        <p>Given a collection of images, the first task to perform is to
identify a set of all visterms used at least once in at least one
image. This set is called the vocabulary. Although the image is
a set, we fix an arbitrary ordering for it so we can refer to
visterm1
through
visterm</p>
        <p>M
where</p>
        <p>M
is the size
of
vocabulary. Once vocabulary has been fixed, each image is
represented as a vector with integer entries of length M. If this
vector is d then its j
th component d
j is the number of
appearances of visterm j in the image. The length of image is</p>
        <p>As seen above, the collection of images is represented as a
Nby-M</p>
        <p>matrix, where each row describes an image and each
column corresponds to a visterm.</p>
        <p>IV.</p>
        <p>TOPIC MODELS FOR IMAGE ANNOTATION</p>
        <p>In this section, we describe three advanced topic models:
Gaussian-Multinomial PLSA, Gaussian-Multinomial LDA and
Correspondence LDA. These three models are based on basic
topic models seen above: PLSA and LDA. Given that these
basic models are suitable only for modeling one-type of data,
advanced topic models are designed to fit multi-type data.
A. Gaussian-Multinomial PLSA (GM-PLSA)</p>
        <p>GM-PLSA is a combination of two PLSA models: a
standard PLSA to model textual words and a continuous
PLSA to model visual features. These two models share a
common distribution over latent variable z noted P(z|d).</p>
        <p>The whole model, which is represented in figure 3,
assumes the following generative process:
1. Select a document   with probability  (  )
2.</p>
        <p>Choose a latent aspect   with probability  (  |  ) from
a multinomial distribution conditioned on dicument  
3. For each of the words, sample   from a multinomial
distribution</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. For each</title>
      <p>of the</p>
      <p>feature vectors, sample 
(  ) conditioned on the latent aspect  
multivariate</p>
    </sec>
    <sec id="sec-6">
      <title>Gaussian</title>
      <p>distribution
conditioned on the latent aspect  
 from</p>
      <p>a
 ( |  ,   )
B. Gaussian-Multinomial LDA (GM-LDA)</p>
      <p>GM-LDA is a combination of two LDA models: a standard
LDA to model textual words and a continuous LDA to model
visual features. This model, represented in figure 4, shows that
words  
topics
which
and regions   of an image can come from different
means that the whole document can contain
multiple topics. Furthermore, we can view 
representation of the whole document (image features +
as high-level
words).
given as follows:
 ( ,  ,  ,  ,  )
the set of associated words w and latent variables  , z and v is
2. For each of theN image regions:</p>
      <p>( )
on  
3. For each of theM words:
b. Sample a region description   conditional</p>
      <p>Sample   ~</p>
      <p>( )
b. Sample a word   conditional on  

 =1
the set of associated words w and latent variables  , z and y is
b.
a.</p>
      <p>b.
 ( ,  ,  ,  ,  )
CORR-LDA model assumes the following generative process:
Sample a region description   from a
multivariate Gaussian distribution
conditional on  

 ~ ( |  ,  ,  )
Sample  
~</p>
      <p>(1, … ,  )
Sample a region description   from a
multinomial distribution conditional on
  andz</p>
      <p>V.</p>
      <p>CONCLUSION
In this paper, we have described topic models. The basic
models, PLSA and LDA are designed for one-type data and
advanced models are designed for modeling multi-type data,
especially for image modeling and annotation.
1135, 2003.</p>
      <p>2003.
from
the 26th annual International ACM SIGIR Conference on Research and
Development in Information Retrieval, pages 127–134. ACM Press,
annotation for multimodal image retrieval using Bayes point machines.</p>
      <p>IEEE Trans. Circ. Systems Video Technol., 13 1 (2003), pp. 26–38.
learning of semantic classes for image annotation and retrieval. IEEE
Trans. Pattern Anal. Machine Intell., 29 3 (2007), pp. 394–410.
[7] S. Tollari, “Image indexing and retrieval by combining textual and</p>
      <p>Hoffman,</p>
      <p>“Probabilistic latent semantic indexing”, SIGIR
received
student at ENIT, His research interest includes
machine
learning
and
Hamid</p>
      <p>Amiri
received
the</p>
      <p>Diploma of
Electrotechnics, Information Technique in 1978
and
the
Braunschweig,
Germany.</p>
      <p>1983</p>
      <p>at the
obtained</p>
      <p>TU
the
Doctorates Sciences in 1993. He was a Professor
at the National School of Engineer of Tunis
to 2009 he was at the Riyadh College of Telecom and Information.
Currently, he is again at ENIT. His research is focused on
•
•
•
•</p>
      <p>Image Processing.</p>
      <p>Speech Processing.</p>
      <p>Document Processing.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>information ENIT in 2007and the M.Sc. degree from ENIT in 2009. He is now a phd degree in</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>