=Paper= {{Paper |id=None |storemode=property |title=Topic Models for Automatic Image Annotation |pdfUrl=https://ceur-ws.org/Vol-845/paper-9.pdf |volume=Vol-845 }} ==Topic Models for Automatic Image Annotation== https://ceur-ws.org/Vol-845/paper-9.pdf
          Topic models for automatic image annotation

    Marouane Ben Haj Ayech                               Faouizi Benzarti                                Amiri Hamid
         LSTS laboratory                                  LSTS laboratory                              LSTS laboratory
              ENIT                                             ENIT                                         ENIT
          Tunis, Tunisia                                   Tunis, Tunisia                               Tunis, Tunisia
     marouane.ayech@yahoo.fr                             benzartif@yahoo.fr                         hamidlamiri@yahoo.com



Abstract— Modern image retrieval systems, which allow users to      works found in the literature are Chang et al., 2003[5] and
use textual queries and perform content-based image retrieval       Carneiro et al., 2007[6].
(CBIR), depend greatly on Automatic Image Annotation. Many
models namely unsupervised topic models have successfully been          The second class treats images and texts as equivalent data.
applied in text analysis and are showing encouraging results in     Thus, they apply an unsupervised learning over data in order to
automatic image annotation. In this work, we first describe the     discover the correlation between visual features and textual
basic topic models: the latent semantic analysis (LSA), the         words. So, the annotation is posed as statistical inference which
probabilistic latent semantic analysis (PLSA) and the latent        treats images as a bag of words and features generated both by
Dirichlet allocation (LDA). Since these models assume that          latent or hidden variables. Various models have adopted this
documents are represented as “bag of words” in text analysis, we    idea. Mori et al propose a model that use co-occurrence
then describe BOV-based image representation, an analogous          between words and features to predict words for annotating
representation adapted to image annotation. Based on SIFT           unseen images. Duygulu et al propose a translation model
technique followed by vectorial quantization, this representation   between two languages one for blobs and another for words
allow image to be as “bag of visterms” (BOV). Finally, we           that translates blobs into words, i.e., it attaches words to new
describe some advanced topic models: GM-PLSA, GM-LDA and            image regions. Lavrenko et al propose the continuous-space
CORR-LDA, which are used in image annotation.                       relevance model (CRM) in which word probabilities are
                                                                    estimated using multinomial distribution and the blob feature
   Keywords- automatic image annotation; topic models ; CBIR ;
SIFT ; BOV.
                                                                    using a non-parametric kernel density estimate [9]. Some other
                                                                    models that associate latent aspects or topics with images are
                      I.    INTRODUCTION                            called topic models such as LSA, PLSA and LDA and are
                                                                    successfully used in text analysis. So, in this work, we describe
                                                                    them and their extensions which are designed for image
    Automatic image annotation, which means the association         annotation.
of words to whole images [1], became a crucial part of
information retrieval systems especially content-based image                              II.   RELATED WORK
retrieval ones. Indeed, users prefer using textual request when
searching images. However, retrieving within CBIR involves
using low-level visual features of images. So, we fall in the           Topic models require that data representation is based on
semantic gap problem. To resolve this problem, almost all           Bag-of-Words model, which implies that spatial relationships
researchers tried to implement efficient techniques of image        between words are ignored. Thus, the common way followed is
annotation. These approaches must annotate images                   to represent data as an observation matrix noted A that we will
automatically to treat large image databases.                       explain in detail later.

    Many approaches have been proposed for semantic image               The idea behind these models is to add levels of latent
annotation and retrieval and are roughly classified into two        variables to model aspects or topics. Since they were designed
categories: supervised versus unsupervised approaches [9].          to be applied in text analysis, we prefer keep the terminology
                                                                    used in text analysis and after that we present analogy with
    The first class treats the annotation problem as a supervised   image annotation in section 3. In this section, we are interested
classification and the words as independent classes. An             in the following simple models: LSA, PLSA and LDA.
important principle of these methods is to perform similarity
measure at the visual low-level and annotate unseen images by          LSA is Linear Algebra-based model and exploits data, i.e.
propagating the corresponding words. The most important             matrix A, from algebraic perspective, while PLSA and LDA
                                     SIDOP’12 : 2nd Workshop on Signal and Document Processing


are statistical models and belong to the class of probabilistic                                 ∑𝑁
                                                                                                 𝑖=1 𝑛(𝑑𝑖,𝑤𝑗 )𝑃(𝑧𝑘|𝑑𝑖 ,𝑤𝑗 )
generative models.                                                          𝑃�𝑤𝑗 �𝑧𝑘 � = ∑𝑀        𝑁                                (5)
                                                                                              𝑚=1 ∑𝑖=1 𝑛(𝑑𝑖,𝑤𝑚 )𝑃(𝑧𝑘|𝑑𝑖 ,𝑤𝑚 )

A. Bag of Words (BOW )model                                                                   ∑𝑁
                                                                                               𝑖=1 𝑛(𝑑𝑖 ,𝑤𝑗 )𝑃(𝑧𝑘|𝑑𝑖 ,𝑤𝑗 )
                                                                               𝑃(𝑧𝑘 |𝑑𝑖 ) =         ∑𝑀
                                                                                                                                  (6)
    This model is called also “Orderless Document                                                    𝑗=1 𝑛(𝑑𝑖,𝑤𝑗 )
Representation”. Indeed, it assumes that order of words has no
significance; i.e., the term “home made” has the same
probability as “made home”.
   BOW model simplifies data representation and when
applied to a corpus of text, gives a simple data representation.
    Suppose we have a corpus C that is a collection of
documents 𝐶 = {𝑑1 , … , 𝑑𝑁 }. Each document di consists of a set
                                                                      Figure 1. The graphical model of PLSA. Nodes inside a given
of words. C is represented by a term-by-document matrix                  box indicate that they are replicated the number of times
𝐴 ∈ ℝ𝑁×𝑀 = (𝑎𝑖,𝑗 )i=1..N,j=1..M = n(di , wj ) , where N is the          indicated in the bottom left corner. Filled circles indicate
number of documents, M is the vocabulary size and n(di , wj )              observed random variables; unfilled are unobserved.
is the number of occurrences of wj in document di. The
vocabulary V is a set of all possible words in the corpus             The maximum likelihood estimation is released by
𝑉 = {𝑤1 , … , 𝑤𝑁 }.
                                                                   maximizing the objective function :
B. Latent Semantic Analysis (LSA)
                                                                                                                 𝑛�𝑤𝑖,𝑑𝑗 �
    LSA is a Linear Algebra-based model which consists in                        𝐿 = ∏𝑁    𝑀
                                                                                      𝑖=1 ∏𝑗=1 𝑃�𝑤𝑖 �𝑑𝑗 �                     (7)
decomposing the term-by-document matrix using Singular
value decomposition (SVD) [10].                                    where 𝑃�𝑤𝑖 �𝑑𝑗 � is given by (3).
                                       𝑇
                            𝐴 ≅ 𝑈𝑆𝑉          (1)                   D. Latent Direchlet Allocation (LDA)
             𝑁×𝑀          𝑁×𝐾              𝐾×𝐾
where 𝐴 ∈ ℝ        ,𝑈 ∈ ℝ       ,𝑆 ∈ ℝ           and 𝐴 ∈ 𝑉 𝐾×𝑀 .       In contrast to PLSA, LDA treats the multinomial weights
   LSA consists in projection of the original space onto           over topics as latent random variables. Those weights are
reduced dimensionality space, which allows capture of hidden       sampled from a Dirichlet distribution, which is the conjugate
similarity between terms. Unfortunately, LSA lacks a               prior of the multinomial distribution [8]. Since LDA is a
probabilistic interpretation [1].                                  combination of two conjugate distributions, it normally has
                                                                   fewer parameters than PLSA model [4] and by consequent
C. Probabilistic Latent Semantic Analysis (PLSA)                   reduce overfitting.
    In this model, a document is not represented as a bag of          LDA assumes the following generative process:
words but is modeled as a mixture of aspects or topics. Each
topic is represented as a multinomial words distribution [11].        1.   Choose N ~ Poisson(ξ)
    This model is based on a conditional independence                 2.   Choose θ ∼ Dir(α)
assumption: Each observed word wj is conditionally                                                                        𝑘
independent of the document di it belongs to given a latent                                            𝛤 (∑𝐾
                                                                                                           𝑖=1 𝛼𝑖 )
                                                                                         𝑃 (𝜃 |𝛼 ) =     𝑘          � 𝜃𝑖 𝛼𝑖 −1
variable zk.                                                                                           ∏𝑖=1 𝛤(𝛼𝑖 )
                                                                                                                       𝑖=1
               𝑃�𝑤𝑗 , 𝑑𝑖 � = 𝑃(𝑑𝑖 )𝑃�𝑤𝑖 �𝑑𝑗 �              (2)        3.   For each of the N words wn
            𝑃�𝑤𝑖 �𝑑𝑗 � = ∑𝐾
                          𝑘=1 𝑃�𝑤𝑗 �𝑧𝑘 �𝑃(𝑧𝑘 |𝑑𝑖 ) (3)                      (a) Choose a topic zn~ Multinomial(θ)
                                                                                                                      𝑘
   Since the number of latent variables is smaller than the                                                                   𝑙
number of words or documents, zk behave as a bottleneck in                                    𝑃(𝑧𝑛 |𝜃) = 𝜃𝑖 = � 𝜃𝑖 𝑤
predicting words.                                                                                                    𝑙=1
                                                                            where wl=1 if l=i and wl=0 if l≠i
    Model fitting is performed using EM algorithm, which
alternates two steps to assure maximum likelihood estimation:               (b) Choose a word wn from p(wn | zn;β), a
E-step: The conditional distribution 𝑃�𝑧𝑘 �𝑑𝑖 , 𝑤𝑗 � is computed           multinomial probability conditioned on the topic
from the previous estimate of parameters:                                  zn.
                                 𝑃(𝑧𝑘 |𝑑𝑖 )𝑃�𝑤𝑗 �𝑧𝑘 �                               𝑃(𝑤𝑛 |𝑧𝑛 , 𝛽) = 𝑃(𝑤𝑗 = 1�𝑧 𝑖 = 1) = 𝛽𝑖𝑗
              𝑃�𝑧𝑘 �𝑑𝑖 , 𝑤𝑗 � = ∑𝐾                          (4)
                                 𝑙=1 𝑃(𝑧𝑙 |𝑑𝑖 )𝑃�𝑤𝑗�𝑧𝑙 �
                                                                   where N is the number of words contained in a document,
M-step: The parameters 𝑃 (𝑧𝑘 |𝑑𝑖 ) and 𝑃�𝑤𝑗 �𝑧𝑘 � are updated      𝒘 = (𝑤1 , … , 𝑤𝑛 , . . , 𝑤𝑁 ) is the document representation.
with the new expected values 𝑃�𝑧𝑘 �𝑑𝑖 , 𝑤𝑗 �:
                                       SIDOP’12 : 2nd Workshop on Signal and Document Processing


𝑤𝑛 = (𝑤𝑛 1 , … , 𝑤𝑛 𝑗 , … , 𝑤𝑛 𝑉 ) is the representation of the nth            First, we apply an interest-point detector on each image to
word in w, where 𝑤𝑛 𝑗 = 1 and 𝑤𝑛 𝑙 = 0 if l≠j                              extract characteristic points. There exist in the literature many
                                                                           techniques such as DoG (the difference of Gaussians) point
𝑧𝑛 = (𝑧𝑛 1 , … , 𝑧𝑛 𝑖 , … , 𝑧𝑛 𝐾 ) is the representation of the nth word
                                                                           detector. The chosen technique must detect invariant points to
in z, where 𝑧𝑛 𝑖 = 1 and 𝑤𝑛 𝑙 = 0 if l≠i                                   some geometric and photometric transformations. Then, we
We note that:                                                              apply SIFT (Scale Invariant Feature Transform) to obtain local
                                                                           descriptors from each image. These descriptors are computed
         •      The three-level hierarchical structure of LDA              on the regions around each interest point identified by the
                allows that many concepts may be associated to
                                                                           detector. Since the descriptors are obtained and in order to get
                one document.
                                                                           a fixed image representation, we quantize all descriptors into a
         •      The words are generated by concepts                        discrete set of visterms using by example k-means. Each
                                                                           cluster obtained represents a visterm in the image.
                                                                           A. Scale Invariant Feature Transform (SIFT)
                                                                              Finally, complete content and organizational editing before
                                                                           formatting. Please take note of the following items when
                                                                           proofreading spelling and grammar:
                                                                               SIFT is a method for extracting distinctive and invariant
                                                                           features from images that can be used to perform reliable
                                                                           matching between different views of an object or scene [3].
                                                                           The stages of computation used to generate the set of image
                     Figure 2. The graphical model of LDA                  features are:
                                                                                   •   Scale-space extrema detection: the first stage uses
                                                                                       an interest-point detector; difference of Gaussians
    To estimate the parameters of LDA, we must apply                                   (DOG) which is a function to identify potential
inference and estimation procedures to the model. Thus, we                             interest points that are invariant to scale and
need to compute the posteriori distribution in inference step:                         orientation.
                                   𝑃 (𝜃, 𝑧, 𝒘|𝛼, 𝛽 )
              𝑃(𝜃, 𝑧|𝒘, 𝛼, 𝛽) =                         (8)                        •   Keypoint localization: at each candidate location,
                                      𝑃(𝒘|𝛼, 𝛽 )                                       a detailed model is fit to determine location and
where                                                                                  scale. Keypoints are selected based on measures of
                                                                                       their stability.
                                       𝑁

      𝑃(𝜃, 𝑧, 𝒘|𝛼, 𝛽 ) = 𝑃(𝜃|𝛼) � 𝑝(𝑧𝑛 |𝜃) 𝑝(𝑤𝑛 |𝑧𝑛 , 𝛽)                           •   Orientation assignement: One or more orientations
                                                                                       are assigned to each keypoint location based on
                                      𝑛=1
                                                                                       local image gradient directions.
                                 𝑁
                                                                                   •   Keypoint descriptor: The local image gradients are
 𝑃(𝒘|𝛼, 𝛽 ) = � 𝑃(𝜃|𝛼) �� � 𝑝(𝑧𝑛 |𝜃 ) 𝑝(𝑤𝑛 |𝑧𝑛 , 𝛽 )� 𝑑𝜃                               measured at the selected in the region around each
                                𝑛=1 𝑧𝑛                                                 keypoint. These are transformed into a
                                                                                       representation that allows for significant levels of
The posterior distribution is intractable and the solution is to                       local shape distorsion and change in illumination.
use variational estimation methods [8].
                                                                           B. Vectorial Quantization
                                                                               Once SIFT has been applied to the collection of images,
             III.   BOV-BASED IMAGE REPRESENTATION                         we obtain a set of feature vectors corresponding to images
                                                                           regions. We use k-means, an unsupervised quantization
                                                                           technique, and we apply it over the whole feature space. The
   Since aspect models are firstly used in text analysis, we               k-means algorithm yields to a number of centroids (their
have used text terminology (document, word, ..). We present                number must be fixed before applying k-means). Each
now the analogy of this terminology with image annotation                  centroid is a vector whose length equals to feature space
domain.                                                                    dimension and is representative to a subset from feature space.
    Thus, images represent documents and words correspond to               These centroids will be the visterms required by BOV model.
visterms. In fact, an image is divided into regions (visterms).            C. Bag-of-visterms
    Now, we describe briefly the procedure used for image                     An image is modeled using the Bag-of-visterms model,
representation which allows representing an image as set of                which is a simple model that represents a document as an
visterms:                                                                  orderless set of terms. In the case of images, an image is
                                     SIDOP’12 : 2nd Workshop on Signal and Document Processing


therefore represented as a orderless sequence of visual terms,
called visterms.
    Given a collection of images, the first task to perform is to
identify a set of all visterms used at least once in at least one   B. Gaussian-Multinomial LDA (GM-LDA)
image. This set is called the vocabulary. Although the image is         GM-LDA is a combination of two LDA models: a standard
a set, we fix an arbitrary ordering for it so we can refer to       LDA to model textual words and a continuous LDA to model
visterm1 through visterm M where M is the size of                   visual features. This model, represented in figure 4, shows that
vocabulary. Once vocabulary has been fixed, each image is           words 𝑤𝑚 and regions 𝑟𝑛 of an image can come from different
represented as a vector with integer entries of length M. If this   topics which means that the whole document can contain
vector is d then its jth component dj is the number of              multiple topics. Furthermore, we can view 𝜃 as high-level
appearances of visterm j in the image. The length of image is       representation of the whole document (image features +
𝑛 = ∑𝑀 𝑖=1 𝑑𝑗 .
                                                                    words).
As seen above, the collection of images is represented as a N-
by-M matrix, where each row describes an image and each
column corresponds to a visterm.


         IV.    TOPIC MODELS FOR IMAGE ANNOTATION

    In this section, we describe three advanced topic models:
Gaussian-Multinomial PLSA, Gaussian-Multinomial LDA and
Correspondence LDA. These three models are based on basic
topic models seen above: PLSA and LDA. Given that these
basic models are suitable only for modeling one-type of data,
                                                                                  Figure 4. The graphical model of GM-LDA
advanced topic models are designed to fit multi-type data.
A. Gaussian-Multinomial PLSA (GM-PLSA)                              The joint probability distribution of the set of image features r,
    GM-PLSA is a combination of two PLSA models: a                  the set of associated words w and latent variables 𝜃, z and v is
standard PLSA to model textual words and a continuous               given as follows:
PLSA to model visual features. These two models share a             𝑝(𝒓, 𝒘, 𝜃, 𝒛, 𝒗)
                                                                                 𝑁                               𝑀
common distribution over latent variable z noted P(z|d).
                                                                    = 𝑝(𝜃 |𝛼) �� 𝑝(𝑧𝑛 |𝜃 )𝑝(𝑟𝑛 |𝑧𝑛 , 𝜇, 𝜎)� �� 𝑝(𝑣𝑚 |𝜃 )𝑝(𝑤𝑛 |𝑣𝑚 , 𝛽 )�
   The whole model, which is represented in figure 3,                           𝑛=1                             𝑚=1

assumes the following generative process:
                                                                    GM-LDA model assumes the following generative process:
1.   Select a document 𝑑𝑖 with probability 𝑃(𝑑𝑖 )                      1. Sample a Dirichlet random variable 𝜃
                                                                       2. For each of the N image regions:
2.   Choose a latent aspect 𝑧𝑘 with probability 𝑃(𝑧𝑘 |𝑑𝑖 ) from
                                                                              a. Sample 𝑧𝑛 ~𝑀𝑢𝑙𝑡(𝜃)
     a multinomial distribution conditioned on dicument 𝑑𝑖
                                                                              b. Sample a region description 𝑟𝑛 conditional
3.   For each of the words, sample 𝑤𝑚 from a multinomial                          on 𝑧𝑛
     distribution 𝑀𝑢𝑙𝑡(𝜃𝑘 ) conditioned on the latent aspect 𝑧𝑘        3. For each of the M words:
4.   For each of the feature vectors, sample 𝑟𝑛 from a                        a. Sample 𝑣𝑚 ~𝑀𝑢𝑙𝑡(𝜃)
     multivariate    Gaussian      distribution    𝒩 (𝑥|𝜇𝑘 , 𝜎𝑘 )             b. Sample a word 𝑤𝑚 conditional on 𝑣𝑚
     conditioned on the latent aspect 𝑧𝑘

                                                                    C. Correspondance- LDA (CORR-LDA)
                                                                             In CORR-LDA model, image features are firstly
                                                                    generated and subsequently words are generated. Indeed, N
                                                                    region features are generated. Then, for each of the M words,
                                                                    one of the regions is selected from the image and a
                                                                    corresponding word is drawn conditioned on the topic that
                                                                    generates the selected region (Figure 5).

                                                                    The joint probability distribution of the set of image features r,
                                                                    the set of associated words w and latent variables 𝜃, z and y is
                                                                    given as follows:

               Figure 3. The graphical model of GM-PLSA
                                        SIDOP’12 : 2nd Workshop on Signal and Document Processing


                                                                              [7]  S. Tollari, “Image indexing and retrieval by combining textual and
                                                                                   visual informations”, thesis in Université du Sud Toulon-Var, 2006.
𝑝(𝒓, 𝒘, 𝜃, 𝒛, 𝒚)
              𝑁                              𝑀                                [8] D. M.Blei,A. Y.Ng,M. I.Jordan, “Latent Dirichlet Allocation”. Journal
                                                                                   of Machine Learning Research, 3, 993-1022,2003.
= 𝑝(𝜃|𝛼) �� 𝑝(𝑧𝑛 |𝜃)𝑝(𝑟𝑛 |𝑧𝑛 , 𝜇, 𝜎)� �� 𝑝(𝑦𝑚 |𝑁)𝑝(𝑤𝑛 |𝑦𝑚 , 𝒛, 𝛽)�
                                                                              [9] Z. Li, Z. Shi, X. Liu, Z. Shi, ”Modeling continuous visual features for
             𝑛=1                            𝑚=1
                                                                                   semantic image annotation and retrieval”, 2011
CORR-LDA model assumes the following generative process:
   1. Sample 𝜃~𝐷𝑖𝑟(𝜃|𝛼 )                                                      [10] S. Deerwester, S. Dumais, T. Landauer, G. Furnas, and R. Harshman,
                                                                                   “Indexing by latent semantic analysis”, Journal of the American Society
   2. For each of the N image regions:                                             of Information Science, 41(6):391-407, 1990
          a. Sample 𝑧𝑛 ~𝑀𝑢𝑙𝑡(𝜃)                                               [11] T. Hoffman, “Probabilistic latent semantic indexing”, SIGIR
          b. Sample a region description 𝑟𝑛 from a                                 Conference, 1999
              multivariate Gaussian distribution
              conditional on 𝑧𝑛
                    𝑟𝑛 ~𝑃(𝑟|𝑧𝑛 , 𝜇, 𝜎 )
   3. For each of the M words:
          a. Sample 𝑦𝑚 ~𝑈𝑛𝑖𝑓(1, … , 𝑁)
          b. Sample a region description 𝑤𝑚 from a
              multinomial distribution conditional on                                                      BIBLIOGRAPHY
              𝑦𝑚 and z




                                                                                                       Ben Haj Ayech Marouane received an
                                                                                                       engineering degree from ENIT in 2007and the
                                                                                                       M.Sc. degree from ENIT in 2009. He is now a phd
                                                                                                       student at ENIT, His research interest includes
                                                                                                       information retrieval, machine learning and
                                                                                                       computer vision.



                              V. CONCLUSION
                   Figure 5. The graphical model of CORR-LDA

In this paper, we have described topic models. The basic
models, PLSA and LDA are designed for one-type data and
advanced models are designed for modeling multi-type data,
especially for image modeling and annotation.                                                               Hamid Amiri received the Diploma of
                                                                                                            Electrotechnics, Information Technique in 1978
                                                                                                            and the PhD degree in 1983 at the TU
                                                                                                            Braunschweig, Germany. He obtained the
                                                                                                            Doctorates Sciences in 1993. He was a Professor
                              REFERENCES                                                                    at the National School of Engineer of Tunis
                                                                                                            (ENIT), Tunisia, from 1987 to 2001. From 2001
[1]   F. Monay and D. Gatica-Perez, “On Image Auto-Annotation with Latent     to 2009 he was at the Riyadh College of Telecom and Information.
      Space Models”, ACM Multimedia (2003) , p. 275--278                      Currently, he is again at ENIT. His research is focused on
                                                                              •     Image Processing.
[2]   K. Barnard, P. Duygulu, N. Freitas, D. Forsyth, D. Blei, “Matching
                                                                              •     Speech Processing.
      words and pictures,” Journal of Machine Learning Research, 3:1107-
      1135, 2003.                                                             •     Document Processing.
                                                                              •     Natural language processing
[3]   D.G.Lowe, “Distinctive Image Features from Scale-Invariant
      Keypoints”, International Journal of Computer, 2004.
[4]   D.M. Blei, M.I. Jordan, “Modeling annotated Data”, In Proceedings of
      the 26th annual International ACM SIGIR Conference on Research and
      Development in Information Retrieval, pages 127–134. ACM Press,
      2003.
[5]   E. Chang, K. Goh, G. Sychay and G. Wu, CBSA: Content-based soft
      annotation for multimodal image retrieval using Bayes point machines.
      IEEE Trans. Circ. Systems Video Technol., 13 1 (2003), pp. 26–38.
[6]   G. Carneiro, A.B. Chan, P.J. Moreno and N. Vasconcelos, Supervised
      learning of semantic classes for image annotation and retrieval. IEEE
      Trans. Pattern Anal. Machine Intell., 29 3 (2007), pp. 394–410.