=Paper=
{{Paper
|id=Vol-1176/CLEF2010wn-ImageCLEF-SahbiEt2010
|storemode=property
|title=TELECOM ParisTech at ImageCLEF 2010 Photo Annotation Task: Combining Tags and Visual Features for Learning-Based Image Annotation
|pdfUrl=https://ceur-ws.org/Vol-1176/CLEF2010wn-ImageCLEF-SahbiEt2010.pdf
|volume=Vol-1176
}}
==TELECOM ParisTech at ImageCLEF 2010 Photo Annotation Task: Combining Tags and Visual Features for Learning-Based Image Annotation==
<pdf width="1500px">https://ceur-ws.org/Vol-1176/CLEF2010wn-ImageCLEF-SahbiEt2010.pdf</pdf>
<pre>
     TELECOM ParisTech at ImageCLEF 2010
             Photo Annotation Task:
      Combining Tags and Visual Features for
        Learning-Based Image Annotation

                      Hichem Sahbi ?                 Xi Li ?,??
                            ?
                             CNRS LTCI, UMR 5141
                              TELECOM ParisTech
                  46, rue Barrault, 75634 Paris Cedex, France
                        ??
                           NLPR, CASIA, Beijing, China
                    hichem.sahbi@telecom-paristech.fr
                           lixichinanlpr@gmail.com


      Abstract. In this paper, we describe the participation of TELECOM
      ParisTech in the ImageCLEF 2010 Photo Annotation challenge. This edi-
      tion focuses on promoting combination between visual and tag features
      in order to enhance photo annotation. An image collection is supplied
      with tags which are used both for training and testing. Our training ap-
      proach consists of building SVM classifiers and kernels which take into
      account the similarity between visual features as well as tags. The results
      clearly corroborate (i) the complementarity of tags and visual descriptors
      and (ii) the effectiveness of SVM classifiers in photo annotation.


1   Introduction
Recent years have witnessed a rapid increase of image sharing spaces, such as
Flickr, due to the spread of digital cameras and mobile devices. An urgent need
is how to effectively search these huge amounts of data and how to exploit the
structure of these sharing spaces. A possible solution is CBIR (Content-Based
Image Retrieval); where images are represented using low-level visual features
(color, texture, shape, etc.) and searched by analyzing and comparing those fea-
tures. However, low-level visual features are usually unable to deliver satisfactory
semantics, resulting in a gap between them and the high-level human interpre-
tations. To address this problem, a variety of machine learning techniques were
introduced in order to discover the intrinsic correspondence between visual fea-
tures and semantics of images and allow to predict keywords for images.


2   Related Work
Conventionally, image annotation is converted into a classification problem. Ex-
isting state of the art methods (for instance [1, 2]) treat each keyword or concept
2       Hichem Sahbi and Xi Li

as an independent class, and then train the corresponding concept-specific clas-
sifier to identify images belonging to that class, using a variety of machine learn-
ing techniques such as hidden Markov models [2], latent Dirichlet allocation [3],
probabilistic latent semantic analysis [4], and support vector machines [5]. The
aforementioned annotation methods may also be categorized into two branches;
region-based requiring a preliminary step of image segmentation [2, 12], and
holistic [6, 25] operating directly on the whole image space. In both cases, train-
ing is achieved in order to learn how to attach keywords with the corresponding
visual features.


The above annotation methods heavily rely on their visual features for image
annotation. Due to the semantic gap, they are unable to fully explore the seman-
tic information inside images. Another class of annotation methods has emerged
that takes advantage of extra information (tags, context, users’ feedback, on-
tologies, etc.) in order to capture the correlations between images and concepts.
A representative work is the cross-media relevance model (CMRM) [6, 9], which
learns joint statistics of visual and concepts and its variants [7, 8]. The model
uses the keywords shared by similar images to annotate new ones. In [22], the
similarity measure between images integrates contextual information for con-
cept propagation. Semi-supervised annotation techniques were also studied and
usually rely on graph inference [10–13]. The original work, in [3, 26], is inspired
from machine translation and considers images and keywords as two different
languages; in that case, image annotation is achieved by translating visual words
into keywords.


Other existing annotation methods focus on how to define an effective distance
measure for exploring the semantic relationships between concepts in large scale
databases. In [19], the Normalized Google similarity Distance (NGD) is pro-
posed by exploring the textual information available on the web. It is a measure
of semantic correlations derived from counts returned by Google’s search engine
for a given set of keywords. Following the idea of [19], the Flickr distance [20]
is proposed to precisely characterize the visual relationships between concepts.
Each one is represented by a visual language model in order to capture its un-
derlying visual characteristics. Then, a Flickr distance is defined, between two
concepts, as the square root of Jensen-Shannon (JS) divergence between the cor-
responding visual language models. Other techniques consider extra knowledge
derived from ontologies (such as the popular WordNet [14–16]) in order to enrich
annotations [21]. The method in [14] introduces a visual vocabulary in order to
improve translation model in the preprocessing stage of visual feature extraction.
A directed acyclic graph is used to model the causal strength between concepts,
and image annotation is performed by inference on this graph [15]. In [17, 18],
the semantic ontology information is integrated in the post processing stage in
order to further refine initial annotations.
           TELECOM ParisTech at ImageCLEF 2010 Photo Annotation Task               3

3     Motivation and The Proposed Method at a Glance


Among the most successful annotation methods, those based on machine learning
and mainly support vector machines; show a particular interest as they are per-
formant and theoretically well grounded [24]. Support vector machines [23], ba-
sically require the design of similarity measures, also referred to as kernels, which
should provide high values when two images share similar structures/appearances
and should be invariant, as much as possible, to the linear and non-linear trans-
formations. They also satisfy positive definiteness which ensures, according to
Vapnik’s SVM theory [24], optimal generalization performance and also the
uniqueness of the SVM solution. In practice, kernels should not depend only
on intrinsic aspects of images (as images with the same semantic may have dif-
ferent visual and textual features), but also on different sources of knowledge
including context.

    In this work, we introduce an image annotation framework based on a new
similarity measure which takes high values not only when images share the same
visual content but also the same context. The context of an image is defined
as the set of images, with the same tags, and exhibiting better semantic de-
scriptions, compared to both pure visual and tag based descriptions. The issue
of combining context and visual content for image retrieval is not new (see for
instance [28–30]) but the novel part of this work aims to (i) integrate context,
in similarity design useful for classification and annotation, and (ii) plug this
similarity in support vector machines in order to take benefit from their well es-
tablished generalization power [24]. This type of similarity will be referred to as
context-based while those relying only on the intrinsic visual or textual content
will be referred to as context-free. Again, our proposed method goes beyond the
naive use of low level features and context-free similarities (established as the
standard baseline in image retrieval) in order to design a similarity applicable
to annotation and suitable to integrate the “contextual” information taken from
tagged datasets. In the proposed method, two images (even with different visual
content and even sharing different tags) will be declared as similar if they share
the same visual context1 . This is usually useful as tags in data may be noisy and
misspelled. Furthermore, the intrinsic visual content of images might not always
be relevant especially for categories exhibiting large variation of the underlying
visual aspects.

Through this work, an image database is modeled as a graph where nodes are
pictures and edges correspond to shared tags (links) between images. We de-
sign our similarity as the solution of a constrained energy function containing
a fidelity term which measures visual similarity between images and a context
criterion that captures the similarity between the underlying links.

1
    Visual context is defined as the set of images sharing the same tags.
4       Hichem Sahbi and Xi Li

4     Evaluation

4.1   MIR Flickr/ImageCLEF Collection

We evaluated our annotation method on the MIR Flickr dataset containing
18, 000 images belonging to 93 categories (for instance “sky, clouds, water, sea,
river,...”), among them 8, 000 are used for training and 10, 000 for testing. The
whole dataset is annotated but ground truth is provided only for the training
set. The MIR Flickr collection contains 1, 386 tags (provided by the Flickr users)
which occur in at least 20 images, with an average total number of 8.94 tags per
image (see Fig. 1 and [32]).


Fig. 1. This figure shows samples of images taken from the ImageCLEF 2010
Photo Annotation Task Database.


4.2   Indexing and Annotation

Recent years have witnessed a great success of the bag-of-features representa-
tion in a wide range of application, such as image retrieval, image classification,
image segmentation, object recognition, etc. Inspired by text classification, vi-
sual feature spaces are conventionally partitioned by vector quantization (e.g.
kmeans) into several subspaces, each of which corresponds to a visual word. As a
consequence, the bag-of-feature representation is converted to the bag-of-words
(BoW). Since using a basic histogram of orderless visual words, the BoW rep-
resentation only reflects the global statistical properties of visual words, and
ignores their spatial layout. Therefore, the orderless BoW representation has a
low descriptive capability of capturing the geometric relationships among visual
words. Motivated by this, we use, in this evaluation campaign, the same approach
as in [27] in order to better capture the spatial layout of images. The algorithm
is based on a spatial pyramid representation, which constructs a multi-level spa-
tial pyramid by block division. For any block at each level, a traditional BoW
representation in the SIFT feature space is used. In this way, we have a set of
block-specific BoW histograms at multiple levels. As a result, the geometric re-
lationships among visual words can be effectively captured.
         TELECOM ParisTech at ImageCLEF 2010 Photo Annotation Task                   5

    Given a test picture, the goal is to predict which categories (object classes)
are present into that picture. This task is commonly known as concept detection.
For this purpose, we trained ”one-versus-all” SVM classifiers for each category;
we repeat this training process through different folds (20 times), for each cat-
egory, and we take the average score of the underlying SVM classifiers on the
test picture. This makes classification results less sensitive to sampling and un-
balanced classes. Performances are reported using the Mean Average Precision
(MAP), the Equal Error Rate (EER) and the Area Under Curve (AUC). Higher
MAP, AUC and lower EER imply better performance. Figs. (2, 3, 4) show the
annotation results of our best ImageCLEF run through different classes.


                                                1
            AP (Average Precision) per class


                                               0.9

                                               0.8

                                               0.7

                                               0.6

                                               0.5

                                               0.4

                                               0.3

                                               0.2

                                               0.1

                                                0
                                                     0   20   40     60   80   100
                                                               Classes

          Fig. 2. This figure shows the average precision per class.


5   Conclusion


We introduced in this work our participation in the ImageCLEF 2010 Photo
Annotation Task. Our annotation method takes into account image features as
well as their context links (taken from tags in the MIR Flickr collection) in order
to achieve SVM learning and classification. Future extensions of this work include
extra processing of these tags prior to SVM learning and further evaluations in
the next campaigns.
6   Hichem Sahbi and Xi Li


                           0.5

                       0.45

                           0.4

                       0.35
       EER per class


                           0.3

                       0.25

                           0.2

                       0.15

                           0.1

                       0.05

                            0
                                 0   20   40     60   80   100
                                           Classes

      Fig. 3. This figure shows the Equal Error Rate per class.


                            1

                           0.9

                           0.8

                           0.7
           AUC per class


                           0.6

                           0.5

                           0.4

                           0.3

                           0.2

                           0.1

                            0
                                 0   20   40     60   80   100
                                           Classes

     Fig. 4. This figure shows the Area Under Curve per class.
         TELECOM ParisTech at ImageCLEF 2010 Photo Annotation Task                7

Acknowledgement

This work is supported by the French National Research Agency (ANR) under
the AVEIR project.


References
1. G. Carneiro and N. Vasconcelos, “Formulating semantic image annotation as a su-
   pervised learning problem”, in Proc. of CVPR, 2005.
2. J. Li and J. Z. Wang, “Automatic linguistic indexing of pictures by a statistical
   modeling approach,” IEEE Trans. on PAMI., 25(9):1075-1088, 2003.
3. K. Barnard, P.Duygululu, D. Forsyth, D. Blei, and M. Jordan, “Matching words
   and pictures,” The Journal of Machine Learning Research, 2003.
4. F. Monay and D. GaticaPerez, “PLSA-based Image AutoAnnotation: Constraining
   the Latent Space,” in Proc. of ACM International Conference on Multimedia, 2004.
5. Y. Gao, J. Fan, X. Xue, and R. Jain, “Automatic Image Annotation by Incorporat-
   ing Feature Hierarchy and Boosting to Scale up SVM Classifiers,” in Proc. of ACM
   MULTIMEDIA, 2006.
6. J. Jeon, V. Lavrenko, and R.Manmatha, “Automatic image annotation and retrieval
   using cross-media relevance models,” in Proc. of ACM SIGIR, pp. 119-126, 2003.
7. V. Lavrenko, R. Manmatha, and J. Jeon, “A model for learning the semantics of
   pictures,” in Proc. of NIPS, 2004.
8. S. Feng, R. Manmatha, and V. Lavrenko, “Multiple Bernoulli relevance models for
   image and video annotation,” in Proc. of ICCV, pp. 1002-1009, 2004.
9. J.Liu, B.Wang, M.Li, Z.Li, W.Ma, H.Lu, and S.Ma, “Dual cross-media relevance
   model for image annotation,” in Proc. of ACM MULTIMEDIA, pp. 605-614, 2007.
10. X. Wan, J. Yang, and J. Xiao, “Manifold-ranking based topic-focused multi-
   document summarization,” in Proc. of IJCAI, pp. 2903-2908, 2007.
11. D. Zhou, J. Weston, A. Gretton, O. Bousquet, and B. Schölkopf. Ranking on data
   manifolds, in Proc. of NIPS, 2004.
12. J. Liu, M. Li, Q. Liu, H. Lu, and S. Ma, ”Image annotation via graph learning,”
   Pattern Recognition, 42(2):218C228, 2009.
13. J. Liu, M. Li, W. Ma, Q. Liu, and H. Lu, “An adaptive graph model for auto-
   matic image annotation,” in Proc. of ACM International Workshop on Multimedia
   Information Retrieval, pp. 61C70, 2006.
14. M. Srikanth, J. Varner, M. Bowden, and D. Moldovan, “Exploiting ontologies for
   automatic image annotation,” in Proc. of SIGIR, pp. 552-558, 2005
15. Y. Wu, E. Y. Chang, and B. L. Tseng. “Multimodal metadata fusion using causal
   strength,” in Proc. of ACM MULTIMEDIA, pp. 872-881, 2005.
16. G. A. Miller, “Wordnet: a lexical database for English,” Commun. ACM, 38(11):39-
   41, 1995.
17. C. Wang, F. Jing, L. Zhang, and H. J. Zhang, “Image annotation refinement using
   random walk with restarts,” in Proc. of ACM MULTIMEDIA, pp. 647-650, 2006.
18. Y. Jin, L. Khan, L. Wang, and M. Awad, “Image annotations by combining mul-
   tiple evidence & wordNet,” in Proc. of ACM MULTIMEDIA, pp. 706-715, 2005
8       Hichem Sahbi and Xi Li

19. R. Cilibrasi and P. M. B. Vitanyi, “The google similarity distance,” IEEE Trans-
   actions on Knowledge and Data Engineering, 2007.
20. L. Wu, X. Hua, N. Yu, W. Ma, and S. Li, “Flickr distance,”, in Proc. of ACM
   MULTIMEDIA, 2008.
21. Y. Wang and S. Gong, “Translating Topics to Words for Image Annotation,” in
   Proc. of ACM CIKM, 2007.
22. Zhiwu Lu, Horace H.S. Ip, and Q. He, “Context-Based Multi-Label Image Anno-
   tation,” in Proc. of ACM CIVR, 2009.
23. Boser B, Guyon I., and Vapnik V, ” An training algorithm for optimal margin
   classifiers” in In Fifth Annual ACM Workshop on Computational Learning Theory,
   Pittsburgh,1992.
24. V. Vapnik, ”Statistical Learning Theory.”, in A Wiley-Interscience Publication”,
   1998”.
25. C. Wang, S. Yan, L. Zhang, H. Zhang, “Multi-Label Sparse Coding for Automatic
   Image Annotation,”, in Proc. of CVPR, 2009.
26. P. Duygulu, K. Barnard, J. de Freitas, and D. Forsyth, “Object Recognition as
   Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary,” in Proc.
   of ECCV, 2002.
27. S. Lazebnik, C. Schmid, and J. Ponce, “Beyond Bags of Features: Spatial Pyramid
   Matching for Recognizing Natural Scene Categories”, in Proc. of CVPR, 2006.
28. A.C. Gallagher, C.G. Neustaedter, L. Cao, J. Luo, and T. Chen, ”Image Annota-
   tion Using Personal Calendars as Context”, in Proc. of ACM Multimedia”,,2008
29. Cao L., Luo J. and Huang T.S., ”Annotating Photo Collection by Label Propaga-
   tion According to Multiple Similarity Cues”, in Proc. of ACM Multimedia”, 2008
30. Y.H. Yang, P.T. Wu, C.W. Lee, K.H Lin, W.H. Hsu, and H. Chen , “ContextSeer:
   Context Search and Recommendation at Query Time for Shared Consumer Photos,”
   in Proc. of ACM Multimedia”, 2008
31. D. Haussler, “Convolution Kernels on Discrete Structures,” in Technical Report
   UCSC-CRL-99-10, University of California in Santa Cruz, Computer Science De-
   partment, July, 1999
32. S. Nowak and Mark Huiskes, “New Strategies for Image Annotation: Overview of
   the Photo Annotation Task at ImageCLEF 2010,” in The Working Notes of CLEF
   2010, Padova, Italy, 2010.

</pre>