IAM@ImageCLEFPhotoAnnotation 2009: Naı̈ve
  application of a linear-algebraic semantic space
                                  Jonathon S. Hare, Paul H. Lewis
                               Intelligence Agents Multimedia Group
    School of Electronics and Computer Science, University of Southampton, Southampton, UK
                                   {jsh2|phl}@ecs.soton.ac.uk


                                             Abstract
       This paper describes Southampton’s submissions to the 2009 ImageCLEF photo anno-
       tation task. For the task we used an annotation system based on the idea of construct-
       ing semantic spaces, which was developed previously at Southampton. To represent
       the image content, we used a combination of different SIFT and Colour-SIFT features
       detected using the difference-of-Gaussian and MSER techniques. These features were
       converted into a visual term representation by applying vector quantisation using a
       codebook learnt from a hierarchical k-means clustering. In terms of EER and AUC,
       the annotator performs reasonably well, however, it struggles when evaluated using the
       hierarchical measure proposed for the task, due to the way the annotation confidences
       are thresholded.

Categories and Subject Descriptors
H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval; H.3.1 [Information
Storage and Retrieval]: Content Analysis and Indexing; H.3.4 [Information Storage and
Retrieval]: Systems and Software—Performance Evaluation; I.4.9 [Artificial Intelligence]:
Applications; I.2.6 [Artificial Intelligence]: Learning

General Terms
Automatic Annotation, Performance, Experimentation

Keywords
Image Content Analysis, Data Fusion, Semantic Space


1     Introduction
The ImageCLEF 2009 photo annotation task set the challenge of automatically annotating 13000
images with 53 annotation concepts. The allowable training data was limited to a set of 5000
images pre-labelled with the concepts. The images themselves were drawn from the MIR Flickr
25000 image collection [4]. Southampton’s submissions to the task used a previously developed
annotation system, with a combination of visual term features created from local descriptors of
salient interest regions.
2     Methodology
As with many automatic annotation approaches, the methodology applied to this task involved
extracting feature vectors for each of the images, and then feeding the features of a training set,
together with annotations to a machine learning system. The machine learning system attempts
to learn low-level relationships between all of the features and annotations. Once the training
phase is complete, features from un-annotated images can be fed into the system to use the learnt
relations to get predictions of annotations.

2.1    Visual Features
The images were represented by vectors of visual-term occurrences [11]. The visual-terms were
created by finding interest points and extracting local feature descriptors, and then quantising to
a pre-determined codebook. For the experiments in this task we used a combination of multi-
scale difference-of-Gaussian interest regions with SIFT features [6], MSER regions [7] with SIFT
features, and MSER regions with colour-SIFT features [1]. Each of the three region/feature com-
binations had its own 3125 term codebook created by applying hierarchical k-means [8] (5 levels
with 5 clusters per node). The codebook size was not optimised in any way, and was chosen based
on a best guess basis from previous experience with these feature morphologies and the machine
learning technique described in the next subsection. The final image representation was created by
appending the term-occurrence vectors from each of the region/feature representations to create
a vector with 9375 dimensions.

2.2    Machine Learning
The machine-learning component is based on a linear-algebraic semantic space [3, 2], which is a
development and generalisation of a text indexing technique called Cross-Language Latent Seman-
tic Indexing [5]. This technique produces a vector space into which both visual-terms and keyword
terms are mapped along with the images. Un-annotated images can then be projected into this
space. Annotation was performed by projecting the test images into the space, and ranking the
possible annotations based on their cosine similarity.
    The use of the cosine similarity measure gives each possible annotation a score between -1
and 1; however, these scores are not themselves all that informative. A higher score does mean
more confidence in an annotation, but only when considered against all the other annotations.
For the purposes of evaluating the annotator using the EER and AUC measures this is not too
much of a problem — we can just scale the scores to the 0..1 range (i.e add one and divide
by two). However, as will be discussed in the next section, the hierarchical scoring measure
[10] thresholds the annotation confidences at 0.5 to produce a binary indication of present/not-
present. Unfortunately, for the semantic space this poses a big problem as the position of the
threshold should ideally be set differently for each image, based on confidences of all the predicted
annotations.


3     Experiments, Results and Discussion
We submitted three different runs to the task organisers. The first run was trained on the raw
annotations. The second included a partial expansion of the annotation hierarchy [9] provided
by the organisers, based on the non-abstract nodes (i.e. if Lake=true then water=true). The
third included a full expansion of the hierarchy. The hierarchical expansion just means that a few
extra annotation terms are fed into the machine learning component together with the leaf-node
annotations already present for the image in question. The run titles and settings are shown in
Table 1.
                                      Run Title           Description
           IAM Southampton 30 2 1245438072355.txt         Raw annotations
           IAM Southampton 30 2 1245519187248.txt         Partial hierarchical expansion
           IAM Southampton 30 2 1245519327555.txt         Full hierarchical expansion


                              Table 1: Description of submitted runs

                               Technique                           EER     AUC
                                  Mean                             0.373   0.553
                                 Median                            0.372   0.673
                                   Min                             0.234   0.070
                                   Max                             0.526   0.839
                                 Random                            0.500   0.499
                   IAM Southampton 30 2 1245438072355.txt          0.330   0.715


Table 2: Summary of averaged EER and AUC scores over all annotation terms. The summary
statistics were calculated using only for the best run of each participant.


3.1    Preliminary Results Analysis
The two runs that included the hierarchical information did not perform as well (based on average
EER and AUC) as the one based on the raw annotations. Looking at the EER scores for each
annotation term, the hierarchical methods were consistently worse performing. For these reasons
the results for these runs will not be discussed further.
    The EER and AUC scores are summarised in Table 2. Our scores are better than the averages
of the other participants, however, they are still a fair way off of the top scores. Using a semantic
space approach for annotation, from past experience, we expect that there will be a large diversity
in the performance for different terms. This is due to differences in the amount of training data
for each term, and also the amount of visual diversity that might be associated with a term. For
example, visually specific annotation terms require less training data than diverse ones. Table 3
shows the top- and bottom-most 5 annotation terms. It is interesting that the worst performing
terms are those that are rather general and unspecific. The top performing terms all have very
specific visual representations.
    Table 4 shows the results of our annotator using the hierarchical measure [10]. Unfortunately,
because this scoring measure performs a binary thresholding operation on the confidence scores,


                                    Annotation Term        EER
                                        Sunset-Sunrise     0.232
                                     Landscape-Nature      0.234
                                                  Night    0.237
                                                    Sea    0.243
                                             Mountains     0.249
                                   Aesthetic-Impression    0.416
                                        Overall-Quality    0.436
                                   Neutral-Illumination    0.466
                                                 Sports    0.470
                                                 Fancy     0.478


                          Table 3: Best and worst annotations by EER
the performance of our technique measures at the lower end of the spectrum of results from the
different participants. As previously discussed, the semantic space annotation approach doesn’t
really permit the global setting of such a threshold.

                              Technique                           EER        AUC
                                 Mean                             0.684
                                Median                            0.752
                                  Min                             0.390
                                  Max                             0.829
                                Random                            0.384
                  IAM Southampton 30 2 1245438072355.txt       0.41897374


                          Table 4: Summary of averaged hierarchical scores


3.2     Computational Performance and Implementation Details
The feature extraction phase was performed in parallel (4 images being processed at once) on a
quad core machine (Intel Core 2 Quad @ 2.66Ghz, 8G ram, Redhat Enterprise 5.3). The time for
image processing varied based on both the size of the image, and the image content. Timings for
a typical image from the training set are shown in Table 5.
    Training the semantic space took approximately 1 hour on a dual quad core 2.8GHz Xeon
workstation running Mac OS X (the semantic space code is single threaded, so only uses a single
core). We would estimate that no more than 1G of ram was used during the semantic space
training phase. Projecting all the test image in bulk took under 2 minutes, and it took about 5
minutes to generate annotations for all the 13000 images; so, in general, it took less than .05s to
get from a list of visual terms to the suggested annotations for a single image.

Implementation. The semantic-space software is written in C and makes use of Doug Rohde’s
SVDLIBC 1 for efficiently performing the large sparse SVD. The feature detector and descriptor
software is written in C and C++. The image processing components were driven by a standard
UNIX make file, which enabled easy parallelisation.


4      Conclusions
For this task we applied an older technique for automatically annotating images using a semantic
space. The performance of the technique in terms of EER and AUC is fairly competitive, however,
the technique does not mesh well with the hierarchical scoring measure proposed for this task. The
semantic space technique is reasonably computationally efficient; the most time is spent processing
    1 http://tedlab.mit.edu/
                               ~dr/SVDLIBC/


                                                Feature                   Time
      Difference-of-Gaussian detection + SIFT extraction              ≈ 1.8s/image
                                         MSER detection               ≈ 0.1s/image
                               SIFT extraction on MSER                ≈ 2.7s/image
                       Colour-SIFT extraction on MSER                 ≈ 1.0s/image
                                      Vector quantisation   <0.1s per set of extracted features
                                          Estimated total             ≈ 5.9s/image


 Table 5: Approximate timings for feature extraction on a typical image from the training set.
the images to extract features. In our experiments, the use of the hierarchy did not lead to any
improvement in the annotation quality.


Acknowledgements
The authors wish to thank the European Union, which supported this work under the Seventh
Framework project LivingKnowledge (IST-FP7-231126) and the LiveMemories project, graciously
funded by the Autonomous Province of Trento (Italy).


References
 [1] Gertjan J. Burghouts and Jan-Mark Geusebroek. Performance evaluation of local colour
     invariants. Computer Vision and Image Understanding, 113(1):48 – 62, 2009.
 [2] Jonathan S. Hare, Sina Samangooei, Paul H. Lewis, and Mark S. Nixon. Semantic spaces
     revisited: investigating the performance of auto-annotation and semantic retrieval using se-
     mantic spaces. In ACM CIVR ’08, pages 359–368. ACM, July 2008.
 [3] Jonathon S. Hare, Paul H. Lewis, Peter G. B. Enser, and Christine J. Sandom. A Linear-
     Algebraic Technique with an Application in Semantic Image Retrieval. In Hari Sundaram,
     Milind Naphade, John R. Smith, and Yong Rui, editors, CIVR 2006, volume 4071 of LNCS,
     pages 31–40. Springer, 2006.

 [4] Mark J. Huiskes and Michael S. Lew. The mir flickr retrieval evaluation. In MIR ’08:
     Proceedings of the 2008 ACM International Conference on Multimedia Information Retrieval,
     New York, NY, USA, 2008. ACM.
 [5] T K Landauer and M L Littman. Fully automatic cross-language document retrieval using
     latent semantic indexing. In Proceedings of the Sixth Annual Conference of the UW Centre
     for the New Oxford English Dictionary and Text Research, pages 31–38, UW Centre for the
     New OED and Text Research, Waterloo, Ontario, Canada, October 1990.
 [6] David Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91–110,
     January 2004.

 [7] Jiri Matas, Ondrej Chum, Martin Urban, and Tomás Pajdla. Robust wide baseline stereo
     from maximally stable extremal regions. In Paul L. Rosin and A. David Marshall, editors,
     BMVC. British Machine Vision Association, 2002.
 [8] David Nister and Henrik Stewenius. Scalable recognition with a vocabulary tree. In In CVPR,
     pages 2161–2168, 2006.

 [9] Stefanie Nowak and Peter Dunker. Overview of the CLEF 2009 Large Scale - Visual Concept
     Detection and Annotation Task. In CLEF working notes 2009, Corfu, Greece, 2009.
[10] Stefanie Nowak and Hanna Lukashevich. Multilabel classification evaluation using ontology
     information. In Claudia d’Amato, Nicola Fanizzi, Marko Grobelnik, Agnieszka Lawrynowicz,
     and Vojtech Svátek, editors, Proceedings of the First ESWC Workshop on Inductive Reasoning
     and Machine Learning on the Semantic Web, Heraklion, Greece, June 2009.
[11] J Sivic and A Zisserman. Video google: A text retrieval approach to object matching in
     videos. In ICCV, pages 1470–1477, October 2003.