IAM@ImageCLEFphoto 2009: Experiments on
      Maximising Diversity using Image Features
                        Jonathon S. Hare, David P. Dupplaw, Paul H. Lewis
                               Intelligence Agents Multimedia Group
    School of Electronics and Computer Science, University of Southampton, Southampton, UK
                                {jsh2|dpd|phl}@ecs.soton.ac.uk


                                              Abstract
       This paper describes the diversity enabled retrieval system constructed at Southamp-
       ton for the ImageCLEFphoto 2009 task. The retrieval system used Terrier as the
       underlying textual indexing and retrieval system, and combined it with a technique
       for re-ranking the results by maximising the visual dissimilarity of retrieved images.
       The results show that our visual re-ranking methods does indeed work at increasing
       the diversity in the top results, however, at the same time it causes a slight drop in
       precision. The text-based approach designed for handling the ‘part 1 topics’ of the
       task is also shown to perform very well.

Categories and Subject Descriptors
H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor-
mation Search and Retrieval; H.3.4 Systems and Software

General Terms
Image search, Diversity, Measurement, Performance, Experimentation

Keywords
Image Content Analysis, Data Fusion, Content-based Image Retrieval


1     Introduction
The 2009 ImageCLEF photo retrieval task [9] aimed to promote diversity in image search. The
was performed using a set of nearly 500,000 captioned images provided by the Belga photo agency.
The task incorporated two separate query types. The part 1 topics were described as a main topic
(i.e. ‘David Beckham’), together with a set of clusters or sub-topics (i.e. ‘Manchester United’,
‘Real Madrid’, etc.). The part 1 topics also included detailed information about what might be
expected in the results of a search for each of the clusters. The part 2 topics provided a single
topic with no context. Both part 1 and 2 topics included example images which could be used for
content-based search or classification.
    In the 2009 ImageCLEF photo retrieval task, Southampton’s baseline system used standard
text retrieval techniques for the part 2 topics. The baseline handling of the part 1 topics augmented
the standard text search with multiple sub-queries (one per cluster) followed by a merge phase in
order to build a complete ranking for the topic. On top of the baseline system we developed a re-
ranking procedure for the results lists that leveraged visual features extracted from the images and
attempted to re-order the list such that the first images in the list were highly visually dissimilar.
2       Methodology
The overall methodology for tackling the task involved building a baseline retrieval system using
only the textual captions, and then augmenting the search results generated by the baseline system
with information extracted from the actual content of the images in order to promote a diverse
spectrum of different images near the top of the ranked search results. Each of the different aspects
of this methodology is described below.

2.1     Text-based Baseline System
The Terrier text retrieval system [8], developed at the University of Glasgow, was used as the
underlying text search technology for our submissions. In particular, we adapted Terrier to index
the image captions, and modified the search algorithm based on the particular query formulations
in the task.

2.1.1    Indexing of Captions
The caption tokeniser was configured to tokenise all of the text in the input record after the docu-
ment identifier. The tokeniser took any non-alphanumeric character, or run of non-alphanumeric
characters, as being a token separator. Each token was converted to lowercase, and tokens with
more than four digits or three consecutive instances of the same letter were rejected. Tokens match-
ing the standard Terrier list of stopwords were also discarded. For experimentation, we build two
separated indices; one with the Porter Stemmer [10] feature enabled, and another without.

2.1.2    Retrieval using the Indexed Captions
For retrieval, Terrier was configured to use the standard TF-IDF weighting model (based on
Robertson’s definition of TF and the standard Sparck-Jones IDF definition [3]). Within the photo
retrieval task, there were two sets of queries or topics. Each query in the first set contained a title,
together with a number of explicit clusters, which themselves included a title (“clusterTitle”),
description (“clusterDesc”) and sample images. The second set of queries only contained a single
title together with three sample images relevant to that title. The retrieval methods we used were
necessarily different for the two sets of query modalities, however, the search and retrieval process
is completely automatically driven by the provided topic files.

Part 1 topics. This year, we only considered the cluster titles for building our search. Because
the cluster titles also contained the terms used in the overall topic title, for the purposes of
reporting it is assumed that that the topic title is used as well. For each of the topics, a three
stage process was used to generate the results:

    1. Convert each “clusterTitle” into a Terrier query such that all words with a ‘-’ must not
       appear anywhere in the captions of the returned images, and all the remaining words must
       appear in the captions of the result documents.
    2. Query the index using each of the generated queries from the cluster titles in turn and store
       the results.
    3. Merge the results lists in a round-robin fashion, ignoring the scores assigned by Terrier (i.e.
       just using ranked position). The top ranked results from each of the sub-queries will come
       first, followed by the second most relevant images, and so on. Duplicate images (i.e. those
       retrieved from more than one of the searches) are also filtered, so only the higher ranking
       example is retained. At this point we also gave each image an arbitrary score based on its
       position in the sub-query search; in our implementation all of the top-ranked images from
       the sub-queries scored 4000 and the second ranking images scored 3999, etc.
Part 2 topics. The second set of queries was processed in a much simpler manner; basically,
results were generated by feeding the title into a Terrier query (marking all terms in the title field
as required, as in the part 1 topics). No attempt to improving diversity by further analysis of the
textual information was made for these topics.

2.2     Enhancing Result-set Diversity with Image Features
We hypothesise that the use of image features could give a large boost to the diversity of a result
set. In particular, in our approach we developed a technique that re-ranks a list of search results
by maximising the visual dissimilarity of the top-ranking images.

2.2.1    Image Features
The image feature used in the submitted experiments was a visual-term representation based
on quantised SIFT features extracted from a multiscale difference-of-Gaussian pyramid [4]. The
features were quantised to a vocabulary of 3125 terms [11, 2]. The codebook for the vector
quantiser was learnt using a hierarchical K-means algorithm [6] (5 levels with 5 clusters per node);
Due to time constraints, we used a pre-existing codebook that we had previously trained on the
5000 training images from ImageCLEF 2009 photo annotation task [7]. We also generated similar
features using the MSER algorithm [5] coupled with quantised SIFT and Colour-SIFT [1] features,
however, again for time reasons these were not included in the submitted results.

2.2.2    Re-ranking Algorithm
In order to try and improve the diversity of the result set, we propose a technique that incorporates
visual information into the ranking. The proposed algorithm works by maximising the distance
between the set of already re-ranked images, R, and an image from the ranked list of images
retrieved from the text-based search, I. The distance between the feature-vectors of a set of
images, R, and the feature-vector from a single image, fq , is calculated using Equation 1. The
function d(., .) in Equation 1 can be any distance function that compares two vectors; in this work
we chose to use the Euclidean distance.
                                                   Y
                                      D(fq , R) =       d(fq , fr )                               (1)
                                                  fr ∈R

    Algorithm 1 shows the steps taken to re-order the results from a text-based search using image
feature-vectors. The output is a list containing all of the input images, but in a different order.
Note that the algorithm treats the first input image as a special key, and that that image will also
appear in the rank 1 position in the output.

 Algorithm 1: Re-ranking by maximising visual dissimilarity.
  input : A ranked list of n images, I = I1 ...In , from the text-based search
  output: A re-ranked list of n images, R = R1 ...Rn
   begin
      Construct an empty list R
      Add the first element of I, I1 , to R, and remove it from I
        while I is not empty do
          Find Ix from I such that D(Ix , R) (from Equation 1) is maximised
          Add Ix to R and remove it from I
      return R
   end
Application to Part 1 Topics. In our experiments, we applied the visual re-ranking procedure
individually to each of the results sets formed from the sub-searches created from the cluster titles.
The re-ranked sub-result-sets were then merged as with the text-based search for part 1 topics.

Application to Part 2 Topics. Re-ranking of the results was simply a matter of applying the
algorithm to the result set formed through the part 2 topics text search.


3     Experiments, Results and Discussion
We submitted four runs to the task organisers. The run parameters were configured by controlling
the application of the Porter Stemmer in the text indexing and retrieval stage, and applying, or
not applying, the visual re-ranking. The run titles and their configurations can be seen in Table
1.

 Diversity Re-ranking Configuration                     Terrier Configuration
                                             Porter Stemmer                No Stemming
                None                       SOTON1 T CT TXT             SOTON2 T CT TXT
           Visual Features               SOTON1 T CT TXT IMG SOTON2 T CT TXT IMG


Table 1: Matrix showing the configurations of our submitted experimental runs. The mean,
median, min and max are calculated over all the submitted results from each of the participants.


3.1    Preliminary Results Analysis
Table 2 shows a summary of Southampton’s results calculated over all the topics from both parts
1 and 2. All of the results are significantly above the mean scores calculated from averaging the
results of all runs submitted by all participants. However, the MAP scores in particular are still
significantly below the maximum achieved by other participants.
    The results show that the application of the visual re-ranking algorithm causes a drop in
precision, which is however traded off by an increase in the cluster recall, implying a higher
diversity in the ranked results. Turning off the Porter stemmer gives a slight boost in precision.
    We hypothesise that turning off the stemmer has this effect because a number of the queries
contained names of entities which would be adversely affected by the stemmer, and produce many
irrelevant images in the result set. As an example, consider the topic which contained the term
“fortis”. Using the Porter stemmer, this is indexed as “forti”. Unfortunately, the number “forty”
is also indexed as “forti”, and so searching for “fortis” will include many irrelevant results from
images that had nothing to do with “fortis”, but did have “forty” in their captions.
    Results from only the part 1 topics are shown in Table 3. The table shows that the run entitled
‘SOTON2 T CT TXT’ achieved the highest F-measure (which measures the combined precision
and cluster recall after 10 retrieved documents) and P10 scores from all the submitted results.
Again, the effect of the visual re-ranking is to increase cluster recall, but decrease precision.
Turning off the stemmer gives a slight precision boost (in the first few retrieved documents) and
also boosts cluster recall slightly.
    In the part 2 topic results shown in Table 4, the trend of increased cluster recall with the
addition of the visual re-ranking holds true. The same is true of the stemming; disabling stemming
increases precision slightly. One difference with the part 1 results however, is that the F-measure
statistic reports slightly better results with stemming enabled.
     Run Name           S   V   MAP      F-measure    P@10     P@20    CR@10    CR@20
           Mean                 0.294      0.585      0.655    0.455    0.547    0.623
          Median                0.330      0.629      0.754    0.506    0.557    0.641
           Min                  0.003      0.095      0.068    0.019    0.158    0.206
           Max                  0.506      0.809      0.848    0.691    0.824    0.862
  SOTON1 T CT TXT       y   n   0.372      0.720      0.802    0.648    0.653    0.718
SOTON1 T CT TXT IMG     y   y   0.332      0.716      0.720    0.567    0.711    0.770
  SOTON2 T CT TXT       n   n   0.379      0.729      0.824    0.649    0.654    0.711
SOTON2 T CT TXT IMG     n   y   0.339      0.727      0.746    0.566    0.745    0.767


          Table 2: Summary of results over all part 1 and part 2 queries.


     Run Name           S   V   MAP      F-measure    P@10     P@20    CR@10    CR@20
           Mean                 0.297      0.600      0.677    0.448    0.558    0.640
          Median                0.347      0.639      0.768    0.533    0.574    0.646
           Min                  0.001      0.033      0.028    0.010    0.041    0.072
           Max                  0.513      0.818      0.868    0.658    0.829    0.877
  SOTON1 T CT TXT       y   n   0.361      0.784      0.824    0.603    0.747    0.810
SOTON1 T CT TXT IMG     y   y   0.322      0.776      0.760    0.535    0.793    0.832
  SOTON2 T CT TXT       n   n   0.371      0.818      0.868    0.601    0.773    0.820
SOTON2 T CT TXT IMG     n   y   0.333      0.805      0.804    0.528    0.806    0.842


                Table 3: Summary of results for all part 1 queries.


     Run Name           S   V   MAP      F-measure    P@10     P@20    CR@10    CR@20
           Mean                 0.288      0.569      0.632    0.460    0.542    0.614
          Median                0.307      0.639      0.740    0.496    0.574    0.642
           Min                  0.001      0.102      0.072    0.017    0.096    0.115
           Max                  0.530      0.819      0.836    0.724    0.819    0.876
  SOTON1 T CT TXT       y   n   0.383      0.652      0.780    0.693    0.560    0.626
SOTON1 T CT TXT IMG     y   y   0.342      0.653      0.680    0.600    0.629    0.708
  SOTON2 T CT TXT       n   n   0.387      0.635      0.780    0.696    0.594    0.602
SOTON2 T CT TXT IMG     n   y   0.345      0.648      0.688    0.604    0.613    0.692


                Table 4: Summary of results for all part 2 queries.
4    Discussion, Conclusions and Future Possibilities
The multiple sub-query and merge approach for the part 1 topics clearly works very well. The
approach taken for the part 2 topics suffers from a lack of precision in the retrieved result sets
from the text retrieval.
    The visual re-ranking algorithm described in section 1 has been shown to work as planned
through the increased cluster recall scores it is able to produce; however, at the same time it
causes a drop in F-measure because precision at the top-end of the result list also drops. One
possible remedy to this problem would be to improve the precision of baseline text retrieval system
so that fewer irrelevant images get passed into the re-ranking algorithm. Another possible approach
would be to incorporate the retrieval score of the text-retrieval phase into the re-ranking so that
images that were predicted to be more relevant still appear higher in the final result list.
    Turning off the Porter stemmer gives a small boost in performance. We need to do some more
analysis of the results, however, we hypothesise that the reasons are attributable to the problem
of stemming named entities as described earlier. A future modification to the textual indexing
and query processors might be to incorporate natural language processing (NLP) techniques to
automatically detect named entities and not stem them, whilst still using stemming for other
words.
    In the experiments described in this paper we only used a single form of visual feature. It would
be interesting to repeat the experiments in the future using a broader spectrum of visual features,
and to also look at combining various features. There are also possibilities for using the sample
images that were provided as part of the topic specification to help drive the search using both
content-based techniques, and perhaps query expansion or automatic relevance feedback using
information in the captions belonging to those images.


Acknowledgements
The authors wish to thank the European Union, which supported this work under the Seventh
Framework project LivingKnowledge (IST-FP7-231126) and the LiveMemories project, graciously
funded by the Autonomous Province of Trento (Italy).


References
 [1] Gertjan J. Burghouts and Jan-Mark Geusebroek. Performance evaluation of local colour
     invariants. Computer Vision and Image Understanding, 113(1):48 – 62, 2009.

 [2] Jonathon S. Hare and Paul H. Lewis. On image retrieval using salient regions with vector-
     spaces and latent semantics. In Wee Kheng Leow, Michael S. Lew, Tat-Seng Chua, Wei-Ying
     Ma, Lekha Chaisorn, and Erwin M. Bakker, editors, CIVR, volume 3568 of LNCS, pages
     540–549. Springer, 2005.
 [3] Karen Spärck Jones. A statistical interpretation of term specificity and its application in
     retrieval. Journal of Documentation, 28:11–21, 1972.
 [4] David Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91–110,
     January 2004.
 [5] Jiri Matas, Ondrej Chum, Martin Urban, and Tomás Pajdla. Robust wide baseline stereo
     from maximally stable extremal regions. In Paul L. Rosin and A. David Marshall, editors,
     BMVC. British Machine Vision Association, 2002.
 [6] David Nister and Henrik Stewenius. Scalable recognition with a vocabulary tree. In In CVPR,
     pages 2161–2168, 2006.
 [7] Stefanie Nowak and Peter Dunker. Overview of the CLEF 2009 Large Scale - Visual Concept
     Detection and Annotation Task. In CLEF working notes 2009, Corfu, Greece, 2009.
 [8] I. Ounis, C. Lioma, C. Macdonald, and V. Plachouras. Research directions in terrier. Novat-
     ica/UPGRADE Special Issue on Web Information Access, Ricardo Baeza-Yates et al. (Eds),
     Invited Paper, 2007.

 [9] Monica Paramita, Mark Sanderson, and Paul Clough. Diversity in photo retrieval: overview
     of the ImageCLEFPhoto task 2009. In CLEF working notes 2009, Corfu, Greece, 2009.
[10] M F. Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980.
[11] J Sivic and A Zisserman. Video google: A text retrieval approach to object matching in
     videos. In ICCV, pages 1470–1477, October 2003.