MRIM-LIG at ImageCLEF 2009: Photo Retrieval
and Photo Annotation tasks
Philippe Mulhem, Jean-Pierre Chevallet, Georges Quénot, Rami Al Batal
Grenoble University, LIG
Philippe.Mulhem@imag.fr, Jean-Pierre.Chevallet@imag.fr
Georges.Quenot@imag.fr, Rami.Albatal@imag.fr
Abstract
This paper describes the different experiments that have been conducted by the MRIM
group at the LIG in Grenoble for the ImageCLEF 2009 campaign. The group partic-
ipated in the following tasks: Image Retrieval and Image Annotation. For the Image
Retrieval task, we submitted runs with both text and image features, and a diversifica-
tion process was applied. For the Image Annotation task, we used several features and
classifiers in a way to generate keyword descriptions. For these two tasks, the results
obtained are above the average of the participants.
Categories and Subject Descriptors
H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor-
mation Search and Retrieval; H.3.4 Systems and Software
General Terms
Image Retrieval, Image Indexing
Keywords
diversity algorithm, early fusion, late fusion, SVM
1 Introduction
We describe here the different experiments that have been conducted by the MRIM group at the
LIG in Grenoble for the ImageCLEF 2009 campaign. This papers describes our participation
in the following tasks : Image Retrieval and Image annotation. For Robot vision task, we used
a language model to represent the image content; For the Image Retrieval track, we submitted
runs with both text and image features, and a diversification process was applied; For the Image
Annotation task, we used several features and classifiers in a way to generate keyword descriptions.
We also participated in the Robot Vision track, describes elsewhere in the CLEF working notes.
This paper is organized as follows. In section 2 we present our approach and results for the Image
Retrieval track, and in section 4 we present our work for the Image Annotation track. We conclude
in section 5.
2 Image Retrieval track
3 Photo Retrieval Task
3.1 Managing Diversity
This year again on the photo retrieval task, the focus was on managing results diversity [5].
CLEF2009 queries includes clusters that can be used to add and also to measure diversity of the
retrieved documents. For example, the first query is only one term long: "leterme", which is a
proper noun.
1
leterme
yves leterme
Relevant images contain photographs of Yves Leterme. Images of
Leterme with other people are relevant if Leterme is shown in the foreground.
Images of Leterme in the background are irrelevant.
belga28/05980958.jpg
leterme albert
Relevant images contain photographs of Yves Leterme and King Albert II.
Images with only one of them are considered to be irrelevant.
belga27/05960161.jpg
leterme -albert
Images which contain photographs of Leterme which are not part of
the above categories are relevant to this cluster.
belga32/06229323.jpg
Three clusters are proposed for this query: "yves leterme", "leterme albert" and "leterme -albert".
These clusters are supposed to be different possible classes of answers. The goal is then to retrieve
a document’s list that maximizes diversity measured using these thematic clusters. One can also
note that some terms may be negated.
In the experiment we propose here, we have used clusters as “sub-query” (we called “cluster
query”) for the categorization of each main query answer. In this example, "leterme" is the main
query, but we also run the 3 cluster queries. We consider only the output of the main query to
be diversified according to the “cluster queries” (or sub-queries). The diversification consists in
reordering the main query output list using the results of the cluster queries. In fact we consider
that document lists from cluster queries are document classifications against each cluster.
In practice, each cluster query is passed through the IRS (see figure 1). Results of all these
cluster queries are not fused with the main query, but they are used to categorize the document
result of the main query. In fact for each document answered for the main query, we identify
clusters where this document is an answer (box “Add cluster” in fig. 1). When several clusters
are possible, we have decide to select the one with the lowest RSV value for this document. We
could also keep all possible clusters. A special cluster (id=0) is created for document not found in
any clusters. This adds to each document of the original main answer list, a cluster identification
that is used for the diversification algorithm described in the next part.
3.2 Diversification Algorithm
We propose an algorithm to diversity a list of document associated to a cluster information (a
cluster id). We make the hypothesis that each document answered to a query, is associated to
a limited number of clusters. The shuffle algorithm uses only this information to propose more
diversity to the result. The idea is to find a balance between the diversity of cluster among the
Figure 1: Diversification with cluster framework
answers, and the documents sorted by relevance status values (RSV). Hence, the shuffle algorithm
should not change the original order ”too much”.
To implement this notion, we propose to set an absolute limit to a document move : the
value r is the maximum rank a document can jump to be closer to the top. We propose also a
damping parameter α which acts as a memory of clusters previously seen. Our shuffle algorithm
implements the maximum rank constraint r by maintaining a queue of size r documents. This
algorithm work in two stages (see fig. 2): the best document is selected from the queue, from the
top, and according to the least cluster seen. This is the first document from the top of the queue
that have the smaller “cluster seen” value. This document is then extracted from the queue, and
added into the final order document list. Also the cluster of this selected document is added to
the seen cluster structure, and the next document is push into the queue. The algorithm can be
formalized as:
for all d in orginalDocumentList do
ds ← topOf(queue)
for all dq in queue do
if clusterSeen(cluster(dq)) < clusterSeen(cluster(ds)) then
ds ← dq
end if
end for
updateSeenCluster(cluster(ds))
push(ds,newDocumentList)
push(d,queue)
end for
This algorithm has an extra step to process all documents that remain in the queue when the
original list has all been processed. The “clusterSeen” structure associates for each cluster c a
“novelty” value nc from 0 to 1. The larger this value, the more recent a cluster has been seen.
This new novelty value n′c value is updated by the updateSeenCluster() function using:
′ (nc + 1) ∗ α if pushed document belongs to cluster c
nc = (1)
nc ∗ α otherwise
The α value in ]0; 1[ acts as a damping factor that slowly reduces the cluster novelty nc , if this
cluster is not seen. A high value of α (close to 1) slows down the damping as a low value (close to
Figure 2: Diversification Algorithm
zero), produces a sharp release and tend to force this cluster to be used more quickly. Each cluster
could have they own damping factor αc . For example, value can be set according the the cluster
size. We decide to set the same value for all clusters because we do not have enough information
to differentiate clusters.
3.3 Adding diversity clusters
Only half part of the queries have identified clusters. The rest of CLEF queries are only composed
by a simple query. We propose to automatically build cluster queries so that we can apply the same
diversification technic to all queries. If we want to diversify answers using textual queries, then
only terms belonging to the corpus are useful. So an efficient technic seems to be a diversification
using terms that appears within the retrieved document using the original query. There exists
several potential technics to do this clustering based on terms:
• Document clustering: Document clustering is a classical technic to structure a corpus. Clus-
tering is usually perform on document vectors, using term vector space. Document clustering
can be compute at indexing time, on the whole corpus, or at querying time, only on retrieved
documents. Clusters can then be used for diversification.
• Term clustering: this is the reverse of document clustering. The clustering is performed
on terms, using documents vector space. A similar technics is based on term collocation
or cooccurrencies. Collocation refers to a short window term, where two terms appears, as
cooccurency is usually performed on the whole document content and is equivalent to term
clustering using document vectors.
It is difficult to predict which clustering technics is best for producing diversity. We have
decided to experiment automatic generation of cluster sub queries, so to have an uniform query
description, consistent with the first half part of the queries. That is the reason why we have
adopted term clustering instead of document clustering.
Finally, we have not computed real term cluster, but only term neighborhood because we
wanted to automatically produce automatically cluster sub queries with a similar syntax to those
manually provided. Hence, k (k=100) nearest term neighbor has been computed on the document
space, using just the frequency weighting. The following is an example of the neighborhood of the
term “madonna” computed using the Euclidean distance.
Figure 3: Building of text query clusters
The neighborhood of a term, computed on its document distribution, are terms that appears
in a similar set of documents. We have make the assumption that these terms are interesting to
produce diversity. For example, we have automatically build 10 cluster queries by adding the 10
closest term to euro:
36
euro
36:1
euro currenc
36:2
euro banknot
36:3
euro coin
36:4
euro starter
36:5
euro ocotob
36:6
euro tender
36:7
euro switchov
36:8
euro baht
36:9
euro changeov
36:10
euro turnov
36:11
euro -currenc -banknot -coin -starter -ocotob -tender -switchov -baht -changeov
-turnov
3.4 Visual queries
In this CLEF track, image are only associated to sub-queries that represent clusters. For image
queries, we have used standard color histograms (512 bins, RGB color space) and computed RSV
using Jeffrey divergence similarly to what was described in [3]. This produces for each queries, a
total order on the collection. We found that this computation is too weak semantically to produce
accurate results. We have then decided to filter these results using the output of the main text
queries. As a consequence, only images that are also answer to the textual queries are kept. This
is a strong filter, but according to previous work like like [7], such process leads to good results.
Hence, visual is used in two different way: first as the main result of the query with a text
filter, and second to build cluster information used to diversify the output of the main textual
query. We explain these two approaches in the following.
3.5 Visual and textual diversity
Given a query, we have two types of cluster information: the textual sub-queries that form textual
query clusters, and the visual sub-queries that form visual query clusters. We consider that
these two clusters information have equivalent importance for the diversification and both cluster
informations (id) are added to each answer of the main textual query.
Figure 4: Textual and visual cluster to diversify main query
Then an adapted version of the diversification algorithm is used (”Add Cluster” in Figure 4)
that accept each document to be associated to more that one cluster: one from text and one from
image.
3.6 Submitted runs and official results
We have submited four runs:
LIG1 T TXT : This run uses only the title of textual queries. This is a basic run using Diver-
gence from Randomness weighting(see 3.7 for indexing details), without any diversification
processing. Based on the precision at 10 documents (0.7780) and the cluster recall at 10
documents (0.5416), this run get a F-measure value of 0.6386, in rank 39 on 84 submitted
runs: it means that this basic indexing produces already a good diversification for the top
10 retrieved documents. The MAP (Mean Average Precision) is quite high (0.5063), but it
is not strongly correlated with the cluster recall and precision at 10 documents: among the
five best results of this year one has a MAP of only 0.0747 but with a F-measure of 0.7580!
This run is considered as a baseline for our submissions. We may however notice that, when
considering only the title of the queries, our run is in fourth rank on 14: so no special treat-
ment for diversification and using only the title of the query gives already decent results.
This run is in position 4 over our 4 runs. It shows that clusters informations, diversification
algorithm and visual aspect have a positive impact on diversification at 10 documents.
LIG2 T-CT TXT : This run is a diversification of the previous run using the proposed textual
clusters, and the one we have generated. The size of the queue in the diversification algorithm
is set to 20 and the damping factor is 0.5. This run obtained similar MAP (0.5064) to the
previous one. This is normal as only the order is changing with a maximum windows of 20
documents. There is no change is cluster recall at 10 (0.5416), and a non significant increase
on cluster precision at 10 (0.7800). We can conclude that our diversification algorithm does
not have a significant impact in the 10 first retrieved documents. Even, our proposal to
create cluster query by adding top 10 similar terms using document vector has no impact
on the results. We think that further studies have to be conducted to find better parameter
values for the diversification algorithm.
LIG3 T-I TXT-IMG : This is a visual run with a filtering using textual output of the IRS (run
1). We have guessed that a visual run only using color histogram is no enough meaningful.
That is the reason why we filter the visual output with the result of the textual queries.
A visual query is a set of cluster images. The output is a concatenation of each sub query
clusters. Because of this raw cluster result concatenation with a maximum of 1000 answers,
we are forced to perform a diversification with a larger queue of 1000 with a damping factor
of 0.5. We have to investigate if this larger queue value is related to the good result of the
top 10 cluster diversification. This is our best run with the global rank 20 (on 84), with
a F-Measure of 0.6975. There is a decrease in the MAP (0.4337), first because filtering by
intersection first 1000 image answers and textual answers reduces the number of retrieved
documents (from 19066 for run 1 and 2 to 18375), and second the precision at 1000 is also
reduced (from 0.3813 to 0.3750). Considering the first 25 queries, our approach was ranked
39, and considering the second part of the query set, the rank is 6th, and the 4th one
considering only the query title. We have no clear explanation of this difference between
the first and second queries. We can conclude that the filtering of a simple image matching
with text using our diversification algorithm with a very large queue is our best choice. We
have to further investigate the real influence of the diversification algorithm and the mutual
influence of text and images.
LIG4 T-CT-I TXT-IMG : Textual queries are diversified both using textual given and con-
structed sub-query clusters, and also by the visual sub-queries clusters. This run uses the
textual run 2 with the generated textual clusters. This run considers then both text and
image results in a late fusion schema. For the whole set of queries, this run is at rank 32.
We may notice that, even if the mean average precision (map) is not the official measure
for this competition, this run obtained the higher value for map with 0.5055, it seems that
this approach tends to favor the average precision than the early results. In conclusion, the
building of textual clusters using term/term distance is good solution to increase the map
(from 0.4524 to 0.5221 for the second 25 queries), but it is not effective for the the cluster
recall at the top values (from 0.6365 to 0.5696 at 10 documents), even if at 1000 document
it is slightly better than the previous run (from 0.9477 to 0.9677).
3.7 Technical details and problems
The textual search engine used during this task is our system XIOTA [2]. The weighing scheme
is the Divergence from Randomness [1]. Standard stop-word removal and Porter stemming were
applied. When using the textual search engine, we have always limited the size of the answer to
1000 documents.
The term expansion is computed on stemmed terms, so there is an inconsistency, because
stemmed terms are added with non stemmed terms of the original query. Hence expended terms
and stemmed twice. That may cause a recall problem because these terms are not consistent with
the rest of the collection.
Term negation is treated using a negative weight into the query vector.
Because of the new ordering, original RSV values are no longer meaningful and are replaced
by a fake values.
In the run 3, we have diversified the raw output of the visual IRS. The problem is that visual
output are sorted by cluster first and by IRS second. This imply that the diversification algorithm
should have a queue greater than the largest cluster output (1000 documents in our case). A
better solution should be to first sort each visual query answer, only by IRS only, before doing
the diversification on clusters. We had also a last minute problem: keeping the initial raw output
of the visual system, implies possible duplication of answer in each cluster answer. This also can
be solved by sorting the visual result by IRS, and be keeping only the lowest IRS value when the
same image appears in several clusters answers.
4 Image Annotation track
The Image Annotation track focuses on the evaluation of state of art automatic image annotation
approaches. The whole image set is composed of 5,000 images as the learning set with manually
annotated images, and 13,000 images as the test set. The number of annotation concepts is 53,
and these concepts are organized in a (simple) hierarchy, representing for instance the four seasons
(winter, summer, spring, autumn) in an exhaustive way. The evaluation uses Equal Error Rate
measure as well as a measure that considers the hierarchy of concepts to compute the quality of
the systems [4]. We describe now our approach, the submitted runs and the results obtained.
4.1 Our approach
Our idea was to test early fusion of features and late fusion of concept detectors. We studied the
use of color features (RGB and HSV histograms), and texture features (SIFT points). For the
concepts detectors, we used the well known support vector machines approach. We used a “one
against all” (OAA) framework (all positive versus all negative samples). Because of the small size
of positive (resp. negative) samples compared to negative (resp. positive) ones, we used one more
complex approach: Consider one concept C having pc positive samples, and nc negative samples
(ni = 5000 - pc >> pc). We define Nc SVMs with all the positive samples and 2*pc negative
samples, so that union the negative samples of all the SVMs cover all the pc negatives samples of
C. Each of these SVM learns the probability of belonging to each class concept/non concept. For
one concept, we sum-up then the results for all the NC SVMs. We call this framework MOAA
(Multiple One Against All). The MOAA approach is strongly inspired from the work of Tahir,
Kittler, Mikolajczyk and Yan called Inverse Random Under Sampling [6]. We chose also to fuse
all the results obtained (late fusion) by selecting the best detector for each concept.
As a post processing, we applied on all our different runs a linear scaling in a way to fit the
learning set a priori probabilities. We took afterward into account the hierarchy of concept in the
following way: a) when conflicts occur (for instance the tag Day and the tag Night are associated
to one image of the test set), we keep unchanged the larger value tag, and we decrease (linearly)
the value all the other conflicting tags, b) we propagated the concepts values in a bottom-up way if
the values of the generic concept is increased, otherwise we do not update the pre-existing values.
4.2 Submitted runs
In the LIG part of the AVEIR submitted runs, we used three sets of features:
• The first one, namely RGB, is only based on colors vectors represented in RGB color space.
We split each image in three horizontal stripes (same height of 1/3 of the image height,
whole width of the image). For each stripe we extract a simple RGB histogram (uniform
sub-sampling) of 512 bins (= 8 × 8 × 8), and we normalize it. We concatenate then these
three histograms into a 1536 bins histogram.
• The second set of features we used are SIFT (Scale Invariant Feature Transform). These
features were extracted using a software provided by K. van de Sande [8]. SIFT features
were extracted from regions selected according to Harris-Laplace feature points detection.
Each feature is a 128-dimensional vector. A visual vocabulary containing 4000 dimensions
was then generated using the SIFT features of the learning set, leading to a 4000 dimensions
vector for each image.
• The third set of features, called RGBGAB, comes from an early fusion of color and texture
features. We used a 64 dimensions RGB color histogram concatenated with a 40 dimensions
vector describing Gabor filters energy (8 orientations× 5 scales). The result is then a 104-
dimensional vector.
The learning of the concepts was based on SVMs (Support Vector Machines), which proved
their strength in the domains of learning and image annotation [ref]. For the RGB, SIFT, and the
RGBGAB features we used the “one against all” approach using Radial Basis Function (RBF) as
SVM kernel. The SVM learns the probability for one sample of belonging to each concept. For
the SIFT and HSBGAB features, we used additionally MOAA SVM learning process as described
previously. Using the rescaling and the hierarchy propagation, the results (average equal error
rate) achieved on the test set are presented in Table 1. These results show that, on the learning
set, the SIFT MOAA (Id=3) and the RGBGAB OAA (Id=4) give the best results. The fusion
selecting the best detector for each concept leads to a EER of 0.26% .
Id Features Learning framework Average EER
1 RGB OAA 0.480
2 SIFT OAA 0.372
3 SIFT MOAA 0.340
4 RGBGAB OAA 0.339
5 RGBGAB MOAA 0.371
6 all selection best 0.256
Table 1: EER results for different features and learning frameworks
4.3 Official results obtained
Table2 presents the official results that we obtained. This table also presents the average results
over all the runs submitted by the participants. The four runs officially submitted correspond to
the SIFT OAA (run id. 50 2 1245573948330.txt), SIFT MOAA (run id. 50 2 1245574356224.txt),
RGBGAB OAA (run id. 50 2 1245574874836.txt) and Fusion (cf. id 6 in Table 1, run id.
0 2 1245578233616.txt). As expected, the EER values for the submitted runs on the test set
are greater than on the test set, and if we rank these runs they do not behave as on the test set
regarding the EER values:
• the late fusion of the runs still provides our best results for EER (11th site on 18, EER of
0.384 when the average for the participants is 0.390), but it is very close to the SIFT MOAA
run.
• the RGBGAB OAA run achieves almost similar EER than the SIFT OAA run.
If we focus now on the evaluation of the recognition based on concept hierarchy (not available on
the test set), then the RGBGAB OAA achieves our best result (recognition rate of 0.74, position 27
on 74 runs submitted, 10th site on 18), before the SIFT MOAA, the SIFT OAA and eventually the
Fusion run. It seems so that our fusion run gives good results, but when an annotation concept is
false is not close (in the hierarchy) to a relevant concept. Conversely, the HSVGAB OAA provides
less accurate concepts, but when the concept is wrong then is it more often close in the hierarchy
to a relevant concept compared to the fusion run.
The results obtained for annotation are the given in Table 2.
5 Conclusion
In this section, we conclude for the two tracks in which runs where submitted by the MRIM
research group from the LIG.
For the Image Retrieval task, we studied the use of a new diversification algorithm and the
automatic building of sub queries using term distance, for diversification. Results obtained with
official id Features Learning Average Hierarchical
framework EER (/74) recognition (/74)
(avg. 0.390) (avg. 0.619)
50 2 ...8330.txt SIFT OAA 0.443 (42) 0.680 (36)
50 2 ...6224.txt SIFT MOAA 0.392 (36) 0.714 (34)
50 2 ...4836.txt RGBGAB OAA 0.440 (40) 0.741 (27)
50 2 ...3616.txt all selection best 0.384 (34) 0.591 (45)
Table 2: EER and hierarchical results for different features and learning frameworks
our diversification algorithm are not enough conclusive, and we still have to study the behavior
of this algorithm and it effective influence on diversification. Also the automatic building of
sub queries using distance among terms has no clear positive effect. Again, we have to further
experiment with and without this extension. Our best run (rank 20) integrates textual and visual
matching: we process the image queries and we keep the images that are relevant according to the
text queries. We uses Divergence form Randomness weighting for text. We obtain overall very
good results according the MAP, but average results using the top 10 cluster ranking. We underline
here that this evaluation measure proposed is based only on the first 10 documents, and is largely
uncorrelated to the MAP (the usual evaluation measure for information retrieval systems) with a
correlation value of 0.45 [COMMENT TU CALCULES CELA ?]. What is happening in fact is:
a) a low map value does not implies a low F-measure for diversity, and b) a medium or high value
of map (for instance higher than 0.3) has good chances to have a medium to high value for the
F-measure. For instance, the top 5 run have a F-measure of 0.76 and a map of 0.07, which means
that this system is not a good information retrieval system according to an information retrieval
standard evaluation measure, but has a top evaluation for this task. Considering the task from
a news context, we consider that using only the 10 first documents result for such evaluation is
probably too constrained to reflect the real behavior of a system.
For the image annotation track, we proposed to study different learning approaches. Early
fusion used RGB color space and gabor features. We also extracted SIFT features and used
simple one against all and multiple one against all learning techniques. A late fusion was defined
as a selection of the best learning configuration for each concept to be detected.
The results that we obtained show that fusing several approaches (early and late fusion, simple
or multiple one against all learning technique) leads to higher results for the EER (our best rank
is 34 on 74 runs). We used for all our runs the concept hierarchy to avoid conflicts. However, our
best result according to the hierarchical recognition rate was obtained using an early fusion and
simple one against all learning technique (rank 27 on 74). The concept hierarchy was quite simple,
but allowed to study in a real case how to manage such data. In the future, we will study further
the behaviour of the different learning approaches in a way to increase the quality of our results,
and we will also explore different uses of the hierarchy during the learning and recognition steps.
Acknowledgment
This work was partly supported by: a) the French National Agency of Research (ANR-06-MDCA-
002), b) the Quaero Programme, funded by OSEO, French State agency for innovation and c) the
Région Rhones Alpes (projet LIMA).
References
[1] Gianni Amati and Cornelis Joost van Rijsbergen. Probabilistic models of information retrieval
based on measuring the divergence from randomness. ACM Transaction on Information Sys-
tems, 20(4):357–389, October 2002.
[2] JeanPierre Chevallet. X-iota: An open xml framework for ir experimentation. Lecture Notes
in Computer Science (LNCS), AIRS’04 Conference Beijing, 3211:263–280, 2004.
[3] Loic Maisonnasse, Philippe Mulhem, Eric Gaussier, and Jean-Pierre Chevallet. Lig at image-
clef 2008. In Evaluating Systems for Multilingual and Multimodal Information Access – 9th
Workshop of the Cross-Language Evaluation Forum, 2008.
[4] Stefanie Nowak and Peter Dunker. Overview of the clef 2009 large scale - visual concept
detection and annotation task. In CLEF working notes 2009, Corfu, Greece, September 2009.
[5] Monica Paramita, Mark Sanderson, and Paul Clough. Diversity in photo retrieval: overview of
the imageclefphoto task 2009. In CLEF working notes 2009, Corfu, Greece, September 2009.
[6] Muhammad Atif Tahir, Josef Kittler, Krystian Mikolajczyk, and Fei Yan. A multiple expert
approach to the class imbalance problem using inverse random under sampling. In Multiple
Classifier Systems, pages 82–91, Reykjavik, Iceland, June 2009.
[7] Sabrina Tollari, Philippe Mulhem, Marin Ferecatu, Hervé Glotin, Marcin Detyniecki, Patrick
Gallinari, Hichem Sahbi, and Zhong-Qiu Zhao. A comparative study of diversity methods
for hybrid text and image retrieval approaches. In Evaluating Systems for Multilingual and
Multimodal Information Access – 9th Workshop of the Cross-Language Evaluation Forum,
2008.
[8] Koen E. A. van de Sande, Theo Gevers, and Cees G. M. Snoek. Evaluation of color descrip-
tors for object and scene recognition. In IEEE Conference on Computer Vision and Pattern
Recognition, Anchorage, Alaska, USA, June 2008.