=Paper= {{Paper |id=Vol-1173/CLEF2007wn-ImageCLEF-DeselaersEt2007 |storemode=property |title=Overview of the ImageCLEF 2007 Object Retrieval Task |pdfUrl=https://ceur-ws.org/Vol-1173/CLEF2007wn-ImageCLEF-DeselaersEt2007a.pdf |volume=Vol-1173 |dblpUrl=https://dblp.org/rec/conf/clef/DeselaersHVBBDBGHHLLCNRSSW07 }} ==Overview of the ImageCLEF 2007 Object Retrieval Task== https://ceur-ws.org/Vol-1173/CLEF2007wn-ImageCLEF-DeselaersEt2007a.pdf

Overview of the ImageCLEF 2007
Object Retrieval Task
Thomas DeselaersRWTH , Allan HanburyPRIP , Ville ViitaniemiHUT ,
András BenczúrBUDAC , Mátyás BrendelBUDAC , Bálint DaróczyBUDAC ,
Hugo Jair Escalante BalderasINAOE , Theo GeversISLA , Carlos Arturo Hernández GracidasINAOE ,
Steven C. H. HoiNTU , Jorma LaaksonenHUT , Mingjing LiMSRA ,
Heidy Marisol Marin CastroINAOE , Hermann NeyRWTH Xiaoguang RuiMSRA
Nicu SebeISLA , Julian StöttingerPRIP , Lei WuMSRA

RWTH
Computer Science Department, RWTH Aachen University, Germany
deselaers@cs.rwth-aachen.de
PRIP
PRIP, Institute of Computer-Aided Automation,
Vienna University of Technology, Vienna, Austria
{julian,hanbury}@prip.tuwien.ac.at
BUDAC
Data Mining and Web search Research Group, Informatics Laboratory
Computer and Automation Research Institute of the Hungarian Academy of Sciences,
Budapest, Hungary
{mbrendel, daroczyb, benczur}@ilab.sztaki.hu
INAOE
TIA Research Group, Computer Science Department,
National Institute of Astrophysics, Optics and Electronics, Tonantzintla, Mexico
{hmarinc,hugojair,carloshg}@ccc.inaoep.mx
ISLA
Intelligent Systems Lab Amsterdam, University of Amsterdam, The Netherlands
{nicu,gevers}@science.uva.nl
HUT
Adaptive Informatics Research Centre, Helsinki University of Technology, Finland
firstname.lastname@tkk.fi
MSRA
Microsoft Research Asia, Beijing, China
mjli@microsoft.com
NTU
School of Computer Engineering, Nanyang Technological University, Singapore
chhoi@ntu.edu.sg

Abstract
We describe the object retrieval task of ImageCLEF 2007, give an overview of the
methods of the participating groups, and present and discuss the results.
The task was based on the widely used PASCAL object recognition data to train
object recognition methods and on the IAPR TC-12 benchmark dataset from which
images of objects of the ten different classes bicycles, buses, cars, motorbikes, cats,
cows, dogs, horses, sheep, and persons had to be retrieved.
Seven international groups participated using a wide variety of methods. The results
of the evaluation show that the task was very challenging and that different methods
for relevance assessment can have a strong influence on the results of an evaluation.
Categories and Subject Descriptors
H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor-
mation Search and Retrieval; H.3.4 Systems and Software; H.3.7 Digital Libraries; H.2.3 [Database
Management]: Languages—Query Languages

General Terms
Measurement, Performance, Experimentation

Keywords
Image retrieval, image classification, performance evaluation

1 Introduction
Object class recognition, automatic image annotation, and object retrieval are strongly related
tasks. In object recognition, the aim is to identify whether a certain object is contained in an
image, in automatic image annotation, the aim is to create a textual description of a given image,
and in object retrieval, images containing certain objects or object classes have to be retrieved out
of a large set of images. Each of these techniques are important to allow for semantic retrieval
from image collections.
Over the last year, research in these areas has strongly grown, and it is becoming clear that
performance evaluation is a very important step to foster progress in research. Several initiatives
create benchmark suites and databases to compare different methods tackling the same problem
quantitatively.
In the last years, evaluation campaign for object detection [10, 9], content-based image re-
trieval [5] and image classification [24] have developed. There is however, no task aiming at
finding images showing particular object from a larger database. Although this task is extremely
similar to the PASCAL visual object classes challenge [10, 9], it is not the same. In the PASCAL
object recognition challenge, the probability for an object to be contained in an image is relatively
high and the images to train and test the methods are from the same data collection. In realis-
tic scenarios, this is not a suitable assumption. Therefore, in the object retrieval task described
here, we use the training data that was carefully assembled by the PASCAL NoE with much
manual work, and the IAPR TC-12 database which has been created under completely different
circumstances as the database from which relevant images are to be retrieved.
In this paper, we present the results of the object retrieval task that was arranged as part
of the CLEF/ImageCLEF 2007 image retrieval evaluation. This task was conceived as a purely
visual task, making it inherently cross-lingual. Once one has a model for the visual appearance of
a specific object, such as a bicycle, it can be used to find images of bicycles independently of the
language or quality of the annotation of an image.
ImageCLEF1 [5] has started within CLEF2 (Cross Language Evaluation Forum) in 2003. A
medical image retrieval task was added in 2004 to explore domain–specific multilingual information
retrieval and also multi-modal retrieval by combining visual and textual features for retrieval.
Since 2005, a medical retrieval and a medical image annotation task are both presented as part
of ImageCLEF. In 2006, a general object recognition task was presented to see whether interest
in this area existed. Although, only a few groups participated, many groups expressed their
interest and encouraged us to create an object retrieval task. In 2007, beside the here described
object retrieval task, a photographic retrieval task also using the IAPR TC-12 database [14], a
medical image retrieval task [26], and a medical automatic annotation task [26] were arranged in
ImageCLEF 2007.
1 http://ir.shef.ac.uk/imageclef/
2 http://www.clef-campaign.org/
Figure 1: Example images from the PASCAL VOC 2006 training dataset.

2 Task Description
The task was defined as a visual object retrieval task. Training data was in the form of annotated
example images of ten object classes (PASCAL VOC 2006 data). The task was, after learning
from the provided annotated images, to find all images in the IAPR-TC12 database containing
the learned objects. The particularity of the task is that the training and test images are not from
the same set of images. This makes the task more realistic, but also more challenging.

2.1 Datasets
For this task, two datasets were available. As training data, the organisers of the PASCAL
Network of Excellence visual object classes challenge kindly agreed that we use the training data
they assembled for their 2006 challenge.

PASCAL VOC 2006 training data The PASCAL Visual Object Classes (VOC) challenge
2006 training data is freely available on the PASCAL web-page3 and consists of approximately
2600 images, where for each image a detailed description is available which of the ten object classes
is visible in which area of the image. Example images from this database are shown in Figure 1
with the corresponding annotation.

IAPR TC-12 dataset The IAPR TC-12 Benchmark database [15] consists of 20,000 still images
taken from locations around the world and comprising an assorted cross-section of still images
which might for example be found in a personal photo collection. It includes pictures of different
sports and actions, photographs of people, animals, cities, landscapes and many other aspects
of contemporary life. Some example images are shown in Figure 2. This data is also strongly
annotated using textual descriptions of the images and various meta-data. We use only the image
data for this task.

2.2 Object Retrieval Task
The ten queries correspond to the ten classes of the PASCAL VOC 2006 data: bicycles, buses,
cars, motorbikes, cats, cows, dogs, horses, sheep, and persons. For training, only the “train” and
3 http://www.pascal-network.org/challenges/VOC/
Figure 2: Example images from the IAPR TC-12 benchmark dataset

“val” sections of the PASCAL VOC database were to be used. For each query, participants were
asked to submit a list of 1000 images obtained by their method from the IAPR-TC12 database,
ranked in the order of best to worst satisfaction of the query.

2.3 Evaluation Measure
To evaluate the retrieval performance we use the same measure used by most retrieval evaluations
such as the other tasks in CLEF/ImageCLEF [14, 26], TREC4 and TRECVid5 . The avereage
precision (AP) gives an indicate for the retrieval quality for one topic and the mean average
precision (MAP) provides a single-figure measure of quality across recall levels averaged over all
queries. To calculate these measures, it of course necessary to judge which images are relevant for
a given query and which are not.

2.4 Relevance Assessments
To find relevant images, we created pools per topic [1] keeping the top 100 results from all submitted
runs resulting in 1,507 images to be judged per topic on average. This resulted in a total of 15,007
images to be assessed. The normal relevance judgement process in information retrieval tasks
envisages that several users judge each document in question for relevance and that for each
image relevance for the particular query is judged. In this case, to judge the relevance is easy
enough that we can postulate that every two persons among the judges would come to the same
conclusion and therefore each image was judged by only one judge. Furthermore, since only
10 queries were to be judged and the concepts to find are simple, the judges complained about
the task being to easy and boring. Therefore, after approximately 3000 images were judged, we
allowed the judges to additionally specify whether an image is relevant with respect to any of
the other queries. The whole judgement process was performed over a web interface which was
quickly created and everybody from the RWTH Aachen University Human Language Technology
and Pattern Recognition Group and from the Vienna University of Technology PRIP group was
invited to judge images. Thus, most of the judges are computer science students and researchers
with a Human Language Technology background. Note, that in the pooling process all images
that are not judged are automatically considered to be not relevant.
The web-interface is shown in Figure 3 to give an impression of the process. On each page,
10 images are shown, and the judge has to decide whether a particular object is present in these
images or not. To reduce boredom for the judges, they are allowed (and recommended) to specify
4 http://trec.nist.gov/
5 http://www-nlpir.nist.gov/projects/t01v/
class class relevant relevant relevant
id normal w/ additional full database
1 bicycle 66 254 655
2 bus 23 69 218
3 car 200 522 1268
4 motorbike 7 28 86
5 cat 5 18 7
6 cow 7 23 49
7 dog 9 22 72
8 horse 13 94 175
9 sheep 5 42 6
10 person 554 3939 11248

Table 1: Results from the relevance judgement process. Column 3 shows the number of relevant
images when standard pooling is used, column 4 when the additional class information is taken
into account. Column 5 shows the results of the relevance judgement of all 20,000 images.

whether other object classes are present in the images. The judges were told to be rather positive
about the relevance of an image, e.g. to consider sheep-like animals such as llamas to be sheep
and to consider tigers and other non-domestic cats to be cats.
Furthermore, Ville Viitaniemi from the HUTCIS group, judged all 20,000 images with respect
to relevance for all of the topics with a stricter definition of relevances.

Results from the Relevance Judgements Table 1 gives an overview how many images were
found to be relevant for each of the given topics. It can be observed that there are far more
relevant images for the person topic than for any other topic. From these numbers it can be seen
that the task at hand is really challenging for most of the classes and very similar to the proverbial
looking for a needle in a haystack. In particular for the sheep topic, only 0.03% of the images in
the database are relevant although some more images were judged to be relevant by the judges
in the additional relevance information. If only the data from the conventional pooling process
is considered for eight of the ten classes less than a thousandth of the images are relevant. The
discrepancy in the results with additional relevance judgements and the results with full relevance
judgement are due to different relevance criteria. The judges that created the additional relevance
information were instructed to judge images as relevant that show sheep-like animals such as
llamas, and to judge tigers as cats, where the full relevance judgement was stricter in this respect.
Table 1 also shows that the additional relevance judgements found more cats and sheep than are
actually in the database.

3 Methods
Seven international groups from academia participated in the task and submitted a total of 38
runs. The group with the highest number of submissions had 13 submissions. In the following
sections, the methods of the groups are explained (in alphabetical order) and references to further
work are given.

3.1 Budapest methods
Authors: Mátyás Brendel, Bálint Daróczy, and András Benczúr
Affiliation: Data Mining and Web search Research Group, Informatics Laboratory
Computer and Automation Research Institute of the Hungarian Academy of Sciences
Email: {mbrendel, daroczyb, benczur}@ilab.sztaki.hu
Figure 3: The relevance judgement web-interface.

3.1.1 budapest-acad315
The task of object retrieval is to classify objects found in images. This means to find an objects in
an image, which is similar to sample objects in the pre-classified images. There are two problems
with this task: the first is, how do we model objects. The second is, how do we measure similarity
of objects. Our first answer to the first question is to model objects with image segments. Segment,
region or blob based image similarity is a common method in content based image retrieval, see for
example [4, 27, 2, 21]. Respectively, the basis of our first method is to find segments on the query
image, which are similar to the objects in the pre-classified images. The image is then classified
to be in that class, to which we find the most similar segment in the query image.
Image segmentation in itself is a widely researched and open problem itself. We used an image
segmenter developed by our group to extract segments from the query images. Our method is
based on a graph-based algorithm developed by Felzenszwalb and Huttenlocher [11]. We imple-
mented a pre-segmentation method to reduce the computational time and use a different smoothing
technique. All images were sized to a fixed resolution. Gaussian-based smoothing helped us cut
down high frequency noise. Because of the efficiency of OpenCV6 implementation we did not im-
plement either a resizing or a Gaussian-based smoothing algorithm. As pre-segmentation we built
a three-level Gaussian-Laplacian pyramid to define initial pixel groups. The original pyramid-
based method, which considers the connection between pixels on different levels too, was modified
to eliminate the so-called blocking problem. We used brightness difference to measure distance
between pixels:

dif f Y (P1 , P2 ) = 0.3∗ | RP2 − RP1 | +0.59∗ | GP2 − GP1 | +0.11∗ | BP2 − BP1 | (1)
After pre-segmentation, we had segments of 16 pixels maximum. To detect complex segments,
we modified the original graph-based method by Felzenszwalb and Huttenlocher [11] with an
adaptive threshold system using Euclidean distance to prefer larger regions instead of small regions
of the image. Felzenszwalb and Huttenlocher defined an undirected graph G = (V, E) where
∀vi ∈ V corresponds to a pixel in the image, and the edges in E connect certain pairs of neighboring
pixels. This graph-based representation of the image reduces the original proposition into a graph
cutting challenge. They made a very efficient and linear algorithm that yields a result near to the
optimal normalized cut which is one of the NP-full graph problems [11, 29].
6 http://www.intel.com/technology/computing/opencv/
Algorithm 1 Algorithm Segmentation (Isrc , τ1 , τ2 )
τ1 and τ2 are threshold functions. Let I2 be the source image, I1 and I0 are the down-scaled
images. Let P (x, y, i) be the pixel P (x, y) in the image on level i (Ii ). Let G = (V, E) be an
undirected weighted graph where ∀vi ∈ V corresponds to a pixel P (x, y). Each edge (vi , vj ) has a
non-negative weight w(vi , vj ).

Gaussian-Laplacian Pyramid

1. For every P (x, y, 1) Join(P (x, y, 1), P (x/2, y/2, 0)) if τ1 < dif f Y (P (x, y, 1), P (x/2, y/2, 0))

2. For every P (x, y, 2) Join(P (x, y, 2), P (x/2, y/2, 1)) if τ1 < dif f Y (P (x, y, 2), P (x/2, y/2, 1))

Graph-based Segmentation

1. Compute M axweight (R) = maxe∈M ST (R,E) w(e) for every coherent group of points R where
M ST (R, E) is the minimal spanning tree

2. Compute Co(R) = τ2 (R) + M axweight (R) as the measure of coherence between points in R
T
3. Join(R1 , R2 ) if e ∈ E exists so w(e) < min(Co(R1 ), Co(R2 )) is true, where R1 R2 = ∅
and w(e) is the weight of the border edge e between R1 and R2

4. Repeat steps 1,2,3 for every neighboring group (R1 , R2 ) until possible to join two groups

Algorithm 1 sometimes does not find relevant parts with low initial thresholds. To find the
relevant borders which would disappear with the graph-based method using high thresholds we
calculated the Sobel-gradient image to separate important edges from other remainders.
Similarity of complex objects is usually measured on a feature base. This means the the
similarity of the objects is defined by the similarity in a certain feature-space.

dist(Si , Oj ) = d(F (Si ), F (Oj )) : Si ∈ S, Oj ∈ O (2)

where S is the set of segments and O is the set of objects, dist is the distance function of the
objects and segments, d is a distance function in the feature space (usually some of the conventional
metrics in the n-dimensional real space), F is the function which assigns features to objects and
segments. We extracted from the segments features, like mean color, size, shape information,
and histogram information. As shape information a 4 × 4 sized low-resolution variant of the
segment (framed in a rectangle with background) was used. Our histograms had 5 bins in each
channel. Altogether a 35 dimensional, real valued feature-vector was extracted for each of the
segments. The same features were extracted for the objects in the pre-classified images taking
them as segments. The background and those classes, which were not requested were ignored.
The features of the objects were written to a file, with the class-identifiers, which were extracted
from the color-coding. This way we obtained a data-base of class samples, containing features
of objects belonging to the classes. After this, the comparison of the objects of the pre-classified
sample-images and the segments of the query image was possible. We used Euclidean distance to
measure similarity. The distance of the query-image Q was computed as:

dist(Q) = min dist(Si , Oj ) : Si ∈ S, Oj ∈ O (3)
i,j

where S is the set of segments of image Q, O is the set of the pre-classified sample objects. Q is
classified to be in the class of that object, which minimizes the distance. The score of an image
was computed as:
score(Q) = 1000/dist(Q) (4)
where Q is the query image.
3.1.2 budapest-acad314
In our first method (see budapest-acad315) we found that our segments are much smaller than
the objects in the pre-segmented images. It would have been possible to get larger segments
by adjusting the segmentation algorithm, however this way we would not get objects, which
were really similar to the objects. We found that our segmentation algorithm could not generate
segments similar to the the objects in the pre-classified images with any settings of the parameters.
Even, if we tried our algorithm on the sample-images, and the segments were approximately of
the same size, the segments did not match the pre-classified objects. The reason for this is that
pre-segmentation was made by humans and algorithmic segmentation is far from capable of the
same result. For example, it is almost impossible to write an algorithm, which would segment a
shape of a human being as one segment, if his clothes are different. However, people were one of
the classes defined, and the sample images contained people with the entire body as one object.
Therefore we modified our method. Our second method is still segment-based. But we also do a
segmentation on the sample-images. We took the segmented sample-images, and if a segment was
80% inside of an area of a pre-defined object, then we took this segment as a proper sample for
that object. This way a set of sample segments were created. After this the method is similar to
the previous, the difference is only that we have sample-segments instead of sample objects, but
we treat them the same way anyway. The features of the segments were extracted and they were
written to a file, with the identifier of the class, which was extracted from the color-codes. After
this, the comparison of the segments of the pre-classified images and the query image was possible.
We used Euclidean distance again to measure similarity. The closest segment of the image to a
segment in any of the objects was searched.

dist(Q) = min dist(Si , Sj ) : Si ∈ S, Sj ∈ O (5)
i,j

where S is the segments of image Q, O is the set of segments belonging to the pre-classified objects.
The image was classified according to the object, to which the closest segment belongs. As we
expected, this modification made the algorithm better.

3.2 HUTCIS: Conventional Supervised Learning using Fusion of Image
Features
Authors: Ville Viitaniemi, Jorma Laaksonen
Affiliation: Adaptive Informatics Research Centre,
Laboratory of Computer and Information Science,
Helsinki University of Technology, Finland
Email: firstname.lastname@tkk.fi

All our 13 runs identified with prefix HUTCIS implement a similar general system architecture
with three system stages:

1. Extraction of a large number of global and semi-global image features. Here we interpret
global histograms of local descriptors as one type of global image feature.

2. For each individual feature, conventional supervised classification of the test images using
the VOC2006 trainval images as the training set.

3. Fusion of the feature-wise classifier outputs.

By using this architecture, we knowingly ignored the aspect of qualitatively different training and
test data. The motivation was to provide a baseline performance level that could be achieved by
just applying a well-working implementation of the conventional supervised learning approach.
Table 2 with ROC AUC performances in VOC 2006 test set reveals that the performance of
our principal run HUTCIS SVM FULLIMG ALL is relatively close to the best performances in
the last year’s VOC evaluation [9]. The last row of the table indicates what the rank of the
HUTCIS SVM FULLIMG ALL run would have been among the 19 VOC 2006 participants.
The following briefly describes the components of the architecture. For more detailed descrip-
tion, see e.g. [34].

Features For different runs, the features are chosen from a set of feature vectors, each with
several components. Table 3 lists 10 of the features. Additionally, the available feature set includes
interest point SIFT feature histograms with different histogram sizes, and concatenations of pairs,
triples and quadruples of the tabulated basic feature vectors. The SIFT histogram bins have been
selected by clustering part of the images with with self-organising map (SOM) algorithm.

Classification and fusion The classification is performed either by a C-SVC implementation
built around LIBSVM support vector machine (SVM) library [3], or a SOM-based classifier [19].
The SVM classifiers (prefix HUTCIS SVM) are fused together using an additional SVM layer. For
the SOM classifiers (prefix HUTCIS PICSOM), the fusion is based on summation of normalized
classifier outputs.

The different runs Our principal run HUTCIS SVM FULLIMG ALL realizes all the three
system stages in the best way possible. Other runs use subsets of those image features, inferior
algorithms or are otherwise predicted to be suboptimal.
The run HUTCIS SVM FULLIMG ALL performs SVM-classification with all the tabulated
features, SIFT histograms and and twelve previously hand-picked concatenations of the tabulated
features, selected on basis of SOM classifier performance in the VOC2006 task. The runs HUT-
CIS SVM FULLIMG IP+SC and HUTCIS SVM FULLIMG IP+SC are otherwise similar but use
just subsets of the features: SIFT histograms and color histogram, or just SIFT histograms, re-
spectively.
The runs identified by prefix HUTCIS SVM BB are half-baked attempts to account for the
different training and test image distributions. These runs are also based on SIFT histogram and
color histogram features. For the training images, the features are calculated from the bounding
boxes specified in VOC2006 annotations. For the test images, the features are calculated for whole
images. The different runs with this prefix correspond to different ways to select the images as a
basis for SIFT codebook formation.
The run HUTCIS FULLIMG+BB is the rank based fusion of features extracted from full
images and bounding boxes. The runs HUTCIS PICSOM1 and HUTCIS PICSOM2 are otherwise
identical but use a different setting of SOM classifier parameter. HUTCIS PICSOM2 smooths less

Table 2: ROC AUC performance in VOC2006 test set
Run id. bic. bus car cat cow dog horse mbike person sheep
FULLIMG ALL 0.921 0.978 0.974 0.930 0.937 0.866 0.932 0.958 0.874 0.941
FULLIMG IP+SC 0.922 0.977 0.974 0.924 0.934 0.851 0.928 0.953 0.865 0.941
FULLIMG IP 0.919 0.952 0.970 0.917 0.926 0.840 0.903 0.943 0.834 0.936
Best in VOC2006 0.948 0.984 0.977 0.937 0.940 0.876 0.927 0.969 0.863 0.956
Rank 7th 4th 3rd 4th 4th 3rd 1st 5th 1st 6th

Table 3: Some of the image features used in the HUTCIS runs
Colour layout Dominant colour
Sobel edge histogram (4 × 4 tiling of the image) HSV colour histogram
Average colour (5-part tiling) Colour moments (5-part tiling)
16 × 16 FFT of edge image Sobel edge histogram (5-part tiling)
Sobel edge co-occurrence matrix (5-part tiling)
the feature spaces, the detection is based on more local information. Both the runs are based on
the full set of features mentioned above. concatenations of pairs, triples and

Results The tabulated MAP results of HUTCIS runs are corrupted by our own slip in ordering
of the queries that prevented us from fully participating the pool selection for query topics 4–10.
For the normal pooling, our mistake excluded us completely, for additional relevance judgments
only partly. This is expected to degrade our results quite much for the normal pooling and
somewhat less for the additional pool, as participating the pool selection is known to be essential
for obtaining good performance figures.
As expected, HUTCIS SVM FULLIMG ALL turned out to be the best of our runs for the
uncorrupted query topics 1–3. The idea of extracting features only from bounding boxes of objects
in order to reduce the effect of different backgrounds in the training and test images seemed
usually not to work as well as features extracted from full images, although results were not very
conclusive. Partly this is could be to the asymmetry in feature extraction: the features extracted
from bounding boxes of the training objects were compared with the features of whole of the
test images. The results of the SOM classifier runs did not provide information that would be of
general interest, besides confirming the previously known result of SOM classifiers being inferior
to SVMs.

3.3 INAOE’s Annotation-based object retrieval approaches
Authors: Heidy Marisol Marin Castro, Hugo Jair Escalante Balderas,
and Carlos Arturo Hernández Gracidas
Affiliation: TIA Research Group, Computer Science Department,
National Institute of Astrophysics, Optics and Electronics, Tonantzintla, Mexico
Email: {hmarinc,hugojair,carloshg}@ccc.inaoep.mx
The TIA research group at INAOE, Mexico proposed two methods based on image labeling.
Automatic image annotation methods were used for labeling regions within segmented images, and
then we performed object retrieval based on the generated annotations. Results were not what
we expected, though it can be due to the fact that annotations were defined subjectively and that
not enough images were annotated for creating the training set for the annotation method. Two
approaches were proposed: a semi-supervised classifier based on unlabeled data and a supervised
one, the last method was enhanced with a recent proposed method based on semantic cohesion
[8]. Both approaches followed the following steps:
1. Image segmentation
2. Feature extraction
3. Manual labeling of a small subset of the training set
4. Training a classifier
5. Using the classifier for labeling the test-images
6. Using labels assigned to regions images for object retrieval
For both approaches the full collection of images was segmented with the normalized cuts algorithm
[30]. A set of 30 features were extracted from each region; we considered color, shape and texture
attributes. We used our own tools for image segmentation, feature extraction and manual labeling
[22]. The considered annotations were the labels of the 10 objects defined for this task. The features
for each region together with the manual annotations for each region were used as training set
with the two approaches proposed. Each classifier was trained with this dataset and then all of the
test images were annotated with such a classifier. Finally, the generated annotations were used
for retrieving objects with queries. Queries were created using the labels of the objects defined
for this task; and selected as relevant those images with the highest number of regions annotated
with the object label. Sample segmented images with their corresponding manual annotations are
shown in Figure 4. As we can see the segmentation algorithm works well for some images (isolated
cows, close-up of people), however for other objects segmentation is poor (a bicycle, for example).
Figure 4: Sample images from the generated training set. .

Figure 5: Left: graphical description of the improvement process of MRFI. Right: interpretation of MRFI
for a given configuration of labels and regions; (red) line-arcs consider semantic cohesion between labels,
while (blue) dashed-arcs consider relevance weight of each label according to k − nn .

KNN+MRFI, A supervised approach For the supervised approach we used a simple knn
classifier for automatically labeling regions. Euclidean distance was used as similarity function.
The label of the nearest neighbor (in the training set) for each test-region was assigned as anno-
tation for this region. This was our baseline run (INAOE-TIA-INAOE-RB-KNN ).
The next step consisted of improving annotation performance of knn using an approach called
(MRFI ) [8] which we recently proposed for improving annotation systems. This approach consists
of modeling each image (region-annotations pairs) with a Markov random field (MRF ), introduc-
ing semantic knowledge, see Figure 5. The top−k more likely annotations for each region are
considered. Each of these annotations have a confidence weight related to the relevance of the
label to being the correct annotation for that region, according to knn. The MRFI approach uses
the relevance weights with semantic information for choosing a unique (the correct) label for each
region. Semantic information is considered in the MRF for keeping coherence among annotations
assigned to regions within a common image; while the relevance weight is considered for taking
into account the confidence of the annotation method (k − nn) on each of the labels, see Figure
5. The (pseudo) optimal configuration of regions-annotations for each image is obtained by min-
imizing an energy function defined by potentials. For optimization we used standard simulated
annealing.
The intuitive idea of the MRFI approach is to guarantee that the labels assigned to regions
are coherent among them, taking into account semantic knowledge and the confidence of the
annotation system. In previous work, semantic information was obtained from cooccurrences of
labels on an external corpus. However for this work semantic association between a pair of labels is
given by the normalized number of relevant documents returned by GoogleR to queries generated
using the pair of labels. This run is named INAOE-TIA-INAOE-RB-KNN+MRFI, see [8] for
details.

SSAssemble: Semi-supervised Weighted AdaBoost The semi-supervised approach consist
of using a recently proposed ensemble of classifiers, called WSA [22]. Our WSA ensemble uses
naive Bayes as its base classifier. A set of these is combined in a cascade based on the AdaBoost
technique [13]. Ensemble methods work by combining a set of base classifiers in some way, such
as a voting scheme, producing a combined classifier which usually outperforms a single classifier.
When training the ensemble of Bayesian classifiers, WSA considers the unlabeled images on each
stage. These are annotated based on the classifier from the previous stage, and then used to
train the next classifier. The unlabeled instances are weighted according to a confidence measure
based on their predicted probability value; while the labeled instances are weighted according to
Algorithm 2 Semi-supervised Weighted AdaBoost (WSA) algorithm.
Require: L: labeled instances, U : unlabeled instances, P : training instances, T : Iterations
T
X 1
Ensure: Final Hypothesis and probabilities: Hf = argmax log , P (xi )
t=1
Bt
1
1: W (xi )0 = N umInst(L) , ∀xi ∈ L
2: for t from 1 to T do
3: W (xi )t = N W (xi ) ∀xi ∈ L
X
W (xi )
i=1
4: ht = C(L, W (xi )t )
N
X
5: et = W (xi )t if ht (xi ) 6= yi
i=1
6: if et ≥ 0.5 then
7: exit
8: end if
9: if et = 0.0 then
10: et = 0.01
11: end if
et
12: Bt = (1−e t)

13: W (xi )(t+1) = W (xi )t ∗ Bt if ht (xi ) = yi ∀xi ∈ L
14: P (xi ) = C(L, U, W (xi )t )
15: W (xi ) = P (xi ) ∗ Bt ∀xi ∈ U
16: end for

the classifier error, as in standard AdaBoost. Our method is based on the supervised multi-class
AdaBoost ensemble, which has shown to be an efficient scheme to reduce the rate error of different
classifiers.
Formally WSA algorithm receives a set of labeled data (L) and a set of unlabeled data (U ).
An initial classifier N B1 is build using L. The labels in L are used to evaluate the error of N B1 .
As in AdaBoost the error is used to weight the examples, increasing the weight of the misclassified
examples and keeping the same weight of the correctly classified examples. The classifier is used
to predict a class for U with certain probability. In the case of U , the weights are multiplied by
the predicted probability of the majority class. Unlabeled examples with high probability of their
predicted class will have more influence in the construction of the next classifier than examples
with lower probabilities. The next classifier N B2 is build using the weights and predicted class
of L ∪ U . N B2 makes new predictions on U and the error of N B2 on all the examples is used to
re–weight the examples. This process continues, as in AdaBoost, for a predefined number of cycles
or when a classifier has a weighted error greater or equal to 0.5. As in AdaBoost, new instances
are classified using a weighted sum of the predicted class of all the constructed base classifiers.
WSA is described in algorithm 2.
We faced several problems when performing the annotation image task. The first one was that
the training set and the test set were different, so this caused a classification with high error ratio.
The second one was due the segmentation algorithm. The automatic segmentation algorithm did
not perform well for all images leading to have incorrect segmentation of the objects in the images.
The last one concerns to the different criteria for manual labeling of the training set. Due all these
facts we did not get good results. We hope improving the annotation task by changing part of the
labeling strategy.

3.4 MSRA: Object Retrieval
Authors: Mingjing Li, Xiaoguang Rui, and Lei Wu
Affiliation: Microsoft Research Asia
Email: mjli@microsoft.com

Two approaches were adopted by Microsoft Research Asia (MSRA) to perform the object retrieval
task in ImageCLEF 2007. One is based on the visual topic model (VTM); the other is the visual
language modelling (VLM) method [35]. VTM represents an image by a vector of probabilities
that the image belongs to a set of visual topics, and categorizes images using SVM classifiers. VLM
represents an image as a 2-D document consisting of visual words, trains a statistical language
model for each image category, and classifies an image to the category that generates the image
with the highest probability.

3.4.1 VTM: Visual Topic Model
Probabilistic Latent Semantic Analysis (pLSA) [12], which is a generative model from the text
literature, is adopted to find out the visual topics from training images. Different from traditional
pLSA, all training images of 10 categories are put together in the training process and about 100
visual topics are discovered finally.
The training process consists of five steps: local feature extraction, visual vocabulary construc-
tion, visual topic construction, histogram computation, and classifier training. At first, salient im-
age regions are detected using scale invariant interest point detectors such as the Harris-Laplace
and the Laplacian detectors. For each image, about 1,000 to 2,000 salient regions are extracted.
Those regions are described by the SIFT descriptor which computes a gradient orientation his-
togram within the support region. Next, 300 local descriptors are randomly selected from each
category and combined together to build a global vocabulary of 3,000 visual words. Based on the
vocabulary, images are represented by the frequency of visual words. Then, pLSA is performed to
discover the visual topics in the training images. pLSA is also applied to estimate how likely an
image belongs to each visual topic. The histogram of the estimated probabilities is taken as the
feature representation of that image for classification. For multi-class classification problem, we
adopt the one-against-one scheme, and train an SVM classifier with RBF kernel for each possible
pair of categories.

3.4.2 VLM: Visual Language Modeling
The approach consists of three steps: image representation, visual language model training and
object retrieval. Each image is transformed into a matrix of visual words. First, an image is simply
segmented into 8 x 8 patches, and the texture histogram feature is extracted from each path. Then
all patches in the training set are grouped into 256 clusters based on their features. Next, each
path cluster is represented using an 8-bit hash code, which is defined as the visual word. Finally,
an image is represented by a matrix of visual words, which is called visual document.
Visual words in a visual document are not independent to each other, but correlated with other
words. To simply the model training, we assume that visual words are generated in the order from
left to right, and top to bottom and each word is only conditionally dependent on its immediate
top and left neighbors, and train a trigram language model for each image category. Given a test
image, it is transformed into a matrix of visual words in the same way, and the probability that it
is generated by each category is estimated respectively. Finally, the image categories are ranked
in the descending order of these probabilities.

3.5 NTU: Solution for the Object Retrieval Task
Authors: Steven C. H. Hoi
Affiliation: School of Computer Engineering, Nanyang Technological University, Singapore
Email: chhoi@ntu.edu.sg
3.5.1 Introduction
Object retrieval is an interdisciplinary research problem between object recognition and content-
based image retrieval (CBIR). It is commonly expected that object retrieval can be solved more
effectively with the joint maximization of CBIR and object recognition techniques. The goal of
this paper is to study a typical CBIR solution with application to the object retrieval tasks [17, 18].
We expected that the empirical study in this work will serve as a baseline for future research when
applying CBIR techniques for object recognition.

3.5.2 Overview of Our Solution
We study a typical CBIR solution for the object retrieval problem. In our approach, we focus on
two key tasks. One is the feature representation, the other is the supervised learning scheme with
support vector machines.

Feature Representation In our approach, three kinds of global features are extracted to rep-
resent an image, including color, shape, and texture.
For color, we study the Grid Color Moment feature (GCM). For each image, we split it into
3 × 3 equal grids and extract color moments to represent each of the 9 grids. Three color moments
are then computed: color mean, color variance and color skewness in each color channel (H, S,
and V), respectively. Thus, an 81-dimensional color moment is adopted as the color feature for
each image.
For shape, we employ the edge direction histogram. First, an input color image is first converted
into a gray image. Then a Canny edge detector is applied to obtain its edge image. Based on the
edge images, the edge direction histogram can then be computed. Each edge direction histogram
is quantized into 36 bins of 10 degrees each. In addition, we use a bin to count the number of
pixels without edge. Hence, a 37-dimensional edge direction histogram is used for shape.
For texture, we investigate the Gabor feature. Each image is first scaled to the size of 64 × 64.
Then, the Gabor wavelet transformation is applied to the scaled image at 5 scale levels and 8
orientations, which results in a total of 40 subimages for each input image. For each subimage,
we calculate three statistical moments to represent the texture, including mean, variance, and
skewness. Therefore, a 120-dimensional feature vector is used for texture.
In total, a 238-dimensional feature vector is used to represent each image. The set of visual
features has been shown to be effective for content-based image retrieval in our previous experi-
ments [17, 18].

Supervised Learning for Object Retrieval The object retrieval task defined in ImageCLEF
2007 is similar to a relevance feedback task in CBIR, in which a number of possible and negative
labeled examples are given for learning. This can be treated as a supervised classification task.
To solve it, we employ the support vector machines (SVM) technique for training the classifiers
on the given examples [17]. In our experiment, a standard SVM package is used to train the SVM
classifier with RBF kernels. The parameters C and γ are best tuned on the VOC2006 training
set, in which the training precision is 84.2% for the classification tasks. Finally, we apply the
trained classifiers to do the object retrieval by ranking the distances of the objects apart from the
classifier’s decision boundary.

3.5.3 Concluding Remarks
We described our solution for the object retrieval task in the ImageCLEF 2007 by a typical CBIR
solution. We found that the current solution, though it was trained with good performance in an
object recognition test-bed, did not achieve promising results in the tough object retrieval tasks.
In our future work, several directions can be explored to improve the performance, including local
feature representation and better machine learning techniques.
3.6 PRIP: Color Interest Points and SIFT features
Authors: Julian Stöttinger1 , Allan Hanbury1 , Nicu Sebe2 , Theo Gevers2
Affiliation: 1 PRIP, Institute of Computer-Aided Automation,
Vienna University of Technology, Vienna, Austria
2
Intelligent Systems Lab Amsterdam,
University of Amsterdam, The Netherlands
Email: {julian,hanbury}@prip.tuwien.ac.at, {nicu,gevers}@science.uva.nl
In the field of retrieval, detection, recognition and classification of objects, many state of the art
methods use interest point detection at an early stage. This initial step typically aims to find
meaningful regions in which descriptors are calculated. Finding salient locations in image data is
crucial for these tasks. Most current methods use only the luminance information of the images.
This approach focuses on the use of color information in interest point detection and its gain in
performance. Based on the Harris corner detector, multi-channel visual information transformed
into different color spaces is the basis to extract the most salient interest points. To determine the
characteristic scale of an interest point, a global method of investigating the color information on a
global scope is used. The two different PRIP-PRIP ScIvHarris approaches differ in the properties
of these interest points only.
The method consists of the following stages:
1. Extraction of multi-channel based interest points
2. Local descriptions of interest points
3. Estimating the signature of an image
4. Classification

Extraction of multi-channel based interest points An extension of the intensity-based
Harris detector [16] is proposed in [25]. Because of common photometric variations in imaging
conditions such as shading, shadows, specularities and object reflectance, the components of the
RGB color system are correlated and therefore sensitive to illumination changes. However, in
natural images, high contrast changes may appear. Therefore, a color Harris detector in RGB
space does not dramatically change the position of the corners compared to a luminance based
approach. Normalized rgb overcomes the correlation of RGB and favors color changes. The main
drawback, however, is its instability in dark regions. We can overcome this by using quasi invariant
color spaces.
The approach PRIP-PRIP HSI ScIvHarris uses the HSI color space [32], which is quasi-
invariant to shadowing and specular effects. Therefore, changes in lighting conditions in images
should not affect the positions of the interest points, resulting in more stable locations. Addition-
ally, the HSI color space discriminates between luminance and color. Therefore, much information
can be discarded, and the locations get more sparse and distinct.
The PRIP cbOCS ScIvHarris approach follows a different idea. As proposed in [33], colors
have different occurrence probabilities and therefore different information content. Therefore, rare
colors are regarded as more salient than common ones. We wish to find a boosting function so
that color vectors having equal information content have equal impact on the saliency function.
This transformation can be found analyzing the occurrence probabilities of colors in large image
databases. With this change of focus towards rare colors, we aim to discard many repetitive
locations and get more stable results on rare features.
The characteristic scale of an interest point is chosen by applying a principal component
analysis (PCA) on the image and thus finding a description for the correlation of the multi-channel
information [31]. The characteristic scale is decided when the Laplacian of Gaussian function of
this projection and the Harris energy is a maximum at the same location in the image. The final
extraction of these interest points and corresponding scales is done by preferring locations with
high Harris energy and huge scales. A maximum number of 300 locations per image has been
defined, as over-description diminish the overall recognition ability dramatically.
Local descriptions of interest points The scale invariant feature transform (SIFT) [20]
showed to give best results in a broad variety of applications [23]. We used the areas of the
extracted interest points as a basis for the description phase. SIFT are basically sampled and
normalized gradient histograms, which can lead to multiple descriptions per location. This occurs
if there is more than one direction of the gradients regarded as predominant.

Estimating the signature of an image In this bag of visual features approach [36], we cluster
the descriptions of one image to a fixed number of 40 clusters using k-means. The centroids and
the proportional sizes of the clusters build the signature of one image having a fixed dimensionality
of 40 by 129.

Classification The Earth Mover’s Distance (EMD) [28] showed to be a suitable metric for
comparing image signatures. It takes the proportional sizes of the clusters into account, which
seems to gain a lot of discriminative power. The classification itself is done in the most straight
forward way possible: for every object category, the smallest distances to another signature build
the classification.

3.7 RWTHi6: Patch-Histograms and Log-Linear Models
Authors: Thomas Deselaers, Hermann Ney
Affiliation: Human Language Technology and Pattern Recognition,
RWTH Aachen University, Aachen, Germany
Email: surname@cs.rwth-aachen.de

The approach used by the Human Language Technology and Pattern Recognition group of the
RWTH Aachen University, Aachen, Germany, to participate in the PASCAL Visual Object Classes
Challenge consists of four steps:

1. patch extraction
2. clustering
3. creation of histograms
4. training of a log-linear model

where the first three steps are feature extraction steps and the last is the actual classification step.
This approach was first published in [6, 7].
The method follows the promising approach of considering objects to be constellations of
parts which offers the immediate advantages that occlusions can be handled very well, that the
geometrical relationship between parts can be modelled (or neglected), and that one can focus on
the discriminative parts of an object. That is, one can focus on the image parts that distinguish
a certain object from other objects.
The steps of the method are briefly outlined in the following paragraphs. To model the differ-
ence in the training and test data, the first three steps have been done for the training and test
data individually, and then the according histograms have been extracted for the respective other,
so that once the vocabulary was learnt for the training data and once for the test data and the
histograms are created for each using both vocabularies. Results, however show that this seems
not to be a working approach to tackle divergence in training and testing data.

Patch Extraction Given an image, we extract square image patches at up to 500 image points.
Additionally, 300 points from a uniform grid of 15×20 cells that is projected onto the image
are used. At each of these points a set of square image patches of varying sizes (in this case
7 × 7, 11 × 11, 21 × 21, and 31 × 31 pixels) are extracted and scaled to a common size (in this case
15 × 15 pixels).
In contrast to the interest points from the detector, the grid-points can also fall onto very
homogeneous areas of the image. This property is on the one hand important for capturing
homogeneity in objects which is not found by the interest point detector and on the other hand
it captures parts of the background which usually is a good indicator for an object, as in natural
image objects are often found in a “natural” environment.
After the patches are extracted and scaled to a common size, a PCA dimensionality reduction
is applied to reduce the large dimensionality of the data, keeping 39 coefficients corresponding
to the 40 components of largest variance but discarding the first coefficient corresponding to the
largest variance. The first coefficient is discarded to achieve a partial brightness invariance. This
approach is suitable because the first PCA coefficient usually accounts for global brightness.

Clustering The data are then clustered using a k-means style iterative splitting clustering al-
gorithm to obtain a partition of all extracted patches. To do so, first one Gaussian density is
estimated which is then iteratively split to obtain more densities. These densities are then re-
estimated using k-means until convergence is reached and then the next split is done. It has be
shown experimentally that results consistently improve up to 4096 clusters but for more than 4096
clusters the improvement is so small that it is not worth the higher computational demands.

Creation of Histograms Once we have the cluster model, we discard all information for each
patch except its closest corresponding cluster center identifier. For the test data, this identifier
is determined by evaluating the Euclidean distance to all cluster centers for each patch. Thus,
the clustering assigns a cluster c(x) ∈ {1, . . . C} to each image patch x and allows us to create
histograms of cluster frequencies by counting how many of the extracted patches belong to each
of the clusters. The histogram representation h(X) with C bins is then determined by counting
PLX
and normalization such that hc (X) = L1X l=1 δ(c, c(xl )), where δ denotes the Kronecker delta
function, c(xl ) is the closest cluster center to xl , and xl is the l-th image patch extracted from
image X, from which a total of LX patches are extracted.

Training & Classification Having obtained this representation by histograms of image patches,
we define a decision rule for the classification of images. The approach based on maximum likeli-
hood of the class-conditional distributions does not take into account the information of competing
classes
QKduring
QNk training. We can use this information by maximizing the class posterior probabil-
ity k=1 n=1 p(k|Xkn ) instead. Assuming a Gaussian density with pooled covariances for the
class-conditional distribution, this maximization is equivalent to maximizing the parameters of a
log-linear or maximum entropy model
C
!
1 X
p(k|h) = exp αk + λkc hc ,
Z(h) c=1

PK PC
where Z(h) = k=1 exp αk + λ h
c=1 kc c is the renormalization factor. We use a modified
version of generalized iterative scaling. Bayes’ decision rule is used for classification.

4 Results
The results for all submissions are given in Table 4 using the normal Qrels and in Table 5 using
the Qrels with additional information. Furthermore in Table 6, the results are presented using
the relevance information estimated for each image in the database for all classes. All results are
given as AP for the individual topics (first 10 columns) and MAP over all topics (last column).
The tables are ordered by MAP (last column). When considering the individual topics, it can
be observed that some methods work well for some of the topics but completely fail for others.
To highlight this, the highest AP in each column in shown in bold. The two runs from Budapest
perform well for the bicycles (1) and motorbikes (4) topics respectively, but have bad results for
all other topics.
Table 4: Results from the ImageCLEF 2007 object retrieval task with the normal relevance infor-
mation. All values have been multiplied by 100 to make the table more readable. The numbers in
the top row refer to the class id’s (see Table 1). The MAP over all classes is in the last column.
The highest AP per class is shown in bold.
rank run id 1 2 3 4 5 6 7 8 9 10 avg
1 HUTCIS_SVM_FULLIMG_ALL 21.3 1.5 28.1 0.3 0.0 0.2 0.2 0.8 21.0 0.8 7.4
2 HUTCIS_SVM_FULLIMG_IP+SC 10.2 1.2 25.8 0.2 0.0 0.2 0.1 1.4 20.3 1.6 6.1
3 HUTCIS_SVM_FULLIMG+BB 13.0 1.5 11.4 0.1 0.0 0.4 0.1 0.6 22.4 1.1 5.1
4 HUTCIS_SVM_FULLIMG_IP 9.3 1.3 23.6 0.1 0.0 0.1 0.1 2.6 3.0 1.2 4.1
5 MSRA-MSRA_RuiSp 2.5 1.9 7.9 3.5 0.9 0.0 0.3 0.7 2.1 13.7 3.4
6 HUTCIS_SVM_BB_BAL_IP+SC 4.9 1.3 2.4 0.0 0.0 1.5 0.0 0.1 10.4 0.4 2.1
7 HUTCIS_PICSOM1 3.2 1.3 13.5 0.1 0.0 0.1 0.2 0.0 0.4 0.5 1.9
8 HUTCIS_PICSOM2 1.7 1.4 13.2 0.1 0.0 0.0 0.4 0.0 0.6 0.5 1.8
9 HUTCIS_SVM_BB_ALL 5.8 1.6 2.4 0.0 0.0 1.4 0.0 0.1 4.4 0.3 1.6
10 HUTCIS_SVM_BB_BB_IP+SC 5.2 1.3 2.8 0.1 0.0 0.8 0.0 0.1 4.4 0.4 1.5
11 HUTCIS_SVM_BB_FULL_IP+SC 8.1 1.7 1.5 0.1 0.0 1.0 0.0 0.1 2.4 0.3 1.5
12 RWTHi6-HISTO-PASCAL 0.3 2.7 1.7 0.1 0.0 0.1 0.2 0.4 0.1 9.4 1.5
13 HUTCIS_SVM_BB_BB_IP 4.0 0.7 1.2 0.1 0.0 0.1 0.0 0.2 2.1 0.3 0.9
14 MSRA-MSRA-VLM_8_8_640_ful 0.6 0.2 1.2 0.3 0.0 0.0 0.0 2.0 0.1 3.6 0.8
15 HUTCIS_SVM_BB_BAL_IP 3.9 1.6 1.2 0.1 0.0 0.2 0.0 0.1 0.5 0.3 0.8
16 MSRA-MSRA-VLM-8-8-800-HT 0.9 0.1 1.1 0.4 0.0 0.0 0.0 0.6 0.3 3.2 0.7
17 PRIP-PRIP_HSI_ScIvHarris 0.2 0.0 0.6 0.2 2.6 0.2 0.1 0.0 0.0 2.2 0.6
18 budapest-acad-budapest-acad314 2.3 0.1 0.0 0.5 1.1 0.0 0.0 0.0 0.0 0.1 0.4
19 HUTCIS_SVM_BB_FULL_IP 0.2 1.2 1.0 0.1 0.0 0.1 0.0 0.1 1.2 0.2 0.4
20 NTU_SCE_HOI-NTU_SCE_HOI_1 2.0 0.1 1.8 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.4
21 PRIP-PRIP_cbOCS_ScIvHarr2 0.1 0.0 0.3 0.0 0.8 0.0 0.0 0.0 1.3 1.0 0.4
22 budapest-acad-budapest-acad315 0.3 0.0 0.0 1.5 0.0 0.1 0.3 0.0 0.1 0.6 0.3
23 INAOE-TIA-INAOE_SSAssemble 0.5 0.1 0.1 0.1 0.0 0.3 0.0 0.0 0.0 0.6 0.2
24 INAOE-TIA-INAOE-RB-KNN+MRFI_ok 0.1 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.2 0.1
25 INAOE-TIA-INAOE-RB-KNN+MRFI 0.1 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.2 0.1
26 INAOE-TIA-INAOE-RB-KNN 0.0 0.0 0.2 0.0 0.0 0.0 0.1 0.0 0.0 0.3 0.1

Table 5: Results from the ImageCLEF 2007 object retrieval task with the additional relevance
information. All values have been multiplied by 100 to make the table more readable. The numbers
in the top row refer to the class id’s (see Table 1). The MAP over all classes is in the last column.
The highest AP per topic is shown in bold.
rank run id 1 2 3 4 5 6 7 8 9 10 avg
1 HUTCIS_SVM_FULLIMG_ALL 7.2 1.3 14.9 0.4 0.0 0.4 0.1 4.1 2.7 5.8 3.7
2 HUTCIS_SVM_FULLIMG_IP+SC 3.7 0.7 14.1 0.5 0.0 1.1 0.0 3.5 3.0 6.3 3.3
3 HUTCIS_SVM_FULLIMG_IP 3.3 0.7 13.0 0.8 0.0 1.4 0.0 4.0 0.8 5.0 2.9
4 HUTCIS_SVM_FULLIMG+BB 4.7 1.6 6.5 0.7 0.0 0.5 0.0 2.0 3.1 4.2 2.4
5 HUTCIS_PICSOM1 1.5 1.2 7.5 0.2 0.0 0.3 0.1 0.4 0.6 5.1 1.7
6 HUTCIS_PICSOM2 1.0 1.1 7.4 0.3 0.0 0.2 0.2 0.3 0.6 4.6 1.6
7 MSRA-MSRA_RuiSp 1.4 0.7 5.4 1.0 0.3 0.1 0.1 0.3 0.3 5.1 1.5
8 HUTCIS_SVM_BB_BAL_IP+SC 1.9 2.3 1.1 0.3 0.0 0.7 0.0 0.5 1.8 3.2 1.2
9 HUTCIS_SVM_BB_BB_IP+SC 1.9 2.2 1.3 0.5 0.1 0.5 0.0 0.6 1.0 3.2 1.1
10 HUTCIS_SVM_BB_BB_IP 1.5 0.5 0.9 3.8 0.2 0.2 0.0 0.7 0.4 2.9 1.1
11 HUTCIS_SVM_BB_ALL 2.2 1.3 1.1 0.5 0.0 0.7 0.0 0.7 1.1 3.1 1.1
12 budapest-acad-budapest-acad314 9.1 0.2 0.0 0.7 0.3 0.1 0.0 0.1 0.0 0.0 1.1
13 HUTCIS_SVM_BB_FULL_IP+SC 3.0 1.4 0.8 0.2 0.1 0.6 0.0 0.6 1.0 3.0 1.1
14 RWTHi6-HISTO-PASCAL 0.2 1.0 1.5 0.3 0.0 0.0 0.1 0.2 0.1 5.0 0.8
15 HUTCIS_SVM_BB_BAL_IP 1.5 0.8 0.7 0.6 0.3 0.2 0.0 0.6 0.2 2.9 0.8
16 budapest-acad-budapest-acad315 0.2 0.0 0.0 6.2 0.0 0.0 0.1 0.0 0.2 0.1 0.7
17 HUTCIS_SVM_BB_FULL_IP 0.3 0.8 0.6 0.4 0.1 0.1 0.0 0.8 0.4 2.7 0.6
18 MSRA-MSRA-VLM_8_8_640_ful 0.3 0.3 1.3 0.1 0.0 0.0 0.0 0.3 0.1 3.2 0.6
19 MSRA-MSRA-VLM-8-8-800-HT 0.4 0.1 1.1 0.1 0.0 0.0 0.0 0.3 0.1 2.5 0.5
20 PRIP-PRIP_HSI_ScIvHarris 0.1 0.0 0.4 0.1 1.5 0.1 0.3 0.1 0.1 1.8 0.5
21 NTU_SCE_HOI-NTU_SCE_HOI_1 1.2 0.2 2.3 0.0 0.0 0.0 0.0 0.0 0.0 0.4 0.4
22 PRIP-PRIP_cbOCS_ScIvHarr2 0.1 0.0 0.2 0.8 0.4 0.0 0.0 0.1 0.2 1.1 0.3
23 INAOE-TIA-INAOE_SSAssemble 0.1 0.1 0.2 0.0 0.0 0.3 0.0 0.1 0.1 1.1 0.2
24 INAOE-TIA-INAOE-RB-KNN 0.1 0.0 0.3 0.1 0.0 0.0 0.0 0.0 0.0 1.0 0.2
25 INAOE-TIA-INAOE-RB-KNN+MRFI_ok 0.3 0.1 0.1 0.0 0.0 0.0 0.1 0.0 0.0 0.8 0.1
26 INAOE-TIA-INAOE-RB-KNN+MRFI 0.3 0.1 0.1 0.0 0.0 0.0 0.1 0.0 0.0 0.8 0.1
Table 6: Results from the ImageCLEF 2007 object retrieval task with complete relevance infor-
mation for the whole database. All values have been multiplied by 100 to make the table more
readable. The numbers in the top row refer to the class id’s (see Table 1). The MAP over all
classes is in the last column. The highest AP per class is shown in bold.
rank run id 1 2 3 4 5 6 7 8 9 10 avg
1 budapest-acad-budapest-acad314 28.3 0.3 0.1 1.8 0.2 0.2 0.0 0.0 0.0 0.0 3.1
2 HUTCIS_SVM_FULLIMG_ALL 4.1 1.2 10.6 0.4 0.0 0.6 0.1 3.8 0.0 8.3 2.9
3 HUTCIS_SVM_FULLIMG_IP+SC 2.6 1.0 11.1 1.0 0.0 1.0 0.1 3.2 0.0 8.2 2.8
4 HUTCIS_SVM_FULLIMG_IP 2.4 1.1 10.3 1.8 0.0 1.1 0.1 3.0 0.0 8.1 2.8
5 HUTCIS_SVM_FULLIMG+BB 3.0 1.1 4.2 0.6 0.0 0.7 0.1 2.5 0.0 8.6 2.1
6 budapest-acad-budapest-acad315 0.6 0.0 0.1 18.5 0.0 0.1 0.1 0.0 0.6 0.1 2.0
7 HUTCIS_SVM_BB_ALL 1.6 0.9 0.5 0.3 0.0 0.6 0.1 1.5 0.0 8.3 1.4
8 HUTCIS_SVM_BB_BB_IP+SC 1.4 1.0 0.7 0.3 0.0 0.5 0.1 1.1 0.0 8.4 1.4
9 HUTCIS_SVM_BB_FULL_IP+SC 2.0 0.8 0.4 0.2 0.0 0.8 0.1 1.1 0.0 8.2 1.3
10 HUTCIS_PICSOM1 0.9 0.7 4.5 0.6 0.0 0.3 0.1 0.7 0.0 5.6 1.3
11 MSRA-MSRA_RuiSp 0.9 0.5 3.6 0.6 0.7 0.1 0.1 0.4 0.0 6.0 1.3
12 HUTCIS_SVM_BB_BAL_IP+SC 1.3 0.8 0.5 0.2 0.0 0.5 0.1 0.8 0.0 8.4 1.3
13 HUTCIS_PICSOM2 0.8 0.6 4.2 0.5 0.0 0.3 0.1 0.4 0.0 5.4 1.2
14 HUTCIS_SVM_BB_BB_IP 1.1 0.7 0.4 1.4 0.0 0.3 0.0 1.0 0.0 7.2 1.2
15 HUTCIS_SVM_BB_BAL_IP 1.1 0.8 0.3 0.3 0.0 0.4 0.1 0.9 0.0 6.9 1.1
16 HUTCIS_SVM_BB_FULL_IP 0.3 0.9 0.3 0.3 0.0 0.3 0.0 1.1 0.0 6.6 1.0
17 RWTHi6-HISTO-PASCAL 0.4 0.2 1.4 0.2 0.0 0.1 0.0 0.2 0.0 5.5 0.8
18 NTU_SCE_HOI-NTU_SCE_HOI_1 1.2 0.7 2.4 0.0 0.0 0.0 0.1 0.1 0.0 0.8 0.5
19 MSRA-MSRA-VLM_8_8_640_ful 0.4 0.3 0.7 0.1 0.1 0.0 0.0 0.3 0.0 2.5 0.4
20 MSRA-MSRA-VLM-8-8-800-HT 0.3 0.2 0.5 0.0 0.0 0.1 0.0 0.2 0.1 2.5 0.4
21 INAOE-TIA-INAOE_SSAssemble 0.1 0.0 0.1 0.0 0.0 0.2 0.0 0.2 0.0 3.2 0.4
22 PRIP-PRIP_HSI_ScIvHarris 0.1 0.0 0.3 0.1 1.4 0.1 0.0 0.0 0.0 1.5 0.4
23 INAOE-TIA-INAOE-RB-KNN+MRFI_ok 0.5 0.1 0.6 0.0 0.0 0.2 0.0 0.0 0.0 2.2 0.4
24 INAOE-TIA-INAOE-RB-KNN+MRFI 0.5 0.1 0.6 0.0 0.0 0.2 0.0 0.0 0.0 2.2 0.4
25 INAOE-TIA-INAOE-RB-KNN 0.3 0.0 0.5 0.1 0.0 0.1 0.0 0.0 0.0 2.2 0.3
26 PRIP-PRIP_cbOCS_ScIvHarr2 0.1 0.0 0.1 0.5 0.1 0.0 0.1 0.1 0.0 0.8 0.2

5 Discussion
One issue that should be taken into account when interpreting the results is that 50% of the
evaluated runs are from HUTCIS and thus this group had a significant impact on the pools for
relevance assessment. This effect is further boosted by the fact that the initial runs from HUTCIS
(which were used for the pooling) had the queries 4–10 in wrong order. This implies that for
the topics where the runs were in correct order, HUTCIS has a good chance of having many of
their images in the pools and that furthermore for the runs where HUTCIS’ runs were broken,
none of their images were considered in pooling. Therefore, it can be observed that the additional
relevance information leads to some significant changes in some of the results. A particular strong
change of the results can be seen for the runs from HUTCIS: the APs of the queries which were
in correct order for the pooling process are strongly reduced whereas those which were in wrong
order are significantly improved, which shows that the large number of runs from HUTCIS had a
strong influence on the results of the pooling.
The drawbacks of using pooling is visible in the results as the MAPs for some methods increase
when the additional relevance judgements are used. This effect is particular strong for topics with
only very few relevant images.
The results clearly show that the task is a very difficult one, and that most recent methods are
not yet able to fully cope with tasks where training and test data do not match very well. This
means that although the other image- and object retrieval and recognition tasks are very good to
foster advancement in research, they do not yet pose a fully realistic challenge as their training-
and testing data match up pretty well. A higher variability between the test- and the training
data would make these tasks more realistic.
The participating methods use a wide variety of different techniques, some methods use global
image descriptors, other methods use local descriptors and discriminative models, which are cur-
rently the state of the art in object recognition and detection.
For the classes cat (5), cow (6), and sheep (9), AP values are so small that it is not obvious
which method produces the best results. For the classes cat and sheep this effect is even stronger
since only 6 resp. 7 images from the whole database are relevant.
No single method obtained the highest AP values in all classes. A very strong outlier in this
respect are the two methods from the Budapest-group which have clearly the best results for the
bicycle and the motorbike class respectively but have very bad performance for all other classes.
The fact that the queries are very dissimilar in terms of 1) image statistics (number of correct
images) and 2) visual difficulty, makes the AP values vary strongly among the different topics and
therefore MAP is not a very reliable measure judge the overall retrieval quality here.
Finally, although a maximum of 1000 images were to be returned for each query, the number
of relevant images exceeded 1000 for two classes (car and person). This implies that the maximum
possible AP (using the full annotations) for “car” is 0.789, and for “person” is 0.089, thus the
performance of the best runs for the person query is nearly perfect.

6 Conclusion
We presented the object retrieval task of ImageCLEF 2007, the methods of the participating
groups, and the results. The results show that none of the methods really solves the assigned task
and therewith that, although large advances in object detection and recognition were achieved
over the last years, still many improvements are necessary to solve realistic tasks on a web-scale
size with a high variety of the data to be analysed. It can however be observed that some of the
methods are able to obtain reasonable results for a limited set of classes.
Furthermore, the analysis of the results showed that the use of pooling techniques for relevance
assessment can be problematic if the pools are biased due to erroneous runs or due to many strongly
correlated submissions as it was the case in this evaluation.

Acknowledgements
We would like to thank the CLEF campaign for supporting the ImageCLEF initiative. This work
was partially funded by the European Commission MUSCLE NoE (FP6-507752) and the DFG
(German research foundation) under contract Ne-572/6.
We would like to thank the PASCAL NoE for allowing us to use the training data of the
PASCAL 2006 Visual object classes challenge as training data for this task, in particular, we
would like to thank Mark Everingham for his cooperation.
We would like to thank Paul Clough from Sheffield University for support in creating the pools
and Jan Hosang, Jens Forster, Pascal Steingrube, Christian Plahl, Tobias Gass, Daniel Stein,
Morteza Zahedi, Richard Zens, Yuqi Zhang, Markus Nus̈baum, Gregor Leusch, Michael Arens,
Lech Szumilas, Jan Bungeroth, David Rybach, Peter Fritz, Arne Mauser, Saša Hasan, and Stefan
Hahn for helping to create relevance assessments for the images.
The work of the Budapest group was supported by a Yahoo! Faculty Research Grant and by
grants MOLINGV NKFP-2/0024/2005, NKFP-2004 project Language Miner.

References
[1] M. Braschler and C. Peters. CLEF Methodology and Metrics. In C. Peters (Ed.), Cross-
language information retrieval and evaluation: Proceedings of the CLEF2001 Workshop, Lec-
ture Notes in Computer Science 2406, Springer Verlag, pages 394–404, 2002.

[2] C. Carson, S. Belongie, H. Greenspan, and J. Malik. Blobworld: Image Segmentation Using
Expectation-Maximization and Its Application to Image Querying. IEEE Trans. Pattern
Anal. Mach. Intell., 24(8):1026–1038, 2002.
[3] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001. Software
available at http://www.csie.ntu.edu.tw/∼cjlin/libsvm.

[4] Y. Chen and J. Z. Wang. Image Categorization by Learning and Reasoning with Regions. J.
Mach. Learn. Res., 5:913–939, 2004.

[5] P. Clough, H. Müller, and M. Sanderson. Overview of the CLEF Cross–Language Image
Retrieval Track (ImageCLEF) 2004. In C. Peters, P. D. Clough, G. J. F. Jones, J. Gonzalo,
M. Kluck, and B. Magnini, editors, Multilingual Information Access for Text, Speech and
Images: Result of the fifth CLEF evaluation campaign, Lecture Notes in Computer Science,
Springer–Verlag, Bath, England, 2005.

[6] T. Deselaers, D. Keysers, and H. Ney. Discriminative Training for Object Recognition using
Image Patches. In IEEE Conference on Computer Vision and Pattern Recognition, volume 2,
San Diego, CA, pages 157–162, June 2005.

[7] T. Deselaers, D. Keysers, and H. Ney. Improving a Discriminative Approach to Object Recog-
nition using Image Patches. In DAGM 2005, Pattern Recognition, 26th DAGM Symposium,
number 3663 in Lecture Notes in Computer Science, Vienna, Austria, pages 326–333, August
2005.

[8] H. J. Escalante, M. M. y Gómez, and L. E. Sucar. Word Co-occurrence and MRF’s for
Improving Automatic Image Annotation. In In Proceedings of the 18th British Machine
Vision Conference (BMVC 2007), Warwick, UK, September, 2007.

[9] M. Everingham, A. Zisserman, C. Williams, and L. V. Gool. The Pascal Visual Object
Classes Challenge 2006 (VOC2006) Results. Technical report, 2006. Available on-line at
http://www.pascal-network.org/.

[10] M. Everingham, A. Zisserman, C. K. I. Williams, L. van Gool, M. Allan, C. M. Bishop,
O. Chapelle, N. Dalal, T. Deselaers, G. Dorko, S. Duffner, J. Eichhorn, J. D. R. Farquhar,
M. Fritz, C. Garcia, T. Griffiths, F. Jurie, D. Keysers, M. Koskela, J. Laaksonen, D. Larlus,
B. Leibe, H. Meng, H. Ney, B. Schiele, C. Schmid, E. Seemann, J. Shawe-Taylor, A. Storkey,
S. Szedmak, B. Triggs, I. Ulusoy, V. Viitaniemi, and J. Zhang. The 2005 PASCAL Visual Ob-
ject Classes Challenge. In Machine Learning Challenges. Evaluating Predictive Uncertainty,
Visual Object Classification, and Recognising Tectual Entailment (PASCAL Workshop 05),
number 3944 in Lecture Notes in Artificial Intelligence, Southampton, UK, pages 117–176,
2006.

[11] P. F. Felzenszwalb and D. P. Huttenlocher. Efficient Graph-Based Image Segmentation.
International Journal of Computer Vision, 59, 2004.

[12] R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman. Learning object categories from Google’s
image search. In International Conference on Computer Vision, Beijing, China, Oct 2005.

[13] Y. Freund and R. Schapire. Experiments with a New Boosting Algorithm. In International
Conference on Machine Learning, pages 148–156, 1996.
[14] M. Grubinger, P. Clough, A. Hanbury, and H. Müller. Overview of the ImageCLEF 2007
Photographic Retrieval Task. In Working Notes of the 2007 CLEF Workshop, Budapest,
Hungary, September 2007.

[15] M. Grubinger, P. Clough, H. Müller, and T. Deselaers. The IAPR Benchmark: A New
Evaluation Resource for Visual Information Systems. In LREC 06 OntoImage 2006: Language
Resources for Content-Based Image Retrieval, Genoa, Italy, page in press, May 2006.

[16] C. Harris and M. Stephens. A Combined Corner and Edge Detection. In 4th Alvey Vision
Conference, pages 147–151, 1988.
[17] S. C. H. Hoi and M. R. Lyu. A novel log-based relevance feedback technique in content-based
image retrieval. In 12th ACM International Conference on Multimedia (MM 2004), New
York, NY, USA, pages 24–31, oct 2004.
[18] S. C. Hoi, M. R. Lyu, and R. Jin. A unified log-based relevance feedback scheme for image
retrieval. IEEE Transactions on Knowledge and Data Engineering, 18(4):509–524, 2006.
[19] J. Laaksonen, M. Koskela, and E. Oja. PicSOM—Self-Organizing Image Retrieval with
MPEG-7 Content Descriptions. IEEE Transactions on Neural Networks, Special Issue on
Intelligent Multimedia Processing, 13(4):841–853, July 2002.
[20] D. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. IJCV, 60(2):91–110,
2004.
[21] Q. Lv, M. Charikar, and K. Li. Image similarity search with compact data structures. In
CIKM ’04: Proceedings of the thirteenth ACM international conference on Information and
knowledge management, ACM Press, New York, NY, USA, pages 208–217, 2004.
[22] H.-M. Marin-Castro, L. E. Sucar, and E. F. Morales. Automatic image annotation using
a semi-supervised ensemble of classifiers. To appear. In 12th Iberoamerican Congress on
Pattern Recognition CIARP 2007, Lecture Notes in Computer Science, Springer, Viña del
Mar, Valparaiso, Chile, 2007.
[23] K. Mikolaczyk and C. Schmid. A performance evaluation of local descriptors. IEEE Trans-
actions on Pattern Analysis & Machine Intelligence, 27(10):1615–1630, 2005.
[24] P.-A. Moellic and C. Fluhr. ImageEVAL 2006 Official Campaign. Technical report, Im-
agEVAL, 2006.
[25] P. Montesinos, V. Gouet, and R. Deriche. Differential Invariants for Color Images. In ICPR,
page 838, 1998.
[26] H. Müller, T. Deselaers, E. Kim, J. Kalpathy-Cramer, T. M. Deserno, P. Clough, and
W. Hersh. Overview of the ImageCLEFmed 2007 Medical Retrieval and Annotation Tasks.
In Working Notes of the 2007 CLEF Workshop, Budapest, Hungary, September 2007.
[27] B. G. Prasad, K. K. Biswas, and S. K. Gupta. Region-based image retrieval using integrated
color, shape, and location index. Comput. Vis. Image Underst., 94(1-3):193–233, 2004.
[28] Y. Rubner, C. Tomasi, and L. J. Guibas. The Earth Mover’s Distance as a Metric for Image
Retrieval. IJCV, 40(2):99–121, 2000.
[29] J. Shi and J. Malik. Normalized Cuts and Image Segmentation. IEEE Transactions on
Pattern and Machine Intelligence, 22:888–905, 2000.
[30] J. Shi and J. Malik. Normalized Cuts and Image Segmentation. IEEE Trans. Pattern Anal.
Mach. Intell., 22(8):888–905, 2000.
[31] J. Stöttinger, A. Hanbury, N. Sebe, and T. Gevers. Do Colour Interest Points Improve Image
Retrieval? ICIP, 2007. to appear.
[32] J. van de Weijer and T. Gevers. Edge and Corner Detection by Photometric Quasi-Invariants.
PAMI, 27(4):625–630, 2005.
[33] J. van de Weijer, T. Gevers, and A. Bagdanov. Boosting Color Saliency in Image Feature
Detection. PAMI, 28(1):150–156, 2006.
[34] V. Viitaniemi and J. Laaksonen. Improving the accuracy of global feature fusion based image
categorisation. In Proceedings of the 2nd International Conference on Semantic and Digital
Media Technologies (SAMT 2007), Lecture Notes in Computer Science, Springer, Genova,
Italy, December 2007. Accepted.
[35] L. Wu, M. J. Li, Z. W. Li, W. Y. Ma, and N. H. Yu. Visual Language Modeling for Image
Classification. In 9th ACM SIGMM International Workshop on Multimedia Information
Retrieval, (MIR’07), Augsburg, Germany, sep 2007.

[36] J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid. Local Features and Kernels for Classi-
fication of Texture and Object Categories: A Comprehensive Study. CWPR, 73(2):213–238,
2006.