Overview of the ImageCLEF 2007 Object Retrieval Task Thomas DeselaersRWTH , Allan HanburyPRIP , Ville ViitaniemiHUT , András BenczúrBUDAC , Mátyás BrendelBUDAC , Bálint DaróczyBUDAC , Hugo Jair Escalante BalderasINAOE , Theo GeversISLA , Carlos Arturo Hernández GracidasINAOE , Steven C. H. HoiNTU , Jorma LaaksonenHUT , Mingjing LiMSRA , Heidy Marisol Marin CastroINAOE , Hermann NeyRWTH Xiaoguang RuiMSRA Nicu SebeISLA , Julian StöttingerPRIP , Lei WuMSRA RWTH Computer Science Department, RWTH Aachen University, Germany deselaers@cs.rwth-aachen.de PRIP PRIP, Institute of Computer-Aided Automation, Vienna University of Technology, Vienna, Austria {julian,hanbury}@prip.tuwien.ac.at BUDAC Data Mining and Web search Research Group, Informatics Laboratory Computer and Automation Research Institute of the Hungarian Academy of Sciences, Budapest, Hungary {mbrendel, daroczyb, benczur}@ilab.sztaki.hu INAOE TIA Research Group, Computer Science Department, National Institute of Astrophysics, Optics and Electronics, Tonantzintla, Mexico {hmarinc,hugojair,carloshg}@ccc.inaoep.mx ISLA Intelligent Systems Lab Amsterdam, University of Amsterdam, The Netherlands {nicu,gevers}@science.uva.nl HUT Adaptive Informatics Research Centre, Helsinki University of Technology, Finland firstname.lastname@tkk.fi MSRA Microsoft Research Asia, Beijing, China mjli@microsoft.com NTU School of Computer Engineering, Nanyang Technological University, Singapore chhoi@ntu.edu.sg Abstract We describe the object retrieval task of ImageCLEF 2007, give an overview of the methods of the participating groups, and present and discuss the results. The task was based on the widely used PASCAL object recognition data to train object recognition methods and on the IAPR TC-12 benchmark dataset from which images of objects of the ten different classes bicycles, buses, cars, motorbikes, cats, cows, dogs, horses, sheep, and persons had to be retrieved. Seven international groups participated using a wide variety of methods. The results of the evaluation show that the task was very challenging and that different methods for relevance assessment can have a strong influence on the results of an evaluation. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor- mation Search and Retrieval; H.3.4 Systems and Software; H.3.7 Digital Libraries; H.2.3 [Database Management]: Languages—Query Languages General Terms Measurement, Performance, Experimentation Keywords Image retrieval, image classification, performance evaluation 1 Introduction Object class recognition, automatic image annotation, and object retrieval are strongly related tasks. In object recognition, the aim is to identify whether a certain object is contained in an image, in automatic image annotation, the aim is to create a textual description of a given image, and in object retrieval, images containing certain objects or object classes have to be retrieved out of a large set of images. Each of these techniques are important to allow for semantic retrieval from image collections. Over the last year, research in these areas has strongly grown, and it is becoming clear that performance evaluation is a very important step to foster progress in research. Several initiatives create benchmark suites and databases to compare different methods tackling the same problem quantitatively. In the last years, evaluation campaign for object detection [10, 9], content-based image re- trieval [5] and image classification [24] have developed. There is however, no task aiming at finding images showing particular object from a larger database. Although this task is extremely similar to the PASCAL visual object classes challenge [10, 9], it is not the same. In the PASCAL object recognition challenge, the probability for an object to be contained in an image is relatively high and the images to train and test the methods are from the same data collection. In realis- tic scenarios, this is not a suitable assumption. Therefore, in the object retrieval task described here, we use the training data that was carefully assembled by the PASCAL NoE with much manual work, and the IAPR TC-12 database which has been created under completely different circumstances as the database from which relevant images are to be retrieved. In this paper, we present the results of the object retrieval task that was arranged as part of the CLEF/ImageCLEF 2007 image retrieval evaluation. This task was conceived as a purely visual task, making it inherently cross-lingual. Once one has a model for the visual appearance of a specific object, such as a bicycle, it can be used to find images of bicycles independently of the language or quality of the annotation of an image. ImageCLEF1 [5] has started within CLEF2 (Cross Language Evaluation Forum) in 2003. A medical image retrieval task was added in 2004 to explore domain–specific multilingual information retrieval and also multi-modal retrieval by combining visual and textual features for retrieval. Since 2005, a medical retrieval and a medical image annotation task are both presented as part of ImageCLEF. In 2006, a general object recognition task was presented to see whether interest in this area existed. Although, only a few groups participated, many groups expressed their interest and encouraged us to create an object retrieval task. In 2007, beside the here described object retrieval task, a photographic retrieval task also using the IAPR TC-12 database [14], a medical image retrieval task [26], and a medical automatic annotation task [26] were arranged in ImageCLEF 2007. 1 http://ir.shef.ac.uk/imageclef/ 2 http://www.clef-campaign.org/ Figure 1: Example images from the PASCAL VOC 2006 training dataset. 2 Task Description The task was defined as a visual object retrieval task. Training data was in the form of annotated example images of ten object classes (PASCAL VOC 2006 data). The task was, after learning from the provided annotated images, to find all images in the IAPR-TC12 database containing the learned objects. The particularity of the task is that the training and test images are not from the same set of images. This makes the task more realistic, but also more challenging. 2.1 Datasets For this task, two datasets were available. As training data, the organisers of the PASCAL Network of Excellence visual object classes challenge kindly agreed that we use the training data they assembled for their 2006 challenge. PASCAL VOC 2006 training data The PASCAL Visual Object Classes (VOC) challenge 2006 training data is freely available on the PASCAL web-page3 and consists of approximately 2600 images, where for each image a detailed description is available which of the ten object classes is visible in which area of the image. Example images from this database are shown in Figure 1 with the corresponding annotation. IAPR TC-12 dataset The IAPR TC-12 Benchmark database [15] consists of 20,000 still images taken from locations around the world and comprising an assorted cross-section of still images which might for example be found in a personal photo collection. It includes pictures of different sports and actions, photographs of people, animals, cities, landscapes and many other aspects of contemporary life. Some example images are shown in Figure 2. This data is also strongly annotated using textual descriptions of the images and various meta-data. We use only the image data for this task. 2.2 Object Retrieval Task The ten queries correspond to the ten classes of the PASCAL VOC 2006 data: bicycles, buses, cars, motorbikes, cats, cows, dogs, horses, sheep, and persons. For training, only the “train” and 3 http://www.pascal-network.org/challenges/VOC/ Figure 2: Example images from the IAPR TC-12 benchmark dataset “val” sections of the PASCAL VOC database were to be used. For each query, participants were asked to submit a list of 1000 images obtained by their method from the IAPR-TC12 database, ranked in the order of best to worst satisfaction of the query. 2.3 Evaluation Measure To evaluate the retrieval performance we use the same measure used by most retrieval evaluations such as the other tasks in CLEF/ImageCLEF [14, 26], TREC4 and TRECVid5 . The avereage precision (AP) gives an indicate for the retrieval quality for one topic and the mean average precision (MAP) provides a single-figure measure of quality across recall levels averaged over all queries. To calculate these measures, it of course necessary to judge which images are relevant for a given query and which are not. 2.4 Relevance Assessments To find relevant images, we created pools per topic [1] keeping the top 100 results from all submitted runs resulting in 1,507 images to be judged per topic on average. This resulted in a total of 15,007 images to be assessed. The normal relevance judgement process in information retrieval tasks envisages that several users judge each document in question for relevance and that for each image relevance for the particular query is judged. In this case, to judge the relevance is easy enough that we can postulate that every two persons among the judges would come to the same conclusion and therefore each image was judged by only one judge. Furthermore, since only 10 queries were to be judged and the concepts to find are simple, the judges complained about the task being to easy and boring. Therefore, after approximately 3000 images were judged, we allowed the judges to additionally specify whether an image is relevant with respect to any of the other queries. The whole judgement process was performed over a web interface which was quickly created and everybody from the RWTH Aachen University Human Language Technology and Pattern Recognition Group and from the Vienna University of Technology PRIP group was invited to judge images. Thus, most of the judges are computer science students and researchers with a Human Language Technology background. Note, that in the pooling process all images that are not judged are automatically considered to be not relevant. The web-interface is shown in Figure 3 to give an impression of the process. On each page, 10 images are shown, and the judge has to decide whether a particular object is present in these images or not. To reduce boredom for the judges, they are allowed (and recommended) to specify 4 http://trec.nist.gov/ 5 http://www-nlpir.nist.gov/projects/t01v/ class class relevant relevant relevant id normal w/ additional full database 1 bicycle 66 254 655 2 bus 23 69 218 3 car 200 522 1268 4 motorbike 7 28 86 5 cat 5 18 7 6 cow 7 23 49 7 dog 9 22 72 8 horse 13 94 175 9 sheep 5 42 6 10 person 554 3939 11248 Table 1: Results from the relevance judgement process. Column 3 shows the number of relevant images when standard pooling is used, column 4 when the additional class information is taken into account. Column 5 shows the results of the relevance judgement of all 20,000 images. whether other object classes are present in the images. The judges were told to be rather positive about the relevance of an image, e.g. to consider sheep-like animals such as llamas to be sheep and to consider tigers and other non-domestic cats to be cats. Furthermore, Ville Viitaniemi from the HUTCIS group, judged all 20,000 images with respect to relevance for all of the topics with a stricter definition of relevances. Results from the Relevance Judgements Table 1 gives an overview how many images were found to be relevant for each of the given topics. It can be observed that there are far more relevant images for the person topic than for any other topic. From these numbers it can be seen that the task at hand is really challenging for most of the classes and very similar to the proverbial looking for a needle in a haystack. In particular for the sheep topic, only 0.03% of the images in the database are relevant although some more images were judged to be relevant by the judges in the additional relevance information. If only the data from the conventional pooling process is considered for eight of the ten classes less than a thousandth of the images are relevant. The discrepancy in the results with additional relevance judgements and the results with full relevance judgement are due to different relevance criteria. The judges that created the additional relevance information were instructed to judge images as relevant that show sheep-like animals such as llamas, and to judge tigers as cats, where the full relevance judgement was stricter in this respect. Table 1 also shows that the additional relevance judgements found more cats and sheep than are actually in the database. 3 Methods Seven international groups from academia participated in the task and submitted a total of 38 runs. The group with the highest number of submissions had 13 submissions. In the following sections, the methods of the groups are explained (in alphabetical order) and references to further work are given. 3.1 Budapest methods Authors: Mátyás Brendel, Bálint Daróczy, and András Benczúr Affiliation: Data Mining and Web search Research Group, Informatics Laboratory Computer and Automation Research Institute of the Hungarian Academy of Sciences Email: {mbrendel, daroczyb, benczur}@ilab.sztaki.hu Figure 3: The relevance judgement web-interface. 3.1.1 budapest-acad315 The task of object retrieval is to classify objects found in images. This means to find an objects in an image, which is similar to sample objects in the pre-classified images. There are two problems with this task: the first is, how do we model objects. The second is, how do we measure similarity of objects. Our first answer to the first question is to model objects with image segments. Segment, region or blob based image similarity is a common method in content based image retrieval, see for example [4, 27, 2, 21]. Respectively, the basis of our first method is to find segments on the query image, which are similar to the objects in the pre-classified images. The image is then classified to be in that class, to which we find the most similar segment in the query image. Image segmentation in itself is a widely researched and open problem itself. We used an image segmenter developed by our group to extract segments from the query images. Our method is based on a graph-based algorithm developed by Felzenszwalb and Huttenlocher [11]. We imple- mented a pre-segmentation method to reduce the computational time and use a different smoothing technique. All images were sized to a fixed resolution. Gaussian-based smoothing helped us cut down high frequency noise. Because of the efficiency of OpenCV6 implementation we did not im- plement either a resizing or a Gaussian-based smoothing algorithm. As pre-segmentation we built a three-level Gaussian-Laplacian pyramid to define initial pixel groups. The original pyramid- based method, which considers the connection between pixels on different levels too, was modified to eliminate the so-called blocking problem. We used brightness difference to measure distance between pixels: dif f Y (P1 , P2 ) = 0.3∗ | RP2 − RP1 | +0.59∗ | GP2 − GP1 | +0.11∗ | BP2 − BP1 | (1) After pre-segmentation, we had segments of 16 pixels maximum. To detect complex segments, we modified the original graph-based method by Felzenszwalb and Huttenlocher [11] with an adaptive threshold system using Euclidean distance to prefer larger regions instead of small regions of the image. Felzenszwalb and Huttenlocher defined an undirected graph G = (V, E) where ∀vi ∈ V corresponds to a pixel in the image, and the edges in E connect certain pairs of neighboring pixels. This graph-based representation of the image reduces the original proposition into a graph cutting challenge. They made a very efficient and linear algorithm that yields a result near to the optimal normalized cut which is one of the NP-full graph problems [11, 29]. 6 http://www.intel.com/technology/computing/opencv/ Algorithm 1 Algorithm Segmentation (Isrc , τ1 , τ2 ) τ1 and τ2 are threshold functions. Let I2 be the source image, I1 and I0 are the down-scaled images. Let P (x, y, i) be the pixel P (x, y) in the image on level i (Ii ). Let G = (V, E) be an undirected weighted graph where ∀vi ∈ V corresponds to a pixel P (x, y). Each edge (vi , vj ) has a non-negative weight w(vi , vj ). Gaussian-Laplacian Pyramid 1. For every P (x, y, 1) Join(P (x, y, 1), P (x/2, y/2, 0)) if τ1 < dif f Y (P (x, y, 1), P (x/2, y/2, 0)) 2. For every P (x, y, 2) Join(P (x, y, 2), P (x/2, y/2, 1)) if τ1 < dif f Y (P (x, y, 2), P (x/2, y/2, 1)) Graph-based Segmentation 1. Compute M axweight (R) = maxe∈M ST (R,E) w(e) for every coherent group of points R where M ST (R, E) is the minimal spanning tree 2. Compute Co(R) = τ2 (R) + M axweight (R) as the measure of coherence between points in R T 3. Join(R1 , R2 ) if e ∈ E exists so w(e) < min(Co(R1 ), Co(R2 )) is true, where R1 R2 = ∅ and w(e) is the weight of the border edge e between R1 and R2 4. Repeat steps 1,2,3 for every neighboring group (R1 , R2 ) until possible to join two groups Algorithm 1 sometimes does not find relevant parts with low initial thresholds. To find the relevant borders which would disappear with the graph-based method using high thresholds we calculated the Sobel-gradient image to separate important edges from other remainders. Similarity of complex objects is usually measured on a feature base. This means the the similarity of the objects is defined by the similarity in a certain feature-space. dist(Si , Oj ) = d(F (Si ), F (Oj )) : Si ∈ S, Oj ∈ O (2) where S is the set of segments and O is the set of objects, dist is the distance function of the objects and segments, d is a distance function in the feature space (usually some of the conventional metrics in the n-dimensional real space), F is the function which assigns features to objects and segments. We extracted from the segments features, like mean color, size, shape information, and histogram information. As shape information a 4 × 4 sized low-resolution variant of the segment (framed in a rectangle with background) was used. Our histograms had 5 bins in each channel. Altogether a 35 dimensional, real valued feature-vector was extracted for each of the segments. The same features were extracted for the objects in the pre-classified images taking them as segments. The background and those classes, which were not requested were ignored. The features of the objects were written to a file, with the class-identifiers, which were extracted from the color-coding. This way we obtained a data-base of class samples, containing features of objects belonging to the classes. After this, the comparison of the objects of the pre-classified sample-images and the segments of the query image was possible. We used Euclidean distance to measure similarity. The distance of the query-image Q was computed as: dist(Q) = min dist(Si , Oj ) : Si ∈ S, Oj ∈ O (3) i,j where S is the set of segments of image Q, O is the set of the pre-classified sample objects. Q is classified to be in the class of that object, which minimizes the distance. The score of an image was computed as: score(Q) = 1000/dist(Q) (4) where Q is the query image. 3.1.2 budapest-acad314 In our first method (see budapest-acad315) we found that our segments are much smaller than the objects in the pre-segmented images. It would have been possible to get larger segments by adjusting the segmentation algorithm, however this way we would not get objects, which were really similar to the objects. We found that our segmentation algorithm could not generate segments similar to the the objects in the pre-classified images with any settings of the parameters. Even, if we tried our algorithm on the sample-images, and the segments were approximately of the same size, the segments did not match the pre-classified objects. The reason for this is that pre-segmentation was made by humans and algorithmic segmentation is far from capable of the same result. For example, it is almost impossible to write an algorithm, which would segment a shape of a human being as one segment, if his clothes are different. However, people were one of the classes defined, and the sample images contained people with the entire body as one object. Therefore we modified our method. Our second method is still segment-based. But we also do a segmentation on the sample-images. We took the segmented sample-images, and if a segment was 80% inside of an area of a pre-defined object, then we took this segment as a proper sample for that object. This way a set of sample segments were created. After this the method is similar to the previous, the difference is only that we have sample-segments instead of sample objects, but we treat them the same way anyway. The features of the segments were extracted and they were written to a file, with the identifier of the class, which was extracted from the color-codes. After this, the comparison of the segments of the pre-classified images and the query image was possible. We used Euclidean distance again to measure similarity. The closest segment of the image to a segment in any of the objects was searched. dist(Q) = min dist(Si , Sj ) : Si ∈ S, Sj ∈ O (5) i,j where S is the segments of image Q, O is the set of segments belonging to the pre-classified objects. The image was classified according to the object, to which the closest segment belongs. As we expected, this modification made the algorithm better. 3.2 HUTCIS: Conventional Supervised Learning using Fusion of Image Features Authors: Ville Viitaniemi, Jorma Laaksonen Affiliation: Adaptive Informatics Research Centre, Laboratory of Computer and Information Science, Helsinki University of Technology, Finland Email: firstname.lastname@tkk.fi All our 13 runs identified with prefix HUTCIS implement a similar general system architecture with three system stages: 1. Extraction of a large number of global and semi-global image features. Here we interpret global histograms of local descriptors as one type of global image feature. 2. For each individual feature, conventional supervised classification of the test images using the VOC2006 trainval images as the training set. 3. Fusion of the feature-wise classifier outputs. By using this architecture, we knowingly ignored the aspect of qualitatively different training and test data. The motivation was to provide a baseline performance level that could be achieved by just applying a well-working implementation of the conventional supervised learning approach. Table 2 with ROC AUC performances in VOC 2006 test set reveals that the performance of our principal run HUTCIS SVM FULLIMG ALL is relatively close to the best performances in the last year’s VOC evaluation [9]. The last row of the table indicates what the rank of the HUTCIS SVM FULLIMG ALL run would have been among the 19 VOC 2006 participants. The following briefly describes the components of the architecture. For more detailed descrip- tion, see e.g. [34]. Features For different runs, the features are chosen from a set of feature vectors, each with several components. Table 3 lists 10 of the features. Additionally, the available feature set includes interest point SIFT feature histograms with different histogram sizes, and concatenations of pairs, triples and quadruples of the tabulated basic feature vectors. The SIFT histogram bins have been selected by clustering part of the images with with self-organising map (SOM) algorithm. Classification and fusion The classification is performed either by a C-SVC implementation built around LIBSVM support vector machine (SVM) library [3], or a SOM-based classifier [19]. The SVM classifiers (prefix HUTCIS SVM) are fused together using an additional SVM layer. For the SOM classifiers (prefix HUTCIS PICSOM), the fusion is based on summation of normalized classifier outputs. The different runs Our principal run HUTCIS SVM FULLIMG ALL realizes all the three system stages in the best way possible. Other runs use subsets of those image features, inferior algorithms or are otherwise predicted to be suboptimal. The run HUTCIS SVM FULLIMG ALL performs SVM-classification with all the tabulated features, SIFT histograms and and twelve previously hand-picked concatenations of the tabulated features, selected on basis of SOM classifier performance in the VOC2006 task. The runs HUT- CIS SVM FULLIMG IP+SC and HUTCIS SVM FULLIMG IP+SC are otherwise similar but use just subsets of the features: SIFT histograms and color histogram, or just SIFT histograms, re- spectively. The runs identified by prefix HUTCIS SVM BB are half-baked attempts to account for the different training and test image distributions. These runs are also based on SIFT histogram and color histogram features. For the training images, the features are calculated from the bounding boxes specified in VOC2006 annotations. For the test images, the features are calculated for whole images. The different runs with this prefix correspond to different ways to select the images as a basis for SIFT codebook formation. The run HUTCIS FULLIMG+BB is the rank based fusion of features extracted from full images and bounding boxes. The runs HUTCIS PICSOM1 and HUTCIS PICSOM2 are otherwise identical but use a different setting of SOM classifier parameter. HUTCIS PICSOM2 smooths less Table 2: ROC AUC performance in VOC2006 test set Run id. bic. bus car cat cow dog horse mbike person sheep FULLIMG ALL 0.921 0.978 0.974 0.930 0.937 0.866 0.932 0.958 0.874 0.941 FULLIMG IP+SC 0.922 0.977 0.974 0.924 0.934 0.851 0.928 0.953 0.865 0.941 FULLIMG IP 0.919 0.952 0.970 0.917 0.926 0.840 0.903 0.943 0.834 0.936 Best in VOC2006 0.948 0.984 0.977 0.937 0.940 0.876 0.927 0.969 0.863 0.956 Rank 7th 4th 3rd 4th 4th 3rd 1st 5th 1st 6th Table 3: Some of the image features used in the HUTCIS runs Colour layout Dominant colour Sobel edge histogram (4 × 4 tiling of the image) HSV colour histogram Average colour (5-part tiling) Colour moments (5-part tiling) 16 × 16 FFT of edge image Sobel edge histogram (5-part tiling) Sobel edge co-occurrence matrix (5-part tiling) the feature spaces, the detection is based on more local information. Both the runs are based on the full set of features mentioned above. concatenations of pairs, triples and Results The tabulated MAP results of HUTCIS runs are corrupted by our own slip in ordering of the queries that prevented us from fully participating the pool selection for query topics 4–10. For the normal pooling, our mistake excluded us completely, for additional relevance judgments only partly. This is expected to degrade our results quite much for the normal pooling and somewhat less for the additional pool, as participating the pool selection is known to be essential for obtaining good performance figures. As expected, HUTCIS SVM FULLIMG ALL turned out to be the best of our runs for the uncorrupted query topics 1–3. The idea of extracting features only from bounding boxes of objects in order to reduce the effect of different backgrounds in the training and test images seemed usually not to work as well as features extracted from full images, although results were not very conclusive. Partly this is could be to the asymmetry in feature extraction: the features extracted from bounding boxes of the training objects were compared with the features of whole of the test images. The results of the SOM classifier runs did not provide information that would be of general interest, besides confirming the previously known result of SOM classifiers being inferior to SVMs. 3.3 INAOE’s Annotation-based object retrieval approaches Authors: Heidy Marisol Marin Castro, Hugo Jair Escalante Balderas, and Carlos Arturo Hernández Gracidas Affiliation: TIA Research Group, Computer Science Department, National Institute of Astrophysics, Optics and Electronics, Tonantzintla, Mexico Email: {hmarinc,hugojair,carloshg}@ccc.inaoep.mx The TIA research group at INAOE, Mexico proposed two methods based on image labeling. Automatic image annotation methods were used for labeling regions within segmented images, and then we performed object retrieval based on the generated annotations. Results were not what we expected, though it can be due to the fact that annotations were defined subjectively and that not enough images were annotated for creating the training set for the annotation method. Two approaches were proposed: a semi-supervised classifier based on unlabeled data and a supervised one, the last method was enhanced with a recent proposed method based on semantic cohesion [8]. Both approaches followed the following steps: 1. Image segmentation 2. Feature extraction 3. Manual labeling of a small subset of the training set 4. Training a classifier 5. Using the classifier for labeling the test-images 6. Using labels assigned to regions images for object retrieval For both approaches the full collection of images was segmented with the normalized cuts algorithm [30]. A set of 30 features were extracted from each region; we considered color, shape and texture attributes. We used our own tools for image segmentation, feature extraction and manual labeling [22]. The considered annotations were the labels of the 10 objects defined for this task. The features for each region together with the manual annotations for each region were used as training set with the two approaches proposed. Each classifier was trained with this dataset and then all of the test images were annotated with such a classifier. Finally, the generated annotations were used for retrieving objects with queries. Queries were created using the labels of the objects defined for this task; and selected as relevant those images with the highest number of regions annotated with the object label. Sample segmented images with their corresponding manual annotations are shown in Figure 4. As we can see the segmentation algorithm works well for some images (isolated cows, close-up of people), however for other objects segmentation is poor (a bicycle, for example). Figure 4: Sample images from the generated training set. . Figure 5: Left: graphical description of the improvement process of MRFI. Right: interpretation of MRFI for a given configuration of labels and regions; (red) line-arcs consider semantic cohesion between labels, while (blue) dashed-arcs consider relevance weight of each label according to k − nn . KNN+MRFI, A supervised approach For the supervised approach we used a simple knn classifier for automatically labeling regions. Euclidean distance was used as similarity function. The label of the nearest neighbor (in the training set) for each test-region was assigned as anno- tation for this region. This was our baseline run (INAOE-TIA-INAOE-RB-KNN ). The next step consisted of improving annotation performance of knn using an approach called (MRFI ) [8] which we recently proposed for improving annotation systems. This approach consists of modeling each image (region-annotations pairs) with a Markov random field (MRF ), introduc- ing semantic knowledge, see Figure 5. The top−k more likely annotations for each region are considered. Each of these annotations have a confidence weight related to the relevance of the label to being the correct annotation for that region, according to knn. The MRFI approach uses the relevance weights with semantic information for choosing a unique (the correct) label for each region. Semantic information is considered in the MRF for keeping coherence among annotations assigned to regions within a common image; while the relevance weight is considered for taking into account the confidence of the annotation method (k − nn) on each of the labels, see Figure 5. The (pseudo) optimal configuration of regions-annotations for each image is obtained by min- imizing an energy function defined by potentials. For optimization we used standard simulated annealing. The intuitive idea of the MRFI approach is to guarantee that the labels assigned to regions are coherent among them, taking into account semantic knowledge and the confidence of the annotation system. In previous work, semantic information was obtained from cooccurrences of labels on an external corpus. However for this work semantic association between a pair of labels is given by the normalized number of relevant documents returned by GoogleR to queries generated using the pair of labels. This run is named INAOE-TIA-INAOE-RB-KNN+MRFI, see [8] for details. SSAssemble: Semi-supervised Weighted AdaBoost The semi-supervised approach consist of using a recently proposed ensemble of classifiers, called WSA [22]. Our WSA ensemble uses naive Bayes as its base classifier. A set of these is combined in a cascade based on the AdaBoost technique [13]. Ensemble methods work by combining a set of base classifiers in some way, such as a voting scheme, producing a combined classifier which usually outperforms a single classifier. When training the ensemble of Bayesian classifiers, WSA considers the unlabeled images on each stage. These are annotated based on the classifier from the previous stage, and then used to train the next classifier. The unlabeled instances are weighted according to a confidence measure based on their predicted probability value; while the labeled instances are weighted according to Algorithm 2 Semi-supervised Weighted AdaBoost (WSA) algorithm. Require: L: labeled instances, U : unlabeled instances, P : training instances, T : Iterations T X 1 Ensure: Final Hypothesis and probabilities: Hf = argmax log , P (xi ) t=1 Bt 1 1: W (xi )0 = N umInst(L) , ∀xi ∈ L 2: for t from 1 to T do 3: W (xi )t = N W (xi ) ∀xi ∈ L X W (xi ) i=1 4: ht = C(L, W (xi )t ) N X 5: et = W (xi )t if ht (xi ) 6= yi i=1 6: if et ≥ 0.5 then 7: exit 8: end if 9: if et = 0.0 then 10: et = 0.01 11: end if et 12: Bt = (1−e t) 13: W (xi )(t+1) = W (xi )t ∗ Bt if ht (xi ) = yi ∀xi ∈ L 14: P (xi ) = C(L, U, W (xi )t ) 15: W (xi ) = P (xi ) ∗ Bt ∀xi ∈ U 16: end for the classifier error, as in standard AdaBoost. Our method is based on the supervised multi-class AdaBoost ensemble, which has shown to be an efficient scheme to reduce the rate error of different classifiers. Formally WSA algorithm receives a set of labeled data (L) and a set of unlabeled data (U ). An initial classifier N B1 is build using L. The labels in L are used to evaluate the error of N B1 . As in AdaBoost the error is used to weight the examples, increasing the weight of the misclassified examples and keeping the same weight of the correctly classified examples. The classifier is used to predict a class for U with certain probability. In the case of U , the weights are multiplied by the predicted probability of the majority class. Unlabeled examples with high probability of their predicted class will have more influence in the construction of the next classifier than examples with lower probabilities. The next classifier N B2 is build using the weights and predicted class of L ∪ U . N B2 makes new predictions on U and the error of N B2 on all the examples is used to re–weight the examples. This process continues, as in AdaBoost, for a predefined number of cycles or when a classifier has a weighted error greater or equal to 0.5. As in AdaBoost, new instances are classified using a weighted sum of the predicted class of all the constructed base classifiers. WSA is described in algorithm 2. We faced several problems when performing the annotation image task. The first one was that the training set and the test set were different, so this caused a classification with high error ratio. The second one was due the segmentation algorithm. The automatic segmentation algorithm did not perform well for all images leading to have incorrect segmentation of the objects in the images. The last one concerns to the different criteria for manual labeling of the training set. Due all these facts we did not get good results. We hope improving the annotation task by changing part of the labeling strategy. 3.4 MSRA: Object Retrieval Authors: Mingjing Li, Xiaoguang Rui, and Lei Wu Affiliation: Microsoft Research Asia Email: mjli@microsoft.com Two approaches were adopted by Microsoft Research Asia (MSRA) to perform the object retrieval task in ImageCLEF 2007. One is based on the visual topic model (VTM); the other is the visual language modelling (VLM) method [35]. VTM represents an image by a vector of probabilities that the image belongs to a set of visual topics, and categorizes images using SVM classifiers. VLM represents an image as a 2-D document consisting of visual words, trains a statistical language model for each image category, and classifies an image to the category that generates the image with the highest probability. 3.4.1 VTM: Visual Topic Model Probabilistic Latent Semantic Analysis (pLSA) [12], which is a generative model from the text literature, is adopted to find out the visual topics from training images. Different from traditional pLSA, all training images of 10 categories are put together in the training process and about 100 visual topics are discovered finally. The training process consists of five steps: local feature extraction, visual vocabulary construc- tion, visual topic construction, histogram computation, and classifier training. At first, salient im- age regions are detected using scale invariant interest point detectors such as the Harris-Laplace and the Laplacian detectors. For each image, about 1,000 to 2,000 salient regions are extracted. Those regions are described by the SIFT descriptor which computes a gradient orientation his- togram within the support region. Next, 300 local descriptors are randomly selected from each category and combined together to build a global vocabulary of 3,000 visual words. Based on the vocabulary, images are represented by the frequency of visual words. Then, pLSA is performed to discover the visual topics in the training images. pLSA is also applied to estimate how likely an image belongs to each visual topic. The histogram of the estimated probabilities is taken as the feature representation of that image for classification. For multi-class classification problem, we adopt the one-against-one scheme, and train an SVM classifier with RBF kernel for each possible pair of categories. 3.4.2 VLM: Visual Language Modeling The approach consists of three steps: image representation, visual language model training and object retrieval. Each image is transformed into a matrix of visual words. First, an image is simply segmented into 8 x 8 patches, and the texture histogram feature is extracted from each path. Then all patches in the training set are grouped into 256 clusters based on their features. Next, each path cluster is represented using an 8-bit hash code, which is defined as the visual word. Finally, an image is represented by a matrix of visual words, which is called visual document. Visual words in a visual document are not independent to each other, but correlated with other words. To simply the model training, we assume that visual words are generated in the order from left to right, and top to bottom and each word is only conditionally dependent on its immediate top and left neighbors, and train a trigram language model for each image category. Given a test image, it is transformed into a matrix of visual words in the same way, and the probability that it is generated by each category is estimated respectively. Finally, the image categories are ranked in the descending order of these probabilities. 3.5 NTU: Solution for the Object Retrieval Task Authors: Steven C. H. Hoi Affiliation: School of Computer Engineering, Nanyang Technological University, Singapore Email: chhoi@ntu.edu.sg 3.5.1 Introduction Object retrieval is an interdisciplinary research problem between object recognition and content- based image retrieval (CBIR). It is commonly expected that object retrieval can be solved more effectively with the joint maximization of CBIR and object recognition techniques. The goal of this paper is to study a typical CBIR solution with application to the object retrieval tasks [17, 18]. We expected that the empirical study in this work will serve as a baseline for future research when applying CBIR techniques for object recognition. 3.5.2 Overview of Our Solution We study a typical CBIR solution for the object retrieval problem. In our approach, we focus on two key tasks. One is the feature representation, the other is the supervised learning scheme with support vector machines. Feature Representation In our approach, three kinds of global features are extracted to rep- resent an image, including color, shape, and texture. For color, we study the Grid Color Moment feature (GCM). For each image, we split it into 3 × 3 equal grids and extract color moments to represent each of the 9 grids. Three color moments are then computed: color mean, color variance and color skewness in each color channel (H, S, and V), respectively. Thus, an 81-dimensional color moment is adopted as the color feature for each image. For shape, we employ the edge direction histogram. First, an input color image is first converted into a gray image. Then a Canny edge detector is applied to obtain its edge image. Based on the edge images, the edge direction histogram can then be computed. Each edge direction histogram is quantized into 36 bins of 10 degrees each. In addition, we use a bin to count the number of pixels without edge. Hence, a 37-dimensional edge direction histogram is used for shape. For texture, we investigate the Gabor feature. Each image is first scaled to the size of 64 × 64. Then, the Gabor wavelet transformation is applied to the scaled image at 5 scale levels and 8 orientations, which results in a total of 40 subimages for each input image. For each subimage, we calculate three statistical moments to represent the texture, including mean, variance, and skewness. Therefore, a 120-dimensional feature vector is used for texture. In total, a 238-dimensional feature vector is used to represent each image. The set of visual features has been shown to be effective for content-based image retrieval in our previous experi- ments [17, 18]. Supervised Learning for Object Retrieval The object retrieval task defined in ImageCLEF 2007 is similar to a relevance feedback task in CBIR, in which a number of possible and negative labeled examples are given for learning. This can be treated as a supervised classification task. To solve it, we employ the support vector machines (SVM) technique for training the classifiers on the given examples [17]. In our experiment, a standard SVM package is used to train the SVM classifier with RBF kernels. The parameters C and γ are best tuned on the VOC2006 training set, in which the training precision is 84.2% for the classification tasks. Finally, we apply the trained classifiers to do the object retrieval by ranking the distances of the objects apart from the classifier’s decision boundary. 3.5.3 Concluding Remarks We described our solution for the object retrieval task in the ImageCLEF 2007 by a typical CBIR solution. We found that the current solution, though it was trained with good performance in an object recognition test-bed, did not achieve promising results in the tough object retrieval tasks. In our future work, several directions can be explored to improve the performance, including local feature representation and better machine learning techniques. 3.6 PRIP: Color Interest Points and SIFT features Authors: Julian Stöttinger1 , Allan Hanbury1 , Nicu Sebe2 , Theo Gevers2 Affiliation: 1 PRIP, Institute of Computer-Aided Automation, Vienna University of Technology, Vienna, Austria 2 Intelligent Systems Lab Amsterdam, University of Amsterdam, The Netherlands Email: {julian,hanbury}@prip.tuwien.ac.at, {nicu,gevers}@science.uva.nl In the field of retrieval, detection, recognition and classification of objects, many state of the art methods use interest point detection at an early stage. This initial step typically aims to find meaningful regions in which descriptors are calculated. Finding salient locations in image data is crucial for these tasks. Most current methods use only the luminance information of the images. This approach focuses on the use of color information in interest point detection and its gain in performance. Based on the Harris corner detector, multi-channel visual information transformed into different color spaces is the basis to extract the most salient interest points. To determine the characteristic scale of an interest point, a global method of investigating the color information on a global scope is used. The two different PRIP-PRIP ScIvHarris approaches differ in the properties of these interest points only. The method consists of the following stages: 1. Extraction of multi-channel based interest points 2. Local descriptions of interest points 3. Estimating the signature of an image 4. Classification Extraction of multi-channel based interest points An extension of the intensity-based Harris detector [16] is proposed in [25]. Because of common photometric variations in imaging conditions such as shading, shadows, specularities and object reflectance, the components of the RGB color system are correlated and therefore sensitive to illumination changes. However, in natural images, high contrast changes may appear. Therefore, a color Harris detector in RGB space does not dramatically change the position of the corners compared to a luminance based approach. Normalized rgb overcomes the correlation of RGB and favors color changes. The main drawback, however, is its instability in dark regions. We can overcome this by using quasi invariant color spaces. The approach PRIP-PRIP HSI ScIvHarris uses the HSI color space [32], which is quasi- invariant to shadowing and specular effects. Therefore, changes in lighting conditions in images should not affect the positions of the interest points, resulting in more stable locations. Addition- ally, the HSI color space discriminates between luminance and color. Therefore, much information can be discarded, and the locations get more sparse and distinct. The PRIP cbOCS ScIvHarris approach follows a different idea. As proposed in [33], colors have different occurrence probabilities and therefore different information content. Therefore, rare colors are regarded as more salient than common ones. We wish to find a boosting function so that color vectors having equal information content have equal impact on the saliency function. This transformation can be found analyzing the occurrence probabilities of colors in large image databases. With this change of focus towards rare colors, we aim to discard many repetitive locations and get more stable results on rare features. The characteristic scale of an interest point is chosen by applying a principal component analysis (PCA) on the image and thus finding a description for the correlation of the multi-channel information [31]. The characteristic scale is decided when the Laplacian of Gaussian function of this projection and the Harris energy is a maximum at the same location in the image. The final extraction of these interest points and corresponding scales is done by preferring locations with high Harris energy and huge scales. A maximum number of 300 locations per image has been defined, as over-description diminish the overall recognition ability dramatically. Local descriptions of interest points The scale invariant feature transform (SIFT) [20] showed to give best results in a broad variety of applications [23]. We used the areas of the extracted interest points as a basis for the description phase. SIFT are basically sampled and normalized gradient histograms, which can lead to multiple descriptions per location. This occurs if there is more than one direction of the gradients regarded as predominant. Estimating the signature of an image In this bag of visual features approach [36], we cluster the descriptions of one image to a fixed number of 40 clusters using k-means. The centroids and the proportional sizes of the clusters build the signature of one image having a fixed dimensionality of 40 by 129. Classification The Earth Mover’s Distance (EMD) [28] showed to be a suitable metric for comparing image signatures. It takes the proportional sizes of the clusters into account, which seems to gain a lot of discriminative power. The classification itself is done in the most straight forward way possible: for every object category, the smallest distances to another signature build the classification. 3.7 RWTHi6: Patch-Histograms and Log-Linear Models Authors: Thomas Deselaers, Hermann Ney Affiliation: Human Language Technology and Pattern Recognition, RWTH Aachen University, Aachen, Germany Email: surname@cs.rwth-aachen.de The approach used by the Human Language Technology and Pattern Recognition group of the RWTH Aachen University, Aachen, Germany, to participate in the PASCAL Visual Object Classes Challenge consists of four steps: 1. patch extraction 2. clustering 3. creation of histograms 4. training of a log-linear model where the first three steps are feature extraction steps and the last is the actual classification step. This approach was first published in [6, 7]. The method follows the promising approach of considering objects to be constellations of parts which offers the immediate advantages that occlusions can be handled very well, that the geometrical relationship between parts can be modelled (or neglected), and that one can focus on the discriminative parts of an object. That is, one can focus on the image parts that distinguish a certain object from other objects. The steps of the method are briefly outlined in the following paragraphs. To model the differ- ence in the training and test data, the first three steps have been done for the training and test data individually, and then the according histograms have been extracted for the respective other, so that once the vocabulary was learnt for the training data and once for the test data and the histograms are created for each using both vocabularies. Results, however show that this seems not to be a working approach to tackle divergence in training and testing data. Patch Extraction Given an image, we extract square image patches at up to 500 image points. Additionally, 300 points from a uniform grid of 15×20 cells that is projected onto the image are used. At each of these points a set of square image patches of varying sizes (in this case 7 × 7, 11 × 11, 21 × 21, and 31 × 31 pixels) are extracted and scaled to a common size (in this case 15 × 15 pixels). In contrast to the interest points from the detector, the grid-points can also fall onto very homogeneous areas of the image. This property is on the one hand important for capturing homogeneity in objects which is not found by the interest point detector and on the other hand it captures parts of the background which usually is a good indicator for an object, as in natural image objects are often found in a “natural” environment. After the patches are extracted and scaled to a common size, a PCA dimensionality reduction is applied to reduce the large dimensionality of the data, keeping 39 coefficients corresponding to the 40 components of largest variance but discarding the first coefficient corresponding to the largest variance. The first coefficient is discarded to achieve a partial brightness invariance. This approach is suitable because the first PCA coefficient usually accounts for global brightness. Clustering The data are then clustered using a k-means style iterative splitting clustering al- gorithm to obtain a partition of all extracted patches. To do so, first one Gaussian density is estimated which is then iteratively split to obtain more densities. These densities are then re- estimated using k-means until convergence is reached and then the next split is done. It has be shown experimentally that results consistently improve up to 4096 clusters but for more than 4096 clusters the improvement is so small that it is not worth the higher computational demands. Creation of Histograms Once we have the cluster model, we discard all information for each patch except its closest corresponding cluster center identifier. For the test data, this identifier is determined by evaluating the Euclidean distance to all cluster centers for each patch. Thus, the clustering assigns a cluster c(x) ∈ {1, . . . C} to each image patch x and allows us to create histograms of cluster frequencies by counting how many of the extracted patches belong to each of the clusters. The histogram representation h(X) with C bins is then determined by counting PLX and normalization such that hc (X) = L1X l=1 δ(c, c(xl )), where δ denotes the Kronecker delta function, c(xl ) is the closest cluster center to xl , and xl is the l-th image patch extracted from image X, from which a total of LX patches are extracted. Training & Classification Having obtained this representation by histograms of image patches, we define a decision rule for the classification of images. The approach based on maximum likeli- hood of the class-conditional distributions does not take into account the information of competing classes QKduring QNk training. We can use this information by maximizing the class posterior probabil- ity k=1 n=1 p(k|Xkn ) instead. Assuming a Gaussian density with pooled covariances for the class-conditional distribution, this maximization is equivalent to maximizing the parameters of a log-linear or maximum entropy model C ! 1 X p(k|h) = exp αk + λkc hc , Z(h) c=1 PK  PC  where Z(h) = k=1 exp αk + λ h c=1 kc c is the renormalization factor. We use a modified version of generalized iterative scaling. Bayes’ decision rule is used for classification. 4 Results The results for all submissions are given in Table 4 using the normal Qrels and in Table 5 using the Qrels with additional information. Furthermore in Table 6, the results are presented using the relevance information estimated for each image in the database for all classes. All results are given as AP for the individual topics (first 10 columns) and MAP over all topics (last column). The tables are ordered by MAP (last column). When considering the individual topics, it can be observed that some methods work well for some of the topics but completely fail for others. To highlight this, the highest AP in each column in shown in bold. The two runs from Budapest perform well for the bicycles (1) and motorbikes (4) topics respectively, but have bad results for all other topics. Table 4: Results from the ImageCLEF 2007 object retrieval task with the normal relevance infor- mation. All values have been multiplied by 100 to make the table more readable. The numbers in the top row refer to the class id’s (see Table 1). The MAP over all classes is in the last column. The highest AP per class is shown in bold. rank run id 1 2 3 4 5 6 7 8 9 10 avg 1 HUTCIS_SVM_FULLIMG_ALL 21.3 1.5 28.1 0.3 0.0 0.2 0.2 0.8 21.0 0.8 7.4 2 HUTCIS_SVM_FULLIMG_IP+SC 10.2 1.2 25.8 0.2 0.0 0.2 0.1 1.4 20.3 1.6 6.1 3 HUTCIS_SVM_FULLIMG+BB 13.0 1.5 11.4 0.1 0.0 0.4 0.1 0.6 22.4 1.1 5.1 4 HUTCIS_SVM_FULLIMG_IP 9.3 1.3 23.6 0.1 0.0 0.1 0.1 2.6 3.0 1.2 4.1 5 MSRA-MSRA_RuiSp 2.5 1.9 7.9 3.5 0.9 0.0 0.3 0.7 2.1 13.7 3.4 6 HUTCIS_SVM_BB_BAL_IP+SC 4.9 1.3 2.4 0.0 0.0 1.5 0.0 0.1 10.4 0.4 2.1 7 HUTCIS_PICSOM1 3.2 1.3 13.5 0.1 0.0 0.1 0.2 0.0 0.4 0.5 1.9 8 HUTCIS_PICSOM2 1.7 1.4 13.2 0.1 0.0 0.0 0.4 0.0 0.6 0.5 1.8 9 HUTCIS_SVM_BB_ALL 5.8 1.6 2.4 0.0 0.0 1.4 0.0 0.1 4.4 0.3 1.6 10 HUTCIS_SVM_BB_BB_IP+SC 5.2 1.3 2.8 0.1 0.0 0.8 0.0 0.1 4.4 0.4 1.5 11 HUTCIS_SVM_BB_FULL_IP+SC 8.1 1.7 1.5 0.1 0.0 1.0 0.0 0.1 2.4 0.3 1.5 12 RWTHi6-HISTO-PASCAL 0.3 2.7 1.7 0.1 0.0 0.1 0.2 0.4 0.1 9.4 1.5 13 HUTCIS_SVM_BB_BB_IP 4.0 0.7 1.2 0.1 0.0 0.1 0.0 0.2 2.1 0.3 0.9 14 MSRA-MSRA-VLM_8_8_640_ful 0.6 0.2 1.2 0.3 0.0 0.0 0.0 2.0 0.1 3.6 0.8 15 HUTCIS_SVM_BB_BAL_IP 3.9 1.6 1.2 0.1 0.0 0.2 0.0 0.1 0.5 0.3 0.8 16 MSRA-MSRA-VLM-8-8-800-HT 0.9 0.1 1.1 0.4 0.0 0.0 0.0 0.6 0.3 3.2 0.7 17 PRIP-PRIP_HSI_ScIvHarris 0.2 0.0 0.6 0.2 2.6 0.2 0.1 0.0 0.0 2.2 0.6 18 budapest-acad-budapest-acad314 2.3 0.1 0.0 0.5 1.1 0.0 0.0 0.0 0.0 0.1 0.4 19 HUTCIS_SVM_BB_FULL_IP 0.2 1.2 1.0 0.1 0.0 0.1 0.0 0.1 1.2 0.2 0.4 20 NTU_SCE_HOI-NTU_SCE_HOI_1 2.0 0.1 1.8 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.4 21 PRIP-PRIP_cbOCS_ScIvHarr2 0.1 0.0 0.3 0.0 0.8 0.0 0.0 0.0 1.3 1.0 0.4 22 budapest-acad-budapest-acad315 0.3 0.0 0.0 1.5 0.0 0.1 0.3 0.0 0.1 0.6 0.3 23 INAOE-TIA-INAOE_SSAssemble 0.5 0.1 0.1 0.1 0.0 0.3 0.0 0.0 0.0 0.6 0.2 24 INAOE-TIA-INAOE-RB-KNN+MRFI_ok 0.1 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.2 0.1 25 INAOE-TIA-INAOE-RB-KNN+MRFI 0.1 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.2 0.1 26 INAOE-TIA-INAOE-RB-KNN 0.0 0.0 0.2 0.0 0.0 0.0 0.1 0.0 0.0 0.3 0.1 Table 5: Results from the ImageCLEF 2007 object retrieval task with the additional relevance information. All values have been multiplied by 100 to make the table more readable. The numbers in the top row refer to the class id’s (see Table 1). The MAP over all classes is in the last column. The highest AP per topic is shown in bold. rank run id 1 2 3 4 5 6 7 8 9 10 avg 1 HUTCIS_SVM_FULLIMG_ALL 7.2 1.3 14.9 0.4 0.0 0.4 0.1 4.1 2.7 5.8 3.7 2 HUTCIS_SVM_FULLIMG_IP+SC 3.7 0.7 14.1 0.5 0.0 1.1 0.0 3.5 3.0 6.3 3.3 3 HUTCIS_SVM_FULLIMG_IP 3.3 0.7 13.0 0.8 0.0 1.4 0.0 4.0 0.8 5.0 2.9 4 HUTCIS_SVM_FULLIMG+BB 4.7 1.6 6.5 0.7 0.0 0.5 0.0 2.0 3.1 4.2 2.4 5 HUTCIS_PICSOM1 1.5 1.2 7.5 0.2 0.0 0.3 0.1 0.4 0.6 5.1 1.7 6 HUTCIS_PICSOM2 1.0 1.1 7.4 0.3 0.0 0.2 0.2 0.3 0.6 4.6 1.6 7 MSRA-MSRA_RuiSp 1.4 0.7 5.4 1.0 0.3 0.1 0.1 0.3 0.3 5.1 1.5 8 HUTCIS_SVM_BB_BAL_IP+SC 1.9 2.3 1.1 0.3 0.0 0.7 0.0 0.5 1.8 3.2 1.2 9 HUTCIS_SVM_BB_BB_IP+SC 1.9 2.2 1.3 0.5 0.1 0.5 0.0 0.6 1.0 3.2 1.1 10 HUTCIS_SVM_BB_BB_IP 1.5 0.5 0.9 3.8 0.2 0.2 0.0 0.7 0.4 2.9 1.1 11 HUTCIS_SVM_BB_ALL 2.2 1.3 1.1 0.5 0.0 0.7 0.0 0.7 1.1 3.1 1.1 12 budapest-acad-budapest-acad314 9.1 0.2 0.0 0.7 0.3 0.1 0.0 0.1 0.0 0.0 1.1 13 HUTCIS_SVM_BB_FULL_IP+SC 3.0 1.4 0.8 0.2 0.1 0.6 0.0 0.6 1.0 3.0 1.1 14 RWTHi6-HISTO-PASCAL 0.2 1.0 1.5 0.3 0.0 0.0 0.1 0.2 0.1 5.0 0.8 15 HUTCIS_SVM_BB_BAL_IP 1.5 0.8 0.7 0.6 0.3 0.2 0.0 0.6 0.2 2.9 0.8 16 budapest-acad-budapest-acad315 0.2 0.0 0.0 6.2 0.0 0.0 0.1 0.0 0.2 0.1 0.7 17 HUTCIS_SVM_BB_FULL_IP 0.3 0.8 0.6 0.4 0.1 0.1 0.0 0.8 0.4 2.7 0.6 18 MSRA-MSRA-VLM_8_8_640_ful 0.3 0.3 1.3 0.1 0.0 0.0 0.0 0.3 0.1 3.2 0.6 19 MSRA-MSRA-VLM-8-8-800-HT 0.4 0.1 1.1 0.1 0.0 0.0 0.0 0.3 0.1 2.5 0.5 20 PRIP-PRIP_HSI_ScIvHarris 0.1 0.0 0.4 0.1 1.5 0.1 0.3 0.1 0.1 1.8 0.5 21 NTU_SCE_HOI-NTU_SCE_HOI_1 1.2 0.2 2.3 0.0 0.0 0.0 0.0 0.0 0.0 0.4 0.4 22 PRIP-PRIP_cbOCS_ScIvHarr2 0.1 0.0 0.2 0.8 0.4 0.0 0.0 0.1 0.2 1.1 0.3 23 INAOE-TIA-INAOE_SSAssemble 0.1 0.1 0.2 0.0 0.0 0.3 0.0 0.1 0.1 1.1 0.2 24 INAOE-TIA-INAOE-RB-KNN 0.1 0.0 0.3 0.1 0.0 0.0 0.0 0.0 0.0 1.0 0.2 25 INAOE-TIA-INAOE-RB-KNN+MRFI_ok 0.3 0.1 0.1 0.0 0.0 0.0 0.1 0.0 0.0 0.8 0.1 26 INAOE-TIA-INAOE-RB-KNN+MRFI 0.3 0.1 0.1 0.0 0.0 0.0 0.1 0.0 0.0 0.8 0.1 Table 6: Results from the ImageCLEF 2007 object retrieval task with complete relevance infor- mation for the whole database. All values have been multiplied by 100 to make the table more readable. The numbers in the top row refer to the class id’s (see Table 1). The MAP over all classes is in the last column. The highest AP per class is shown in bold. rank run id 1 2 3 4 5 6 7 8 9 10 avg 1 budapest-acad-budapest-acad314 28.3 0.3 0.1 1.8 0.2 0.2 0.0 0.0 0.0 0.0 3.1 2 HUTCIS_SVM_FULLIMG_ALL 4.1 1.2 10.6 0.4 0.0 0.6 0.1 3.8 0.0 8.3 2.9 3 HUTCIS_SVM_FULLIMG_IP+SC 2.6 1.0 11.1 1.0 0.0 1.0 0.1 3.2 0.0 8.2 2.8 4 HUTCIS_SVM_FULLIMG_IP 2.4 1.1 10.3 1.8 0.0 1.1 0.1 3.0 0.0 8.1 2.8 5 HUTCIS_SVM_FULLIMG+BB 3.0 1.1 4.2 0.6 0.0 0.7 0.1 2.5 0.0 8.6 2.1 6 budapest-acad-budapest-acad315 0.6 0.0 0.1 18.5 0.0 0.1 0.1 0.0 0.6 0.1 2.0 7 HUTCIS_SVM_BB_ALL 1.6 0.9 0.5 0.3 0.0 0.6 0.1 1.5 0.0 8.3 1.4 8 HUTCIS_SVM_BB_BB_IP+SC 1.4 1.0 0.7 0.3 0.0 0.5 0.1 1.1 0.0 8.4 1.4 9 HUTCIS_SVM_BB_FULL_IP+SC 2.0 0.8 0.4 0.2 0.0 0.8 0.1 1.1 0.0 8.2 1.3 10 HUTCIS_PICSOM1 0.9 0.7 4.5 0.6 0.0 0.3 0.1 0.7 0.0 5.6 1.3 11 MSRA-MSRA_RuiSp 0.9 0.5 3.6 0.6 0.7 0.1 0.1 0.4 0.0 6.0 1.3 12 HUTCIS_SVM_BB_BAL_IP+SC 1.3 0.8 0.5 0.2 0.0 0.5 0.1 0.8 0.0 8.4 1.3 13 HUTCIS_PICSOM2 0.8 0.6 4.2 0.5 0.0 0.3 0.1 0.4 0.0 5.4 1.2 14 HUTCIS_SVM_BB_BB_IP 1.1 0.7 0.4 1.4 0.0 0.3 0.0 1.0 0.0 7.2 1.2 15 HUTCIS_SVM_BB_BAL_IP 1.1 0.8 0.3 0.3 0.0 0.4 0.1 0.9 0.0 6.9 1.1 16 HUTCIS_SVM_BB_FULL_IP 0.3 0.9 0.3 0.3 0.0 0.3 0.0 1.1 0.0 6.6 1.0 17 RWTHi6-HISTO-PASCAL 0.4 0.2 1.4 0.2 0.0 0.1 0.0 0.2 0.0 5.5 0.8 18 NTU_SCE_HOI-NTU_SCE_HOI_1 1.2 0.7 2.4 0.0 0.0 0.0 0.1 0.1 0.0 0.8 0.5 19 MSRA-MSRA-VLM_8_8_640_ful 0.4 0.3 0.7 0.1 0.1 0.0 0.0 0.3 0.0 2.5 0.4 20 MSRA-MSRA-VLM-8-8-800-HT 0.3 0.2 0.5 0.0 0.0 0.1 0.0 0.2 0.1 2.5 0.4 21 INAOE-TIA-INAOE_SSAssemble 0.1 0.0 0.1 0.0 0.0 0.2 0.0 0.2 0.0 3.2 0.4 22 PRIP-PRIP_HSI_ScIvHarris 0.1 0.0 0.3 0.1 1.4 0.1 0.0 0.0 0.0 1.5 0.4 23 INAOE-TIA-INAOE-RB-KNN+MRFI_ok 0.5 0.1 0.6 0.0 0.0 0.2 0.0 0.0 0.0 2.2 0.4 24 INAOE-TIA-INAOE-RB-KNN+MRFI 0.5 0.1 0.6 0.0 0.0 0.2 0.0 0.0 0.0 2.2 0.4 25 INAOE-TIA-INAOE-RB-KNN 0.3 0.0 0.5 0.1 0.0 0.1 0.0 0.0 0.0 2.2 0.3 26 PRIP-PRIP_cbOCS_ScIvHarr2 0.1 0.0 0.1 0.5 0.1 0.0 0.1 0.1 0.0 0.8 0.2 5 Discussion One issue that should be taken into account when interpreting the results is that 50% of the evaluated runs are from HUTCIS and thus this group had a significant impact on the pools for relevance assessment. This effect is further boosted by the fact that the initial runs from HUTCIS (which were used for the pooling) had the queries 4–10 in wrong order. This implies that for the topics where the runs were in correct order, HUTCIS has a good chance of having many of their images in the pools and that furthermore for the runs where HUTCIS’ runs were broken, none of their images were considered in pooling. Therefore, it can be observed that the additional relevance information leads to some significant changes in some of the results. A particular strong change of the results can be seen for the runs from HUTCIS: the APs of the queries which were in correct order for the pooling process are strongly reduced whereas those which were in wrong order are significantly improved, which shows that the large number of runs from HUTCIS had a strong influence on the results of the pooling. The drawbacks of using pooling is visible in the results as the MAPs for some methods increase when the additional relevance judgements are used. This effect is particular strong for topics with only very few relevant images. The results clearly show that the task is a very difficult one, and that most recent methods are not yet able to fully cope with tasks where training and test data do not match very well. This means that although the other image- and object retrieval and recognition tasks are very good to foster advancement in research, they do not yet pose a fully realistic challenge as their training- and testing data match up pretty well. A higher variability between the test- and the training data would make these tasks more realistic. The participating methods use a wide variety of different techniques, some methods use global image descriptors, other methods use local descriptors and discriminative models, which are cur- rently the state of the art in object recognition and detection. For the classes cat (5), cow (6), and sheep (9), AP values are so small that it is not obvious which method produces the best results. For the classes cat and sheep this effect is even stronger since only 6 resp. 7 images from the whole database are relevant. No single method obtained the highest AP values in all classes. A very strong outlier in this respect are the two methods from the Budapest-group which have clearly the best results for the bicycle and the motorbike class respectively but have very bad performance for all other classes. The fact that the queries are very dissimilar in terms of 1) image statistics (number of correct images) and 2) visual difficulty, makes the AP values vary strongly among the different topics and therefore MAP is not a very reliable measure judge the overall retrieval quality here. Finally, although a maximum of 1000 images were to be returned for each query, the number of relevant images exceeded 1000 for two classes (car and person). This implies that the maximum possible AP (using the full annotations) for “car” is 0.789, and for “person” is 0.089, thus the performance of the best runs for the person query is nearly perfect. 6 Conclusion We presented the object retrieval task of ImageCLEF 2007, the methods of the participating groups, and the results. The results show that none of the methods really solves the assigned task and therewith that, although large advances in object detection and recognition were achieved over the last years, still many improvements are necessary to solve realistic tasks on a web-scale size with a high variety of the data to be analysed. It can however be observed that some of the methods are able to obtain reasonable results for a limited set of classes. Furthermore, the analysis of the results showed that the use of pooling techniques for relevance assessment can be problematic if the pools are biased due to erroneous runs or due to many strongly correlated submissions as it was the case in this evaluation. Acknowledgements We would like to thank the CLEF campaign for supporting the ImageCLEF initiative. This work was partially funded by the European Commission MUSCLE NoE (FP6-507752) and the DFG (German research foundation) under contract Ne-572/6. We would like to thank the PASCAL NoE for allowing us to use the training data of the PASCAL 2006 Visual object classes challenge as training data for this task, in particular, we would like to thank Mark Everingham for his cooperation. We would like to thank Paul Clough from Sheffield University for support in creating the pools and Jan Hosang, Jens Forster, Pascal Steingrube, Christian Plahl, Tobias Gass, Daniel Stein, Morteza Zahedi, Richard Zens, Yuqi Zhang, Markus Nus̈baum, Gregor Leusch, Michael Arens, Lech Szumilas, Jan Bungeroth, David Rybach, Peter Fritz, Arne Mauser, Saša Hasan, and Stefan Hahn for helping to create relevance assessments for the images. The work of the Budapest group was supported by a Yahoo! Faculty Research Grant and by grants MOLINGV NKFP-2/0024/2005, NKFP-2004 project Language Miner. References [1] M. Braschler and C. Peters. CLEF Methodology and Metrics. In C. Peters (Ed.), Cross- language information retrieval and evaluation: Proceedings of the CLEF2001 Workshop, Lec- ture Notes in Computer Science 2406, Springer Verlag, pages 394–404, 2002. [2] C. Carson, S. Belongie, H. Greenspan, and J. Malik. Blobworld: Image Segmentation Using Expectation-Maximization and Its Application to Image Querying. IEEE Trans. Pattern Anal. Mach. Intell., 24(8):1026–1038, 2002. [3] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/∼cjlin/libsvm. [4] Y. Chen and J. Z. Wang. Image Categorization by Learning and Reasoning with Regions. J. Mach. Learn. Res., 5:913–939, 2004. [5] P. Clough, H. Müller, and M. Sanderson. Overview of the CLEF Cross–Language Image Retrieval Track (ImageCLEF) 2004. In C. Peters, P. D. Clough, G. J. F. Jones, J. Gonzalo, M. Kluck, and B. Magnini, editors, Multilingual Information Access for Text, Speech and Images: Result of the fifth CLEF evaluation campaign, Lecture Notes in Computer Science, Springer–Verlag, Bath, England, 2005. [6] T. Deselaers, D. Keysers, and H. Ney. Discriminative Training for Object Recognition using Image Patches. In IEEE Conference on Computer Vision and Pattern Recognition, volume 2, San Diego, CA, pages 157–162, June 2005. [7] T. Deselaers, D. Keysers, and H. Ney. Improving a Discriminative Approach to Object Recog- nition using Image Patches. In DAGM 2005, Pattern Recognition, 26th DAGM Symposium, number 3663 in Lecture Notes in Computer Science, Vienna, Austria, pages 326–333, August 2005. [8] H. J. Escalante, M. M. y Gómez, and L. E. Sucar. Word Co-occurrence and MRF’s for Improving Automatic Image Annotation. In In Proceedings of the 18th British Machine Vision Conference (BMVC 2007), Warwick, UK, September, 2007. [9] M. Everingham, A. Zisserman, C. Williams, and L. V. Gool. The Pascal Visual Object Classes Challenge 2006 (VOC2006) Results. Technical report, 2006. Available on-line at http://www.pascal-network.org/. [10] M. Everingham, A. Zisserman, C. K. I. Williams, L. van Gool, M. Allan, C. M. Bishop, O. Chapelle, N. Dalal, T. Deselaers, G. Dorko, S. Duffner, J. Eichhorn, J. D. R. Farquhar, M. Fritz, C. Garcia, T. Griffiths, F. Jurie, D. Keysers, M. Koskela, J. Laaksonen, D. Larlus, B. Leibe, H. Meng, H. Ney, B. Schiele, C. Schmid, E. Seemann, J. Shawe-Taylor, A. Storkey, S. Szedmak, B. Triggs, I. Ulusoy, V. Viitaniemi, and J. Zhang. The 2005 PASCAL Visual Ob- ject Classes Challenge. In Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment (PASCAL Workshop 05), number 3944 in Lecture Notes in Artificial Intelligence, Southampton, UK, pages 117–176, 2006. [11] P. F. Felzenszwalb and D. P. Huttenlocher. Efficient Graph-Based Image Segmentation. International Journal of Computer Vision, 59, 2004. [12] R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman. Learning object categories from Google’s image search. In International Conference on Computer Vision, Beijing, China, Oct 2005. [13] Y. Freund and R. Schapire. Experiments with a New Boosting Algorithm. In International Conference on Machine Learning, pages 148–156, 1996. [14] M. Grubinger, P. Clough, A. Hanbury, and H. Müller. Overview of the ImageCLEF 2007 Photographic Retrieval Task. In Working Notes of the 2007 CLEF Workshop, Budapest, Hungary, September 2007. [15] M. Grubinger, P. Clough, H. Müller, and T. Deselaers. The IAPR Benchmark: A New Evaluation Resource for Visual Information Systems. In LREC 06 OntoImage 2006: Language Resources for Content-Based Image Retrieval, Genoa, Italy, page in press, May 2006. [16] C. Harris and M. Stephens. A Combined Corner and Edge Detection. In 4th Alvey Vision Conference, pages 147–151, 1988. [17] S. C. H. Hoi and M. R. Lyu. A novel log-based relevance feedback technique in content-based image retrieval. In 12th ACM International Conference on Multimedia (MM 2004), New York, NY, USA, pages 24–31, oct 2004. [18] S. C. Hoi, M. R. Lyu, and R. Jin. A unified log-based relevance feedback scheme for image retrieval. IEEE Transactions on Knowledge and Data Engineering, 18(4):509–524, 2006. [19] J. Laaksonen, M. Koskela, and E. Oja. PicSOM—Self-Organizing Image Retrieval with MPEG-7 Content Descriptions. IEEE Transactions on Neural Networks, Special Issue on Intelligent Multimedia Processing, 13(4):841–853, July 2002. [20] D. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. IJCV, 60(2):91–110, 2004. [21] Q. Lv, M. Charikar, and K. Li. Image similarity search with compact data structures. In CIKM ’04: Proceedings of the thirteenth ACM international conference on Information and knowledge management, ACM Press, New York, NY, USA, pages 208–217, 2004. [22] H.-M. Marin-Castro, L. E. Sucar, and E. F. Morales. Automatic image annotation using a semi-supervised ensemble of classifiers. To appear. In 12th Iberoamerican Congress on Pattern Recognition CIARP 2007, Lecture Notes in Computer Science, Springer, Viña del Mar, Valparaiso, Chile, 2007. [23] K. Mikolaczyk and C. Schmid. A performance evaluation of local descriptors. IEEE Trans- actions on Pattern Analysis & Machine Intelligence, 27(10):1615–1630, 2005. [24] P.-A. Moellic and C. Fluhr. ImageEVAL 2006 Official Campaign. Technical report, Im- agEVAL, 2006. [25] P. Montesinos, V. Gouet, and R. Deriche. Differential Invariants for Color Images. In ICPR, page 838, 1998. [26] H. Müller, T. Deselaers, E. Kim, J. Kalpathy-Cramer, T. M. Deserno, P. Clough, and W. Hersh. Overview of the ImageCLEFmed 2007 Medical Retrieval and Annotation Tasks. In Working Notes of the 2007 CLEF Workshop, Budapest, Hungary, September 2007. [27] B. G. Prasad, K. K. Biswas, and S. K. Gupta. Region-based image retrieval using integrated color, shape, and location index. Comput. Vis. Image Underst., 94(1-3):193–233, 2004. [28] Y. Rubner, C. Tomasi, and L. J. Guibas. The Earth Mover’s Distance as a Metric for Image Retrieval. IJCV, 40(2):99–121, 2000. [29] J. Shi and J. Malik. Normalized Cuts and Image Segmentation. IEEE Transactions on Pattern and Machine Intelligence, 22:888–905, 2000. [30] J. Shi and J. Malik. Normalized Cuts and Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 22(8):888–905, 2000. [31] J. Stöttinger, A. Hanbury, N. Sebe, and T. Gevers. Do Colour Interest Points Improve Image Retrieval? ICIP, 2007. to appear. [32] J. van de Weijer and T. Gevers. Edge and Corner Detection by Photometric Quasi-Invariants. PAMI, 27(4):625–630, 2005. [33] J. van de Weijer, T. Gevers, and A. Bagdanov. Boosting Color Saliency in Image Feature Detection. PAMI, 28(1):150–156, 2006. [34] V. Viitaniemi and J. Laaksonen. Improving the accuracy of global feature fusion based image categorisation. In Proceedings of the 2nd International Conference on Semantic and Digital Media Technologies (SAMT 2007), Lecture Notes in Computer Science, Springer, Genova, Italy, December 2007. Accepted. [35] L. Wu, M. J. Li, Z. W. Li, W. Y. Ma, and N. H. Yu. Visual Language Modeling for Image Classification. In 9th ACM SIGMM International Workshop on Multimedia Information Retrieval, (MIR’07), Augsburg, Germany, sep 2007. [36] J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid. Local Features and Kernels for Classi- fication of Texture and Object Categories: A Comprehensive Study. CWPR, 73(2):213–238, 2006.