Task Description, Database and Ontology

Overview of the CLEF 2009 Large-Scale Visual Concept Detection and Annotation Task

Stefanie Nowak

0 1

Peter Dunker Semantic Audio-Visual Systems

peter.dunker@ieee.org 0 1

Fraunhofer IDMT

0 1

Ilmenau

0 1

Germany

0 1

General Terms

0 1

H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor- mation Search and Retrieval; H.3.4 Systems and Software; H.3.7 Digital Libraries; H.2 [Database Managment]: H.2.4 Systems-Multimedia Databases

0 Image Classification and Annotation , Knowledge Structures, Evaluation 1 Measurement , Performance, Experimentation, Benchmark

The large-scale visual concept detection and annotation task (LS-VCDT) in ImageCLEF 2009 aims at the detection of 53 concepts in consumer photos. These concepts are structured in an ontology which implies a hierarchical ordering and which can be utilized during training and classification of the photos. The dataset consists of 18.000 Flickr photos which were manually annotated with 53 concepts. 5.000 photos were used for training and 13.000 for testing. Altogether 19 research groups participated and submitted 73 runs. Two evaluation paradigms have been applied, the evaluation per concept and the evaluation per photo. The evaluation per concept was performed by calculating the Equal Error Rate (EER) and the Area Under Curve (AUC). For the evaluation per photo a recently proposed hierarchical measure was utilized that takes the hierarchy and the relations of the ontology into account and calculates a score per photo. For the concepts, an average AUC of 84% could be achieved, including concepts with an AUC of 95%. The classification performance for each photo ranged between 69% and 100% with an average score of 90%.

Task Description, Database and Ontology

The focus of LS-VCDT lies on the automatic detection and annotation of concepts in a large consumer photo collection. It mainly poses two challenges: 1. Can image classifiers scale to the large amount of concepts and data? 2. Can an ontology (hierarchy and relations) help in large scale annotations? In this task, the MIR Flickr 25.000 image dataset [ 10 ] is utilized. This collection consists of 25.000 photos from Flickr with creative commons license. Most of them contain EXIF data, stored in a separate text file. We used altogether 18.000 of theses photos, annotated them manually with the defined visual concepts and provided them to the participants.

The training set consists of 5.000 and the testset of 13.000 images of the photoset. All images have multiple annotations. Most annotations refer to holistic visual concepts and are annotated at an image-based level. Altogether we provided the annotations for 53 concepts in rdf format and as plain text files. The visual concepts are organized in a small ontology. Participants may use the hierarchical order of the concepts and the relations between concepts for solving the annotation task. It was not allowed to use additional data for the training of the systems to ensure comparability among the groups.

The LS-VCDT is an extension of the former VCDT 2008 concerning the amount of data available and the amount of concepts to be annotated. In 2008, the database was quite small with about 1.800 images for training, 1.000 images for testing and 17 concepts to be detected. 0 9 18 27 36 45 1 2 3

4 10 19 28 11 20 29 5 14 23 12 21 13 22 30 31 32 6 15 24 33 37 38 39 40 41 42 7 25 16 34 43 8 17 26 35 44 46 47 48 49 50 51 52 The annotation process was realized in three steps. First the annotation of all photos was performed by several annotators, second a validation step of these annotations was conducted and third an agreement between different annotators for the same concepts and photos was calculated.

The annotation of 18.000 photos was performed by 43 persons from the Fraunhofer IDMT. The number of photos that were annotated by one person varied between 30 and 2.500 images. All annotators were provided with a definition of the concepts and example images with the goal to allow a consistent annotation amongst the large number of persons. It was important that the concepts are represented over the whole image. Some of the concepts exclude each other, others can be depicted simultaneously. One example photo per concept is illustrated in Fig. 1 and a complete list of all concepts can be found in Table 1. The frequency of each concept in the training and test sets is also depicted.

After this first annotation step, a validation of the annotations was performed. Due to the number of people, the number of photos and the ambiguity of some image contents, the annotations were not consistent throughout the database. Three persons performed a validation by screening only those photos that a) were annotated with concept X and b) that were not annotated with concept X. In the first case they had to delete all annotations for concepts that were not depicted at the photo and so were wrongly assigned. In the second case the goal was to find the photos where an annotation for concept X was missing but where the concept was visible.

Additionally, a subset of 100 photos was annotated by 11 different persons. These annotations are used to calculate an agreement between annotators for different concepts and photos. The agreement on concepts is illustrated in Table 1. For each photo and each concept, the annotation of the majority of annotators was regarded as correct and the percentage of annotators that annotated correct is utilized as agreement factor. This agreement is used in the Hierarchical Score (HS) as scaling factor (see Sec. 2.3). In case of a low agreement the algorithm assumes that the concept is ambiguous and therefore reduces the costs if the system wrongly assigns this concept. Regrettably, there was no possibility to annotate each photo by two or three persons to get a validation and an agreement on concepts over the whole set. 2.2

Ontology

In addition to the photos and their annotation an ontology was provided. All concepts are structured in this ontology. Fig. 2 shows a simple hierarchical organization of a part of the concepts. The hierarchy allows to make assumptions about the assignment of concepts to documents. E.g., if a photo is classified to contain trees, it also contains plants. Then, next to the is-a relationship of the hierarchical organization of concepts, additionally other relationships between concepts determine possible label assignments. The ontology restricts e.g., that for a certain sub-node only one concept can be assigned at a time (disjoint items) or that a special concept (like portrait) postulates other concepts like persons or animals.

The ontology allows the participants to incorporate knowledge in their classification algorithms, and to make assumptions about which concepts are probable in combination with certain labels. 2.3

Evaluation Measures

The evaluation of submissions to LS-VCDT considers two evaluation paradigms. We are interested in the evaluation per concept and in the evaluation per photo. For the evaluation per concept, the EER and the AUC of the ROC curves summarize the performance of the individual runs. The EER is defined as the point where the false acceptance rate of a system is equal to the false rejection rate. These scores were also used in the VCDT task 2008 and allow to compare the results of the different groups to some overlapping concepts. The evaluation per photo is assessed with a recently proposed hierarchical measure [ 15 ]. It considers partial matches between system output and ground truth and calculates misclassification costs for each missing or wrongly annotated concept per image. The score is based on structure information (distance between concepts in the

Train (%) Test (%) Annotator Agreement

System 1 System 2 Disjoint concepts

Relationship hasPersons hierarchy), relationships from the ontology and the agreement between annotators for a concept. The calculation of misclassification costs favours systems that annotate a photo with concepts close to the correct ones more than systems that annotate concepts that are far away in the hierarchy from the correct concepts. (E.g. for the single-label classification case depicted in Fig. 2, system 1 gets lower misclassification costs than system 2.) 3

Results

19 participants submitted results to the LS-VCDT task in altogether 73 runs. The number of runs was restricted to a maximum of 5 runs per group.

In Table 2 the results for the evaluation per concept are illustrated. The team with the best results (ISIS University of Amsterdam) achieves an EER of 23% and an AUC of 84% in average for their best run. One run with pseudo-random numbers was added by the organizers. In this case for each concept a random number between 0 and 1 was generated that denotes the confidence of the annotation for the EER/AUC computation and that was rounded to 0 or 1 for the hierarchical measure per photo. The random numbers achieve an EER and AUC of 50%.

In Table 3 the results for each concept are summarized. In average the concepts could be detected with an EER of 23% and an AUC of 84%. A great amount of these concepts was classified best by the ISIS group. It is obvious that the aesthetic concepts (Aesthetic_Impression, Overall_Quality and Fancy) are classified worst (EER greater than 38% and AUC smaller than 66%.). This is not suprising due to the subjective nature of these concepts which also made the groundtruthing very difficult. The best classified concepts are Clouds (AUC: 96%), Sunset-Sunrise(AUC: 95%), Sky(AUC: 95%) and Landscape-Nature (AUC: 94%).

In Table 4 the results for the evaluation per photo are summarized. The classification performance per photo ranges between 69% and 100% with an average of 90%. The best results in terms of HS were achieved by the XRCE group with 83% annotation score over all photos. It can be seen from the table that the ranking of the groups is different than for the EER/AUC. It seems that some of the groups took the ontology information into account (at least in a post-processing step) and others ignored it. The include of the annotator agreements does not change the results substantially. The scores are a bit worse as the measure is stricter, but the ranking of the groups remains. This subsection gives a brief overview about the submitted technologies of the participants. Further information about each approach can be found in the corresponding papers.

The IAM group [ 9 ] focuses on visual-terms, which are created by low-level features mainly based on SIFT and a following codebook quantization. For machine learning, the Cross Language Latent Indexing method was applied which maps the concept names and the visual-terms into a semantic space. The decision classification is handled by estimating the smallest cosine distance between concepts and visual-terms. With the different runs, a successive expansion of the concept hierarchy was tried, which results in no improvement.

The algorithm of the TELECOM ParisTech group [ 7 ] was designed especially for the largescale scenario, which means a low complexity for the processing and an easy extension to a variety of concepts, by accepting a decrease of precision. The algorithm utilizes global visual features and text features generated out of the 53 visual concepts via PCA. A Canonical Correlation Analysis is used to capture linear relationships between these different features spaces.

The UPMC/LIP6 group [ 6 ] utilizes a simple HSV histogram feature calculated in 3 horizontal segments and a linear kernel SVM for learning.

The MRIM-LIG group [ 13 ] combines RGB histograms, SIFT and Gabor features. For the learning phase, different SVM combinations are trained and as a priori, the best feature/SVM setup for each concept is used.

The LSIS group [ 18 ] combines different features, e.g. HSV, edge, gabor or profile entropy and applies a Visual Dictionary with a visual-word approach.

The AVEIR submissions [ 8 ] are from a joint group of the individual participants: Telecom ParisTech, LSIS, MRIM-LIG and UPMC/LIP6. The AVEIR submissions seem to be equal to the individual submissions. An efficient and reliable combination or fusion method based on a carefulness index is discussed only theoretically in the working notes.

The KameyamaLab group [ 16 ] proposed a system with joint global color and texture features as well as local features based on saliency regions. Additionally, a gist of scene feature is used. For the assignment of concept labels, a KNN classifier is applied.

The ISIS group [ 17 ] applies a system that is based on four main steps. First, a sampling strat

Best AUC Best EER Group

egy is applied that combines a spatial pyramid approach and saliency points detection. Second, SIFT features are extracted in different color spaces. To reduce the amount of visual features, a codebook transformation is utilized in the third step and the frequency information of predefined codewords is used as final feature. The final learning step is based on SVM with χ2 kernel. The runs differ mainly in the number of SIFT features used and the codebook generation.

The FIRST group [ 2 ] used SIFT features on different color channels and pyramid histograms over color intensities. The SIFT features are combined by the bag of words approach. For classification, a SVM was applied with average kernel, sparse L1 MKL and non-sparse Lp MKL kernel.

The XRCE group [ 1 ] uses a set of different features, a GMM image representation, a Fisher vector and local RGB statistics and SIFT features. The local features are extracted on a multi-level image-grid. For the classification, a Sparse Logistic Regression approach was applied. In the postprocessing, the hierarchical structure, disjoint-concepts and relating concepts were considered.

The SZTAKI1 group [ 3 ] used SIFT features and a graph-based segmentation algorithm. Based on the segments, color histograms, shape and DFT features are estimated. The SIFT features are post-processed with a GMM, and a Fisher kernel is applied on the features derived from the segmentation. For classification, a binary logistic regression approach is utilized. As one of a few groups the SZTAKI group applied the connections between concepts in the provided ontology among others to estimate correlations of appearing concepts.

The INRIA-LEAR group [ 4 ] utilizes a bag-of-features setup with global features namely a gist of scene descriptor and different color histograms applied in three horizontal regions of the image and the local SIFT feature quantized with k-means. In two runs a weighted nearest neighbor tag prediction method is applied and in two runs a SVM for each concept is used. The fifth run uses a SVM classifier, trained for multi-class separation. The SVM runs performed better than the tag prediction, whilst the tag prediction was ten times faster. No post-processing on the ontology rules was performed, therefore the results of the HS are worse than the EER.

The TIA-INAOE group [ 5 ] provided an algorithm based on global features, e.g. color and edge histograms. As baseline run, a KNN classifier is used and the most often appearing concepts in the top nearest neighbor training images were assigned. A further label refinement process 1SZTAKI equals the bpacad submissions concentrates on co-occurrence statistics of the disjoint concepts of training set.

The CVIU I2R group [ 14 ] provides a system that utilizes various global and local features, e.g. color and edge histograms, color coherence vector, census transform and different SIFT features. Furthermore, a local region search algorithm is used to previously select a relevant bounding box for each concept. In combination with a SVM with χ2 kernel a feature selection process is applied to choose the most relevant features for each concept. For the disjoint concepts, the probabilities were manipulated in order to have a single concept above 0.5.

The UAIC group [ 11 ] utilizes four modules. First, a face detection software was used to estimate the number of faces in images. Unfortunately, this modules breaks the rules of the LSVCDT, because the face detector is based on a Viola and Jones detector which was trained with data that was not provided in this task. The second module concentrates on clustering of training images, whereas most concepts were set to a score of 0.5 if no decision could be made. The same process was applied in an EXIF data processing module. The last module sets default values to disjoint concepts depending on their occurrence in the training data.

The MMIS submissions [ 12 ] utilize global color histogram, Tamura texture and Gabor features. Selected features are estimated in nine subregions and concatenated to the overall feature vector. A non-parametric density estimation is applied and a baseline approach by global feature weighting was submitted. The other four submissions use different parameter combinations for word correlations to semantic similarity. The runs differ by the source of the semantic similarity space, which was estimated by the training data, by Google Web search, WordNet and Wikipedia measure. The submission based on the training data achieved the best results.

Summarizing the approaches, some facts can be driven. The groups that used local features like SIFT achieved better results than the groups relying solely on global features. Most groups that investigated the concept hierarchy and analyzed, e.g. the correlations between the concepts, could achieve better ranks evaluated with the hierarchical measure than with the EER. The information about the computational performance are difficult to compare, because the information range from 72 hours for the complete process, to 1 second for training and testing. In the 2010 task, a more detailed specification for this information is needed. 4

Conclusion

This paper summarises the ImageCLEF 2009 LS-VCDT. Its aim was to automatically annotate photos with 53 concepts in a multilabel scenario. An additionally provided ontology could be used to enrich the classification system. The results show that in average the task could be solved reasonably well with the best system achieving an AUC of 84% for all photos. Four other groups got an AUC score over or equal to 80%. Evaluated on the concept basis, the concepts could be annotated in average with an AUC of 84%. In terms of HS, the best system annotated all photos with an average annotation rate of 83%. Three other systems were very close to these results with 83%, 82% and 81%. Part of the groups used the ontology for post-processing or to learn correlations of concepts. No participant integrated the ontology in a reasoning system and tried to apply this system for the classification task. The large number of concepts and photos posed no problem to the classification systems. 5

Acknowledgment

This work has been partly supported by grant No. 01MQ07017 of the German research program THESEUS funded by the Ministry of Economics.

[1]

Ah-Pine ,

Clinchant , G. Csurka, and Y. Liu. XRCE's Participation in ImageCLEF 2009 . CLEF working notes 2009 , Corfu, Greece, 2009 .

[2]

Binder and

Kawanabe . Fraunhofer FIRST's Submission to ImageCLEF2009 Photo Annotation Task: Non-sparse Multiple Kernel Learning . CLEF working notes 2009 , Corfu, Greece, 2009 .

[3]

Daroczy , I. Petras ,

A.A.

Benczur ,

Fekete ,

Nemeskey ,

Siklosi , and

Weiner . SZTAKI @ ImageCLEF 2009 . CLEF working notes 2009 , Corfu, Greece, 2009 .

[4]

Douze ,

Guillaumin ,

Mensink ,

Schmid , and

Verbeek . INRIA-LEARs participation to ImageCLEF 2009 . CLEF working notes 2009 , Corfu, Greece, 2009 .

[5]

H.J.

Escalante ,

J.A.

Gonzalez ,

C.A.

Hernandez ,

Lopez ,

Montex ,

Morales ,

Ruiz ,

L.E.

Sucar , and

Villasenor . TIA-INAOE's Participation at ImageCLEF 2009 . CLEF working notes 2009 , Corfu, Greece, 2009 .

[6]

Fakeri-Tabrizi ,

Tollari ,

Denoyer , and

Gallinari . UPMC/LIP6 at ImageCLEFannotation 2009: Large Scale Visual Concept Detection and Annotation . CLEF working notes 2009 , Corfu, Greece, 2009 .

[7]

Ferecatu and

Sahbi . TELECOM ParisTech at ImageClef 2009: Large Scale Visual Concept Detection and Annotation Task . CLEF working notes 2009 , Corfu, Greece, 2009 .

[8]

Glotin ,

Fakeri-Tabrizi ,

Mulhem ,

Ferecatu ,

Zhao ,

Tollari , G. Quenot,

Sahbi , E. Dumont, and

Gallinari . Comparison of Various AVEIR Visual Concept Detectors with an Index of Carefulness . CLEF working notes 2009 , Corfu, Greece, 2009 .

[9]

J.S.

Hare and

P.H.

Lewis . IAM@ ImageCLEFPhotoAnnotation 2009: Naive application of a linear-algebraic semantic space . CLEF working notes 2009 , Corfu, Greece, 2009 .

[10] Mark

Huiskes and Michael S.

Lew . The MIR Flickr Retrieval Evaluation . In MIR '08: Proceedings of the 2008 ACM International Conference on Multimedia Information Retrieval , New York, NY, USA, 2008 . ACM.

[11]

Iftene ,

Vamanu , and

Croitoru. UAIC at ImageCLEF 2009 Photo Annotation Task . CLEF working notes 2009 , Corfu, Greece, 2009 .

[12]

Llorente ,

Little , and S. Ru¨ger . MMIS at ImageCLEF 2009: Non-parametric Density Estimation Algorithms . CLEF working notes 2009 , Corfu, Greece, 2009 .

[13]

Mulhem , J-P. Chevallet , G. Quenon, and R. Al Batal. MRIM-LIG at ImageCLEF 2009: Photo Retrieval and Photo Annotation tasks . CLEF working notes 2009 , Corfu, Greece, 2009 .

[14]

Ngiam and

Goh. I2R ImageCLEF Photo Annotation 2009 Working Notes . CLEF working notes 2009 , Corfu, Greece, 2009 .

[15]

Nowak and

Lukashevich . Multilabel Classification Evaluation using Ontology Information . In The 1st Workshop on Inductive Reasoning and Machine Learning on the Semantic Web -IRMLeS 2009 , co-located with the 6th Annual European Semantic Web Conference (ESWC), Heraklion , Greece, 2009 .

[16]

Sarin and

Kameyama . Joint Contribution of Global and Local Features for Image Annotation . CLEF working notes 2009 , Corfu, Greece, 2009 .

[17] K.E.A. van de Sande , T. Gevers, and

A.W.M.

Smeulders . The University of Amsterdam's Concept Detection System at ImageCLEF 2009 . CLEF working notes 2009 , Corfu, Greece, 2009 .

[18]

Z-Q.

Zhao ,

Glotin , and

E. Dumont. LSIS

Scale Photo Annotations: Discriminant Features SVM versus Visual Dictionary based on Image Frequency . CLEF working notes 2009 , Corfu, Greece, 2009 .