Introduction

Overview of the wikipediaMM task at ImageCLEF 2009

Theodora Tsikrika

Theodora.Tsikrika@cwi.nl 1

Jana Kludas

jana.kludas@cui.unige.ch 0 0 CUI, University of Geneva , Switzerland 1 CWI , Amsterdam , The Netherlands

ImageCLEF's wikipediaMM task provides a testbed for the system-oriented evaluation of multimedia information retrieval from a collection of Wikipedia images. The aim is to investigate retrieval approaches in the context of a large and heterogeneous collection of images (similar to those encountered on the Web) that are searched for by users with diverse information needs. This paper presents an overview of the resources, topics, and assessments of the wikipediaMM task at ImageCLEF 2009, summarises the retrieval approaches employed by the participating groups, and provides a first analysis of the main evaluation results.

ImageCLEF Wikipedia image collection image retrieval evaluation

Introduction

The wikipediaMM task is an ad-hoc image retrieval task. The evaluation scenario is thereby similar to the classic TREC ad-hoc retrieval task and the ImageCLEF photo retrieval task: simulation of the situation in which a system knows the set of documents to be searched, but cannot anticipate the particular topic that will be investigated (i.e. topics are not known to the system in advance). Given a multimedia query that consists of a title and one or more sample images describing a user’s multimedia information need, the aim is to find as many relevant images as possible from the (INEX MM) wikipedia image collection. A multi-modal retrieval approach in that case should be able to combine the relevance of different media types into a single ranking that is presented to the user.

The wikipediaMM task differs from other benchmarks in multimedia information retrieval, like TRECVID, in the sense that the textual modality in the wikipedia image collection contains less noise than the speech transcripts in TRECVID. Maybe that is one of the reasons why, both in last year’s task and in INEX Multimedia 2006-2007 (where this image collection was also used), it has proven challenging to outperform the text-only approaches. This year, the aim is to promote the investigation of multi-modal approaches to the forefront of this task by providing a number of resources to support the participants towards this research direction.

The paper is organised as follows. First, we introduce the task’s resources: the wikipedia image collection and additional resources, the topics, and the assessments (Sections 2–4). Section 5 presents the approaches employed by the participating groups and Section 6 summarises their main results. Section 7 concludes the paper.

Task resources

The resources used for the wikipediaMM task are based on Wikipedia data. The collection is the (INEX MM) wikipedia image collection, which consists of approximately 150,000 JPEG and PNG Wikipedia images provided by Wikipedia users. Each image is associated with usergenerated alphanumeric, unstructured metadata in English. These metadata usually contain a brief caption or description of the image, the Wikipedia user who uploaded the image, and the copyright information. These descriptions are highly heterogeneous and of varying length. Further information about the image collection can be found in [ 4 ].

Additional resources were also provided to support the participants in their investigations of multi-modal approaches. These resources are: Image similarity matrix: The similarity matrix for the images in the collection has been constructed by the IMEDIA group at INRIA. For each image in the collection, this matrix contains the list of the top K = 1000 most similar images in the collection together with their similarity scores. The same is given for each image in the topics. The similarity scores are based on the distance between images; therefore, the lower the score, the more similar the images. Further details on the features and distance metric used can be found in [ 1 ]. Image classification scores: For each image, the classification scores for the 101 MediaMill concepts have been provided by UvA [ 3 ]. The UvA classifier is trained on manually annotated TRECVID video data for concepts selected for the broadcast news domain.

Image features: For each image, the set of the 120D feature vectors that has been used to derive the above image classification scores [ 2 ] has also been made available. Participants can use these feature vectors to custom-build a content-based image retrieval (CBIR) system, without having to pre-process the image collection.

The additional resources are beneficial to researchers who wish to exploit visual evidence without performing image analysis. Of course, participants could also extract their own image features.

Topics

3.1

Topic Format

The topics are descriptions of multimedia information needs that contain textual and visual hints. These multimedia queries consist of a textual part, the query title, and a visual part, one or several example images. <title> query by keywords <image> query by image content (one or several) <narrative> description of query in which the definitive definition of relevance and irrelevance are given 3.1.1

<title> The topic <title> simulates a user who does not have (or want to use) example images or other visual constraints. The query expressed in the topic <title> is therefore a text-only query. This profile is likely to fit most users searching digital libraries.

Upon discovering that a text-only query does not produce many relevant hits, a user might decide to add visual hints and formulate a multimedia query. 3.1.2

<image> 3.1.3

<narrative> The visual hints are example images, which can be taken from outside or inside the wikipedia image collection and can be of any common format. Each topic has at least one example image, but it can have several, e.g., to describe the visual diversity of the topic.

A clear and precise description of the information need is required in order to unambiguously determine whether or not a given document fulfils the given information need. In a test collection this description is known as the narrative. It is the only true and accurate interpretation of a user’s needs. Precise recording of the narrative is important for scientific repeatability - there must exist, somewhere, a definitive description of what is and is not relevant to the user. To aid this, the <narrative> should explain not only what information is being sought, but also the context and motivation of the information need, i.e., why the information is being sought and what work-task it might help to solve.

These different types of information sources (textual terms and visual examples) can be used in any combination. It is up to the systems how to use, combine or ignore this information; the relevance of a result does not directly depend on these constraints, but it is decided by manual assessments based on the <narrative>. 3.2

Topic Development

The topics in the ImageCLEF 2009 wikipediaMM task have been partly developed by the participants and partly by the organisers. This year the participation in the topic development process was not obligatory, so only 2 of the participating groups submitted a total of 11 candidate topics. The rest of the candidate topics were created by the organisers with the help of the log of an image search engine. After a selection process performed by the organisers, a final list of 45 topics was created.

These final topics range from simple, and thus relatively easy (e.g., “bikes”), to semantic, and hence highly difficult (e.g., “aerial photos of non-artificial landscapes”), with the latter forming the bulk of the topics. Semantic topics typically have a complex set of constraints, need world knowledge, and/or contain ambiguous terms, so they are expected to be challenging for current state-of-the-art retrieval algorithms. We encouraged the participants to use multi-modal approaches since they are more appropriate for dealing with semantic information needs. On average, the 45 topics contain 1.7 images and 2.7 words. 4

Assessments

The wikipediaMM task is an image retrieval task, where an image with its metadata is either relevant or not (binary relevance). We adopted TREC-style pooling of the retrieved images with a pool depth of 50, resulting in pools of between 299 and 802 images with a mean and median both around 545. The evaluation was performed by the participants of the task within a period of 4 weeks after the submission of runs. The 7 groups that participated in the evaluation process used the web-based interface that was used last year and which has also been previously employed in the INEX Multimedia and TREC Enterprise tracks. 5

Participants

A total of 8 groups submitted 57 runs: CEA (LIC2M-CEA, Centre CEA de Saclay, France), DCU (Dublin City University, School of Computing, Ireland), DEU (Dokuz Eylul University, Department of Computer Engineering, Turkey), IIIT-Hyderabad (Search and Info Extraction Lab, India), LaHC (Laboratoire Hubert Curien, UMR CNRS, France), SZTAKI (Hungarian Academy of Science, Hungary), SINAI (Intelligent Systems, University of Jaen, Spain) and UALICANTE (Software and Computer Systems, University of Alicante, Spain).

DEU (6 runs) Their research interests focussed on 1) the expansion of native documents and queries, term phrase selection based on WordNet, WSD and WordNet similarity functions and 2) a new reranking approach with Boolean retrieval and C3M based clustering. IIT-H (1 run) Their system automatically ranks the most similar images to a given textual query using a combination of the Vector Space Model and the Boolean model. The system preprocesses the data set in order to remove the non-informative terms.

LaHC (13 runs) In this second participation, they extended their approach (a multimedia document model defined as a vector of textual and visual terms weighted using tf.idf) by using 1) additional information for the textual part (legend and image bounding text extracted from the original documents), 2) different image detectors and descriptors, and 3) a new text/image combination approach.

SINAI (4 runs) Their approach focussed on query and document expansion techniques based on WordNet. They used the LEMUR toolkit as their retrieval system.

SZTAKI (7 runs) They used both textual and visual features and employed image segmentation, SIFT keypoints, Okapi BM25 based text retrieval, and query expansion by an online thesaurus. They preprocessed the annotation text to remove author and copyright information and biased retrieval towards images with filenames containing relevant terms. UALICANTE (9 runs) They used IR-n, a retrieval system based on passages and applied two different term selection strategies for query expansion: Probabilistic Relevance Feedback and Local Context Analysis, and their multi-modal versions. They also used the same technique for Camel Case decompounding of image filenames that they used in last year’s participation. 6

Results

Next, we analyse the evaluation results. In our analysis, we use only the top 90% of the runs to exclude noisy and buggy results. Furthermore, we excluded 3 runs that we considered to be redundant, i.e., they were produced by the same group and achieved the exact same result, so as to reduce the bias of the analysis. 6.1

Performance per modality for all topics

examined evaluation metrics (MAP, Precison at 20, and precision after R (= number of relevant) documents are retrieved).

Modality All top 90% runs (46 runs) TXT in top 90% runs (23 runs) TXTIMG in top 90% runs (23 runs)

Performance per topic and per modality

To analyse the average difficulty of the topics, we classify the topics based on the average MAP values per topic as follows: easy: aM AP > 0.3 medium: 0.2 < aM AP <= 0.3 hard: 0.1 < aM AP <= 0.2 very hard: aM AP < 0.1.

We also analysed the performance of runs that use only text (TXT) versus text and visual resources (TXTIMG). Figure 2 shows the average performance on each topic for all runs, the text-only and text-visual based ones. The text-based runs outperform the text-visual ones in 22 out of the 45. So, slightly more than half of the topics profit from a multi-modal approach. 6.3

Visuality of topics

The “visuality” of topics can be deduced from the performance of text-only and text-visual approaches that we presented in the last section. We consider that if, for a topic, the text-visual approaches improve significantly the MAP over all runs (e.g., by dif f (M AP ) >= 0.01), then we could consider that to be a visual topic. In the same way, we can define topics as textual, if the text-only approaches improve significantly the MAP over all runs of a topic. Based on this, 15 of the topics can be characterised as textual and 14 as visual. The remaining 16 topics, where no clear improvements are observed, are considered to be neutral. Finally, we analyse the effect of the application of query expansion (QE) and relevance feedback (FB) techniques. Similarly to the analysis in the previous section, we consider the techniques to be useful for a topic, if they improved significantly the MAP over all runs. Table 6 presents the top 10 best performing topics for these techniques and some statistics. Query expansion is useful for 17 topics and relevance feedback for 11. The statistics show that these techniques can help improve the retrieval results for topics defined without too much detail, e.g., topics having a short title (#words/topic) and/or a small number of example images (#images/topic). 7

Conclusions

This year (similarly to 2008), a text-based approach performed best in the wikipediaMM task, even though highly semantic multimedia topics were developed with the aim to encourage and show the potential of multi-modal approaches. It is worth noting though that all of the participants that submitted both mono-media and multi-modal runs achieved their best results with their multimodal runs. Additionally, we as organisers are really glad to see more than half of the submitted runs being multi-modal. 8

Acknowledgements

Theodora Tsikrika was supported by the European Union via the European Commission project VITALAS (contract no. 045389).

[1]

Marin

Ferecatu . Image retrieval with active relevance feedback using both visual and keywordbased descriptors . In Ph.D. Thesis , Universit de Versailles, France, 2005 .

[2] Jan

C. van Gemert

, Jan-Mark

Geusebroek

Cor J.

Veenman , Cees G. M. Snoek , and Arnold

W. M.

Smeulders . Robust scene categorization by learning image statistics in context . In Proceedings of the 2006 Conference on Computer Vision and Pattern Recognition Workshop, page 105, Washington, DC, USA, 2006 . IEEE Computer Society.

[3] Cees

G. M.

Snoek , Marcel Worring, Jan C. van Gemert, Jan-Mark Geusebroek , and Arnold

W. M.

Smeulders . The challenge problem for automated detection of 101 semantic concepts in multimedia . In Proceedings of the 14th annual ACM international conference on Multimedia , pages 421 - 430 , New York, NY, USA, 2006 . ACM Press.

[4]

Thijs

Westerveld and Roelof van Zwol. The INEX 2006 multimedia track . In Norbert Fuhr, Mounia Lalmas, and Andrew Trotman, editors, Advances in XML Information Retrieval: 5th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2006, Revised Selected Papers , volume 4518 , pages 331 - 344 . Springer, 2007 .