=Paper=
{{Paper
|id=Vol-1176/CLEF2010wn-ImageCLEF-WanEt2010
|storemode=property
|title=I2R At ImageCLEF Wikipedia Retrieval 2010
|pdfUrl=https://ceur-ws.org/Vol-1176/CLEF2010wn-ImageCLEF-WanEt2010.pdf
|volume=Vol-1176
}}
==I2R At ImageCLEF Wikipedia Retrieval 2010==
I2R AT IMAGECLEF WIKIPEDIA RETRIEVAL 2010 Kong-Wah WAN, Yan-Tao ZHENG, Sujoy ROY Computer Vision and Image Understanding, Institute for Infocomm Research, 1 Fusionopolis Way, Singapore 138632 Abstract We report on our approaches and methods for the ImageCLEF 2010 Wikipedia image re- trieval task. A distinctive feature of this year’s image collection is that images are associated with unstructured and noisy textual annoations in three languages: English, French and Ger- man. Hence, besides following conventional text-based and multimodal approaches, we also focus some effort into investigating multilingual methods. We submitted a total of six runs along the following three directions: 1. augmenting basic text-based indexing with feature selection (three runs), 2. multimodal retrieval that re-ranks text-based results using visual- near-duplicates (VND), (one run), and 3. multilingual fusion that combines results from the three language resources indexed separately (two runs). Our best result (i2rcviu MONOLIN- GUAL, MAP of 0.2126) comes from the latter multilingual fusion approach, indicating the promise of exploiting multilingual resources. For our multimodal re-ranking run, we adopt a pseudo-relevance-feedback approach that builds a visual prototype model of each query without the need for any labeled example images. Essentially, we assume that the top-ranked image results from a text baseline retrieval are correct, and proceed to re-rank the result list such that images that are visually similar images to the top-ranked images are pushed up the ranks. This VND-based re-ranking is applied on the results of a text baseline (RUN i2rcviu I2R.baseline, MAP of 0.1847) that indexed images using all available annotations. This visual re-ranking run (i2rcviu I2R.VISUAL.NDK) achieves a MAP of 0.1984, a 7% improvement. Led by this encouraging result, we apply our VND re-ranking on the results from the multilingual run, and obtain our best retrieval result (not submitted) of 0.2338. Keywords: Multimodal Retrieval, Visual re-ranking, Multilingual fusion 1 Introduction We present our approach and methods in the Wikipedia-MM task of ImageCLEF 2010 [9]. In this year, a key distinctive feature of the benchmark image collection is that images are annotated with unstructured and noisy text in three languages: English (EN), French (FR), and German (DE). Hence, apart from conventional text-based and multimodal (visual+text) approaches, we investigated into ways to exploit the multilingual nature of the image corpus. We submitted a total of six runs, focusing our effort along the following three main directions. 1 1.1 Text-based Retrieval Firstly, we explore feature selection and relevance feedback techniques to enhance text-based retrieval. Of our six submission runs, three are from this line of research. Our motivation is that text methods would continue to be the main contributor to accurate image retrieval. Hence, attempting to improve text-based methods would naturally form the bulk of our effort. 1.2 Multimodal Methods Our second focus is on multimodal approach. Specifically, we take the return results of a text- based baseline, makes the assumption that most of these return results are relevant and correct, and proceed to analyse the top images for visual-near-duplicates (VND). VND images are then clustered and re-ranked so that they become closer in ranks. This has the effect of improving the ranks of lower-ranked images that are similar to higher-ranked images. We note instead of making the weak assumption that the top images are relevant, we could have used the example images that accompany the queries. Nonetheless we adopt our present method for the following reasons: 1. several reseachers [1] have already explored the use of the example query images to build a visual prototype model for the query topic; 2. in real world scenarios, users are unlikely to provide example query images, and 3. we aim to explore fully automatic methods. Our approach is also closer to the spirit of pseudo relevance feedback (PRF) commonly used in the text community. Because of the lack of time, we have only submitted one run for this line of research. In this run, we apply the VND re-ranking method on the results from a simple text baseline. This text baseline indexes the images based on whatever textual annotations that are present in the XML metadata, ignoring whether those annotations are in EN, DE or FR. 1.3 Exploiting Multilingual Resources Finally, our third focus, which we believe to be the most novel, is on exploiting multilingual cues. This year’s benchmark image collection offers the unique opportunity for us to examine the comparative advantage of using multiple language resources on image retrieval. Specifically, we build several image indexes based on the various combinations of annotation languages. For example, we build three image indexes based on a single annotation language, i.e. EN-only, DE-only and FR-only. For this case, if an image only has one annotation language, say, EN, then its DE and FR index will be empty, and a query issued to the DE index or FR index will not return that image. We also build three other image indexes based on two annotation languages, i.e. EN-DE, EN-FR, DE-FR. For this case, if an image has only EN annotation, then to build the EN-DE dual-lingual index, we will perform machine translation of EN to DE. For lack of time and compute resource, we stop short of building a triple-lingual index 1 . Because all queries are described using all three languages, for each query, we issue to an image index using the query text in the appropriate language. For example, we issue to the EN-only-index with the EN-only query text, to the DE-only-index the DE-only query text, and so on. Similarly for 1 Note also that our default baseline text system is actually NOT a triple-lingual index system. It merely uses whatever annotations that is present. About 60% of images have annotations in only a single language, and 25% of images have annotations in two languages. Only 10% of images have annotations in all three language 2 the dual-lingual index, we form a concatenated new query text comprising description text in the respective languages. For example, we issue to the EN-DE-index the concatenated English query description and German query description. Results from multiple image indexes can be fused by taking the maximum retrieval confidence. Clearly, there are configurable options to decide which image index to fuse from. We submitted a total of two runs: (1). one that fuses results from the three mono-lingual image index, and (2). one that fuses results from the three dual-lingual image index. The official evaluation results show that fusing results from the mono-lingual image index produces better results than that from fusing results from the dual-lingual image index. The rest of the paper is organized as follows. Section 2 provides details of our methods and runs, with results from the official evaluation. Section 3 discusses the results from our submitted runs and presents some post-evaluation results from our further experimentations. Section 4 concludes the paper with some future outlook. 2 Our Methods and Results 2.1 Text-based retrievals In all our experiments, we use the Lucene toolkit [2] as the retrieval system. In parsing the XML metadata files, we removed all mark-up tags, and retained the main textual data after stopword removal as the annotation content. To build the language-specific index, the annotation content is further splitted into its respective languages. The actual query text that is issued to a retrieval system is a concatenation of the query in its respective languages. Except for the run where we are exploring with query expansion, there is no other manipulation of the query terms. 2.1.1 Baseline Text Retrieval System – RUN: I2R.baseline As mentioned earlier, the index for our baseline text retrieval system is built by utilizing whatever textual annotations that are present in the XML metadata files. It ignores the language of the annotations. A query is composed by concatenating the query description text in all three languages. This submitted run is called i2rcviu I2R.baseline. It obtained a MAP of 0.1847. Figure 1 shows the official result for this run. Figure 1: Results for our text baseline RUN:i2rcviu I2R.baseline Compared to the best obtained results, this MAP value is not impressive. One likely reason for the suboptimal retrieval lies in the way we build this text baseline, and the way we issue concatenated queries of all three languages. Because only 10% of images have annotations in all three languages, for 90% of the images, there will be some terms in the concatenated queries that are non-informative. In other words, for the huge majority of images, there is a language mismatch. This results in noisy retrieval. However, we also note that this situation may be the norm in practical real-world scenarios, where there may not be enough resource to create separate indexes for each of the annotation language. Hence this baseline result can serve as a reference for the multilingual retrieval community. 3 2.1.2 Feature Selection – RUN: I2R.Feat.selection Feature selection is a process wherein a subset of the text features (words) are selected for the final text vector representation. The main idea is to discard unimportant, non-informative words, and to retain a smaller subset of words that contribute the most to accurate retrieval [3, 4]. Our feature selection proceeds as follow. First, we build a new corpus by issuing to our baseline triple-lingual index the fully concatenated query description. For each query, we collate the top 1000 results. This means that we have a new corpus that is of maximum size 70K. We then perform feature selection on this new corpus. We apply a ad-hoc combination of feature selection techniques from [3, 4]. Specifically, we use the Term Contribution and Document Frequency metric to weight each unique term in the new corpus. We then sort the weighted terms and remove the top 15% and bottom 20% terms. As an extra check, for these words that are earmarked for removal, we further compute their ESA semantic relatedness [5] with every word in the query text of all 70 queries. If the ESA value of a word is above a threshold, we shall retain it. The idea is to avoid removing words that turn out to be individually relevant to a particular query. Using the new terms in the final list after removal, we create a new index. This submitted run is called i2rcviu I2R.Feat.selection. It obtained a MAP of 0.1945. Figure 2 shows the official result for this run. Figure 2: Results for our text feature selection RUN:i2rcviu I2R.Feat.selection Compared to the I2R.Feat.selection baseline run, there is a marginal improvement of 5%. This shows the utility of feature selection methods in text retrieval. 2.1.3 Query Expansion – RUN: I2R.PRF We next experimented with a query expansion approach. We adopt Rocchio’s pseudo-relevance feedback [6] as our query expansion model. By assuming the top return results to be relevant, the query expansion model reformulates the query by augmenting the original query with new feedback words selected from the top return results. For the sake of comparison, we also use the same Term Contribution, Document Frequency and ESA metrics to weight words in the top return documents. Amongst the top Term-Contribution and Document-Frequency words, we take the top K words with high ESA values, where K is the length of the original issued query. Hence the expanded query is of length 2K. The submitted run is called I2R.PRF. It obtained a MAP of 0.1840. Figure 3 shows the official result for this run. Figure 3: Results for our pseudo relevance feedback-based qeury expansion RUN:i2rcviu I2R.PRF This is a disappointing result. Not only is this worse than the Feature Selection run (I2R.Feat.selection), it is even worse than the baseline run (I2R.baseline). While the utility of query expansion has been 4 proven in many information retrieval tasks, we did not see its success generalize onto the present Wikipedia image retrieval task. 2.2 Multimodal Approach – RUN : I2R.VISUAL.NDK Our multimodal strategy to image retrieval continues on the trend of combining visual processing with the results from text analysis. Specifically, we adopt a re-ranking approach that reorders images according to their visual similarity with a set of visual prototype models constructed from the top- ranked images. The main intuition of our method is that for a given query, and a visual model of that query, images that are visually closer to the visual model of the query, are likely to be more relevant to the query. Given a visual model of a query, we use a visual-near-duplicate (VND) approach to compute visual similarity between an image and the visual model. Near-duplicate images denote a group of images that depict the same or duplicate scene in the whole or part of the image but with slightly varying visual appearance. The reason for visual difference is due to geometric, photometric and scale changes caused by the variance of camera shooting angle, lighting condition, camera sensor or photo editing process. If a group of near-duplicate images are returned for certain queries, this indicates a good probability that all of them are positive answers. Figure 4 below shows a few examples of near-duplicate examples that are both positive answers to the given queries. Figure 4: Example VND matching There are many possible methods to construct a visual model of the query. A common strategy is to learn the model from a set of relevant images returned from an image search engine. For the set of official queries in WikipediaMM 2010, each query comes with three example images to visually illustrate the query intent. We note that some researchers have already applied learning a visual model of the query from these images [1]. However, the downside of this approach is that because the results of an image search could be noisy, there needs to be some manual effort involved in ascertaining the relevance of the returned images. This is especially problematic for non-object type of queries, where returned images tend to be even noiser. In this paper, we experimented with a pseudo relevance feedback approach to build a visual model of the query by using the top-ranked images. The implicit assumption is that these top- 5 ranked images are likely to be relevant. We choose the text baseline run (I2R.baseline) on which we will apply the visual re-ranking model. We follow the image-cluster-matching approach in [7] to build a visual model of each query from a set of Vpos images, comprising the top-20-ranked images by the text baseline run I2R.baseline. We use a local-feature-based representation for images. Each image is first computed for a number of key-points and their descriptors. The similarity between two images is then determined by matching their keypoint descriptors. Specifically, we utilize Difference of Gaussian as keypoint detector and Scale Invariant Feature Transform (SIFT) as local descriptor. After identifying the group of near-duplicate images in the retrieval list, we take a simple heuristic to rank them at the top as follow. For each image i, we sum up its distance Di to the each image in Vpos as Di = ( j∈Vpos ,j6=i distance(i, j))/|Vpos − {i}|. We implement the distance(.) function as a vari- P ant of the keypoint-based matching method in [8]. Our submitted run for this method is called I2R.VISUAL.NDK. It obtained a MAP of 0.1984. Figure 5 shows the official result for this run. Figure 5: Results for our visual re-ranking method RUN:i2rcviu I2R.VISUAL.NDK From the MAP values of 0.1984, we see that there is an improvement of 7% on the text baseline (MAP 0.1847). This is a healthy sign, and points to the effectiveness of using our visual re-ranking model. Note that this improvement comes about without the need for any training images or manual labeling effort. We report in section 3 further experiments that confirms the ability of our visual re-ranking model to improve results from other text retrieval baselines. 2.3 Exploiting Multilingual Cues In this year, we are provided with an image collection that is annotated with text in three languages: English, German and French. This offers us a good opportunity to explore the utility of these multilingual resources for improved image retrieval. Intuitively, the additional text annotations should help in retrieval since the information are highly related, but not in a redundant way. For example, if an ”Volcano” image has the English word ”Volcano”, then obviously it would match well to the English query with the word ”Volcano”. If the same image does not have an English annotation, but rather it has a German equivalent translation, then it would still likely match well to the original English query translated to the German ”Vulkan”. Further more, if the image has both English and German annotations, then a query issued with both ”Volcano” and ”Vulkan” would have a more confident hit on this image. Of course in a real-world setting, not all images would come with multilingual annotations. In fact, the statistics of the 2010 WikipediaMM collection is such that 60% of the images have annotations in only a single language, 25% of images have annotations in two languages, and only 10% of images have annotations in all three language. This means that the huge majority of images are single- language-annotated. In terms of medium of language in the annotations, 60% of the images have annotations in English, followed by 46% of images having annotations in German, and 30% of images having annotations in French. In such a situation, we explore the following questions: 6 1. Will we do better if we create a separate mono-lingual index, and issue to it with the appropriate query text in the corresponding language? 2. Will we do better if we create a separate dual-lingual index, and issue to it with the appropri- ately concatenated query text in the corresponding languages? Each of the above is compared against the default text baseline used in our paper: the I2R.baseline system, which builds a generic index that uses all the text annotations that are present in the XML metadata, disregarding which language the text are in. We report our exploration of both question and their results in the following. 2.3.1 Mono-lingual – RUN: MONOLINGUAL The main idea is simple. Since the bulk of images would have annotations in at least one language, and since all our queries have the equivalent translations in all three languages, we can build a new retrieval system by the following: 1. build a mono-lingual index for all images 2. issue the appropriate translated-queries to the corresponding mono-lingual index 3. fuse the three result lists using maximum confidence Note that in building the language-specific index, we do not require that all images must have all three annotation languages. In contrast to the dual-lingual run in the next subsection, we do not perform any machine translation here. In other words, if an image has only EN annotation, but no annotation in DE nor in FR, then only the EN-index contains this image, and the DE-index and the FR-index will not contain this image. During query time, only the English query description text will be issued to the EN-index. Figure 6: Results for our mono-lingual RUN:i2rcviu MONOLINGUAL Our submitted run for this method is called MONOLINGUAL. It obtained a MAP of 0.2126. Out of our total six submitted runs, this run has the best result. Compared to the MAP of 0.1847 from the baseline run (I2R.baseline), this is a big improvement of 15%. Note that because the baseline I2R.baseline run builds a generic index that combines all available annotations in all languages, the better result from MONOLINGUAL tells us that it is better to be language-specific. Figure 6 shows the official result for this run. 2.3.2 Dual-lingual – RUN: DUAL LINGUAL Encouraged by the promising result from our MONOLINGUAL run, we turn to the question of whether we can do better with dual-lingual index. We build a new retrieval system following similar ideas in MONOLINGUAL: 1. build a dual-lingual index for all images. Perform machine translation if neccessary. 7 2. issue the concatenation of the appropriate translated-queries to the corresponding dual-lingual index 3. fuse the three result lists using maximum confidence Note that in building the dual-lingual index, we now mandate that all images must have all three annotation languages. In other words, if an image has only EN annotation, but no annotation in DE nor in FR, then we perform machine translation of the En-text to both DE-text and FR- text. We used the Google AJAX Translation API for the machine translation. Limited by the constraints in the AJAX web services, we were forced to throttle our calls, and only performed translation of the headline snippets of the Wikipedia text. Our submitted run for this method is called DUAL LINGUAL. It obtained a MAP of 0.1742. Figure 7 shows the official result for this run. Figure 7: Results for our mono-lingual RUN:i2rcviu DUAL LINGUAL The result is disappointing. It is nowhere near to the MAP value of 0.2126 of the MONOLIN- GUAL run. Compared to the MAP of 0.1847 from the baseline run (I2R.baseline), there is even a drop of 6%. We attribute the bad results to the heavily subdued translation effort by our throttled AJAX call to Google Translation service. It is also likely that the automatically detected headline snippets submitted for translation is not meaningful and representative of the main content. 3 Further Experimentations Of the six submitted runs, there are two promising results, namely our visual re-ranking I2R.VISUAL.NDK run and the mono-lingual MONOLINGUAL run. The visual re-ranking has improved retrieval over the baseline result by 7%, while using the mono-lingual index approach, retrieval improved by 15% over the baseline. As part of our post-evaluation effort, we apply the visual re-ranking method onto the best run from the official evalaution, namely the MONOLINGUAL run. We obtained a MAP result of 0.2338, an improvement of about 10%. This shows that the improvement from our visual re-ranking is robust and generalizable. Had we decided to submit this additional run, this result would have been ranked 13th amongst all submissions. 4 Conclusion For our participation in the ImageCLEF 2010 Wikipedia retrieval task, we submitted a total of six runs, exploring the following directions: (1). Feature selection and feedback strategies text-based methods, (2). Visual re-ranking without the need for any labeled images, (3). Fusion of mutilingual resources. From the official evaluation results, we see that both our visual re-ranking method and the fusion of mono-lingual resources can significantly improve retrieval. Both strategies will now form an integral part of our future effort in image retrieval. 8 References [1] Debora Myoupo, Adrian Popescuy, Herve Le Borgne and Pierre-Alain Moellic, “Visual Reranking for Image Retrieval over the Wikipedia Corpus”, ImageCLEF 2009 working notes, 2009. [2] Lucene, http://lucene.apache.org/java/docs/. [3] Tao Liu, Shengping Liu and Zheng Chen, “An Evaluation on Feature Selection for Text Cluster- ing”, In Proc ICML, pp 488-495, 2003. [4] Y. Yang and J. Pedersen, “A comparative study on feature selection in text categorization”, In Proc of ICML, pp 412-420, 1997. [5] Evgeniy Gabrilovich and Shaul Markovitch, “Computing semantic relatedness using Wikipedia- based explicit semantic analysis”, In Proc International Joint Conference on Artificial Intelligence, pp 1606-1611, 2007. [6] J. Rocchio, “Relevance feedback in information retrieval”, In Gerard Salton, editor, The SMART Retrieval System – Experiments in Automatic Document Processing, pp 313-323, 1971. [7] O. Boiman, E. Shechtman, and M. Irani, “In Defense of Nearest-Neighbor Based Image Classifi- cation”, In Proc CVPR, 2008. [8] Y. Ke, R. Sukthankar, and L. Huston, “An efficient parts-based near-duplicate and sub-image retrieval system”, In Proc ACM Multimedia, pp 869-876, 2004. [9] Adrian Popescu, Theodora Tsikrika and Jana Kludas, “Overview of the Wikipedia Retrieval task at ImageCLEF 2010”, In the Working Notes of CLEF 2010, 2010. 9