Combining Text and Image Queries at ImageCLEF2005 Yih-Cheng Chang1, Wen-Cheng Lin1,2 and Hsin-Hsi Chen1 1 Department of Computer Science and Information Engineering National Taiwan University Taipei, Taiwan 2 Department of Medical Informatics Tzu Chi University Hualien, Taiwan ycchang@nlg.csie.ntu.edu.tw; denislin@mail.tcu.edu.tw; hhchen@csie.ntu.edu.tw Abstract This paper presents our methods for the tasks of bilingual ad hoc retrieval and automatic annotation in ImageCLEF 2005. In ad hoc task, we propose a feedback method for cross-media translation in a visual run, and combine the results of visual and textual runs to generate the final result. Experimental results show that our feedback method performs well. Comparing to initial visual retrieval, average precision is increased from 8% to 34% after feedback. The performance is increased to 39% if we combine the results of textual run and visual run with pseudo relevance feedback. In automatic annotation task, we propose several methods to measure the similarity between a test image and a category, and a test image is classified to the most similar category. Experimental results show that the proposed approaches have good performance, but the simplest 1-NN method has the best performance. We will analyze these results in the paper. ACM Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval---Retrieval models, Relevance feedback Free Keywords: Cross language image retrieval, cross-media translation, automatic image annotation, classification 1 Introduction While digital images have an explosive growth, cross-language image retrieval and automatic annotation become very important nowadays. An automatic annotation system can help us to annotate large amount of images, and a cross-language image retrieval system retrieves images that are annotated in different languages. Two types of approaches, i.e., content-based and text-based approaches, are usually adopted in image retrieval [1]. Content-based image retrieval (CBIR) uses low-level visual features to retrieve images. In such a way, it is unnecessary to annotate images and translate users’ queries. However, due to the semantic gap between image visual features and high-level concepts [2], it’s still hard to use a CBIR system to retrieve images with correct semantic meanings. Integrating textual information may help a CBIR system to cross the semantic gap and improve retrieval performance. Recently many approaches tried to combine text- and content-based methods for image retrieval. A simple approach is conducting text- and content-based retrieval separately and merging the retrieval results of the two runs [3,4]. In contrast to the parallel approach, a pipeline approach uses textual or visual information to perform initial retrieval, and then uses the other features to filter out irrelevant images [5]. In these two approaches, textual and visual queries are formulated by users and do not directly influence each other. Another approach, i.e., transformation-based approach, tries to mine the relations between images and text, and uses the mined relations to transform textual information into visual one, and vice versa [6]. In this paper we try another method to transform visual features to textual ones. We use a feedback method to transform a visual query into textual one. The text descriptions of the top retrieved images of the initial retrieval are used for feedback to conduct a second retrieval. The new textual information can help us cache the semantic meaning of a visual query, and thus improve retrieval performance. The correlation between images and text can be used to annotate images. However, the training data of automatic annotation task has no textual information, thus we use only visual features to classify images. In automatic annotation task, we try several classification methods. A nearest neighbor (1-NN) method is considered as our baseline. We propose several methods to measure the similarity between a test image and a class, and a test image is classified to the most similar class. We propose a method that measures the similarity between an image and a class by averaging the similarity scores of the top n most similar images in the class. Besides, we also propose an approach that divides a class into several smaller classes and classifies a test image according to the similarities between the test image and the centroids of the smaller classes. The rest of the paper is organized as fellows. Section 2 and 3 introduce the proposed approaches and experimental results of bilingual ad hoc retrieval task and automatic annotation task, respectively. Section 4 concludes the remark. 2 Bilingual Ad Hoc Retrieval Task 2.1 Feedback Method for Cross-Media Translation To do cross-media translation between visual and textual representations, several correlation-based approaches have been proposed in automatic annotation task. Those approaches model the correlation between text and visual representation, and use the mined relation to translate images to text descriptions. Mori, Takahashi and Oka [7] divided images into grids, and then the grids of all images were clustered. Co-occurrence information was used to estimate the probability of each word for each cluster. Duygulu, et al. [8] used blobs to represent images. First, images are segmented into regions using a segmentation algorithm like Normalized Cuts [9]. All regions are clustered and each cluster is assigned a unique label (blob token). EM algorithm is used to construct a probability table that links blob tokens with word tokens. Jeon, Lavrenko, and Manmatha [10] proposed a cross-media relevance model (CMRM) to learn the joint distribution of blobs and words. They further proposed continuous-space relevance model (CRM) that learned the joint probability of words and regions, rather than blobs [11]. The above approaches use the relation between text and visual representation as a bridge to translate image to text. However, it is hard to learn all relations between all visual and textual features. In the experiments mentioned above, relations are learned from only hundreds of keywords in textual annotation. Another problem is that the degree of ambiguity of the relations is usually high. For example, visual feature “red circle” may have many meanings such as sun set, red flower, and red ball. Similarly, the word “flower” may look very different, e.g. different color and shape, in images. In this paper we translate visual and textual features without learning correlations. We treat the retrieved images and their text descriptions as aligned documents, and a corpus-based method that uses pseudo relevance feedback is adopted to translate visual or textual features and generate a new query. In cross-language image retrieval, giving a set of images I={i1,i2,…,im} with text descriptions TI,L1={t1,t2,…,tm} in language L1, users use textual query QL2 in language L2 (L2≠L1) and example images E={e1,e2,…,ep} to retrieve relevant images from I. We use a feedback method in a visual run to translate the visual query into textual one as follows. We first use an example image ei as initial query, and use a CBIR system, i.e. VIPER [12], to retrieve images from I. The retrieved images are R={ri1,ri2,…,rin} and their text descriptions are TR,L1={tri1,tri2,…,tril} in language L1. Then we use the text descriptions of the top k retrieved images to construct a new textual query. The new textual query can be seen as a translation of initial visual query. In the feedback run, we submit the new textual query to a text-based retrieval system, i.e. Okapi [13], to retrieve images from I. In addition to the visual feedback run, we also conduct a text-based run using the textual query in the test set. We use the method we proposed last year [14] to translate textual query QL2 into query QL1 in language L1, and submit the translated query QL1 to Okapi system to retrieve images. The results of textual run and visual feedback run can be combined. The similarity scores of images in the two runs are normalized and linearly combined using equal weight. 2.2 Experimental Results In the experiments, the text-based retrieval system used is Okapi IR system, and the content-based retrieval system used is VIPER system. For textual index in Okapi, the caption text, and sections of English captions are used for indexing. The weighting function used is BM25. Chinese queries and example images are used as our source queries. We submitted four Chinese-English cross-lingual runs, two English monolingual runs and one visual run in CLEF 2005 image track. In English monolingual runs, using narrative or not using narrative will be compared. In the four cross-lingual runs, combining with visual run or not combing with visual run, and using narrative or not using narrative will be compared. The details of the cross-lingual runs and visual run are described as follows. (1) NTU-adhoc05-CE-T-W This run use textual queries without narrative to retrieve images. We use query translation method we used last year to translate Chinese queries into English to retrieve images using textual index. (2) NTU-adhoc05-CE-TN-W-Ponly This run use textual queries with narrative. We only use the positive information in narrative. The sentences that contain phrase “不算相關 (are not relevant)” are removed. (3) NTU-adhoc05-EX-prf This run is a visual run with pseudo relevance feedback (the query becomes textual one after feedback). We use the retrieval results of VIPER system provided by ImageCLEF as our initial retrieval results, and use the text descriptions of the top 2 images to construct a new textual query in feedback run. The caption text in descriptions is used to construct a query. The textual query is submitted to Okapi IR system to retrieve images. (4) NTU-adhoc05-CE-T-WEprf This run merges the results of NTU-adhoc05-CE-T-W and NTU-adhoc05-EX-prf. The similarity scores of images in the two runs are normalized and linearly combined using equal weight 0.5. (5) NTU-adhoc05-CE-TN-WEprf-Ponly This run merges the results of NTU-adhoc05-CE-TN-W-Ponly and NTU-adhoc05-EX-prf. From Table 1, the average precision of monolingual retrieval using title field only is 0.3952. Comparing to the performance of last year (0.6304), this year’s query set is much harder. After adding narrative information, average precision is increased slightly. The performance of Chinese-English cross-lingual textual run is about 60.7% of English monolingual run. It shows that there are still many errors in language translation. From Table 2, the performance of initial visual run, i.e. VIPER, is not good enough. Text-based runs, even cross-lingual runs, perform much better than initial visual run. It shows that semantic information is very important for the queries of this year. After feedback, the performance is increased dramatically from 0.0829 to 0.3452. The result shows that the feedback method transforms visual information into textual one well. Combining textual and visual feedback runs further improves retrieval performance. The combined runs perform better than the individual runs. The results show that it needs more information to define users’ information need. The feedback textual query has additional information and help user’s textual query perform better. Table 1. Results of official runs Features in Query Average Run Text Visual Precision NTU-adhoc05-CE-T-W Chinese (Title) None 0.2399 Chinese (Title+ NTU-adhoc05-CE-TN-W-Ponly None 0.2453 Positive Narrative) NTU-adhoc05-CE-T-WEprf Chinese (Title) Example image 0.3977 Chinese (Title+ NTU-adhoc05-CE-TN-WEprf-Ponly Example image 0.3993 Positive Narrative) English Example image NTU-adhoc05-EX-prf 0.3425 (feedback query) (initial query) NTU-adhoc05-EE-T-W English None 0.3952 English (Title+ NTU-adhoc05-EE-TN-W-Ponly None 0.4039 Positive Narrative) Table 2. Performances of unofficial runs (NTU-adhoc05-EE-T-WEprf merges the results of NTU-adhoc05-EE-T-W and NTU-adhoc05-EX-prf) Features in Query Average Run Text Visual Precision NTU-adhoc05-EE-T-WEprf English (Title) Example image 0.5053 Initial Visual Run (VIPER) None Example image 0.0829 Figure 1. Average precision of each query 120 CE EE EX 100 CE+EX EE+EX Average Precision (%) 80 60 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 Query Figure 1 shows the performances of each query in run NTU-adhoc05-CE-T-W, NTU-adhoc05-EE-T-W, NTU-adhoc05-EX-prf, NTU-adhoc05-CE-T-Weprf, and NTU-adhoc05-EE-T-WEprf. For most queries, monolingual run has better performance than visual feedback run. We can say that there are translation errors in cross-media translation. There are ten topics in which the performance of visual feed back run is better than that of monolingual run. This is probably because that the user’s information need is not detailed described in a textual query, i.e. some information is lost. Also, the words used in textual query and image descriptions may be inconsistency. Thus, it is hard to retrieve all relevant images by the textual queries formulated by users. We can use additional information that is not provided directly by users to retrieve more relevant images. The constructed query in feedback run has additional information that comes from example images. For example, when a user wants to find images that have aircraft on land, using query “aircraft in military air base” may be better than using “aircraft on the ground”. This is because that the descriptions of images don’t mention that aircraft is on the ground directly, but aircrafts in military air base are very likely to be parked and thus are on the ground. The additional information “military air base” is obtained because that it is mentioned in the descriptions of images retrieved by example images using a CBIR system. Comparing the performances of runs that combining visual feedback run or not, we can find that most topics perform better after combining. This is probably because that the additional information in feedback run helps our system retrieves images more precisely, and that queries constructed in feedback run could recover translation errors in a cross-lingual run. 3 Automatic Annotation Task 3.1 Classification Approaches The automatic annotate task in ImageCLEF 2005 can be seen as a classification task, since each image can only be annotated with one word (category). In classification task, k-nearest neighbor (k-NN) method is a usually adopted approach [15]. Performance for different categories in k-NN method usually depends on the number of training data in each category. Test images tend to be classified to the categories that have many training data (We will show this later). To solve this problem, computing several representative data is used to normalize the number of training data in each category. We can reduce the number of training data to 1 using a centroid to represent a category. But sometimes using only one centroid to represent a whole category is not sufficient if the images in the category are very different. For example, the images of the flank and the front of skull look very different. Using two centroids of two smaller classes to represent category “skull” is better than using only one centroid that is between the flank and front of skull. In this task, we use clustering to help us to find the representative data of each category. We assume that the images that belong to the same cluster and the same category are very similar, and can be represented by a centroid. The detail of our method is described as follows. (1) First we use k-means algorithm to cluster all training data. The images in a cluster may belong to different categories. (2) After clustering, we compute the centroids of each category in each cluster. (3) Given a test image, we compute the distances between it and each centroids, and the test image is classified to the category with the shortest distance. The second method we used is to compute the similarities between a test image and each category, and then classify the test image to the most similar category. The similarity between a test image and a class is measured by averaging the similarity values between the test image and the top 2 most similar images in the class. A test image is classified to the class that has the highest similarity. 3.2 Experimental Results In this task we submit three runs. The three runs use the same image features. The difference between them is the classification method used. The image features are extracted in the following way. First we resize images to 256 x 256 pixels and segment each image into 32 x 32 blocks (each block is 8 x 8 pixels). Then we compute the average gray value of each block to construct a vector with 1024 elements. We use this vector to represent an image, and the similarity between two images is measured by cosine formula. The details of each run are described as follows. (1) NTU-annotate05-1NN This run is our based line. It uses 1-NN method to classify each image. (2) NTU-annotate05-Top2 This run uses the second method described in Section 3.1. We compute the similarity between a test image and category using the top 2 nearest images in each category, and classify the test image to the most similar category. (3) NTU-annotate05-SC This run uses the first method described in Section 3.1. Training data is clustered using k-means algorithm (k=1000). We compute the centroids of each category in each cluster, and classify a test image to the category of the nearest centroid. The results of official runs are shown in Table 3. The results show that 1-NN method is very useful. 1-NN has the same performance as run NTU-annotate05-Top2, but it doesn’t need to compute average similarity, thus it is faster than top2 method. The performance of run NTU-annotate05-SC is worse than run NTU-annotate05-1NN. Normalizing the number of training data in each category may have a trade-off. Normalization may increase the performance of categories that have less training data, but decrease the performance of categories that have more training data. Table 4 shows the error rate of individual categories. The categories that have a lot of training data are listed in the upper part of Table 4, and the categories that have a few training data are in the lower part. From Table 4, the performances of categories with a lot of training data are better than that of categories with a few training data. For the categories with a lot of training data, 1-NN method performs better than normalization method (run SC). In contrast, normalization method performs much better than 1-NN method for the categories with a few training data. It shows that normalization method could reduce the problem that prefers classifying images to large categories. The reason that the overall performance of normalization method is worse than that of 1-NN method is that large categories have more test images and thus have more influence on the final result. Table 3. Results of official runs Run NTU-annotate05-1NN NTU-annotate05-Top2 NTU-annotate05-SC Error Rate 21.7 % 21.7 % 22.5 % Table 4. Error rate of individual categories. The upper part shows the top 10 categories that have a lot of training data, and the lower part shows the categories that have a few training data Cat. #Training image #Test image Error rate (1NN) Error rate (Top2) Error rate (SC) 12 2563 297 0.003367 0.003367 0.016835 34 880 79 0.012658 0.012658 0.000000 6 576 67 0.194030 0.223881 0.253731 1 336 38 0.000000 0.000000 0.078947 25 284 36 0.138889 0.166667 0.194444 28 228 16 0.312500 0.312500 0.250000 5 225 25 0.080000 0.080000 0.080000 17 217 24 0.125000 0.125000 0.208333 3 215 24 0.291667 0.291667 0.250000 18 205 12 0.416667 0.500000 0.416667 Avg. 572.9 61.8 0.157478 0.171574 0.174896 51 9 1 0.000000 0.000000 0.000000 52 9 1 0.000000 0.000000 0.000000 55 10 2 1.000000 1.000000 1.000000 53 15 3 0.333333 0.000000 0.333333 15 15 3 0.666667 0.666667 0.666667 24 17 4 1.000000 0.750000 0.750000 35 18 4 0.750000 1.000000 0.500000 37 22 2 1.000000 1.000000 1.000000 16 23 1 1.000000 1.000000 0.000000 46 30 1 1.000000 1.000000 0.000000 Avg. 16.8 2.2 0.675000 0.641667 0.425000 4 Conclusions In bilingual ad hoc retrieval task, we propose a simple and useful feedback method for cross-language image retrieval. We transform visual features into textual ones without learning correlations. Experimental results show that the proposed feedback approach performs well. Comparing to initial visual retrieval, average precision is increased from 8% to 34% after feedback. The feedback textual query has additional information that comes from example images, and help user’s textual query perform better. After combining textual and visual feedback runs, average precision is increased from 0.2399 to 0.3977 and from 0.3952 to 0.5053 in cross-lingual and monolingual experiments, respectively. We will test our method in other image collections in the future. In automatic annotation task, we propose a method that normalizes the number of training data of each category. The normalization approach may have a trade-off. It may increase the performance of categories that have less training data, but decrease the performance of categories that have more training data. We will try our method in different collections and study what is the suitable time to use normalization since using normalization may have a trade-off. Acknowledgement Research of this paper was partially supported by National Science Council, Taiwan, under the contracts NSC93-2752-E-001-001-PAE and NSC94-2752-E-001-001-PAE. References 1. Goodrum, A.A.: Image Information Retrieval: An Overview of Current Research. Information Science, 3(2). (2000) 63-66. 2. Eidenberger, H. and Breiteneder, C.: Semantic Feature Layers in Content-based Image Retrieval: Implementation of Human World Features. In: Proceedings of International Conference on Control, Automation, Robotic and Vision. (2002). 3. Besançon, R., Hède, P., Moellic, P.A., and Fluhr, C.: Cross-Media Feedback Strategies: Merging Text and Image Information to Improve Image Retrieval. In: Multilingual Information Access for Text, Speech and Images: 5th Workshop of the Cross-Language Evaluation Forum, CLEF 2004, LNCS 3491. Springer-Verlag GmbH (2005) 709-717. 4. Jones, G.J.F., Groves, D., Khasin, A., Lam-Adesina, A., Mellebeek, B., and Way, A.: Dublin City University at CLEF 2004: Experiments with the ImageCLEF St. Andrew's Collection. In: Multilingual Information Access for Text, Speech and Images: 5th Workshop of the Cross-Language Evaluation Forum, CLEF 2004, LNCS 3491. Springer-Verlag GmbH (2005) 653-663. 5. Baan, J., van Ballegooij, A., Geusenbroek, J.M., den Hartog, J., Hiemstra, D., List, J., Patras, I., Raaijmakers, S., Snoek, C., Todoran, L., Vendrig, J., de Vries, A., Westerveld, T., and Worring, M.: Lazy Users and Automatic Video Retrieval Tools in (the) Lowlands. In: Proceedings of the Tenth Text REtrieval Conference (TREC 2001). National Institute of Standards and Technology (2002) 159-168. 6. Lin, W.C., Chang, Y.C. and Chen, H.H.: Integrating Textual and Visual Information for Cross-Language Image Retrieval. In: Proceedings of the Second Asia Information Retrieval Symposium (AIRS 2005). (2005). 7. Mori, Y., Takahashi, H. and Oka, R.: Image-to-Word Transformation Based on Dividing and Vector Quantizing Images with Words. In: Proceedings of the First International Workshop on Multimedia Intelligent Storage and Retrieval Management. (1999). 8. Duygulu, P., Barnard, K., Freitas, N. and Forsyth, D.: Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary. In: Proceedings of Seventh European Conference on Computer Vision, Vol. 4. (2002) 97-112. 9. Shi, J. and Malik, J.: Normalized Cuts and Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8). (2000) 888-905. 10. Jeon, J., Lavrenko, V. and Manmatha, R.: Automatic Image Annotation and Retrieval using Cross-Media Relevance Models. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2003). ACM Press (2003) 119-126. 11. Lavrenko, V., Manmatha, R. and Jeon, J.: A Model for Learning the Semantics of Pictures. In: Proceedings of the Seventeenth Annual Conference on Neural Information Processing Systems. (2003). 12. Squire, D.M., Müller, W., Müller, H., and Raki, J.: Content-based query of image databases, inspirations from text retrieval: Inverted files, frequency-based weights and relevance feedback. In: Scandinavian Conference on Image Analysis. (1999) 143–149. 13. Robertson, S.E., Walker, S. and Beaulieu, M.: Okapi at TREC-7: Automatic Ad Hoc, Filtering, VLC and Interactive. In: Proceedings of the Seventh Text REtrieval Conference (TREC-7). National Institute of Standards and Technology (1998) 253-264. 14. Lin, W.C., Chang, Y.C. and Chen, H.H.: From Text to Image: Generating Visual Query for Image Retrieval. In: Multilingual Information Access for Text, Speech and Images: 5th Workshop of the Cross-Language Evaluation Forum, CLEF 2004, LNCS 3491. Springer-Verlag GmbH (2005) 664-675. 15. Lehmann, T.M., Güld, M.O., Deselaers, T., Keysers, D., Schubert, H., Spitzer, K., Ney, H., and Wein, B.B.: Automatic categorization of medical images for content-based retrieval and data mining. Computerized Medical Imaging and Graphics, 29(2-3). Elsevier, (2005) 143-155.