Introduction

Using pseudo-relevance feedback to improve image retrieval results

Mouna Torjmen

Karen Pinel-Sauvagnat

Mohand Boughanem

bougha@irit.fr 0 0 IRIT , 118 Route Narbonne-31062 Toulouse Cedex 4 - France

2000

In this paper, we propose a pseudo-relevance feedback method to deal with the photographic retrieval and medical retrieval tasks of ImageCLEF 2007. The aim of our participation to ImageCLEF is to evaluate a combination method using both english textual queries and image queries to answer to topics. The approach processes image queries and merges them with textual queries in order to improve results. We do not obtain good results using only textual information and queries. To process image queries, we used the Fire system to sort similar images using low level features, and we then used associated textual information of the top images to construct a new textual query. Results showed the interest of low level features to process image queries, as performance increased compared to textual queries processing. Finally, best results were obtained combining the results lists of textual queries processing and image queries processing with a linear function .

Image retrieval pseudo-relevance feedback

Introduction

1. The context of an image is all information about the image coming from others sources than the image itself. For the time being, only textual information is used as context. The main problem of this approach is that documents can use di erent words to describe the same image or can use the same words to describe di erent concepts. Moreover image queries can't be processed. 2. Content Based Image Retrieval (CBIR) systems use low-level image features to return images similar to an example image. The main problem of this approach is that visual similarity does not always correspond to semantic similarity (for example a CBIR system can return a picture of blue sky when the example image is a blue car).

Most of the image retrieval systems combine nowadays content and context retrieval, in order to take advantages of both methods. Indeed, it has been proved that combining text- and contentbased methods for images retrieval always improves performance [ 4 ].

Images and textual information can be considered as independent and content and contextual information of queries can be combined in di erent ways:

Image queries and textual queries can be processed separately and the two results lists are then merged using a linear function [ 1 ], [ 7 ].

One can also use a pipeline approach: a rst search is done using textual information or content information, and a ltering step is then processed using the other information type to exclude non-relevant images [ 12 ].

Other methods use Latent Semantic Analysis (LSA) techniques to combine visual and textual information, but are not e cient [ 16 ] [ 17 ].

Some other works propose translation-based methods, in which content and context information are complementary. The main idea is to extract relations between images and text, and to use them to translate textual information to visual one and vice versa [ 9 ]:

In [ 8 ], authors translate textual queries to visual ones. authors of [ 2 ] propose to translate image queries to textual ones, and to process them using textual methods. Results are then merged with those obtained with textual queries. Authors in [ 10 ] also propose to expand the initial textual query by terms extracted thanks to an image query.

For the latter methods, the main problem to construct a new textual query or expand an initial textual query is term extraction. To do this, the main solution is pseudo-relevance feedback. Using pseudo-relevance feedback in context based image retrieval to process image queries is slightly di erent from classic pseudo-relevance feedback. The rst step is to use a visual system to process image queries. Images obtained as results are considered as relevant and the associated textual information is then used to select terms in order to express a new textual query.

The work presented in this paper also propose to combine context and content information to answer to the photographic retrieval and medical retrieval tasks. More precisely, we present a method to transform image queries to textual ones. We use XFIRM [ 14 ], a structured information retrieval system to process english textual queries, and the Fire system [ 3 ] to process image queries. Documents corresponding to the images returned by Fire are used to extract terms that will form a new textual query.

The paper is organized as follows. In Section 2, we describe textual queries processing using the XFIRM system. In Section 3, we describe the image queries processing using in a rst step, the Fire system, and in a second step a pseudo-relevance feedback method. In Section 4, we present our combination method, which uses both results of the XFIRM and FIRE systems. Experiments and results for the two tasks (medical retrieval and photographic retrieval [ 13 ], [ 6 ]) are exposed in section 5. Finally we conclude in Section 6 . 2

Textual queries processing

Textual information of collections used for the photographic and medical retrieval tasks [ 6 ] is organised using the XML language. In the indexing phase, we decided to only use documents elements containing positive information: description , title , notes and location . We then used the XFIRM system [ 14 ] to process queries. XFIRM (XML Flexible Information Retrieval Model ) uses a relevance propagation method to process textual queries in XML documents. Relevance values are rst computed on leaf nodes (which contain textual information) and scores are then propagated along the document tree to evaluate inner nodes relevance values.

Let q = t1; : : : ; tn be a textual query composed of n terms. Relevance values of leaf nodes ln are computed thanks to a similarity function RSV (q; ln).

n RSV (q; ln) = X wiq i=1 wiln; where wiq = tfiq and wiln = tfiln idfi iefi (1) wiq and wiln are the weights of term i in query q and leaf node ln respectively. tfiq and tfiln are the frequency of i in q and ln, idfi = log(jDj=(jdij + 1)) + 1, with jDj the total number of documents in the collection, and jdij the number of documents containing i, and iefi is the inverse element frequency of term i, i.e. log(jN j=jnfij + 1) + 1, where jnfij is the number of leaf nodes containing i and jN j is the total number of leaf nodes in the collection. idfi allows to model the importance of term i in the collection of documents, while iefi allows to model it in the collection of elements.

Each node n in the document tree is then assigned a relevance score rn which is function of the relevance scores of the leaf nodes it contains and of the relevance value of the whole document. rn = jLrnj:

X lnk2Ln dist(n;lnk) 1

RSV (q; lnk) + (1 ) rroot dist(n; lnk) is the distance between node n and leaf node lnk in the document tree, i.e. the number of arcs that are necessary to join n and lnk, and 2]0::1] allows to adapt the importance of the dist parameter. In all the experiments presented in the paper, is set to 0.6.

Ln is the set of leaf nodes being descendant of n, and jLrnj is the number of leaf nodes in Ln having a non-zero relevance value (according to equation 1). 2]0::1], inspired from work presented in [ 11 ], allows the introduction of document relevance in inner nodes relevance evaluation, and rroot is the relevance score of the root element, i.e. the relevance score of the whole document, evaluated with equation 2 with = 1.

Finally, the documents dj containing relevant nodes are retrieved with the following relevance score:

rxfirm(dj ) = maxn2dj rn

Images associated to the documents are lastly returned by the system to answer to the retrieval tasks. 3

Image queries processing

To process image queries, we used a third-steps method: (1) a rst step processes images using the Fire System [ 3 ], (2) we then use pseudo-relevance feedback to construct new textual queries , (3) the new textual queries are processed with the XFIRM system.

We rst used the Fire system to get the top K similar images to the image query. We then get the N associated textual documents (with N K, because some images do not have associated textual information) and extracted the top L terms from them. To select the top L terms, we evaluated two formula to express the weight wi of term ti.

The rst formula uses the frequency of term ti in the N documents.

N wi = X tfij j=1 (2) (3) (4) where tfij is the frequency of term ti in document dj.

The second formula uses terms frequency in the N selected documents, the number of documents in the N selected containing the term, and a normalized idf of the term in the whole collection.

N wi = [1 + log(X tfij)] j=1 ni N log( dDi ) log(D) (5) where ni is the number of documents in the N associated documents containing the term ti, D is the number of all documents in the collection and di is the number of documents in the collection containing ti.

The use of the nNi parameter is based on the following idea: a term occuring one time in n documents is more important and must be more relevant than a term occuring n times in one document. The log function is used on PjN=1 tfij because without it results with or without the nNi parameter were almost the same.

We then construct a new textual query with the top L terms selected according to formula 4 or 5 and we process it using the XFIRM system (as explained in section 2).

In the photographic retrieval task, we obtained the following queries for topic Q48, with K = 5 and L <= 5: Textual query using equation 4: "south korea river" Textual query using equation 5: "south korea night forklift australia"

The original textual query in english was: "vehicle in South Korea". As we can see, the query using equation 5 is more similar to the original query than the one using equation 4. 4

Combination function

To evaluate the interest of using both content and context information, we combined results of image queries and textual queries processing and we evaluated new relevance scores r(dj) for documents dj: r(dj) = (rxfirm(dj)) + (1 ) (rP RF (dj)) (6) where rxfirm(dj) is the relevance score of document dj according to the XFIRM system (equation 3) and rP RF (dj) is the relevance score of dj according to the XFIRM system after image queries processing (see section 3).

In order to answer to both retrieval tasks, we then return all images associated to the top ranked documents.

Figure 1 illustrates our approach.

Top K images s e g a m i t x e t L M

Whole collection Images results XFIRM System Documents and their associated relevance scores XML associated text

New textual query (L terms)

XFIRM System Documents and their associated relevance scores

Linear cfoumncbtiinoan tion

Final documents results

Images associated to documents

Final images results Evaluation and results

Photographic Retrieval Task

Evaluation of textual queries We evaluated english textual queries using the XFIRM system with parameters Results, which are almost the same, are presented in table 1. = 0:9 and = 1.

Run-id RunText0609 RunText061

We notice that the use of term frequency in selected documents is not enough, and that the importance of the term in the collection need to be used in the term weighted function (results are better with equation 5 than with equation 4).

If we now compare table 1 and table 2, we see that processing image queries with the Fire system and our pseudo-relevance feedback system gives better results than using only the XFIRM system on textual queries. It shows the importance of visual features to retrieve images. For this task, we only evaluated the combination method described in section 4. RunComb09 uses equation 5 with = 1, K=15, L=10 and = 0:9.

RunComb05 uses equation 4 with =1, K=6, L=5 and = 0:5.

Results are signi cantly better for run RunComb09. However, as many parameters are involved (K, L, and the equation used to select terms) it is di cult to conclude on which parameters impact the results. Further experiments are thus needed.

Discussion

Increasing the number of textual information resources to construct new textual queries from image queries improves results: the K number of selected images from FIRE results has a great impact on results. Increasing K improves thus results by introducing relevant information. Another factor of in uence on results is the number of new query terms L. In our experiments, when K and L increase, the MAP metric also increases.

Moreover, processing textual queries or images separately does not allow to obtain the best results: combining the two sources of evidence clearly improves results.

Finally, we'd like to conclude with the type of textual information used. In the Medical and Photographic Retrieval Tasks, textual information is encoded using the XML language, and as a consequence, we decided to use an XML-oriented information retrieval system to process textual queries (XFIRM). However, elements are not organized in a hierarchic way as in can be the case in XML documents (no ancestor-descendant relationships between nodes), and the functions used by the XFIRM system to evaluate nodes relevance may not be appropriate in that case. Other experiments are consequently needed with a plain-text information retrieval system. Combining the XFIRM system with the FIRE system may be however interesting with fully encoded-XML collections. 7

Conclusion and future work

We participated in the Photographic and Medical Retrieval Tasks of ImageCLEF 2007 in order to evaluate a method using a content- and context-based approach to answer to topics. We proposed a new pseudo-relevance feedback approach to process image queries and we tested an XML oriented system to process textual queries. Results showed the interest of combining the two sources of evidence (content and context) to answer to image retrieval.

In future work, we plan to:

Add low level features results extracted from FIRE to the combination function in the Medical Retrieval Task, as visual features are very important in the medical domain. Sort images using concepts level features [ 15 ] instead of low level features to construct new textual queries in the Photographic Retrieval Task.

Use a speci c domain ontology to expand textual queries (original textual queries and queries obtained with our pseudo-relevance feedback approach).

[1]

Susanne

Boll , Wolfgang Klas, and

Jochen

Wandel . A cross-media adaptation strategy for multimedia presentations . In ACM Multimedia (1) , pages 37 { 46 , 1999 .

[2] Yih-Chen

Chang

, Wen-Cheng Lin , and Hsin-Hsi Chen . A corpus-based relevance feedback approach to cross-language image retrieval . In CLEF , pages 592 { 601 , 2005 .

[3]

Deselaers ,

Keysers , and

Ney . FIRE | exible image retrieval engine: ImageCLEF 2004 evaluation . In CLEF Workshop ( 2004 ), 2004 .

[4]

Thomas

Deselaers , Henning Mller, Paul Clogh, Hermann Ney, and Thomas

M Lehmann.

The clef 2005 automatic medical image annotation task . International Journal of Computer Vision , 74 ( 1 ): 51 { 58 , August 2007 .

[5]

Fuhr , Mounia Lalmas,

Malik , and

Kazai . INEX 2005 workshop proceedings , 2005 .

[6]

Michael

Grubinger , Paul Clough, Allan Hanbury, and Henning Muller. Overview of the ImageCLEF 2007 photographic retrieval task . In Working Notes of the 2007 CLEF Workshop , Budapest, Hungary, September 2007 .

[7] Gareth

J. F.

Jones , Michael

Burke , John Judge, Anna Khasin, Adenike M. Lam-Adesina , and Joachim Wagner . Dublin city university at clef 2004: Experiments in monolingual, bilingual and multilingual retrieval . In CLEF , pages 207 { 220 , 2004 .

[8] Wen-Cheng

Lin

, Yih-Chen Chang , and Hsin-Hsi Chen . Integrating textual and visual information for cross-language image retrieval . In Proceedings of the Second Asia Information Retrieval Symposium , pages 454 { 466 , 2005 .

[9] Wen-Cheng

Lin

, Yih-Chen Chang , and Hsin-Hsi Chen . Integrating textual and visual information for cross-language image retrieval: A trans-media dictionary approach . Inf. Process. Manage., 43 ( 2 ): 488 { 502 , 2007 .

[10] Nicolas

Maillot

, Jean-Pierre

Chevallet

, Vlad Valea, and Joo Hwee Lim. Ipal inter-media pseudo-relevance feedback approach to imageclef 2006 photo retrieval . In Working Notes for the CLEF 2006 Workshop , 20 - 22 September , Alicante, Spain, 2006 .

[11]

Yosi

Mass and

Matan

Mandelbrod . Experimenting various user models for XML retrieval . In [5] , 2005 .

[12]

Mori ,

Takahashi , and

Oka . Image-to-word transformation based on dividing and vector quantizing images with words , 1999 .

[13] Henning Muller, Thomas Deselaers, Eugene Kim, Jayashree Kalpathy-Cramer,

Thomas M.

Deserno , Paul Clough, and

William

Hersh . Overview of the ImageCLEFmed 2007 medical retrieval and annotation tasks . In Working Notes of the 2007 CLEF Workshop , Budapest, Hungary, September 2007 .

[14]

Karen

Sauvagnat . Modle exible pour la recherche d'information dans des corpus de documents semi-structurs . PhD thesis , Toulouse : Paul Sabatier University, 2005 .

[15] Cees G. M. Snoek , Marcel Worring, Jan C. van Gemert, Jan-Mark Geusebroek , and Arnold

W. M.

Smeulders . The challenge problem for automated detection of 101 semantic concepts in multimedia . In MULTIMEDIA '06: Proceedings of the 14th annual ACM international conference on Multimedia , pages 421 { 430 , New York, NY, USA, 2006 . ACM Press.

[16]

Thijs

Westerveld . Image retrieval: Content versus context . In Content-Based Multimedia Information Access, RIAO 2000 Conference Proceedings , pages 276 { 284 , April 2000 .

[17]

Zhao and

Grosky . Narrowing the semantic gap - improved text-based web document retrieval using visual features , 2002 .