=Paper= {{Paper |id=Vol-1276/MedIR-SIGIR2014-05 |storemode=property |title=Multi-modal Relevance Feedback for Medical Image Retrieval |pdfUrl=https://ceur-ws.org/Vol-1276/MedIR-SIGIR2014-05.pdf |volume=Vol-1276 |dblpUrl=https://dblp.org/rec/conf/sigir/MarkonisSM14 }} ==Multi-modal Relevance Feedback for Medical Image Retrieval== https://ceur-ws.org/Vol-1276/MedIR-SIGIR2014-05.pdf
          Multi–modal relevance feedback for medical image
                              retrieval

                Dimitrios Markonis                     Roger Schaer                           Henning Müller
                       HES–SO                              HES–SO                                  HES–SO
                     TechnoPole 3                        TechnoPole 3                            TechnoPole 3
                  Sierre, Switzerland                 Sierre, Switzerland                     Sierre, Switzerland
           dimitrios.markonis@hevs.ch roger.schaer@hevs.ch                             henning.mueller@hevs.ch

ABSTRACT                                                              in formulating a precise query for a specific task, they gener-
Medical image retrieval can assist physicians in finding in-           ally see quickly whether a returned result is relevant to the
formation supporting their diagnosis. Systems that allow              information need or not. This technique found use in im-
searching for medical images need to provide tools for quick          age retrieval particularly with the emerge of content–based
and easy navigation and query refinement as the time for               image retrieval (CBIR) systems [18, 19, 20]. Following the
information search is often short.                                    CBIR mentality, the visual content of the marked results is
   Relevance feedback is a powerful tool in information re-           used to refine the initial image query. With the result im-
trieval. This study evaluates relevance feedback techniques           ages represented as a grid of thumbnails, relevance feedback
with regard to the content they use. A novel relevance feed-          can be applied quickly to speed up the search iterations and
back technique that uses both text and visual information             refine results. Recent user–tests with radiologists on a med-
of the results is proposed.                                           ical image search system also showed that this method is
   Results show the potential of relevance feedback tech-             intuitive and straightforward to learn [7].
niques in medical image retrieval and the superiority of the             Depending on whether the user manually provides the
proposed algorithm over commonly used approaches.                     feedback to the system (e.g. by marking results) or the
   Future steps include integrating semantics into relevance          system obtains this information automatically (e.g. by log
feedback techniques to benefit of the structured knowledge             analysis) relevance feedback can be categorized as explicit or
of ontologies and experimenting on the fusion of text and             implicit. Moreover, the information obtained by relevance
visual information.                                                   feedback can be used to affect the general behaviour of the
                                                                      system (long–term learning). In [11] a market basket analy-
                                                                      sis algorithm is applied in image retrieval of long–term learn-
Keywords                                                              ing. A recent review of short–term and long–term learning
relevance feedback, content–based image retrieval, medical            relevance feedback techniques in CBIR can be found in [6].
image retrieval                                                       An extensive survey of relevance feedback in text–based re-
                                                                      trieval systems is presented in [15] and for CBIR in [14].
1. INTRODUCTION                                                          In the medical informatics field, [1] applies CBIR with
                                                                      relevance feedback on mammography retrieval. In [12], an
   Searching for images is a daily task for many medical pro-
                                                                      image retrieval framework using relevance feedback is evalu-
fessionals, especially in image–oriented fields such as radiol-
                                                                      ated on a dataset of 5000 medical images that uses support
ogy. However, the huge amount of visual data in hospitals
                                                                      vector machines to compute the refined queries.
and the medical literature is not always easily accessible and
                                                                         In this paper we evaluate different explicit, short–term
physicians have generally little time for information search
                                                                      relevance feedback techniques using visual content or text
as they are charged with many tasks.
                                                                      for medical image retrieval. We propose a technique that
   Therefore, medical image retrieval systems need to return
                                                                      combines visual and text–based relevance feedback and show
information adjusted to the knowledge level and expertise of
                                                                      that it achieves a competitive performance to the state–of–
the user in a quick and precise fashion. A well known tech-
                                                                      the–art approaches.
nique trying to improve search results by user interaction is
relevance feedback [13]. Relevance feedback allows the user
to mark results returned in a previous search step as relevant        2.   METHODS
or irrelevant to refine the initial query. The concept behind
relevance feedback is that though user may have difficulties            2.1 Rocchio algorithm
                                                                         One of the most well known relevance feedback techniques
                                                                      is Rocchio’s algorithm [13]. Its mathematical definition is
                                                                      given below:
                                                                                            1                 1    
                                                                            
                                                                            qm = α qo + β           dj − γ             dj  (1)
                                                                                           |Dr |             |Dnr |
                                                                                                j ∈Dr
                                                                                                d                   j ∈Dnr
                                                                                                                    d

                                                                      where qm is the modified query,
Copyright is held by the author/owner(s).                             qo is the original query,
                                                                      
MedIR 2014, July 11, 2014, Gold Coast, Australia.                     Dr is the set of relevant images,




                                                                 20
Dnr is the set of non–relevant images and
α, β and γ are weights.
  Typical values for the weights are α = 1, β = 0.8 and
γ = 0.2. Rocchio’s algorithm is typically used in vector
models and also for CBIR. Intuitively, the original query
vector is moved towards the relevant vectors and away from
the irrelevant ones. By giving a weight to the positive and
negative parts a problem of CBIR can be avoided that when
more negative than positive feedback exists that also many
relevant images disappear from the results set.

2.2 Late fusion
   Another technique that showed potential in image retrieval [5]
is late fusion. Late fusion [2] is used in information retrieval
to combine result lists. It can be applied for fusing multiple         Figure 1: Mean average precision per search itera-
features, multiple queries and in multi–modal techniques.              tion for k = 5.
The concept behind this method is to merge the result lists
into a single list while boosting common occurrences using
a fusion rule.                                                                           Table 1: Best mAP scores
                                                                            Run            k=5          k = 20       k = 50      k = 100
   For example, the fusion rule of the score–based late fusion              text         0.197 (1)    0.2544 (4)   0.3107 (3)   0.3349 (4)
method CombMNZ [17] is defined as:                                           visual lf    0.2099 (2)   0.2243 (3)   0.2405 (4)   0.2553 (3)
                                                                            visual roc   0.2096 (2)   0.2187 (2)   0.2249 (3)   0.2268 (2)
               ScombMNZ (i) = F (i) ∗ ScombSUM (i)         (2)              mixed lf     0.1971 (3)   0.2606 (4)   0.3079 (4)   0.3487 (3)
                                                                            mixed roc    0.1947 (1)   0.2635 (4)   0.3207 (4)   0.3466 (4)
where F (i) is the number of times an image i is present in
retrieved lists with a non–zero score, and S(i) is the score
assigned to image i. CombSUM is given by                               were used for the visual modelling of the images. In multi-
                                     Nj                                modal runs, the fusion of the visual and text information is
                                     
                    ScombSUM (i) =         Sj (i)          (3)         performed only for the text 1000 top results as in the evalu-
                                     j=1                               ation of ImageCLEF only the top 1000 documents are taken
                                                                       into account in any case.
where Sj (i) is the score assigned to image i in retrieved list          Five techniques were evaluated in this study:
j.
                                                                            1. text: text–based RF using vector space model. Word
2.3 Multi–modal relevance feedback                                             stemming, tokenization and stopword removal is per-
   Most of the techniques use vectors either from the text                     formed in both text and multi–modal runs.
or the visual models. However, it has been shown that ap-
proaches that use both text and visual information can out-                 2. visual rocchio: visual RF using Rocchio to fuse the
perform single–modal ones in image retrieval. We propose                       relevant image vectors and CombMNZ fusion to fuse
the use of multi–modal information for relevance feedback                      the original query’s results with the visual ones.
to enhance the retrieval performance. This is, to the extend                3. visual lf : visual RF using late fusion (and the CombMNZ
of our knowledge, the first time that such a technique is pro-                  fusion rule) to fuse the relevant image results and the
posed in image retrieval. As late fusion is applied on result                  original query results with the visual ones.
lists, it is straightforward to use for combining results from
visual and text queries.                                                    4. mixed rocchio: multimodal RF using Rocchio to fuse
                                                                               the relevant image vectors and CombMNZ fusion to
2.4 Experimental setup                                                         fuse the original query results with the relevant cap-
   For evaluating the relevance feedback techniques the fol-                   tion results and relevant visual results.
lowing experimental setup was followed: The n search iter-
ations are initiated with a text query in iteration 0. The                  5. mixed lf : multimodal RF using late fusion (and the
relevant results from the top k results of iteration i were                    CombMNZ fusion rule) to fuse the relevant image re-
used in the relevance feedback formula of the iteration i + 1                  sults and the original query results with the captions’
for i = 0...n − 2.                                                             results and relevant visual results.
   The image dataset, topics and ground truth of Image-
CLEF 2012 medical image retrieval task [9] were used in                3.     RESULTS
this evaluation. The dataset contains more than 300’000
                                                                         The evaluation of the five techniques was performed for
images from the medical open access literature.
                                                                       k = 5, 20, 50, 100 and n = 5. Results of the mean average
   The image captions were accessed by the text–based runs
                                                                       precision (mAP) of each technique per iteration are shown
and indexed with the Lucene1 text search engine. Vector
                                                                       in Figures 1, 2, 3, 4.
space model was used along with tokenization, stopword
                                                                         Table 1 gives the best mAP scores of each run. The num-
removal, stemming and Inverse document frequency-Term
                                                                       bers in parentheses are the number of the iteration when
frequency weighting. The Bag–of–visual–words model de-
                                                                       this score was achieved. For scores that were the same in
scribed in [3] and the bag–of–colors model appearing in [4]
                                                                       multiple iterations of the same run, the iteration closer to
1
    http://lucene.apache.org/                                          the first is used.




                                                                  21
Figure 2: Mean average precision per search itera-                    Figure 4: Mean average precision per search itera-
tion for k = 20.                                                      tion for k = 100.


                                                                      significance difference between the three best approaches.
                                                                      However, applying different fusion rules for combining vi-
                                                                      sual and text information (such as linear–weighting) could
                                                                      further improve the results of the mixed approaches.
                                                                         It can be noted that as the k increases, the performance
                                                                      improvement also increases, highlighting the added value of
                                                                      relevance feedback. Larger values of k were not explored as
                                                                      this scenarios were judged as unrealistic.
                                                                         In the visual runs using Rocchio for combining the visual
                                                                      queries is performing worse than late fusion. This comes in
                                                                      accordance with the findings in [3]. The reason behind this
                                                                      could be that the large visual diversity of relevant images in
                                                                      medicine and the curse of dimensionality cause the modified
                                                                      vector to behave as an outlier in the high dimensional visual
Figure 3: Mean average precision per search itera-
                                                                      feature space. In the mixed runs the difference between
tion for k = 50.
                                                                      the two methods is not statistically significant with Rocchio
                                                                      performing slightly better than the late fusion.
                                                                         Irrelevant results were ignored, as they often have little or
4. DISCUSSION                                                         no impact on the retrieval performance [10, 16]. More im-
   All of the evaluated techniques improve retrieval after the        portantly, the ground truth of the dataset used contains a
initial search iteration. This demonstrates the potential of          much larger portion of annotated irrelevant results than rel-
relevance feedback for refining medical image search queries.          evant ones. This was considered to potentially simulate an
   Relevance feedback using only visual appearance models,            unrealistic scenario, as users do not usually mark many re-
even though improving the retrieval performance after the             sults as negative examples. Having too many negative exam-
first iteration, performed worse than the text–based runs in           ples could also cause the modified vector to follow an outlier
most cases. Visual features still suffer from the semantic             behaviour. Preliminary results confirmed this hypothesis,
gap between the expressiveness of visual features and our             where the use of negative results for relevance feedback can
human interpretation. Still, this shows their usefulness in           decrease performance after the first iteration.
image datasets where no or little text meta–data are avail-              It should be noted that this is an automated relevance
able. Moreover, when combined with the text–information               feedback experiment of positive only feedback and that in
in the proposed method, they improve the text–only base-              selective relevance feedback situations the retrieval perfor-
line.                                                                 mance is expected to perform even better. A larger number
   The proposed multi–modal runs provide the best results             of steps could be investigated but this might be unrealistic,
in all the cases except for case k = 5. Surprisingly, the             given the fact that physicians have little time and stop after
visual runs perform slightly better than the text and the             a few minutes of search [8]. Often users will only test a few
multi–modal approaches for this case. However, assuming               steps of relevance feedback at the most.
independent and normal distributed average precision val-
ues the significance tests show that the difference is not sta-
tistically significant.                                                5.   CONCLUSIONS
   We consider the case k = 20 as the most realistic scenario           This paper proposes the use of multi–modal information
since users do not often inspect more than 2 pages of re-             when applying relevance feedback to medical image retrieval.
sults. Especially for grid–like result interface views, where         An experiment was set up to simulate the relevance feedback
each page can contain 20 to 50 results, we consider k = 20            of a user on a number of medicine–related topics from Im-
more realistic than k = 5. In this case the proposed meth-            ageCLEF 2012.
ods achieve the best performance with 0.2606 and 0.2635                 In general, all the techniques evaluated in this study im-
respectively. Again, the significance tests do not find any             prove the performance, which shows the added value of rele-




                                                                 22
vance feedback. Text–based relevance feedback showed con-                   Working Notes of CLEF 2012 (Cross Language
sistently good results. Visual–based techniques showed com-                 Evaluation Forum), September 2012.
petitive performance for small shortlist sizes, underperform-          [10] H. Müller, W. Müller, D. M. Squire,
ing in the rest of the cases. The proposed multi–modal                      S. Marchand-Maillet, and T. Pun. Strategies for
approaches showed promising results slightly outperforming                  positive and negative relevance feedback in image
the text–based one but without statistical significance.                     retrieval. Technical Report 00.01, Computer Vision
   More fusion techniques are going to be evaluated in the                  Group, Computing Centre, University of Geneva, rue
future. Comparison to manual query refinement by users is                    G n ral Dufour, 24, CH–1211 Gen ve, Switzerland,
considered in future plans, to assess relevance feedback as a               Jan. 2000.
concept in medical image retrieval. The addition of semantic           [11] H. Müller, D. M. Squire, and T. Pun. Learning from
search is also of interest, to take advantage of the structured             user behavior in image retrieval: Application of the
knowledge of the medical ontologies such as RadLex (Radi-                   market basket analysis. International Journal of
ology Lexicon) and MeSH (Medical Subject Headings).                         Computer Vision, 56(1–2):65–77, 2004. (Special Issue
                                                                            on Content–Based Image Retrieval).
6. ACKNOWLEDGEMENTS                                                    [12] M. M. Rahman, P. Bhattacharya, and B. C. Desai. A
  This work was supported by the EU 7th Framework Pro-                      framework for medical image retrieval using machine
gram in the context of the Khresmoi project (grant 257528).                 learning and statistical similarity matching techniques
                                                                            with relevance feedback. Information Technology in
7. REFERENCES                                                               Biomedicine, IEEE Transactions on, 11(1):58–69,
 [1] C.-C. Chen, P.-J. Huang, C.-Y. Gwo, Y. Li, and C.-H.                   2007.
     Wei. Mammogram retrieval: Image selection strategy                [13] J. J. Rocchio. Relevance feedback in information
     of relevance feedback for locating similar lesions.                    retrieval. In The SMART Retrieval System,
     International Journal of Digital Library Systems                       Experiments in Automatic Document Processing,
     (IJDLS), 2(4):45–53, 2011.                                             pages 313–323. Prentice Hall, Englewood Cliffs, New
 [2] A. Depeursinge and H. Müller. Fusion techniques for                   Jersey, USA, 1971.
     combining textual and visual information retrieval. In            [14] Y. Rui, T. S. Huang, and S. Mehrotra. Relevance
     H. Müller, P. Clough, T. Deselaers, and B. Caputo,                    feedback techniques in interactive content–based
     editors, ImageCLEF, volume 32 of The Springer                          image retrieval. In I. K. Sethi and R. C. Jain, editors,
     International Series On Information Retrieval, pages                   Storage and Retrieval for Image and Video Databases
     95–114. Springer Berlin Heidelberg, 2010.                              VI, volume 3312 of SPIEProc, pages 25–36, Dec. 1997.
 [3] A. Garcı́a Seco de Herrera, D. Markonis, I. Eggel, and            [15] I. Ruthven and M. Lalmas. A survey on the use of
     H. Müller. The medGIFT group in ImageCLEFmed                          relevance feedback for information access systems. The
     2012. In Working Notes of CLEF 2012, 2012.                             Knowledge Engineering Review, 18(02):95–145, 2003.
 [4] A. Garcı́a Seco de Herrera, D. Markonis, and                      [16] G. Salton and C. Buckley. Improving retrieval
     H. Müller. Bag of colors for biomedical document                      performance by relevance feedback. Readings in
     image classification. In H. Greenspan and H. Müller,                   information retrieval, 24:5, 1997.
     editors, Medical Content–based Retrieval for Clinical             [17] J. A. Shaw and E. A. Fox. Combination of multiple
     Decision Support, MCBR–CDS 2012, pages 110–121.                        searches. In TREC-2: The Second Text REtrieval
     Lecture Notes in Computer Sciences (LNCS), Oct.                        Conference, pages 243–252, 1994.
     2013.                                                             [18] D. M. Squire, W. Müller, H. Müller, and T. Pun.
 [5] A. Garcı́a Seco de Herrera, D. Markonis, R. Schaer,                    Content–based query of image databases: inspirations
     I. Eggel, and H. Müller. The medGIFT group in                         from text retrieval. Pattern Recognition Letters
     ImageCLEFmed 2013. In Working Notes of CLEF                            (Selected Papers from The 11th Scandinavian
     2013 (Cross Language Evaluation Forum), September                      Conference on Image Analysis SCIA ’99),
     2013.                                                                  21(13–14):1193–1198, 2000. B.K. Ersboll, P. Johansen,
 [6] J. Li and N. M. Allinson. Relevance feedback in                        Eds.
     content-based image retrieval: a survey. In Handbook              [19] L. Taycher, M. L. Cascia, and S. Sclaroff. Image
     on Neural Information Processing, pages 433–469.                       digestion and relevance feedback in the ImageRover
     Springer, 2013.                                                        WWW search engine. pages 85–94, 1997.
 [7] D. Markonis, F. Baroz, R. L. Ruiz de Castaneda,                   [20] M. E. Wood, N. W. Campbell, and B. T. Thomas.
     C. Boyer, and H. Müller. User tests for assessing a                   Iterative refinement by relevance feedback in
     medical image retrieval system: A pilot study. In                      content–based digital image retrieval. pages 13–20,
     MEDINFO 2013, 2013.                                                    1998.
 [8] D. Markonis, M. Holzer, S. Dungs, A. Vargas,
     G. Langs, S. Kriewel, and H. Müller. A survey on
     visual information search behavior and requirements
     of radiologists. Methods of Information in Medicine,
     51(6):539–548, 2012.
 [9] H. Müller, A. Garcı́a Seco de Herrera,
     J. Kalpathy-Cramer, D. Demner Fushman, S. Antani,
     and I. Eggel. Overview of the ImageCLEF 2012
     medical image retrieval and classification tasks. In




                                                                  23