NovaSearch on medical ImageCLEF 2013

               André Mourão, Flávio Martins and João Magalhães

        Universidade Nova de Lisboa, Faculdade de Ciências e Tecnologia,
                              Caparica, Portugal,
a.mourao@campus.fct.unl.pt, flaviomartins@acm.org, jm.magalhaes@fct.unl.pt


       Abstract. This article presents the participation of the Center of Infor-
       matics and Information Technology group CITI in medical ImageCLEF
       2013. This is our first participation and we submitted runs on the modal-
       ity classification task, the ad-hoc image retrieval task and case retrieval
       task. We are developing a system to integrate textual and visual retrieval
       into a framework for multimodal retrieval. Our approach for multimodal
       case retrieval achieved best global performance using Segmented (6×6
       grid) Local Binary Pattern (LBP) histograms and Segmented HSV his-
       tograms for images and Lucene with query expansion (using the first top
       3 results). In modality classification we achieved one of the largest MAP
       gains in the multimodal classification task, resulting in the third best
       team result.


Keywords: medical retrieval, case-based retrieval, multimodal fusion, medical
modality classification


1     Introduction

ImageCLEF 2013 AMIA: Medical task is the CLEF benchmark focused on the
retrieval and classification of medical images and articles from PubMed. This is
our first participation on ImageCLEF and the Medical track in particular, so
we tried to design a system to participate on every task (excluding Compound
figure separation). However we delved more in the case-based retrieval task and
results lists fusion.
    Our system is divided in image retrieval, textual retrieval and results fu-
sion. For image retrieval, we focused on top performing features from previous
editions [10] (CEDD [3], FCTH [4], Local Binary Pattern histograms [1], color
histograms).
    For text retrieval, we used Lucene1 from the Apache project paired with our
implementation of the BM25L [8] ranking function for Lucene. For modality
classification, we combined text and image descriptors (early fusion) and per-
formed classification using Vowpal Wabbit2 . Our training dataset included only
the images provided on the 2013 edition [7]. We did not perform any type of
1
    http://lucene.apache.org/core/
2
    https://github.com/JohnLangford/vowpal_wabbit/wiki
II

dataset augmentation. On the case-based retrieval task, we also tested a variety
of late fusion approaches, and came up with a variant of Reciprocal Rank (RR)
[11] we named Inverse Square Rank (ISR). Our fusion method provided the best
performance for our case-based image and textual runs.

2     Techniques
2.1   Text retrieval
In text retrieval, article text is indexed using Lucene and retrieved using the
BM25L [8] retrieval function. We expanded the initial query with preferred and
alternative terms sourced from a SKOS formatted version of MeSH using Lucene-
SKOS. This improves precision metrics at the top ranks, so we then exploit these
top documents to perform pseudo-relevance feedback.
    The indexed fields depend on the task: For image retrieval, we achieved good
results indexing and searching only on the image captions, title and abstract. For
case retrieval, we indexed and searched on the full text (all chapters including
image captions), abstract and title. We ran pseudo-relevance feedback using the
top 3 results retrieved using the initial query. We added a maximum of 25 new
query terms to the initial query. We purged any candidate words that did not
appear in a minimum of 5 documents.

2.2   Visual retrieval
For visual retrieval, we extracted a set of descriptors effective in medical images
retrieval (CEDD, FCTH, Local Binary Pattern (LBP) histograms and color his-
tograms). We used this image features in pairs. For LPB and color histograms,
we extracted both descriptors for 6×6 image grid and concatenated the results.
For CEDD and FCTH, we extracted and concatenated both feature descriptors.
    CEDD (Color and Edge Directivity Descriptor) expresses color and edge di-
rectivity information from the image into a compact descriptor (54 bytes per
image). It combines ”real color” histograms from the HSV color space (e.g. bins
Light Magenta, Dark Green) with MPEG-7 Edge Histogram Descriptor. FCTH
(Fuzzy Color and Texture Histogram) combines the same color histograms as
CEDD with texture information extracted with multiple energy wavelets using
fuzzy logic techniques. Color histograms are histograms of the individual com-
ponents of the HSV color space. Local Binary Patters are texture descriptors
based on thresholding a pixel with its neighbors to detect texture variations.
The thresholded values are concatenated into an histogram to represent image
texture. The features of all images in the corpus are stored in a FLANN [9] L2
index.
    The image retrieval results are sorted by their similarity, with the score be-
ing the normalized inverse of the L2 distances between the query image and
the indexed result images. For case-base retrieval, an additional step must be
performed: the image id (IRI), must be converted into a document id (DOI) (Fig-
ure 1 (a)) and the duplicate results must be merged to have an unique document
list (Figure 1 (b)). More details are present in section 2.3.
                                                                                      III


Rank    IRI       Score             Rank DOI Score                       Rank DOI Score
1        a2       1          (a)    1     a 1                      (b)   1     a 1
2        b1       0.93              2     b 0.93                         2     b 0.93
3        b2       0.90              3     b 0.90                         3     c 0.90
4        c3       0.90              4     c 0.90
5        a1       0.73              5     a 0.73
6        c2       0.48              6     c 0.48

Fig. 1. Case based retrieval step. (a) get document id (DOI) from image id (IRI); (b)
combine multiple document results into one (unique) document list. The example uses
CombMAX fusion for simple visualization.


2.3    Results fusion
Result fusion aims at combining ranked lists from multiple sources into a single
combined ranked list. Consider these two use cases: combine the results from
queries with multiple images and combine the results from text and images
queries.
   There are two main approaches for late fusion: score based and rank based.
Score based approaches (CombSUM, CombMAX and CombMNZ) combine the
normalized scores given by the individual searches (e.g. visual and textual) as a
basis to the create the new ranked list. The studied variant that achieves the best
performance [2] is CombMNZ, but ranked based fusion is gaining momentum,
and can outperform score based fusion under most conditions [5, 6]. For each
document i, the score after fusion can be computed as:
                                                 N (i)
                                                 X
                            combSUM(i) =                 Sk (i),                      (1)
                                                 k=1

                        combMAX(i) = max(S), ∀S ⊂ Di ,                                (2)
                      combMNZ(i) = N (i) × combSUM(i),                                (3)
where Sk (i) is the score of the i document on the k result list.
    N(i) refers to the number of times a document appears on a results list. A
result list k does not contain all documents. Documents with a zero score or a
very high rank can be safely ignored. Thus, N(i) varies between 0 (the document
i does not appear on any list) and the total number of results list (the document
i appears on all lists). For example, in our experiments, there are two results
lists: one for visual search and other for textual search, limited to 1000 results
each.
    Rank based fusion methods consider the inverse of the rank of each document
in each one of the individual lists as the score. Reciprocal Rank and Reciprocal
Rank Fusion are the two methods we evaluated:
                                         N (i)
                                         X        1
                               RR(i) =                  ,                             (4)
                                                 Rk (i)
                                         k=1
IV

                                N (i)
                                X           1
                     RRF(i) =                      , with h = 60.              (5)
                                        h + Rk (i)
                                k=1

where Rk (i) is the rank of document i on the k rank.
   After analyzing both score and rank based approaches, we combined elements
from both to improve precision. Inverted Squared Rank (ISR) combines the
inverse rank approaches of RR and RRF (using the squared rank to improve
precision at top results) with the frequency component of combMNZ (results
that appear on multiple lists are boosted):
                                               N (i)
                                               X         1
                          ISR(i) = N (i) ×                     .               (6)
                                                       Rk (i)2
                                               k=1


3     Results

3.1   Modality classification

In the modality classification task, we used the CEDD and FCTH descriptors
and (stemmed) text from the corresponding caption and article title. In the
image and textual runs, we used the corresponding descriptors from all images
in the provided training dataset to create a model based on stochastic gradient
descent with Vowpal Wabbit. These runs correspond to the CEDD FCTH for
images and words from text in Table 1.
    The multimodal run concatenates the descriptors described above and per-
forms the stochastic gradient descent with Vowpal Wabbit with the combined
descriptors (CEDD, FCTH, caption and title words). This run correspond to the
All in Table 1.
    The dataset provided as training data for modality classification was un-
balanced in the quantity of images per class. For example, the ”Compound or
multipane images” (COMP) category contains over 1,105 training images while
the ”PET” (DRPE) category contains 16 training images . Also, images in the
COMP may not be visually distinct; they consist on the combination of images
from other categories. Thus, we have decided to send runs where we ignored the
COMP in training and classification. These run correspond to the runs ended
by noComb in Table 1.
    Regarding the performance of our algorithm, we were able to be the 3rd
best team overall with our multimodal All run, classifying 72.92% of the images
correctly. This is only 8.76% bellow the best run. Our visual run classified 57.62%
of the images correctly (23.17% behind the best run) and our best textual run
classified 62.35% of the images correctly (1.82% behind the best run). The most
interesting fact is the big improvement of the results from single modality runs
in the multimodal run. We improved our best single modality run by over 10%,
while the best team only improved their best run by 0.89% in the multimodal
approach. Our noComb approaches did not perform well, as the testing dataset
contained a lot of COMP images.
                                                                                        V

Table 1. Modality classification performance comparison by retrieval type. All our
runs and the best runs are present. If our run is the best, the second best is present

                        Run Name      Type Correctly classified
                    IBM modality run1  T        64.17%
                          words        T        62.35%
                      words noComb     T        32.80%
                    IBM modality run4  V        80.79%
                      CEDD FCTH        V        57.62%
                   CEDD FCTH NoComb V           32.49%
                    IBM modality run8  M        81.68%
                            All        M        72.92%
                       All NoComb      M        44.61%


3.2   Case retrieval

For case retrieval, we submitted one visual run (with segmented LPB Histograms
and HSV Color Histograms) and two textual runs (one with MeSH expansion
and one without expansion). The run with MeSH expansion (with MSH on the
run name) outperformed the run without expansion in all metrics, with a special
emphasis on MAP (12% increase) and P@10 (11% increase).
    We achieved our best result in the expanded textual run, with a MAP 22%,
GM-MAP 11.8% and P@10 26%, very close to the best textual run (MAP: 24%,
GM-MAP: 11.6% and P@10: 27%). The visual run achieved the best result in
class, outperforming the second best result by a factor of 10. Our run achieved a
MAP of 3% and P@10 of 4%. Our best multimodal run (using ISR) also achieved
the best result in class with a MAP of 16% and P@10 of 18%. The combMNZ
based run achieved much worse results (MAP: 8% and P@10: 14%), following
our idea that rank-based fusion is better that score-based fusion in multimodal
retrieval. Although these results are worse that the textual only results, our
rank-based fusion algorithms improved existing algorithm by a small margin.


Table 2. Case based runs performance comparison by retrieval type. All our runs and
the best runs are present. If ours is the best in the modality, the second best is present

            Run Name               Type MAP GM-MAP bpref P@10 P@30
          SNUMedinfo9               T 0.2429 0.1163 0.2417 0.2657 0.1981
FCT LUCENE BM25L MSH PRF T 0.2233 0.1177 0.2000 0.2600 0.1800
  FCT LUCENE BM25L PRF              T 0.1992 0.0964 0.1874 0.2343 0.1781
    FCT SEGHIST 6x6 LBP             V 0.0281 0.0009 0.0300 0.0429 0.0238
 medgift visual nofilter casebased  V 0.0029 0.0001 0.0035 0.0086 0.0067
      FCT CB MM rComb               M 0.1608 0.0779 0.1400 0.1800 0.1257
 medgift mixed nofilter casebased M 0.1467 0.0883 0.1318 0.1971 0.1457
       FCT CB MM MNZ                M 0.0794 0.0035 0.0800 0.1371 0.0810
VI

Fusion In addition to the submitted runs, we compared the performance of the
fusion algorithms using the best textual and visual runs for case-based retrieval.
Performance was evaluated using trec eval and the provided relevance judgments
(Table 3).
    With our data, rank-based approaches outperformed score based approaches
by a factor of 2. One of the reasons is the scoring differencs between text and
images. Even though both visual and text scores have the same normalization
(the [0...1] interval), the distribution of the results in the score space is different.
Rank based approaches can handle multi-modality better, because the scores are
ignored.
    Regarding the differences between RR, RRF and ISR: ISR performed as well
or better in our experiments (Table 3 and 4) in most measures, with a significant
performance boost on P@10 for case-based retrieval. The polynomial component
promotes top ranking results to the top of the list, offering a better precision at
top results (e.g. P@10).


      Table 3. Fusion comparison for the medical ImageCLEF case based queries

           Run Name        Comb   MAP GM-MAP bpref P@10 P@30
      FCT CB MM rComb       ISR  0.1608 0.0779 0.14 0.1800 0.1257
         (not submitted)   RRF   0.1597 0.0787 0.13 0.1571 0.1248
         (not submitted)     RR  0.1582 0.0779 0.14 0.1771 0.1238
         (not submitted) combSUM 0.0804 0.0039 0.09 0.1429 0.0790
       FCT CB MM MNZ combMNZ 0.0794 0.0035 0.08 0.1371 0.0810
         (not submitted) combMAX 0.0292 0.0013 0.03 0.0457 0.0248


                Table 4. Ad-hoc image retrieval fusion performance.

  Run Name              Comb                   MAP GM-MAP bpref P@10 P@30
nlm-se-image-based-mixed (best global result) 0.3196 0.1018 0.2983 0.3886 0.2686
(not submitted)         RRF                   0.1496 0.0331 0.1549 0.2200 0.1495
(not submitted)          ISR                  0.1457 0.0329 0.1504 0.2057 0.1476
(not submitted)          RR                   0.1448 0.0326 0.1492 0.2086 0.1457
(not submitted)       combMMZ                 0.0488 0.0013 0.0735 0.1457 0.0752


3.3    Image retrieval

In ad-hoc image retrieval, we submitted one visual run (with segmented LPB
Histograms and HSV Color Histograms) and two textual runs (one with MeSH
expansion and one without expansion). As with case retrieval, the run with MeSH
expansion (FCT SOLR BM25L MSH) outperformed the run without expansion
                                                                              VII

(FCT SOLR BM25L) in all metrics, although to a lesser degree: MAP increased
5% and P@10 increased 10%.
    We achieved our best result in the expanded textual run, with a MAP of
23% and P@10 of 30%, well behind the best textual (and best overall) run with
a MAP of 32% and P@10 of 39%. As expected, our visual runs performed worst
with a MAP of 0.7% and P@10 of 3%, about half of the performance of the best
visual run (MAP: 1.9% and P@10: 6%).
    After the competition, we tested the same fusion algorithms as with case-
based retrieval for multimodal retrieval. Our conclusion are similar to the ones
discussed in the case-based retrieval section: rank based approaches are much
better for our data, with ISR and RR being the best for fusion.


Table 5. Ad-hoc image based runs performance comparison by retrieval type. All our
runs and the best runs are present.

        Run Name          Type MAP GM-MAP bpref P@10 P@30
nlm-se-image-based-textual T 0.3196 0.1018 0.2982 0.3886 0.2686
FCT SOLR BM25L MSH         T 0.2305 0.0482 0.2316 0.2971 0.2181
   FCT SOLR BM25L          T   0.22 0.0476  0.228 0.2657 0.2114
         DEMIR4            V 0.0185 0.0005 0.0361 0.0629 0.0581
 FCT SEGHIST 6x6 LBP       V 0.0072 0.0001 0.0151 0.0343 0.0267


4   Conclusions


We would like to emphasize our results in the visual and multimodal case re-
trieval tracks, where we achieved the best MAP and bpref. Our fusion algorithm
(ISR) achieved slightly better performance than existing fusion algorithms and
helped us achieving the best result on the multimodal case retrieval track.
    Our results in the modality classification are also noteworthy. In the multi-
modal run, we increased the best single modality result by 17%, ending with the
third best team result.
    We were not able to submit all the desired combination of features and fu-
sion algorithms due to limited time. We hope that next year we can submit all
the desired runs. We will also focus on improving multimodal fusion using other
fusion approaches (e.g early fusion in case and image based retrieval) and algo-
rithms. Other technique we will study is the integration of modality as a feature
in the retrieval tasks.
    Overall, we are satisfied with our first participation in ImageCLEF and hope
that we improve our performance in the following editions.
VIII

References
 1. T. Ahonen, A. Hadid, and M. Pietikäinen. Face description with local binary
    patterns: application to face recognition. IEEE transactions on pattern analysis
    and machine intelligence, 28(12):2037–41, Dec. 2006.
 2. N. J. Belkin, P. Kantor, E. A. Fox, and J. A. Shaw. Combining the evidence of
    multiple query representations for information retrieval. Inf. Process. Manage.,
    31(3):431–448, May 1995.
 3. S. A. Chatzichristofis and Y. S. Boutalis. Cedd: color and edge directivity descrip-
    tor: a compact descriptor for image indexing and retrieval. In Proceedings of the
    6th international conference on Computer vision systems, ICVS’08, pages 312–322,
    Berlin, Heidelberg, 2008. Springer-Verlag.
 4. S. A. Chatzichristofis, K. Zagoris, Y. S. Boutalis, and N. Papamarkos. Accu-
    rate Image Retrieval Based on Compact Composite Descriptors and Relevance
    Feedback Information. International Journal of Pattern Recognition and Artificial
    Intelligence (IJPRAI), 24(2):207 – 244, 2010.
 5. G. V. Cormack, C. L. A. Clarke, and S. Buettcher. Reciprocal rank fusion out-
    performs condorcet and individual rank learning methods. In SIGIR ’09, pages
    758–759, New York, NY, USA, 2009. ACM.
 6. D. Frank Hsu and I. Taksa. Comparing rank and score combination methods for
    data fusion in information retrieval. Inf. Retr., 8(3):449–480, May 2005.
 7. A. Garcia Seco de Herrera, J. Kalpathy-Cramer, D. Demner Fushman, S. Antani,
    and H. Müller. Overview of the imageclef 2013 medical tasks. In Working notes
    of CLEF 2013, 2013.
 8. Y. Lv and C. Zhai. When documents are very long, bm25 fails! In Proceedings
    of the 34th international ACM SIGIR conference on Research and development in
    Information Retrieval, pages 1103–1104. ACM, 2011.
 9. M. Muja and D. Lowe. Fast approximate nearest neighbors with automatic algo-
    rithm configuration. In International Conference on Computer Vision Theory and
    Application VISSAPP’09), volume 340, pages 331–340. INSTICC Press, 2009.
10. H. Müller, A. Garcı́a Seco de Herrera, J. Kalpathy-Cramer, D. Demner-Fushman,
    S. Antani, and I. Eggel. Overview of the imageclef 2012 medical image retrieval
    and classification tasks. In CLEF 2012 working notes, 2012.
11. M. Zhang, R. Song, C. Lin, S. Ma, Z. Jiang, Y. Jin, Y. Liu, and L. Zhao. Expansion-
    based technologies in finding relevant and new information: Thu trec2002 novelty
    track experiments. In the Proceedings of the Eleventh Text Retrieval Conference
    (TREC, pages 586–590, 2002.