<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>XRCE's Participation at Medical Image Modality Classification and Ad-hoc Retrieval Tasks of ImageCLEF 2011</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gabriela Csurka</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>St´ephane Clinchant</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Guillaume Jacquet</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Multi-modal Information Retrieval, Medical Image Modality Classification, Ad-hoc Retrieval,
Semi-supervised learning, Fisher Vector</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>LIG</institution>
          ,
          <addr-line>Univ. Grenoble I, BP 53 - 38041 Grenoble cedex 9, Grenoble</addr-line>
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Xerox Research Centre Europe</institution>
          ,
          <addr-line>6 chemin de Maupertuis 38240, Meylan</addr-line>
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The aim of this document is to describe our methods used in the Medical Image Modality Classification and Ad-hoc Image Retrieval Tasks of ImageClef 2011. The main novelty in medical image modality classification this year was, that there were more classes (18 modalities) organized in a hierarchy and for some categories only few annotated examples were available. Therefore, our strategy in image categorization was to use a semi-supervised approach. In our experiments, we investigated mono-modal (text and image) and mixed modality based classification. The image classification was based on Fisher Vectors built on SIFT-like local orientation histograms and local color statistics. For text representation we used a binarized bag-ofwords representation where each element indicated whether the term appeared in the image caption or not. In the case of multi-modal classification, we simply averaged the text and image classification scores. For the ad-hoc retrieval task, we used the image captions for text retrieval and Fisher Vectors for visual similarity and modality detection. Our text runs were based on a late fusion of different state of the art text experts and the Lexical Entailment model. This Lexical Entailement model used the last year articles to compute similarities between terms and rank first at the previous challenge. Concerning the submitted runs, we realized that we forgot by inadvertance3, to submit our best run from last year [3]. We did not submit either improvement over this run, which was proposed in [6]. Overall, this explain the medium performance of our submitted runs. In this document, we show that our system from last year and its improvements would have achieve top performance. We have not tuned the parameter of this system for this year task, we have just evaluated the runs we did not submit !. Finally, we experimented with different fusion strategies of our textual expert, visual expert and image modality classification scores, which gives consistent results to last year results and to our analysis presented in [6].</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>3 and because we participated in parallel at several ImageCLEF Task</p>
    </sec>
    <sec id="sec-2">
      <title>Introduction</title>
      <p>
        This year the medical retrieval task of ImageCLEF 2011 uses a subset of PubMed Central
containing 231,000 images [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. As it was indicated by the clinicians that modality is one of the most
important filters to limit their search, a first subtask of the Medical Challenge was the Medical
Image Modality Classification. Participants were therefore provided a training set of 1K images
that have been classified into one out of 18 modalities organized in a hierarchy (see Figure 1).
      </p>
      <p>The main novelty in medical image modality classification this year was, that we had more
classes and less annotated data. Furthermore, for some of the categories only few annotated
examples were available. Therefore, our main stategy in image categorization was to first automatically
augment the training set. We basically used two main approaches, a semi-supervised learning
approach and an approach based on an CBIR retrieval scenario (see section 2).</p>
      <p>
        In both cases, we experimented with mono-modal and multi-modal strategies. In the case of
visual modality, we used as image representation Fisher Vectors built on SIFT-like local orientation
histograms and local color statistics (see for details [
        <xref ref-type="bibr" rid="ref10 ref7">10, 7</xref>
        ], while text (in our case image captions)
were represented by a binarized bag-of-words representation, where each element indicated whether
the term appeared in the image caption or not. In the case of multi-modal classification, we simply
averaged the text and image classification scores.
      </p>
      <p>
        Concerning the ad-hoc medical image retrieval task, the only information we used this year
was image captions and visual representation. Our text expert was based on a late fusion four
textual model built on image captions: a Dirichlet Smoothed language model (DIR), two Power
Law Information-Based Model [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] (SPL and LGD) and the Lexical Entailment IR Model [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] (AX).
The only model that used other information than the provided image captions was the Lexical
Entailment, as it used the last years articles to compute similarities between terms. This text
expert was combined with our Fisher Vector based visual model, with modality class predictions
or both using different fusion strategies (see section 3).
2
      </p>
    </sec>
    <sec id="sec-3">
      <title>Medical Image Modality Classification Task</title>
      <p>
        In our experiments we investigated both mono-modal and mixed modality based classification.
Concerning the pure visual-based classifiers, we used Fisher Vector [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] representation of the
images as decribed also in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Note that in the case of medical images we used only a single FV
per image and per feature without using the spatial pyramid. The low level features we used were
similar to the features used in our Wikipedia runs [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], i.e. SIFT-like local orientation histograms
(ORH) and local RGB statistics.
      </p>
      <p>Note, that in the medical corpus a large amount of images were 1 channel gray-scale image.To
be able to compute the COL features for this images, we first transformed them into 3 channel
images where the R, G, B channels were simply made equal each with the luminance channel of
the gray scale image. This allowed us to obtain low level features of the same size for grayscale
and color RGB images, and hence, to build a common COL visual vocabulary.</p>
      <p>Concerning our text representation (TXT), we used a binarized bag-of-words representation,
where each element indicated whether the term appeared in this document or not (in our case
image caption). Similarly to the Fisher Vectors, we further normalized this vector with Power
Norm (α = 0.5) and L2 normalization.</p>
      <p>
        To train the classifiers, we used our own implementation of the Sparse Logistic Regression
(SLR) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], (i.e. logistic regression with a Laplacian prior). We trained a classifier per class
(oneversus-all) and per feature type (ORH, COL or TXT). Finally, we used the late fusion to combine
them where the scores were simply averaged.
      </p>
      <p>However, when we tested this system on the training data, using a 5 fold cross-validation
scheme, the results we obtained were rather poor (see Table 1.</p>
      <p>Analyzing our results per class, we realized that the low performance might be due to the fact
that for some of the categories only few annotated examples were available. Therefore, we decided
to automatically increase the training set. We basically used two main approach for this a
semisupervised learning approaches and a visual retrieval based approach and the image collection
from the medical retrieval task.</p>
      <p>The main idea of the semi-supervised learning approaches was to use the modality classifiers
trained on the provided training data to rank the 231K images of the collection based on the
classification score. Hence for each modality, the corresponding top K documents were considered
as most probably correctly classified, labeled with the given modality and added to the training
set.</p>
      <p>In the case of the second scenario, we first built an image query with a random set of images
labeled with the given modality. The images in the collection were ranked based on their average
similarity (dot product between Fisher Vectors) to the query image set. The top K documents
ranked as most similar to the query set were added to the training set labeled with the given
modality.</p>
      <p>Finally, the modality classifiers were retrained with the increased training using different feature
types and combined as described above. We submitted different runs (detailed below) using either
only visual information or both visual and textual.
2.1</p>
      <sec id="sec-3-1">
        <title>Visual only based runs:</title>
        <p>– V1: For this run we used the semi-supervised approach with the visual classifier using COL+ORH
features in both steps (training with the original set and training with the increased training
set). After the first step we added the top 25 images for each modality classifier.
– V2 and V3: For this run we used the visual retrieval to increase the training set. To build
our queries, we used the labeled images from two of the 5 folds used in our first experiments
(one for each run). The top 20 images for each query were added to the training set and a
COL+ORH visual categorization system trained.</p>
        <p>Note that both stategies leaded to similar performances. The choice of k might be non-optimal
and a better strategy would be to learn a different k for each modality based on some confidence
of the classification scores.
2.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Mixed modality based runs:</title>
        <p>As the table 3 shows, all the “mixed modality” runs outperformed the “visual only” runs with
about 2-3% in accuracy (see for example V1 and M3, where the enriched training set was the
same). Finally, comparing with the results in table 1, we can see that both strategy to increase
the training set was extremely useful leading to an absolute 25% over the baseline using only the
annotated training set.
3</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Medical Ad-hoc Image Retrieval Task</title>
      <p>
        Concerning the ad-hoc medical image retrieval task, the only information we used this year was
image captions (not the full articles) and visual representation. Using these information, we built
several mono-modal and multi-modal systems. In addition, we also integrated the modality
classification scores with these systems. In which follows, we give further details on these methods and
the corresponding runs.
As visual representation, we used exactly the same representation as in the modality classification,
namely Fisher Vectors with ORH and COL features. As we use the dot product as similarity
measure, the sum of similarities (used in our case) between the FVs of the corresponding features
is equivalent to the dot product between image signatures built as a concatenation of the FV ORH
with FV COL. As the Table 4 shows, adding the color information slightly improve the retrieval
results, however as expected the visual similarity alone is not sufficient to handle the semantic
queries.
Concerning the text representation, we used only the image captions that were first pre-processed
including tokenization, lemmatization, and standard stopword removal. As in some cases
lemmatization might lead to a loss of information, when we constructed the dictionary, we kept for each
term its lemmatized and non lemmatized version. Then, each caption was transformed first into a
bag-of-words representation on which we built basically four textual models. These models (
summerized in Table 5) were Diriclet Smoothed standard language model (similar to the techniques
used in our past participation in other tasks of ImageCLEF [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]), two Power Law Information-Based
Model [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] (LGD and SPL) and finally the Lexical Entailment IR Model [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        In the first model (DIR), we used with standard Language Model representation the Dirichlet
smoothing that gives the following retrieval model [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]:
      </p>
      <p>RSV (q, d) =</p>
      <p>X
w∈Q,xdw&gt;0
xqw log(1 +</p>
      <p>d
xw
μp(w|C)
) + lq log</p>
      <p>μ
ld + μ
where xdw is the number of occurrences of word w in document d, ld is the length of d in tokens
after lemmatization, μ the smoothing parameter and p(w|C) is the corpus language model.</p>
      <p>
        The Log-logistic model (LGD) has the same steps as the Smoothed Power Law model (SPL)
(described in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]). The only difference is that Relevance Score Vector in the Ranking Model becomes
(see details in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]):
      </p>
      <p>X
w∈q∩d
RSV (q, d) =
xw − log P (T fw &gt; tdw)
q
(1)
(2)</p>
      <p>
        Our last model, the Lexical Entailment based IR Models (AX) is also described in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. For
this model, we need to compute the probabilistic term similarities between terms. However, using
only the image captions, gives a context rather poor for the words. As we didn’t processed the full
articles from this year, we used the processed articles from last year’s medical corpus.
      </p>
      <p>
        Finally, expecting to bring complementary information, we averaged (with equal weighting)
several models to get a single text expert. However, as the Table 6, our expectation was rather
wrong, and instead we bringed more noise than useful information. Comparing the four models
shows that without any external information, using only the image captions is unsufficient.
However, using the AX model alone (not submitted 4) gives better performance than all other runs,
including the ones submitted to the challenge by other teams. Note that only T6 and T7
(corresponding to the runs XRCE RUN TXTax dir spl respectively XRCE RUN TXT noMOD) were
submitted (results in red).
As multi-modal retrieval system, we used the Late Semantic Combination (LSC) proposed in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
but also described in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The main idea of this late fusion is that first we use the text expert to select
the top N = 1000 semantically relevant documents, and then we average their textual and visual
scores. Results for different text experts using equal weighting (wT = wV = 0.5) and unbalanced
weighting between scores images and score texts (wT = 0.9 and wV = 0.1) are shown in Table
7. Note that we submitted only M6 (corresponding to XRCE RUN MIX SFL noMOD ax dir spl)
and M7 (corresponding to XRCE RUN MIX SFL noMOD ax dir spl lgd) with equal weigthing
(results in red in the table).
      </p>
      <p>We can see first that using LSC with equal weighting leaded to a decrease of the MAP but
increased in most cases the P10 value. On the contrary, giving a much important weighting for
the text scores compared to the image similarities (wT = 0.9 and wV = 0.1), we are able in some
cases to improve over our text results such as the model AX. As we obtain similar performances
or below using the late fusion, we can say that for this task the image similarities did not really
help to improve the retrieval.
3.4</p>
      <sec id="sec-4-1">
        <title>Multi-modal retrieval systems with image modality prediction</title>
        <p>In this section we show the performance of our retrieval system when we combined them with
the modality prediction. Therefore, for each topic each image and the query text was individually
clssified by our mono-modal modality classifier, and we retained the modality (or modalities) we
4 As explained in our abstract, we realized that we forgot by inadvertance (and because we participated
in several ImageCLEF Task), to submit our best run from last year
obtained. Note that in the case of topics where any type of images were allowed, we also considered
only modalities obtained by this automatic model. Hence our model was sub-optimal.</p>
        <p>Then, we experimented with two strategies. In the first case (FILT), for each document in
the dataset that corresponded to the selected modality of the topic we boosted its retrieval score
(multiplied by 2), while all other scores were retained unchanged. Hence if a score was high and
with the desired modality, it was significantly increased, while if the retrieval score was low, the
modality classifier had smaller effect on it.</p>
        <p>In the second case (Mscore), for each document we added to the retrieval score the classification
score of the query modality. When several modalities were retrained, we used the maximum of all
thoses scores. Results for both strategies applied to different text runs are shown in Table 8.</p>
        <p>Unfortunately, we submitted only our equal weighted mixed runs M6 and M7 combined with
the modality classifier instead of our text runs, shown in table 9. While these results confirm that
the modality classification helps, their performance is significantly worse that the performances
obtained with TM4 and TM5.
4</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this document, we describe the methods we used in Medical Image Modality and Medical
Ad-hoc Retrieval Tasks at ImageClef 2011. We have shown that while for some classes only few
examples were available, using a semi-supervised approach to increase the training data can lead
to very significant improvement of the results. With this method our method came again as
best performing in the challenge. Concerning the ad-hoc retrieval task, our strategy to average</p>
      <p>Model /strategy NoMod Filt Mscore
Mi+Mod TXT expert MAP P@10 MAP P@10 MAP P@10
MM6 AX+DIR+SPL 14.72 34.33 15.20 36.33 16.43 38.00</p>
      <p>
        MM7 AX+DIR+SPL+LGD 14.29 33.67 15.12 36.67 15.45 38.00
several text models instead of using only the Lexical Entailment IR Model [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] (AX) leaded to
a text expert that had a medium performance. Further combining this with a equally weighted
Late semantic combination leaded to a retrieval system that performed even poorer than our
text expert. While the combination of this expert allowed to increase slightly the performance of
the system, it remained far behind the best performing systems in the challenge. However, after
testing our not submitted runs we realized that our AX text model alone performed better that the
best performing text expert in the challenge, and when we further combine it with the modality
classifier, the system out-performs the best submitted run.
      </p>
      <p>Acknowledgments
We would like also to acknowledge Florent Perronnin and Jorge S´anchez for the efficient
implementation of the Fisher Vector computation and the of the Sparse Logistic Regression (SLR) we
used in our experiments.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>J.</given-names>
            <surname>Ah-Pine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Clinchant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Csurka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Perronnin</surname>
          </string-name>
          , and
          <string-name>
            <surname>J-M. Renders</surname>
          </string-name>
          .
          <article-title>Leveraging image, text and cross-media similarities for diversity-focused multimedia retrieval</article-title>
          , volume The Information Retrieval Series, chapter
          <volume>3</volume>
          .4. Springer,
          <year>2010</year>
          . ISBN 978-3-
          <fpage>642</fpage>
          -15180-4.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>St</surname>
          </string-name>
          <article-title>´ephane Clinchant, Julien Ah-Pine, and Gabriela Csurka. Semantic combination of textual and visual information in multimedia retrieval</article-title>
          .
          <source>In ACM International Conference on Multimedia Retrieval (ICMR)</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. St´ephane Clinchant, Gabriela Csurka, Julien Ah-Pine, Guillaume Jacquet, Florent Perronnin,
          <string-name>
            <surname>Jorge Sanchez</surname>
            , and
            <given-names>Keyvan</given-names>
          </string-name>
          <string-name>
            <surname>Minoukadeh</surname>
          </string-name>
          .
          <article-title>Xrce's participation in wikipedia retrieval, medical image modality classi cation and ad-hoc retrieval tasks of imageclef 2010</article-title>
          .
          <source>In Working Notes of CLEF</source>
          <year>2010</year>
          , Padova, Italy,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>St</surname>
          </string-name>
          <article-title>´ephane Clinchant and Eric Gaussier</article-title>
          .
          <article-title>Information-based models for ad hoc ir</article-title>
          .
          <source>In SIGIR '10: Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval</source>
          , pages
          <fpage>234</fpage>
          -
          <lpage>241</lpage>
          , New York, NY, USA,
          <year>2010</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. St´ephane Clinchant, Cyril Goutte, and E´ric Gaussier.
          <article-title>Lexical entailment for information retrieval</article-title>
          .
          <source>In Advances in Information Retrieval, 28th European Conference on IR Research</source>
          , ECIR
          <year>2006</year>
          , London, UK,
          <source>April 10-12</source>
          , pages
          <fpage>217</fpage>
          -
          <lpage>228</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Gabriela</given-names>
            <surname>Csurka</surname>
          </string-name>
          , St´ephane Clinchant, and
          <string-name>
            <given-names>Guillaume</given-names>
            <surname>Jacquet</surname>
          </string-name>
          .
          <article-title>Medical image modality classification and retrieval</article-title>
          .
          <source>In International Workshop on Content-based Multimedia Indexing</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Gabriela</given-names>
            <surname>Csurka</surname>
          </string-name>
          , St´ephane Clinchant, and
          <string-name>
            <given-names>Guillaume</given-names>
            <surname>Jacquet</surname>
          </string-name>
          .
          <article-title>Xrces participation at wikipedia retrieval of imageclef 2011</article-title>
          .
          <source>In Working Notes of CLEF</source>
          <year>2011</year>
          , Amsterdam, The Netherlands,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>B.</given-names>
            <surname>Krishnapuram</surname>
          </string-name>
          and
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Hartemink</surname>
          </string-name>
          .
          <article-title>Sparse multinomial logistic regression: Fast algorithms and generalization bounds</article-title>
          .
          <source>PAMI</source>
          ,
          <volume>27</volume>
          (
          <issue>6</issue>
          ),
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Henning</given-names>
            <surname>Mu</surname>
          </string-name>
          <article-title>¨ller, Jayashree Kalpathy-Cramer, and Steven Bedrick. Overview of the clef 2011 medical image retrieval track</article-title>
          .
          <source>In Working Notes of CLEF</source>
          <year>2011</year>
          , Amsterdam, The Netherlands,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>F.</given-names>
            <surname>Perronnin</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Dance</surname>
          </string-name>
          .
          <article-title>Fisher kernels on visual vocabularies for image categorization</article-title>
          .
          <source>In CVPR</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Zhai and J lafferty. A study of smoothing methods for language models applied to ad hoc to information retrieval</article-title>
          .
          <source>In Proceedings of SIGIR'01</source>
          , pages
          <fpage>334</fpage>
          -
          <lpage>342</lpage>
          . ACM,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>