<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multi-Modal Interactive Approach to ImageCLEF 2007 Photographic and Medical Retrieval Tasks by CINDI</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>M. M. Rahman</string-name>
          <email>rahm@cs.concordia.ca</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bipin C. Desai</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Prabir Bhattacharya</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>General Terms</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Algorithms, Performance, Experimentation</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>1455 de Maisonneuve Blvd.</institution>
          ,
          <addr-line>Montreal, QC, H3G 1M8</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Dept. of Computer Science &amp; Software Engineering, Concordia University</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents the contribution of CINDI group to the ImageCLEF 2007 ad-hoc retrieval tasks. We experiment with multi-modal (e.g., image and text) interaction and fusion approaches based on relevance feedback information for image retrieval tasks of photographic and medical image collections. For a text-based image search, keywords from the annotated ¯les are extracted and indexed by employing the vector space model of information retrieval. For a content-based image search, various global, semi-global, region-speci¯c and visual concept-based features are extracted at di®erent levels of image abstraction. Based on relevance feedback information, multiple textual and visual query re¯nements are performed and user's perceived semantics are propagated from one modality to another with query expansion. The feedback information also dynamically adjusts intra and inter-modality weights in linear combination of similarity matching functions. Finally, the top ranked images are obtained by performing both sequential and simultaneous retrieval approaches. The analysis of results of di®erent runs are reported in this paper.</p>
      </abstract>
      <kwd-group>
        <kwd>H</kwd>
        <kwd>3 [Information Storage and Retrieval]</kwd>
        <kwd>H</kwd>
        <kwd>3</kwd>
        <kwd>1 Content Analysis and Indexing</kwd>
        <kwd>H</kwd>
        <kwd>3</kwd>
        <kwd>3 Information Search and Retrieval</kwd>
        <kwd>H</kwd>
        <kwd>3</kwd>
        <kwd>7 Digital Libraries</kwd>
        <kwd>I</kwd>
        <kwd>4</kwd>
        <kwd>8 [Image Processing and Computer Vision]</kwd>
        <kwd>Scene Analysis|Object Recognition</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>For the 2007 ImageCLEF competition, CINDI research group has participated in two di®erent
tasks of ImageCLEF track: an ad-hoc retrieval from a photographic collection (e.g., IAPR data set)
and ad-hoc retrieval from a medical collection (e.g., CASImage, MIR, PathoPic, Peir, endoscopic
and myPACS data sets) [1, 2]. The goal of the ad-hoc task is given a multilingual statement
describing a user information need along with example images, ¯nd as many relevant images as
possible from the given collection. Our work exploits advantages of both text and image modalities
by involving users in the retrieval loop for cross-modal interaction and integration. This paper
presents our multi-modal retrieval methodologies, description of submitted runs, and analysis of
retrieval results.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Text-Based Image Retrieval Approach</title>
      <p>This section describes the text-based image retrieval approach where a user submits a query topic
using keywords to retrieve images which are associated with retrieved annotation ¯les. For a
textbased search, it is necessary to prepare the document collection consisting of annotated XML and
SGML ¯les into an easily accessible representation. Each annotation ¯le in the collection is linked
to image(s) either in a one-to-one or one-to-many relationships. To incorporate a keyword-based
search on these annotation ¯les, we rely on the vector space model of information retrieval [3]. In
this model, a document is represented as a vector of words where each word is a dimension in an
Euclidean space. The indexing is performed by extracting keywords from selected elements of the
XML and SGML documents depending on the image collection. Let, T = ft1; t2; ¢ ¢ ¢ ; tN g denotes
the set of keywords (terms) in the collection. A document Dj is represented as a vector in a
N -dimensional space as fDj = [wj1; ¢ ¢ ¢ ; wjk; ¢ ¢ ¢ ; wjN ]T . The element wjk = Ljk ¤ Gk denotes the
tf-idf weight [3] of term tk; k 2 f1; ¢ ¢ ¢ ; N g, in a document Dj . Here, the local weight is denoted as
Ljk = log(fjk) + 1, where fjk is the frequency of occurrence of keyword tk in a document Dj . The
global weight Gk is denoted as inverse document frequency as Gk = log(M=Mk), where Mk is the
number of documents in which tk is found and M is the total number of documents in the collection.
A query Dq is also represented as an N -dimensional vector fDq = [wq1; ¢ ¢ ¢ ; wqk; ¢ ¢ ¢ ; wqN ]T . To
compare Dq and Dj , the cosine similarity measure is applied as follows</p>
      <p>Simtext(Dq; Dj ) =
qPN
k=1(wqk)2 ¤</p>
      <p>PN
k=1 wqk ¤ wjk
qPN
k=1(wjk)2
where wqk and wjk are the weights of the term tk in Dq and Dj respectively.
2.1</p>
      <sec id="sec-2-1">
        <title>Textual Query Re¯nement by Relevance Feedback</title>
        <p>Query reformulation is a standard technique for reducing ambiguity due to word mismatch problem
in information retrieval [4]. In the present work, we investigate interactive way to generate multiple
query representations and their integration in a similarity matching function by applying various
relevance feedback methods. The relevance feedback technique prompts the user for feedback on
retrieval results and then use that information on subsequent retrievals with the goal of increasing
retrieval performance [4, 5]. We generate multiple query vectors by applying various relevance
feedback methods. For the ¯rst method, we use the well known Rocchio algorithm [6] as follows
fDmq (Rocchio) = ® f Doq + ¯
1</p>
        <p>X
jRj fDj 2R</p>
        <p>1
fDj ¡ ° ^
jRj ^fDj 2R</p>
        <p>
          ^
X ^
fDj
where fDmq and f Doq are the modi¯ed and the original query vectors, R and R^ are the set of relevant
and irrelevant document vectors and ®, ¯, and ° are weights. This algorithm generally moves a
new query point toward relevant documents and away from irrelevant documents in feature space
[6]. For our second feedback method, we use the Ide-dec-hi formula as
fDmq (Ide) = ® f Doq + ¯
X fDj ¡ ° max(fDj )
^
R
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
(
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
(
          <xref ref-type="bibr" rid="ref3">3</xref>
          )
where maxR^ (fDj ) is a vector of the highest ranked non-relevant document. This is a modi¯ed
version of the Rocchio's formula which eliminates the normalization for the number of relevant
and non-relevant documents and allows limited negative feedback from only the top-ranked
nonrelevant document. For the experimental purpose, we consider the weights as ® = 1, ¯ = 1, and
° = 1.
        </p>
        <p>We also perform two di®erent query reformulation based on local analysis. Generally, local
analysis considers the top k most highly ranked documents for query expansion without any
assistance from the user [12, 3]. However, in this work, we consider only the user selected relevant
images for further analysis. At ¯rst, a simpler approach of query expansion is considered based
on identifying most frequently occurring ¯ve keywords from user selected relevant documents.
After selecting the additional keywords, the query vector is reformulated as fDmq (Local1) by
reweighting its keywords based on the tf-idf weighting scheme and is re-submitted to the system
as a new query. The other query reformulation approach is based on expanding the query with
terms correlated to the query terms. Such correlated terms are those present in local clusters
built from the relevant documents as indicated by the user. There are many ways to build a
local cluster before performing any query expansion [12, 3]. For this work, a correlation matrix
C(jTlj£jTlj) = [cu;v] is constructed [8] in which the rows and columns are associated with terms in
a local vocabulary Tl. The element of this matrix cu;v is de¯ned as
cu;v =</p>
        <p>
          nu;v
nu + nv ¡ nu;v
(
          <xref ref-type="bibr" rid="ref4">4</xref>
          )
where, nu is the number of local documents which contain term tu, nv is the number of local
documents which contain term tv, and nu;v is the number of local documents which contain both
terms tu and tv. Here, cu;v measures the ratio between the number of local documents where both
tu and tv appear and the total number of local documents where either tu or tv appear. If tu and
tv have many co-occurrences in documents, then the value of cu;v increases, and the documents
are considered to be more correlated. Now, given the correlation matrix C, we use it to build the
local correlation cluster. For a query term tu 2 Dq, we consider the u-th row in C (i.e., the row
with all the correlations for the keyword tu). From that row, we return three largest correlation
values cu;l; u 6= l, and add corresponding terms tl for query expansion. The process is continued
for each query term and ¯nally the query vector is reformulated as fDmq (Local2) by re-weighting
its keywords based on the tf-idf weighting scheme.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Content-based Image Retrieval Approach</title>
      <p>In content-based image retrieval (CBIR), access to information is performed at a perceptual level
based on automatically extracted low-level features (e.g., color, texture, shape, etc.) [13]. The
performance of a content-based image retrieval (CBIR) system depends on the underlying image
representation, usually in the form of a feature vector. To generate feature vectors, various global,
semi-global, region-speci¯c, and visual concept-based image features are extracted at di®erent
levels of abstraction. The MPEG-7 based Edge Histogram Descriptor (EHD) and Color Layout
Descriptor (CLD) are extracted for image representation at global level [14]. To represent EHD
as vector f ehd, a histogram with 16 £ 5 = 80 bins is obtained. The CLD represents spatial layout
of images in a very compact form in YCbCr color space where Y is the luma component and Cb
and Cr are the blue and red chroma components [14]. In this work, CLD with 10 Y , 3 Cb and
3 Cr coe±cients is extracted to form a 16-dimensional feature vector f cld. The global distance
measure between feature vectors of query image Iq and database image Ij is a weighted Euclidean
distance measure and is de¯ned as</p>
      <p>
        Disglobal(Iq; Ij ) = !cldDiscld(fIcqld; fIcjld) + !ehdDisehd(fIeqhd; fIejhd);
(
        <xref ref-type="bibr" rid="ref5">5</xref>
        )
where, Discld(fIcqld; fIcjld) and Disehd(fIeqhd; fIejhd) are the Euclidean distance measures for CLD and
EHD respectively and !cld and !ehd are weights for each feature distance measure subject to
0 · !cld; !ehd · 1 and !cld + !ehd = 1 and initially adjusted with equal weights as !cld = 0:5
and !ehd = 0:5. For semi-global feature vector, a simple grid-based approach is used to divide the
images into ¯ve overlapping sub-images [16]. Several moment based color and texture features are
extracted from each of the sub-images and later they are combined to form a semi-global feature
vector. The mean and standard deviation of each color channel in HSV color space are extracted
form each overlapping sub-region of an image Ij . Various texture moment-based features (such as
energy, maximum probability, entropy, contrast and inverse di®erence moment) are also extracted
from the grey level co-occurrence matrix (GLCM) [15]. Color and texture feature vectors are
normalized and combined to form a joint feature vector frsjg of each sub-image r and ¯nally they
are combined as the semi-global feature vector for an entire image as f sg. The semi-global distance
measure between Iq and Ij is de¯ned as
5
Diss-global(Iq; Ij ) = Dsg(fIsqg; fIsjg) = 1r X !rDisr(frsqg; frsjg)
r=1
where, Disr(frsqg; frsjg) is the Euclidean distance measure of the feature vector of region r and !r are
the weights for the regions, which are set as equal initially.
      </p>
      <p>Region-based image retrieval (RBIR) aims to overcome the limitations of global and
semiglobal retrieval approaches by fragmenting an image automatically into a set of homogeneous
regions based on color and/or texture properties. Hence, we consider a local region speci¯c feature
extraction approach by fragmenting an image automatically into a set of homogeneous regions
made up of (2 £ 2) pixel blocks based on a fast k-means clustering technique. The image level
distance between Iq and Ij is measured by integrating properties of all regions in the images.
Suppose, there are M regions in image Iq and N regions in image Ij . Now, the image-level
distance is de¯ned as</p>
      <p>Dislocal(Iq; Ij ) =</p>
      <p>
        PM k=1 wrkj Disrkj (j; q)
i=1 wriq Disriq (q; j) + PN
2
(
        <xref ref-type="bibr" rid="ref6">6</xref>
        )
(
        <xref ref-type="bibr" rid="ref7">7</xref>
        )
where wriq and wrkj are the weights (e.g., number of image block as unit) for region riq and
region rkj of image Iq and Ij respectively. For each region riq 2 Iq, Disriq (q; j) is de¯ned as the
minimum Bhattacharyya distance [18] between this region and any region rkj 2 Ij as Disriq (q; j) =
min(Dis(riq ; r1j ); ¢ ¢ ¢ ; Dis(riq ; rNj )). The Bhattacharyya distance is computed based on mean color
vector and covariance matrix of color channels in HSV color space of each region. The details of
the segmentation, local feature extraction and similarity matching schemes were described in our
previous work in [16].
      </p>
      <p>We also extract visual concept-based image features that is analogous to a keyword-based
representation in text retrieval domain. The visual concepts depict perceptually distinguishable
color or texture patches in local image regions. For example, a predominant yellow color patch
can be presented either in an image of the sun or in a sun°ower image. To generate a set of visual
concepts analogous to a dictionary of keywords, we consider a ¯xed decomposition approach to
generate a 16 £ 16 grid based partition of images. Therefore, sample images from a training set
are equally partitioned into 256 non-overlapping smaller blocks. To represent each block as a
feature vector, color and texture moment-based features are extracted as described for the
semiglobal feature. To generate a coodbook of prototype concept vectors from the block features, we
use a SOM-based clustering technique [17]. The basic structure of a SOM consists of two layers:
an input layer and a competitive output layer. The input layer consists of a set of input node
vector X = fx1; ¢ ¢ ¢ xi; ¢ ¢ ¢ xng; xi 2 &lt;d, while the output layer consists of a set of N neurons
C = fc1; ¢ ¢ ¢ cj ; ¢ ¢ ¢ cN g, where each neuron cj is associated with a weight vector cj 2 &lt;d. After
the weight vectors are determined through the learning process, each output neuron cj resembles
as a visual concept with the associated weight vector cj as code vector of a codebook. To encode
an image, it is also decomposed into an even gird-based partition, where the color and texture
moment-based features are extracted from each block. Now, for joint color and texture
momentbased feature vector of each block, the nearest output node ck; 1 · k · N is identi¯ed by applying
the Euclidean distance measure and the corresponding index k of the output node ck is stored for
that particular block of the image. Based on this encoding scheme, an image Ij can be represented</p>
      <p>
        V¡concept = [f1j ; ¢ ¢ ¢ ; fij ; ¢ ¢ ¢ fNj ]T, where each dimension corresponds to a concept
as a vector fIj
index in the codebook. The element fij represents the frequency of occurrences of ci appearing in
Ij . For this work, codebooks of size of 400 (e.g.,20 £ 20 units) are constructed for the photographic
and medical collection by manually selecting 2% images from each collection as training set. Since,
the concept-based feature space is closely related to the keyword-based feature space of documents,
we apply the cosine measure to compare image Iq and Ij as described in equation (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ).
3.1
      </p>
      <sec id="sec-3-1">
        <title>Visual Query Re¯nement by Relevance Feedback</title>
        <p>This section presents the visual query re¯nement approach at di®erent levels of image
representation. The query re¯nement is closely related to the approach in [9]. It is assumed that, all positive
feedback images at some particular iteration belong to user perceived visual and/or semantic
category and obey the Gaussian distribution to form a cluster in the feature space. We consider the
rest of the images as irrelevant and they may belong to di®erent semantic categories. However,
we do not consider the irrelevant images for query re¯nement. The modi¯ed query vector at a
particular iteration is represented as the mean of the relevant image vectors
where, R is the set of relevant image vectors and x 2 fglobal; sg; V ¡ conceptg. Next, the
covariance matrix of the positive feature vectors is estimated as
f xm =
Iq
1
jRj fIl 2R</p>
        <sec id="sec-3-1-1">
          <title>X fIxl</title>
          <p>Cx =
1 jRj</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>X(fIxlm ¡ fIxq )(fIxlm ¡ fIxq )T</title>
          <p>jRj ¡ 1 l=1
However, singularity issue will arise in covariance matrix estimation if fewer training samples
or positive images are available compared to the feature dimension (as will be the case in user
feedback images). So, we add regularization to avoid singularity in matrices as follows[19]:</p>
          <p>
            C^x = ®Cx + (1 ¡ ®)I
for some 0 · ® · 1 and I is the identity matrix. After generating the mean vector and covariance
matrix for a feature x 2 fglobal; sg; V ¡ conceptg, we adaptively adjust the distance measure
functions in equations (
            <xref ref-type="bibr" rid="ref5">5</xref>
            ) and (
            <xref ref-type="bibr" rid="ref6">6</xref>
            ) with the following Mahalanobis distance measures [18] for
query image Iq and database image Ij as
          </p>
          <p>Disx(Iq; Ij ) = (fIxqm ¡ fIxj )T C^x¡1(fIxqm ¡ fIxj )
The Mahalanobis distance di®ers from the Euclidean distance in that it takes into account the
correlations of the data set and is scale-invariant, i.e. not dependent on the scale of measurements
[18]. We did not perform any query re¯nement for region-speci¯c feature due to its variable feature
dimension for variable number of regions in each image.
4</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Combination of Evidence by Dynamic Weight Update</title>
      <p>
        In recent years, the category of work known as data fusion or multiple-evidence described a range
of techniques in information retrieval whereby multiple pieces of information are combined to
achieve improvements in retrieval e®ectiveness [10, 11]. These pieces of information can take
many forms including di®erent query representations, di®erent document (image) representations,
and di®erent retrieval strategies used to obtain a measure of relationship between a query and a
(
        <xref ref-type="bibr" rid="ref8">8</xref>
        )
(
        <xref ref-type="bibr" rid="ref9">9</xref>
        )
(
        <xref ref-type="bibr" rid="ref10">10</xref>
        )
(
        <xref ref-type="bibr" rid="ref11">11</xref>
        )
document (image). Motivated by this paradigm, in Sections 2 and 3, we described multiple textual
query and image representation schemes. This section presents an adaptive linear combination
approach based on relevance feedback information. One of the most commonly used approaches in
data fusion is the linear combination of similarity scores. For our multi-modal retrieval purpose,
let us consider q as a multi-modal query which has an image part as Iq and a document part as
annotation ¯le as Dq. In a linear combination scheme, the similarity between q and a multi-modal
item j, which also has two parts (e.g., image Ij and text Dj ), is de¯ned as
      </p>
      <p>SimI(Iq; Ij ) =</p>
      <sec id="sec-4-1">
        <title>X !IF SimIIF (Iq; Ij )</title>
        <p>I</p>
        <p>IF</p>
        <p>Sim(q; j) = !I SimI(Iq; Ij ) + !DSimD(Dq; Dj )
where !I and !D are inter-modality weights within the text or image feature space, which subject
to 0 · !I ; !D · 1 and !I + !D = 1. Now, the image based similarity is again de¯ned as the linear
combination of similarity measures in di®erent level of image representation as
where IF 2 fglobal; semi ¡ global; region; V ¡ conceptg and !IF are the weights within the
different image representation schemes (e.g., intra-modality weights). On the other hand, the text
based similarity is de¯ned as the linear combination of similarity matching based on di®erent
query representation schemes.</p>
        <p>SimD(Dq; Dj ) =</p>
      </sec>
      <sec id="sec-4-2">
        <title>X !QF SimQDF (Dq; Dj )</title>
        <p>
          D
QF
(
          <xref ref-type="bibr" rid="ref12">12</xref>
          )
(
          <xref ref-type="bibr" rid="ref13">13</xref>
          )
(
          <xref ref-type="bibr" rid="ref14">14</xref>
          )
(
          <xref ref-type="bibr" rid="ref15">15</xref>
          )
where QF 2 fRocchio; Ide; Local1; Local2g and !QF are the weights within the di®erent query
representation schemes.
        </p>
        <p>
          The e®ectiveness of the linear combination depends mainly on the choice of the di®erent inter
and intra-modality weights. We use a dynamic weight updating method in linear combination
schemes by considering both precision and rank order information of top retrieved K images.
Before any fusion, the distance scores of each representation are normalized and converted to
the similarity scores with a range of [0; 1] as Sim(q; j) = 1 ¡ maDx(Dis(iqs;(jq);¡j)m)¡inm(Din(isD(qis;j()q);j)) , where
min(¢) and max(¢) are the minimum and maximum distance scores. In this approach, an equal
emphasis is given based on their weights to all the features along with their similarity matching
functions initially. However, the weights are updated dynamically during the subsequent iterations
by incorporating the feedback information from the previous round. To update the inter-modality
weights (e.g., !I and !D), we at ¯rst need to perform the multi-modal similarity matching based
on equation (
          <xref ref-type="bibr" rid="ref12">12</xref>
          ). After the initial retrieval result with a linear combination of equal weights (e.g.,
!I = 0:5 and !D = 0:5), a user needs to provide a feedback about the relevant images from the
top K returned images. For each ranked list based on individual similarity matching, we also
consider top K images and measure the e®ectiveness of a query/image feature as
E(D or I) =
        </p>
        <p>PK
i=1 Rank(i)</p>
        <p>
          K=2
¤ P(K)
where Rank(i) = 0 if image in the rank position i is not relevant based on user's feedback and
Rank(i) = (K ¡ i)=(K ¡ 1) for the relevant images. Here, P (K) = RK =K is the precision at top
K, where Rk be the number of relevant images in the top K retrieved result. Hence, the equation
(
          <xref ref-type="bibr" rid="ref15">15</xref>
          ) is basically the product of two factors, rank order and precision. The raw performance scores
obtained by the above procedure are then normalized by the total score as E^(D) = !^D = E(D)+E(I)
E(D)
and E^(I) = !^I = E(D)+E(I) to generate the updated text and image feature weights respectively.
        </p>
        <p>E(I)
For the next iteration of retrieval with the same query, these modi¯ed weights are utilized for the
multi-modal similarity matching function as</p>
        <p>Sim(q; j) = !^I SimI(Iq; Ij ) + !^DSimD(Dq; Dj )
This weight updating process might be continued as long as users provide relevant feedback
information or until no changes are noticed due to the system convergence.</p>
        <p>
          In a similar fashion, to update the intra-modality weights (e.g., !DQF and !IIF ), we need to
consider the top K images in individual result list. So, for image-based similarity in equation (
          <xref ref-type="bibr" rid="ref13">13</xref>
          ), we
consider the result lists of di®erent image features of IF 2 fglobal; semi ¡ global; region; V ¡ conceptg
and measure their weights by using equation (
          <xref ref-type="bibr" rid="ref15">15</xref>
          ) for the next retrieval iteration. On the other
hand, for text-based similarity in equation (
          <xref ref-type="bibr" rid="ref14">14</xref>
          ), the top K images in result lists of di®erent query
features of QF 2 fRocchio; Ide; Local1; Local2g are considered and text-level weights are
determined in a similar way by applying equation (
          <xref ref-type="bibr" rid="ref15">15</xref>
          ).
5
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Sequential approach with pre-¯ltering and re-ordering</title>
      <p>This section describes the process about how to interact with both the modalities in a user's
perceived semantical and sequential way. Since a query can be represented with both keywords and
visual features, it can be initiated either by the keyword-based search or by the visual example
image search. However, we consider a pre-¯ltering and re-ranking approach based on the image
search in the ¯ltered image set which is obtained previously by the textual search. It would be
more appropriate to perform a text-based search at ¯rst due to the higher level information
content and latter use visual only search to re¯ne or re-rank the top returned images by the textual
search. In this method, combining the results of the text and image based retrieval is a matter of
re-ranking or re-ordering of the images in a text-based pre-¯ltered result set. The steps involved
in this approach are as follows:</p>
      <p>
        Step 1: Initially, for a multi-modal query q with a document part as Dq, perform a textual
search with vector fDq and rank the images based on the ranking of the associated annotation ¯les
by applying equation (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ).
      </p>
      <p>Step 2: Obtain user's feedback from top retrieved K = 30 images about relevant and irrelevant
images for the textual query re¯nement.</p>
      <p>Step 3: Calculate the optimal textual query vectors as fDmq (Rocchio); fDmq (Ide); fDmq (Local1) and
fDmq (Local2).</p>
      <p>
        Step 4: Re-submit the modi¯ed query vectors in the text engine and merge the results with
an equal weighting in similarity matching in equation (
        <xref ref-type="bibr" rid="ref14">14</xref>
        ).
      </p>
      <p>
        Step 5: Continue steps 2 to 4 with dynamically updated weights based on equation (
        <xref ref-type="bibr" rid="ref15">15</xref>
        ) until
the user switch to visual only search.
      </p>
      <p>Step 6: Extract di®erent features as fIgqlobal; fIsqg; fIloqcal, and fIVq¡concept for the multi-modal query
q with an image part as Iq.</p>
      <p>
        Step 7: Perform visual only search in top L = 1000 images retrieved by text-based search and
rank them based on the similarity values by applying equation (
        <xref ref-type="bibr" rid="ref13">13</xref>
        ) with equal feature weighting.
      </p>
      <p>Step 8: Obtain user's feedback from top retrieved K = 30 images about the relevant images
and perform visual query re¯nement as fIxqm , where x 2 fglobal; sg; V ¡ conceptg at a particular
iteration.</p>
      <p>
        Step 9: At next iteration, calculate the feature weights based on equation (
        <xref ref-type="bibr" rid="ref15">15</xref>
        ) and apply it to
equation (
        <xref ref-type="bibr" rid="ref13">13</xref>
        ) for ranked based retrieval result.
      </p>
      <p>Step 10: Continue steps 8 and 9, until the user is satis¯ed or the system converges.</p>
      <p>
        The process °ow diagram of the sequential search approach is shown in Figure 1. For this
approach, the text-based search with query reformulation is performed ¯rst as shown in the (
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
left portion of the ¯gure and image-based search is performed in the ¯ltered image set as shown
in the (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) right portion of the ¯gure 1.
      </p>
    </sec>
    <sec id="sec-6">
      <title>Simultaneous approach with linear combination</title>
      <p>This section describes our approach of simultaneous multi-modal search. Here, textual and
content-based search are performed simultaneously from the beginning and the results are
combined with an adaptive linear combination scheme as described in Section4. The steps involved in
this approach are as follows:</p>
      <p>Step 1: Initially, for a multi-modal query q with a document part as Dq and an image part as
Iq, extract textual query vector as fDq and di®erent image feature vectors as fIgqlobal; fIsqg; fIloqcal, and
f V¡concept.</p>
      <p>Iq</p>
      <p>
        Step 2: Perform a multi-modal search to rank the images based on equation (
        <xref ref-type="bibr" rid="ref12">12</xref>
        ), where
SimD(Dq; Dj) is initially performed through Simtext(Dq; Dj) equation (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) and SimI(Iq; Ij) is
performed through equation (
        <xref ref-type="bibr" rid="ref13">13</xref>
        ) with initially equal weighting in both inter and intra-modality
weights.
      </p>
      <p>Step 3: Obtain user's feedback from top retrieved K = 30 images about relevant and irrelevant
images for both textual and visual query re¯nement and for dynamically update the weights.</p>
      <p>
        Step 4: Based on the feedback information, calculate the optimal textual query vectors as
fDmq (Rocchio); fDmq (Ide); fDmq (Local1) and fDmq (Local2) and image query vectors as fIxqm , where x 2
fglobal; sg; V ¡ conceptg and update the inter and intra-modality weights based on equation (
        <xref ref-type="bibr" rid="ref15">15</xref>
        ).
      </p>
      <p>
        Step 5: Re-submit the modi¯ed textual and image query vectors to the system and apply
multimodal similarity matching based on equation (
        <xref ref-type="bibr" rid="ref16">16</xref>
        ), where SimD(Dq; Dj) is performed through
equation (
        <xref ref-type="bibr" rid="ref14">14</xref>
        ) and SimI(Iq; Ij) is performed through equation (
        <xref ref-type="bibr" rid="ref13">13</xref>
        ).
      </p>
      <p>Step 6: Continue steps 3 to 5, until the user is satis¯ed or the system converges.</p>
      <p>The process °ow diagram of the above multi-modal simultaneous search approach is shown in
Figure 2. For this approach, both text and image-based search are performed simultaneously as
shown in left and right portions of Figure 2.
6.0.1</p>
      <p>Analysis of the submitted runs
The types and performances of the di®erent runs are shown in Table 1 and Table 2 for the ad-hoc
retrieval of the photographic and medical collections respectively. In all these runs, only English
is used as the source and target language without any translation for the text-based retrieval
approach. We submitted ¯ve di®erent runs for the ad-hoc retrieval of the photographic collection,
where ¯rst two runs are based on text only search and last three runs are based on mixed modality
search as shown in Table 1. For the ¯rst run \CINDI-TXT-ENG-PHOTO", we performed only a
manual text-based search without any query expansion as our base run. This run achieved a MAP
score of 0.1529 and ranked 140th out of 476 submissions (e.g., within the top 30%). Our second
run \CINDI-TXT-QE-PHOTO" achieved the best MAP score (0.2637) among all our submitted
runs and ranked 21st for this year competition. In this run, we performed two iterations of manual
feedback for textual query expansion and combination based on dynamic weight update schemes
for text only retrieval as described in Sections 2 and 4. The rest of the runs are based on
multimodal approach, where in the third run \CINDI-TXT-QE-IMG-RF-RERANK", we performed
the sequential approach with pre-¯ltering and re-ordering as described in subsection 5 with two
iterations of manual feedback in both text and image-based searches. However, the re-ordering
approach did not improve the result as a whole (e.g., ranked 32nd) in terms of MAP score (0.2336)
as compared to the only textual query expansion approach of our best run. The main reason might
be due to the fact that the majority of the query topics are more semantically oriented, where
visual search is not suitable or feasible at all. However, this run might perform well where queries
have both textual and distinct visual properties, such as query topic number 15 as \night shots
of cathedrals" or query topic number 24 as \snowcapped building in Europe". For the fourth run
\CINDI-TXTIMG-FUSION-PHOTO", we performed a simultaneous retrieval approach without
any feedback information with a linear combination of weights as !D = 0:7 and !I = 0:3 and
for the ¯fth run \CINDI-TXTIMG-RF-PHOTO", two iterations of manual relevance feedback are
performed as described in Section 6. However, these two runs did not perform well in terms of
MAP score as compared to the sequential approach due to early combination and nature of the
queries as described earlier.</p>
      <p>For the image retrieval task in the medical collections, we submitted seven runs this year.
However, due to few errors (such as duplicate entry and reference image as 0.jpg in the result
set), three of our runs could not produce performance report by evaluating with the trec-eval
program. This is mainly due to reason of directly using reference images from the annotation
¯les instead of using the link XML ¯le as provided. We are currently ¯xing this problem and
later analyze and report the results of these runs. Table 2 shows the o±cial result of the four
runs out of our seven submitted runs. In the ¯rst run \INDI-IMG-FUSION", we performed
only a visual only search based on various image feature representation schemes as described
in Section 3 without any feedback information and with a linear combination of equal feature
weights. For the second run \CINDI-IMG-FUSION-RF", we performed only one iteration of
manual feedback for visual query re¯nement and combined the similarity matching functions
based on the dynamic weight updating scheme. For this run we achieved a MAP score of 0.0372,
which is slightly better then the score (0.0333) achieved by the ¯rst run without any relevance
feedback information. However, compared to the the text-based approaches the performances
are very low as it happened in previous years of ImageCLEFmed. For the third run
\CINDITXT-IMAGE-LINEAR", we performed a simultaneous retrieval approach without any feedback
information with a linear combination of weights as !D = 0:7 and !I = 0:3 and for the fourth
run \CINDI-TXT-IMG-RF-LINEAR", two iterations of manual relevance feedback are performed
similar to the last two runs of photographic retrieval task. From Table 2, it is clear that combining
both modalities for the medical retrieval task is far better then using only a single modality (e.g.,
only image) and we achieved the best MAP score as 0.1483 among all our submissions for this
task.
7</p>
    </sec>
    <sec id="sec-7">
      <title>Conclusion</title>
      <p>This paper presents the ad-hoc image retrieval approaches of CINDI research group for
ImageCLEF 2007. We submitted several runs with di®erent combination of methods, features and
parameters. We investigated with cross-modal interaction and fusion approaches for the retrieval
of the photographic and medical image collections. The description of the runs and analysis of
the results are discussed in this paper.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Grubinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Clough</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hanbury</surname>
          </string-name>
          , and
          <string-name>
            <surname>H.</surname>
          </string-name>
          <article-title>MuÄller, Overview of the ImageCLEF 2007 Photographic Retrieval Task</article-title>
          ,
          <source>Working Notes of the 2007 CLEF Workshop</source>
          , Budapest, Hungary, Sep.
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>MuÄller</surname>
          </string-name>
          , T. Deselaers, E. Kim,
          <string-name>
            <given-names>C.</given-names>
            <surname>Kalpathy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jayashree</surname>
          </string-name>
          , M. Thomas,
          <string-name>
            <given-names>P.</given-names>
            <surname>Clough</surname>
          </string-name>
          , W. Hersh,
          <article-title>Overview of the ImageCLEFmed 2007 Medical Retrieval</article-title>
          and
          <string-name>
            <given-names>Annotation</given-names>
            <surname>Tasks</surname>
          </string-name>
          ,
          <source>Working Notes of the 2007 CLEF Workshop</source>
          , Budapest, Hungary, Sep.
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Baeza-Yates</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Ribiero-Neto</surname>
          </string-name>
          , Modern Information Retrieval, Addison Wesley,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G.</given-names>
            <surname>Salton</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Buckley</surname>
          </string-name>
          ,
          <article-title>Improving retrieval performance by relevance feedback</article-title>
          ,
          <source>Journal of the American Society for Information Science</source>
          , vol.
          <volume>41</volume>
          (
          <issue>4</issue>
          ), pp.
          <volume>288</volume>
          {
          <issue>297</issue>
          ,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Rui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. S.</given-names>
            <surname>Huang</surname>
          </string-name>
          , Relevance Feedback:
          <article-title>A Power Tool for Interactive Content-Based Image Retrieval, IEEE Circuits Syst</article-title>
          . Video Technol., vol.
          <volume>8</volume>
          , pp.
          <volume>644</volume>
          {
          <issue>655</issue>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.J.</given-names>
            <surname>Rocchio</surname>
          </string-name>
          ,
          <article-title>Relevance feedback in information retrieval</article-title>
          .
          <source>In The SMART Retrieval System - Experiments in Automatic Document Processing</source>
          , pp.
          <volume>313</volume>
          {
          <issue>323</issue>
          , Englewood Cli®s, NJ, Prentice Hall, Inc.
          <year>1971</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>E.</given-names>
            <surname>Ide</surname>
          </string-name>
          ,
          <article-title>New experiments in relevance feedback</article-title>
          ,
          <source>In The SMART retrieval system - Experiments in Automatic Document Processing</source>
          , pp
          <volume>337</volume>
          {
          <fpage>354</fpage>
          .
          <year>1971</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ogawa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Morita</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Kobayashi</surname>
          </string-name>
          ,
          <article-title>A fuzzy document retrieval system using the keyword connection matrix and a learning method</article-title>
          ,
          <source>Fuzzy Sets and Systems</source>
          , vol.
          <volume>39</volume>
          pp.
          <volume>163</volume>
          {
          <issue>179</issue>
          ,
          <year>1991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ishikawa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Subramanya</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Faloutsos</surname>
          </string-name>
          ,
          <source>MindReader: Querying Databases Through Multiple Examples, 24th Internat. Conf. on Very Large Databases</source>
          , New York, pp.
          <volume>24</volume>
          {
          <issue>27</issue>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>E.A.</given-names>
            <surname>Fox</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.A.</given-names>
            <surname>Shaw</surname>
          </string-name>
          , Combination of Multiple Searches,
          <source>Proc. of the 2nd Text Retrieval Conference (TREC-2)</source>
          ,
          <source>NIST Special Publication 500-215</source>
          , pp.
          <fpage>243</fpage>
          -
          <lpage>252</lpage>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Combining Multiple Evidence from Di®erent Properties of Weighting Schemes</article-title>
          ,
          <source>Proc. of the 18th Annual ACM-SIGIR</source>
          , pp.
          <volume>180</volume>
          {
          <issue>188</issue>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>R.</given-names>
            <surname>Attar</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.S.</given-names>
            <surname>Fraenkel</surname>
          </string-name>
          ,
          <article-title>Local feedback in full-text retrieval systems</article-title>
          ,
          <source>Journal of ACM</source>
          , vol.
          <volume>24</volume>
          (
          <issue>3</issue>
          ), pp.
          <volume>397</volume>
          {
          <issue>417</issue>
          ,
          <year>1977</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Smeulder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Worring</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Santini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <article-title>Content-Based Image Retrieval at the End of the Early Years</article-title>
          ,
          <source>IEEE Trans. on Pattern Anal. and Machine Intell</source>
          ., vol.
          <volume>22</volume>
          , pp.
          <volume>1349</volume>
          {
          <issue>1380</issue>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>B.S.</given-names>
            <surname>Manjunath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Salembier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. Sikora</given-names>
            , (eds.), Introduction to MPEG-7
            <surname>- Multimedia Content Description Interface</surname>
          </string-name>
          , John Wiley Sons Ltd. pp.
          <volume>187</volume>
          {
          <issue>212</issue>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>R.M. Haralick</surname>
            , Shanmugam,
            <given-names>and I. Dinstein</given-names>
          </string-name>
          ,
          <article-title>Textural features for image classi¯cation</article-title>
          ,
          <source>IEEE Trans System, Man, Cybernetics</source>
          , vo;. 3, pp.
          <volume>610</volume>
          {
          <issue>621</issue>
          ,
          <year>1973</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>M.M. Rahman</surname>
            ,
            <given-names>B.C.</given-names>
          </string-name>
          <string-name>
            <surname>Desai</surname>
            , and
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Bhattacharya</surname>
            ,
            <given-names>A Feature</given-names>
          </string-name>
          <string-name>
            <surname>Level</surname>
          </string-name>
          <article-title>Fusion in Similarity Matching to Content-Based Image Retrieval</article-title>
          ,
          <source>Proc. 9th Internat Conf Information Fusion</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kohonen</surname>
          </string-name>
          ,
          <string-name>
            <surname>Self-Organizing</surname>
            <given-names>Maps</given-names>
          </string-name>
          , Springer-Verlag, Heidelberg. 2nd ed.
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>K.</given-names>
            <surname>Fukunaga</surname>
          </string-name>
          , Introduction to Statistical Pattern Recognition, 2nd ed. Academic Press,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>J.</given-names>
            <surname>Friedman</surname>
          </string-name>
          ,
          <article-title>Regularized Discriminant Analysis</article-title>
          ,
          <source>Journal of American Statistical Association</source>
          , vol.
          <volume>84</volume>
          , pp.
          <volume>165</volume>
          {
          <issue>175</issue>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>