<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CUHK Experiments with ImageCLEF 2005¤</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Steven C.H. Hoi, Jianke Zhu and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin, N.T.</institution>
          ,
          <country country="HK">Hong Kong</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes the empirical studies of cross-language and cross-media retrieval for the ImageCLEF competition in 2005. It reports the empirical summary of the work of CUHK (The Chinese University of Hong Kong) at ImageCLEF 2005. This is the ¯rst participation of our group at ImageCLEF. The task we participated this year is the \Bilingual ad hoc retrieval" task. There are three major focuses and contributions in our participation. The ¯rst is the empirical evaluations of language models and the smoothing strategies for cross-language image retrieval. The second is the evaluations of cross-media image retrieval, i.e., combining text and visual content for image retrieval. The last one is the evaluation of the bilingual image retrieval between English and Chinese. We provide empirical analysis on the experimental results. From the o±cial testing results of the Bilingual ad hoc retrieval task, we achieve the highest MAP result (0.4135) in the monolingual query among all organizations.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Visual information retrieval has been an active research topic for many years. Although
contentbased image retrieval (CBIR) has been received considerable studies in the community [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], there
is so far few benchmark image dataset available. The CLEF (Cross Language Evaluation Forum)
organization [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] began the ImageCLEF campaign from 2003 for benchmark evaluation of
crosslanguage image retrieval [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. ImageCLEF 2005 o®ers four di®erent tasks: bilingual ad hoc retrieval,
interactive search, medical image retrieval and automatic image annotation task. This is the ¯rst
participation of our CUHK group (The Chinese University of Hong Kong) at ImageCLEF. The
task we participated this year is the \Bilingual ad hoc retrieval".
      </p>
      <p>
        In the past decade, traditional information retrieval mainly focused on the document retrieval
problems [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Along with more and more attentions in multimedia information retrieval in recent
years, the cross-language and cross-media retrieval have been put forward as an important research
topic in the community [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The cross-language image retrieval is to tackle the multimodal
information retrieval task by unifying the techniques from traditional information retrieval, natural
language processing (NLP), and traditional CBIR solutions.
      </p>
      <p>In this participation, we o®er the main contributions in three aspects. The ¯rst is the empirical
evaluation of language models and the smoothing strategies for cross-language image retrieval. The
second is the evaluation of cross-media image retrieval, i.e., combining text and visual content for
image retrieval. The last one is the methodology and empirical evaluation of the bilingual image
retrieval between English and Chinese.</p>
      <p>The rest of this paper is organized as follows. Section 2 introduces the TF-IDF retrieval
model and the language model based retrieval methods. Section 3 describes the details of our
implementation for this participation, and outlines our empirical study on the cross-language and
cross-media retrieval system. Finally section 4 concludes our work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Language Models for Text Based Image Retrieval</title>
      <p>
        In this participation, we conducted extensive experiments to evaluate the performance of Language
Models and the in°uences of di®erent smoothing strategies. More speci¯cally, two kinds of retrieval
models are studied in our experiments: (1) The TF-IDF retrieval model (2) The KL-divergence
language models based method. The smoothing strategies for Language Models are evaluated in
our experiments [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]: (1) Jelinek-Mercer (JM), (2) Dirichlet prior (DIR), (3) Absolute discounting
(ABS).
2.1
      </p>
      <sec id="sec-2-1">
        <title>TF-IDF Similarity Measure for Information Retrieval</title>
        <p>
          We incorporate the Language Models (LM) with the TF-IDF similarity measure[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. TF-IDF is
widely used in information retrieval, which is a way of weighting the relevance of a query to a
document. The main idea of TF-IDF is to represent each document by a vector in the size of the
overall vocabulary. Each document Di is then represented as a vector (wi1; wi2); ¢ ¢ ¢ ; win if n is
the size of the vocabulary. The entry wi;j is calculated as:
        </p>
        <p>wij = T Fij £ log(IDFj )
where T Fij is the term frequency of the jth word in the vocabulary in the document Di, i.e. the
number of occurrences. IDFj is the inverse document frequency of the jth term, given as
IDFj =</p>
        <p>#documents
#documents containing the jth term
The similarity between two documents is then de¯ned as the cosine of the angle between the two
vectors.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Language Modeling for Information Retrieval</title>
        <p>
          A statistical language model, or more simply a language model, is a probabilistic mechanism
for generating text. The ¯rst serious statistical language modeler was Claude Shannon [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. In
exploring the application of his newly founded theory of information to human language, thought of
purely as a statistical source, Shannon measured how well simple n-gram models did at predicting,
or compressing, natural text. In the past several years there has been signi¯cant interest in the
(1)
(2)
use of language modeling methods for a variety of text retrieval and natural language processing
tasks [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
2.2.1
        </p>
        <p>The KL-divergence Measure
Given two probability mass functions p(x) and q(x), D(pjjq), the Kullback-Leibler (KL) divergence
(or relative entropy) between p and q is de¯ned as</p>
        <p>D(pjjq) = X p(x)log
x
p(x)
q(x)</p>
        <p>
          One can show that D(pjjq) is always non-negative and is zero if and only if p = q. Even though
it is not a true distance between distributions (because it is not symmetric and does not satisfy the
triangle inequality), it is still often useful to think of the KL-divergence as a "distance" between
distributions [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
2.2.2
        </p>
        <p>
          The KL-divergence based Retrieval Model
For the language modeling approach, we assume a query q is generated by a generative model
p(qjµQ), where µQ denotes the parameters of the query unigram language model. Similarly, we
assume that a document d is generated by a generative model p(qjµD), where µQ denotes the
parameters of the document unigram language model. Let µ^Q and µ^D be the estimated query
and document language models respectively. The relevance value of d with respect to q can be
measured by the following negative KL-divergence function [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]:
¡D(µ^Qjjµ^D) = X p(wjµ^Q)logp(wjµ^D) + (¡ X p(wjµ^Q)logp(wjµ^Q))
        </p>
        <p>w w</p>
        <p>
          In the above formula, the second term on the right-hand side of the formula is a
querydependent constant, i.e., the entropy of the query model µ^Q. It can be ignored for the ranking
purpose. In general, we consider the smoothing scheme for the estimated document model as
follows:
p(wjµ^D) =
½ ps(wjd) if word w is seen
®dp(wjC) otherwise
(3)
(4)
(5)
where ps(wjd) is the smoothed probability of a word seen in the document, p(wjC) is the collection
language model, and ®d is a coe±cient controlling the probability mass assigned to unseen words,
so that all probabilities sum to one [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. In the subsequent section, we discuss several smoothing
techniques in details.
2.3
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Several Smoothing Techniques</title>
        <p>A smoothing method may be as simple as adding an extra count to every word, or words of
di®erent count are treated di®erently. In order to solve the problem e±ciently, we select three
representative methods that are popular and relatively e±cient. The three methods are described
below.
2.3.1</p>
        <p>Jelinek-Mercer (JM)
This method involves a linear interpolation of the maximum likelihood model with the collection
model, using a coe±cient ¸ to control the in°uence of each model.</p>
        <p>p¸(!jd) = (1 ¡ ¸)pml(!jd) + ¸p(!jC)
(6)</p>
        <p>Thus, this is a simple mixture model (but we preserve the name of the more general
JelinekMercer method which involves deleted-interpolation estimation of linearly interpolated n-gram
models.
2.3.2</p>
        <p>Dirichlet prior (DIR)
A language model is a multinomial distribution, for which the conjugate prior for Bayesian analysis
is the Dirichlet distribution with parameters (¹(!1jC); ¹p(!2jC); : : : ; ¹p(!njC)). Thus, the model
is given by
p¹(!jd) =
c(!; d) + ¹p(!jC)</p>
        <p>P! c(!; d) + ¹
The Laplace method is a special case of the technique.
2.3.3</p>
        <p>Absolute discounting (ABS)
The idea of the absolute discounting method is to lower the probability of seen words by subtracting
a constant from their counts. It is similar to the Jelinek-Mercer method, but di®ers in that it
discounts the seen word probability by subtracting a constant instead of multiplying it by 1 ¡ ¸.
The model is given by
p±(!jd) =
max(c(!; d) ¡ ±; 0)</p>
        <p>P! c(!; d)
+ ±p(!jC)
where ± 2 [0; 1] is a discount constant and ¾ = ±jdj¹=jdj, so that all probabilities sum to one.
Here jdj¹ is the number of unique terms in document d, and jdj is the total count of words in the
documents, so that jdj = P! c(!; d).
(7)
(8)</p>
        <p>The three methods are summarized in Table 1 in terms of ps(!jd) and ®d in the general form.
It is easy to see that a larger parameter value means smoothing in all cases. Retrieval using any
of the three methods can be very e±ciently, when the smoothing parameter is given in advance.
It is as e±cient as scoring using a TF-IDF model.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Cross-Language and Cross-Media Image Retrieval</title>
      <p>In this section, we describe the experimental setup and our experimental development at the
ImageCLEF 2005. In addition, we analyze the results of our submission.
3.1</p>
      <sec id="sec-3-1">
        <title>Experimental Setup</title>
        <p>
          The bilingual ad hoc retrieval task is to ¯nd as many relevant images as possible for each given
topic. The St. Andrew collection is used as the benchmark dataset in the campaign. The collection
consists of 28,133 images, all of which associate with textual captions written in British English
(the target language). The caption consists of 8 ¯elds including title, photographer, location,
date, and one or more pre-de¯ned categories (all manually assigned by domain experts). In the
ImageCLEF 2005 campaign, there are totally 28 queries for each language. For each query, two
image samples are given. Figure 1. shows a query example of images, title and narrative texts in
the campaign.
&lt;num&gt; Number: 1 &lt;/num&gt;
&lt;title&gt; aircraft on the ground &lt;/title&gt;
&lt;narr&gt; Relevant images will show one or more
airplanes positioned on the ground. Aircraft do not
have to be the focus of the picture, although it should
be possible to make out that the picture contains
aircraft. Pictures of aircraft flying are not relevant and
pictures of any other flying object (e.g. birds) are not
relevant. &lt;/narr&gt;
&lt;/top&gt;
For the Bilingual ad hoc retrieval task, we studied the query tasks in English and Chinese
(simpli¯ed). Both text and visual information are used in our experiments. To study the language
models, we employ the Lemur toolkit [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] in our experiments. A list of standard stopwords is used
in the parsing step.
        </p>
        <p>To evaluate the in°uence on the performance by di®erent schemes, we produced the results by
using di®erent con¯gurations. Tables 2 shows the con¯gurations and the experimental results in
detail. In total, 36 runs with di®erent con¯gurations are submitted in our submission.
3.3</p>
      </sec>
      <sec id="sec-3-2">
        <title>Analysis on the Experimental Results</title>
        <p>In this part, we empirically analyze the experimental results of our submission. The goal of our
evaluation is to check whether the language model is e®ective for cross-language image retrieval
and what kinds of smoothing techniques achieve better performance. Moreover, we like to know
the performance comparison between the Chinese query and the monolingual query.
0
0.1
0.2
0.3
0.4
0.6
0.7
0.8
0.9</p>
        <p>1
0.5</p>
        <p>Recall
LM denotes Language Model, KL denotes Kullback-Leibler divergence based, DIR denotes the
smoothing using the Dirichlet priors, ABS denotes the smoothing using Absolute discounting,</p>
        <p>JM denotes the Jelinek-Mercer smoothing.
3.3.1</p>
        <p>
          Empirical Analysis of Language Models
To deal with the Chinese queries for retrieving English documents, we ¯rst adopt a Chinese
segmentation tool from the Linguistic Data Consortium (LDC) [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], i.e., the \LDC Chinese
segmenter" 1, to extract the Chinese words from the given query sentences. The segmentation step
is important toward e®ective query translation. Figure 4 shows the Chinese segmentation results
of part queries. We can see that the results can still be improved.
        </p>
        <p>
          For the bilingual query translation, the second step is to translate the extracted Chinese words
into English words using a Chinese-English dictionary. In our experiment, we employ the LDC
Chinese-to-English Wordlist [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] for the translations. The ¯nal translated queries are obtained by
combining the translation results.
        </p>
        <p>
          From the experimental results shown in Table 2, we can observe that the mean average precision
of Chinese-To-English Queries is about the half of the monolingual queries. There are a lot of
ways to improve the performance. One is to improve the Chinese segmentation algorithm. Some
post-processing tricks may be e®ective for improving the performance. Moreover, the translation
results can be further re¯ned. One can tune better results by adopting some Natural Language
Processing techniques [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
3.3.3
        </p>
        <p>Cross-Media Retrieval: Re-Ranking Scheme with Text and Visual Content
In this participation, we study the combination of text and visual content for cross-media image
retrieval. In our development, we suggest the re-ranking scheme in combination with text and
visual content. For a given query, we ¯rst rank the images by using the language modeling
techniques. On the top ranking images, we then re-rank the images by measuring the visual
similarity to the query.</p>
        <p>1It can be downloaded from: http://www.ldc.upenn.edu/Projects/Chinese/seg.zip .
1. 地面上的飞机</p>
        <sec id="sec-3-2-1">
          <title>Aircraft on the ground</title>
          <p>2. 演奏台旁聚集的群众</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>People gathered at bandstand</title>
          <p>3. 狗的坐姿</p>
        </sec>
        <sec id="sec-3-2-3">
          <title>Dog in sitting position</title>
          <p>4. 靠码头的蒸汽船</p>
        </sec>
        <sec id="sec-3-2-4">
          <title>Steam ship docked</title>
          <p>5. 动物雕像</p>
        </sec>
        <sec id="sec-3-2-5">
          <title>Animal statue</title>
          <p>6. 小帆船</p>
        </sec>
        <sec id="sec-3-2-6">
          <title>Small sailing boat</title>
          <p>7. 在船上的渔夫们</p>
        </sec>
        <sec id="sec-3-2-7">
          <title>Small sailing boat</title>
          <p>8. 被雪覆盖的建筑物</p>
        </sec>
        <sec id="sec-3-2-8">
          <title>Fishermen in boat</title>
          <p>9. 马拉动运货车或四轮车的图片</p>
        </sec>
        <sec id="sec-3-2-9">
          <title>Horse pulling cart or carriage</title>
          <p>10. 苏格兰的太阳</p>
          <p>Sun pictures, Scotland
地面 上 的 飞机
演奏 台 旁 聚集 的 群众
狗 的 坐 姿
靠 码头 的 蒸汽 船
动物</p>
          <p>雕像
小 帆船
在 船上 的 渔夫 们
被 雪 覆盖 的 建筑物
马拉 动 运 货车 或 四 轮 车 的</p>
          <p>图片
苏格兰 的 太阳</p>
          <p>In our experiment, two kinds of visual features are used: texture and color features. For
the texture feature, the discrete cosine transform (DCT) is engaged to calculate coe±cients that
multiply the basis functions of the DCT. Applying the DCT to an image yields a set of coe±cients
to represent the texture of the image. In our implementation, a block-DCT (block size 8x8) is
applied on the normalized input images which generate a 256-dimensional DCT feature. For the
color feature, 9-dimensional color moment is extracted for each image. In total, each image is
represented by a 265-dimensional feature vector.</p>
          <p>As shown in Table 2, the MAP of query results using only the visual information is about 6%,
which is much lower than the text information with over 40%. From the experimental results, we
can observe the re-ranking scheme only produce a marginal improvement compared with the text
only approaches. Some reasons can be explained for the results. One is the engaged visual features
not e®ective enough to discriminate the images. Another possible reason is that the ground truth
images in the given query may not be quite di®erent in visual content. It is interesting to study
more e®ective features and learning methods for improving the performance.
3.3.4</p>
          <p>Query Expansion for Information Retrieval
From the experimental results in Table 2, we observe that all the queries are greatly enhanced
by adopting Query Expansion 2 (QE). The average improvement for all the queries is around
1.71% which accounts %4.12 of the maximum MAP of 41.35%. It is interesting to ¯nd that the
QE especially bene¯ts a lot for the Jelinek-Mercer smoothing method, the mean gain with QE is
about 2.49% which accounts %6.02 of the maximum MAP of 41.35%.</p>
          <p>2Query expansion refers to adding further terms to a text query (e.g. through PRF or thesaurus) or images to
a visual query
In this paper, we reported our empirical studies of cross-language and cross-media image retrieval
at the ImaegCLEF 2005 campaign. We addressed three major focuses and contributions in our
participation. The ¯rst is the empirical evaluations of Language Models and the smoothing
strategies for Cross-Language image retrieval. The second one is the evaluation of Cross-Media image
retrieval, i.e., combining text and visual content for image retrieval. The last one is the evaluation
of the Bilingual image retrieval between English and Chinese. We conducted empirical analysis
on the experimental results and provided the empirical summary of our participation.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>[1] Http://www.ldc.upenn.edu/projects/chinese/.</mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>[2] Http://www.lemurproject.org/.</mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Ricardo</given-names>
            <surname>Baeza-Yates</surname>
          </string-name>
          and
          <string-name>
            <given-names>Berthier</given-names>
            <surname>Ribeiro-Neto</surname>
          </string-name>
          .
          <article-title>Modern Information Retrieval</article-title>
          .
          <source>Addison Wesley</source>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Clough</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mueller</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Sanderson</surname>
          </string-name>
          .
          <article-title>The clef cross language image retrieval track (imageclef) 2004</article-title>
          .
          <article-title>In In the Fifth Workshop of the Cross-Language Evaluation Forum (CLEF</article-title>
          <year>2004</year>
          )
          <article-title>(LNCS</article-title>
          ). Springer,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T. M.</given-names>
            <surname>Cover</surname>
          </string-name>
          and
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Thomas</surname>
          </string-name>
          .
          <source>Elements of Information Theory. Wiley</source>
          ,
          <year>1991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Manning</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>SchuÄtze</surname>
          </string-name>
          .
          <source>Foundations of Statistical Natural Language Processing</source>
          . The MIT Press,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Savoy</surname>
          </string-name>
          . Report on clef
          <article-title>-2001 experiments (cross language evaluation forum)</article-title>
          .
          <source>In LNCS 2406</source>
          , pages
          <fpage>27</fpage>
          {
          <fpage>43</fpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C. E.</given-names>
            <surname>Shannon</surname>
          </string-name>
          .
          <article-title>Prediction and entropy of printed english</article-title>
          .
          <source>Bell Sys. Tech. Jour.</source>
          ,
          <volume>30</volume>
          :
          <fpage>51</fpage>
          {
          <fpage>64</fpage>
          ,
          <year>1951</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A. W. M.</given-names>
            <surname>Smeulders</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Worring</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Santini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gupta</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Jain</surname>
          </string-name>
          .
          <article-title>Content-based image retrieval at the end of the early years</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          ,
          <volume>22</volume>
          (
          <issue>12</issue>
          ):
          <volume>1349</volume>
          {
          <fpage>1380</fpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Chengxiang</given-names>
            <surname>Zhai</surname>
          </string-name>
          and John La®erty.
          <article-title>Model-based feedback in the kl-divergence retrieval model</article-title>
          .
          <source>In In Tenth International Conference on Information and Knowledge Management (CIKM2001)</source>
          , pages
          <fpage>403</fpage>
          {
          <fpage>410</fpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Chengxiang</given-names>
            <surname>Zhai and John La</surname>
          </string-name>
          <article-title>®erty. A study of smoothing methods for language models applied to ad hoc information retrieval</article-title>
          .
          <source>In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'01)</source>
          , pages
          <fpage>334</fpage>
          {
          <fpage>342</fpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>