<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Document Expansion for Text-based Image Retrieval at WikipediaMM 2010</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jinming Min</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Johannes Leveling</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gareth J. F. Jones</string-name>
          <email>gjonesg@computing.dcu.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centre for Next Generation Localisation School of Computing, Dublin City University Dublin 9</institution>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2009</year>
      </pub-date>
      <abstract>
        <p>We describe and analyze our participation in the WikipediaMM task at ImageCLEF 2010. Our approach is based on text-based image retrieval using information retrieval techniques on the metadata documents of the images. We submitted two English monolingual runs and one multilingual run. The monolingual runs used the query to retrieve the metadata document with the query and document in the same language; the multilingual run used queries in one language to search the metadata provided in three languages. The main focus of our work was using the English query to retrieve images based on the English metadata. For these experiments the English metadata data was expanded using an external resource - DBpedia. This study expanded on our application of document expansion in our previous participation in ImageCLEF 2009. In 2010 we combined document expansion with a document reduction technique which aimed to include only topically important words to the metadata. Our experiments used the Okapi feedback algorithm for document expansion and Okapi BM25 model for retrieval. Experimental results show that combining document expansion with the document reduction method give the best overall retrieval results.</p>
      </abstract>
      <kwd-group>
        <kwd>text-based image search</kwd>
        <kwd>metadata-based search</kwd>
        <kwd>relevance feedback</kwd>
        <kwd>document expansion</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        This paper describes our participation in the WikipediaMM task at CLEF
2010 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Our approach to this task was based only on text retrieval using the
metadata provided for each image. This is a challenging information retrieval
(IR) task since the image metadata usually contains less terms than would be
found in text documents more typically used in IR. This can lead to problems
of vocabulary mismatch between the user query and image metadata. For our
particpation in CLEF 2010, we continued to explore the document expansion
research for this task which we utilized in WikipediaMM 2009 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. This year our
document expansion method was combined with a document reduction
technique.
      </p>
      <p>This paper is structured as follows: Section 2 introduces the retrieval model
used in this work, Section 3 describes our document expansion and document
reduction methods, Section 4 records and analyzes our experimental results, and
nally Section 5 gives conclusions and directions for further work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Retrieval Model</title>
      <p>
        After testing di erent IR models on the text-based image retrieval task, we chose
the tf-idf model in the Lemur toolkit1 as our baseline model for this task [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
The document term frequency (tf ) weight used in the tf-idf model is:
tf (qi; D) =
      </p>
      <p>k1 f (qi; D)
f (qi; D) + k1 (1
b + b lldc )
where f (qi; D) is the frequency of query term qi in document D, ld is the length
of document D, lc is the average document length of the collection, and k1 and b
are xed parameters set to 1.2 and 0.75 respectively for this task (default values
in Lemur toolkit). The idf of a term is given by log(N=nt), where N is number of
documents in the collection and nt is the number of documents containing term
t. The query tf function (qtf ) is de ned similarly with a parameter representing
average query length. The score of document D against query Q is given by:
s(D; Q) =
n
X tf(qi; D) qtf(qi; Q) idf(qi)2
i=1
(1)
(2)
qtf is the tf for a term in queries, computed using the same method as the tf
in the documents.</p>
      <p>For the WikipediaMM 2010 task, we use the following data: the topics, the
metadata collection and English DBpedia collection (version 3.5). All these
collections were preprocessed for use in our work. For the topics, we selected the
English title as the query; for the metadata collection, the text including the
image name, description, comment and caption was selected as the query to
perform the document expansion and all the tags were removed. To transform
the metadata into the query was processed as follows:
1. removing punctuation in metadata text;
2. removing URLs from the metadata text;
3. removing special HTML encoded characters;</p>
      <p>The English DBpedia collection includes 2,787,499 documents corresponding
to a brief abstract of a Wikipedia article. We select 500 stop words by ranking
the term frequencies from English DBpedia collection and remove all the stop
words before indexing it.
1 http://www.lemurproject.org/</p>
    </sec>
    <sec id="sec-3">
      <title>Document Expansion</title>
      <p>
        Our document expansion method is similar to a typical query expansion process.
In the o cial runs, we used pseudo-relevance feedback (PRF) as our document
expansion method with the Okapi feedback algorithm [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The Okapi feedback
algorithm reformulates the query from two parts: the original query, feedback
words from the assumed top relevant documents. In our implementation of the
query expansion process, the factors for original query terms and feedback terms
are all set to be 1 ( = 1, = 1) which has been applied successfully in previous
document expansion work [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. For every metadata document, after preprocessing
we use the remaining text as the query. We retrieve the top 100 documents as
the assumed relevant documents. We rst remove all the stop words from the
returned top 100 documents. We select the top ve words as the document
expansion terms. The selected expansion terms are then added to the metadata
document and the index is rebuilt.
      </p>
      <p>In our o cial runs, we use Equation 3 to select the expansion terms from
DBpedia. Here the r(ti) means the number of documents which contain term
ti in the top 100 assumed relevant documents. idf uses the same method as
Equation 4.</p>
      <p>
        S(ti) = r(ti) idf(ti)
(3)
For the number of feedback words, we select the top ld words ranked using
Equation 3, where ld is the length of the original query document. This strategy is
taken from the method successfully adopted in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Our best results are from the
combination of document reduction, document expansion and query expansion.
Use of our document expansion techniques is designated as follows:
{ DE: document expansion from external resource
{ DR: document reduction for the metadata documents
{ QE: query expansion from original metadata documents
3.1
      </p>
      <p>Document Reduction
In previous research on DE, usually all the words in the document are associated
with the same weight as the \query" terms to nd relevant documents prior to
expansion. Given an example document \blue ower shot by user", an obvious
problem is easily identi ed. In this document the phrase \blue ower" is an
accurate description of the image. If we leave the noise words "shot by user" in
the query, it will not help us nd good relevant documents. So our method rst
computes the importance for each term in a document. To do this we compute
the weight of each term as its signi cance using the Okapi BM25 function.</p>
      <p>For example, considering the following document from the WikipediaMM
collection in Fig 1, the document will be \billcratty2 summary old publicity
portrait of dancer choreographer bill cratty. photo by jack mitchell. licensing
promotional" after preprocessing. If we manually select the important words
from the document, we could form a new document: \old publicity portrait of
dancer choreographer bill cratty". Using the reduced document as the query
document is obviously better than the original one in terms of locating potentially
useful DE terms. For automatic reduction of the document, we rst compute all
the term idf scores of the collection vocabulary as de ned in Equation 4.
idf (ti) = log</p>
      <p>N</p>
      <p>n(ti) + 0:5
n(ti) + 0:5
here ti is the ith term, and N is the total number of documents in this collection;
n(ti) is the number of the documents which contain the term ti. So for every
word ti in document D, we can compute its BM25 weight using Equation 5:
weight(ti; D) = idf (ti)</p>
      <p>
        f (ti; D)(k1 + 1)
f (ti; D) + k1(1
b + b ajvDgdjl )
(4)
(5)
here f (ti; D) is the frequency of word ti in document D; k1 and b are parameters
(k1 = 2:0, b = 0:75, starting parameters suggested by [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]); jDj is the length of
the document D; and avgdl is the average length of documents in the collection.
For the above example, the BM25 score of each term is shown in Table 1 after
removing the stopwords.
      </p>
      <p>We propose to reduce documents by ranking their terms using their BM25
score in decreasing order and removing all terms below a given cut-o value
(given as a percentage here). If we choose 50% as the number to reduce the
document length, we get the new document "billcratty2 cratty choreographer
dancer mitchell bill" for the above example. We call the cut-o value the
document reduction rate, which can be de ned as: If the reduction rate is r%, we will
keep r% of the original length for the document, and the length of a document
means the number of all terms in a document. Using the new reduced document
as the query to obtain documents for expansion produces some di erences in the
top ranked documents compared to the DE method without DR process. Thus
it will select di erent feedback words from the relevant documents.
&lt;/article&gt;
&lt;?xml version="1.0"?&gt;
&lt;article&gt;
&lt;name id="23918"&gt;BillCratty2.jpg&lt;/name&gt;
&lt;text&gt;
&lt;h2&gt;Summary&lt;/h2&gt; Old publicity portrait of dancer
choreographer Bill Cratty. Photo by Jack Mitchell.
&lt;h2&gt;Licensing&lt;/h2&gt;
&lt;value&gt;Promotional&lt;/value&gt;
&lt;/text&gt;
&lt;/article&gt;
For our participation in this task we submitted three o cial runs as shown in
Table 2. Our best result comes from the combination of document reduction,
document expansion and query expansion. In our document reduction
experiment, the document reduction rate is set 50%. For the run dcuRunOkapi, the
English metadata was expanded from the English DBpedia; for the run
dcuRunOkapiAll, the index is built from the combination of the expanded English
metadata and the original French and German metadata. These two runs
produce the same retrieval results since the French and German metadata do not
a ect the English query very much. For the run dcuRunOkapiFR, French topics
were used to search the French metadata. The retrieval e ectiveness was found
to be relatively low compared to the English runs. The reason for this is due to
the generally very signi cant lack of French metadata in the WikipediaMM
image collection. To compare with our DE result, we also provide another English
baseline run without document expansion - baselineEnRun. Comparing DE run
with baseline run, DE run improves 12.96% by MAP criteria. Applied paired
ttest, the two runs are signi cantly di erent (p 0:005). In Figure 2, TL means
topic language and AL means annotation language.
5</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>This paper presented and analyzed our system for the WikipediaMM task at
CLEF 2010 focusing on document reduction and document expansion. In our</p>
      <p>Run
dcuRunOkapi
dcuRunOkapiAll
baselineEnRun
dcuRunOkapiFR</p>
      <p>Modality</p>
      <p>TXT
TXT
TXT
TXT</p>
      <p>Methods TL AL MAP P@10
DR+DE+QE EN EN 0.2039 (+12.96%) 0.4271
DR+DE+QE EN EN+FR+DE 0.2039 (+12.96%) 0.4271</p>
      <p>QE EN EN 0.1805 0.4200</p>
      <p>QE FR FR 0.1192 0.3243
past research, document expansion from external resources has been shown to
be e ective in the text based image retrieval task. This year, document expansion
combined with document reduction produces e ective results in this task.</p>
      <p>Our main ndings in this research are as follows. Document expansion can
improve the retrieval performance for our text-based image retrieval task. For
this year, using the improved document expansion method with document
reduction still gives us a good retrieval result in this task. From the overall results
from this task, the combination of content-based image retrieval and text based
image retrieval methods performs better than the single method and this will
form one of our future research directions.
6</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This research is supported by the Science Foundation Ireland (Grant 07/CE/I1142)
as part of the Centre for Next Generation Localisation (CNGL) at Dublin City
University.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Popescu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tsikrika</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kludas</surname>
          </string-name>
          , J.:
          <article-title>Overview of the Wikipedia Retrieval task at ImageCLEF 2010</article-title>
          . In: Working Notes of CLEF 2010, Padova, Italy (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Min</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wilkins</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leveling</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          : DCU at WikipediaMM 2009:
          <article-title>Document Expansion from Wikipedia Abstracts</article-title>
          .
          <source>In: Working Notes for the CLEF 2009 Workshop</source>
          , Corfu, Greece (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Robertson</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , Sparck Jones,
          <string-name>
            <surname>K.</surname>
          </string-name>
          :
          <article-title>Simple, proven approaches to text retrieval</article-title>
          .
          <source>TR UCAM-CL-TR-356</source>
          , University of Cambridge, Computer Laboratory (
          <year>1994</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Singhal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pereira</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Document Expansion for Speech Retrieval</article-title>
          .
          <source>In: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval</source>
          , Berkeley, California, USA (
          <year>1999</year>
          )
          <volume>34</volume>
          {
          <fpage>41</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>