<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>DCU at WikipediaMM 2009: Document Expansion from Wikipedia Abstracts</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Jinming Min, Peter Wilkins, Johannes Leveling, Gareth Jones Centre for Next Generation Localisation School of Computing, Dublin City University Dublin 9</institution>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>s collection DBpedia. Since the metadata is short for retrieval by query words, we decided to expand the metadata using a typical query expansion method. In our experiments, we use the Rocchio algorithm for document expansion. Our best run is in the 26th rank of all 57 runs which is under our expectation, and we think that the main reason is that our document expansion method uses all the words from the metadata documents which contain words which are unrelated to the content of the images. Compared with our text retrieval baseline, our best document expansion run improves MAP by 11.17%. As one of our conclusions, we think that the document expansion can play an e ective factor in the image metadata retrieval task. Our content-based image retrieval uses the same approach as in our participation in ImageCLEF 2008.</p>
      </abstract>
      <kwd-group>
        <kwd>Query formulation</kwd>
        <kwd>Relevance feedback</kwd>
        <kwd>Document Expansion</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>This is DCU's rst participation in the WikipediaMM task of CLEF. This task aims to nd relevant
images based on the query text and image. In the image collection, every image is associated with
a metadata le. The le usually contains the description, copyright, author, camera parameters,
date and location of the image; for the topics, the data consists of the image and the query
text. So this task can be performed by text retrieval or content-based image retrieval or by a
combination of these two methods. Our main research e orts are on the text retrieval. Since the
useful information in the metadata is usually very short, this text retrieval task is di erent from
the ad-hoc retrieval for news or articles, and we call it short-length documents retrieval.</p>
      <p>
        Since the metadata les contain only very few words to describe the content of the images,
we decided to expand the metadata document from an external resource, the Wikipedia abstracts
collection DBpedia1. Document expansion is quite similar to query expansion in information
retrieval research, and we use the Rocchio algorithm as our document expansion method [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
We have also tested the combination of query expansion from external resource with document
expansion from the external resource, but the result is not as e ective as the document expansion
in this task. With proper expansion from the Wikipedia abstracts corpus, our text retrieval
experiment improves MAP by 11.17% percent compared with the baseline system.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Text Retrieval System Description</title>
      <p>In our approach, we use the Lemur toolkit2 as the retrieval system. We have tested all di erent
retrieval models in the Lemur toolkit and found that the tf-idf model performs well in this task.
Our formal runs employ Lemur's tf-idf method as the retrieval model.</p>
      <p>We parse the metadata to be used for indexing and document expansion. For the text part of
the topics, we directly extract all words to form a query. The system overview is shown in Figure
1.</p>
      <p>For the queries, we choose to expand queries based on external resources or not which could
lead di erent results; for the metadata document, we will expand it when it's length is less than
a threshold L &lt; 200. We found long-length documents in the metadata documents contain more
noise to get useful relevant feedback terms.</p>
      <p>In the following parts, we will describe the preprocessing, the retrieval model used in the task,
and the document expansion algorithm.
2.1</p>
      <sec id="sec-2-1">
        <title>Preprocessing</title>
        <p>For WikipediaMM 2009, we use the following data: the topics, the metadata collection and
DBpedia. All these collections are preprocessed to be used in our task. For the topics, we select the
title part as the query; for the metadata collection, the text is selected as the query to perform
the document expansion. An example is shown in Figure 2:
1http://dbpedia.org
2http://www.lemurproject.org/</p>
        <sec id="sec-2-1-1">
          <title>1. removing useless punctuation in metadata;</title>
        </sec>
        <sec id="sec-2-1-2">
          <title>2. removing special HTML encoded characters;</title>
          <p>For every metadata document, after parsing we set the remaining text as the query.</p>
          <p>For the DBpedia, we use the text as index and remove the stop words computed from itself.
An example document from DBpedia is:
&lt;DOC&gt;
&lt;DOCNO&gt;1969 Paris Open&lt;/DOCNO&gt;
&lt;TEXT&gt;The 1969 Paris Open was a professional tennis tournament played on
indoor carpet courts. It was the 2nd edition of the Paris Open (later
known as the Paris Masters).
&lt;/TEXT&gt;
&lt;/DOC&gt;</p>
          <p>The English DBpedia includes 2,787,499 documents corresponding to a brief form of a Wikipedia
article. We compute 500 stop words from DBPedia and remove all the stop words before indexing
it.
2.2</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Text Retrieval Model</title>
        <p>After testing di erent information retrieval models on the text based image retrieval task, we
found that the tf-idf model outperforms state-of-art models such as Okapi BM25 or language
modeling in Lemur toolkit. So we choose the tf-idf model as our baseline retrieval model in this
task. The document term frequency (tf ) weight we use in tf-idf model is:
tf (qi; D) =</p>
        <p>k1 f (qi; D)
f (qi; D) + k1 (1 b + b lldc )
(1)
f (qi; D) is the frequency of query term qi in Document D, ld is the length of document D, lc is
the average document length of the collection, and k1 and b are parameters set to 1.2 and 0.75
respectively. The idf of a term is given by log(N=nt). N is number of documents in the collection
and nt is the number of documents containing term t.</p>
        <p>The query tf function (qtf ) is de ned similarly with a parameter representing average query
length. The score of document D against query Q is given by:</p>
        <p>i=1
s(D; Q) = X tf(qi; D) qtf(qi; Q) idf(qi)2
n
(2)
qtf is the tf for a term in queries and it's computed using the same method with the tf in
documents.
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Document Expansion</title>
        <p>
          Our document expansion method is similar to a typical query expansion process. We use the
pseudo-relevance feedback as our document expansion method with Rocchio's algorithm [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. The
Rocchio algorithm reformulates the query from three parts: the original query, the feedback words
from the assumed top relevant documents and the negative feedback terms from the assumed
non-relevant documents. For the described experiments, we do not use negative feedback. In our
implementation of Rocchio's algorithm, the factors for original query terms and feedback terms
are all set to be 1 ( = 1, = 1). We choose the DBpedia as the external resource for document
expansion. The reasons are:
1. the DBpedia dataset contains only the Wikipedia terms de nition sentences which contains
less noise than full articles;
2. the DBpedia documents are also derived from Wikipedia documents which share some
characteristics with our image metadata documents form Wikipedia.
        </p>
        <p>For every metadata document, after preprocessing we use the remaining text as the query. We
retrieve the top 100 documents as the assumed relevant documents. With all the words from the
returned top 100 documents we rst remove all the stop words. The stop words list is produced
from the DBpedia document collection, and we compute the term frequency from the DBpedia
collection and set the top 500 words as the stop words. For the top 100 relevant documents in
DBpedia, we compute a word frequency list and remove the stop words and the original words
from the query. We select the top ve words as the document expansion words.</p>
        <p>The number of relevant documents for document expansion is higher than normal because the
Wikipedia abstract corpus usually has very short documents. If we only used 10 or 20 as the
assumed relevant documents, it would be very hard for us to get useful feedback terms from the
relevant documents. Furthermore, the original metadata documents are short so we only select
the top 5 terms as the feedback terms. Then the expanded terms will be added into the metadata
document and the index is rebuilt.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Content-Based Image Retrieval</title>
      <p>For content-based image retrieval we make use of the following six global visual features de ned
in the MPEG-7 speci cation:</p>
      <p>Scalable Colour (SC): derived from a colour histogram de ned in the HSV colour space.
It uses a Haar transform coe cient encoding, allowing scalable representation.
Colour Structure (CS): based on colour histograms, the feature represents an image by
both the color distribution (similar to a color histogram) and the local spatial structure of
the colour.</p>
      <p>Colour Layout (CL): compact descriptor which captures the spatial layout of the
representative colours on a grid superimposed on an image.</p>
      <p>Colour Moments (CM): similar to Colour Layout, this descriptor divides an image into
4x4 subimages and for each subimage the mean and the variance on each LUV color space
component is computed.</p>
      <p>Edge Histogram (EH): represents the spatial distribution of edges in an image. Edges are
categorized into ve types: vertical, horizontal, 45 degrees diagonal, 135 degrees diagonal
and non directional.</p>
      <p>Homogeneous Texture (HT): provides a quantitative representation using 62 numbers,
consisting of the mean energy and the energy deviation from a set of frequency channels.</p>
      <p>
        Our work for visual querying was the same approach as undertaken in ImageCLEF 2008. For
a visual query, we take the topic images and extract from each their representation of the image
by each of the six features above. For each feature we query its associated retrieval expert (i.e.
visual index and ranking function) to produce a ranked list. The ranking metric for each feature
is as speci ed by MPEG-7 and is typically a variation on Euclidian distance. For our experiments
we kept the top 1000 results. Each ranked list was then weighted and the results from all ranked
lists are normalized using MinMax [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], then linearly combined using CombSUM [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        The weighting we employed was linear, using an approach where the weights are determined
at query-time. This approach is the same as used in our previous ImageCLEF experiments, with
an explanation found in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>We have submitted 5 runs to the WikipediaMM task including 4 runs by the text retrieval system
and 1 run by image retrieval system. The main technique used in the text retrieval is: query
expansion (QE), query expansion from external resource (QEE), document expansion from external
resource (DEE). Our 4 runs are combinations of these techniques and the baseline run uses only
the tf-idf IR model without additions. The results for English monolingual WikipediaMM 2009
task are shown in Table 1.</p>
      <p>Run
dcut df-baseline
dcut df-dbpedia-qe
dcut df-dbpediametadata-dbpediaqe
dcut df-dbpediametadata-qe
dcuimg</p>
      <sec id="sec-4-1">
        <title>Modality TXT TXT TXT</title>
        <p>TXT
IMG</p>
      </sec>
      <sec id="sec-4-2">
        <title>Methods</title>
        <p>BASELINE</p>
        <p>DEE
QEE+DEE+QE</p>
        <p>DEE+QE
BASELINE</p>
        <p>MAP
0.1576
0.1685
0.1641
0.1752
0.0079</p>
        <p>P@10
0.2600
0.2600
0.2378
0.2578
0.0244</p>
        <p>Our best result ranks in the middle of all the o cial runs in the WikipediaMM 2009 task.
Compared with our baseline result, document expansion plays an important role in our best
result. Document expansion can improve the MAP from 0.1576 to 0.1685, but query expansion
from external resource in combination with our methods does not show much improvement. Our
image run gets bad results due to some computation error which will xed in the future research.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>
        We presented our system for the WikipediaMM task of CLEF 2009 focusing on document
expansion. Document expansion has not been thoroughly researched for information retrieval. From the
past research, whether the document expansion can improve the retrieval e ectiveness or how to
improve it is not obvious [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. Our results show that the document expansion could play a role
in the image metadata retrieval task. Also the documents usually contain too much information
unrelated to the content of picture such as the copyright and author information. This information
used in the document expansion will greatly harm the expansion results. In future experiments
we will try to remove some noise from the documents and use the words related with the content
of the image as the query to perform document expansion.
      </p>
      <p>From this task, our main nding is that the document expansion can improve the retrieval
e ectiveness much when the document length is short. On the other hand, query expansion from
the external resource does not improve performance since the query text usually is very accurate
and does not need to be expanded with more words.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This research is supported by the Science Foundation Ireland (Grant 07/CE/I1142) as part of the
Centre for Next Generation Localisation (CNGL) project.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.J.</given-names>
            <surname>Rocchio</surname>
          </string-name>
          .
          <article-title>Relevance feedback in information retrieval</article-title>
          . In In Gerard Salton, editor,
          <source>The SMART Retrieval System-Experiments in Automatic Document Processing</source>
          , pages
          <volume>313</volume>
          {
          <fpage>323</fpage>
          , Englewood Cli s, NJ, USA,
          <year>1971</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Edward</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Fox</surname>
            and
            <given-names>Joseph A.</given-names>
          </string-name>
          <string-name>
            <surname>Shaw</surname>
          </string-name>
          .
          <article-title>Combination of Multiple Searches</article-title>
          .
          <source>In Proceedings of the Third Text REtreival Conference (TREC-1994)</source>
          , pages
          <fpage>243</fpage>
          {
          <fpage>252</fpage>
          ,
          <string-name>
            <surname>Gaithersburg</surname>
            ,
            <given-names>MD</given-names>
          </string-name>
          , USA,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Anni</given-names>
            <surname>Jarvelin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Wilkins</surname>
          </string-name>
          , Tomasz Adamek, Eija Airio, Gareth Jones,
          <string-name>
            <given-names>Alan F.</given-names>
            <surname>Smeaton</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Eero</given-names>
            <surname>Sormunen</surname>
          </string-name>
          .
          <article-title>Dcu and uta at imageclefphoto 2007</article-title>
          .
          <source>In ImageCLEF 2007 - The CLEF Cross Language Image Retrieval Track Workshop</source>
          , Budapest, Hungary,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Amit</given-names>
            <surname>Singhal</surname>
          </string-name>
          and
          <string-name>
            <given-names>Fernando</given-names>
            <surname>Pereira</surname>
          </string-name>
          .
          <article-title>Document Expansion for Speech Retrieval</article-title>
          .
          <source>In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval</source>
          , pages
          <volume>34</volume>
          {
          <fpage>41</fpage>
          , Berkeley, California, USA,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Bodo</given-names>
            <surname>Billerbeck</surname>
          </string-name>
          and
          <string-name>
            <given-names>Justin</given-names>
            <surname>Zobel</surname>
          </string-name>
          .
          <article-title>Document expansion versus query expansion for ad-hoc retrieval</article-title>
          .
          <source>In The Tenth Australasian Document Computing Symposium</source>
          , pages
          <volume>34</volume>
          {
          <fpage>41</fpage>
          ,
          <string-name>
            <surname>Sydney</surname>
          </string-name>
          , Australia,
          <year>December 2005</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>