<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Visual Reranking for Image Retrieval over the Wikipedia Corpus</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>route du panorama</string-name>
          <email>P@10</email>
          <email>P@20</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>FONTENAY AUX ROSES</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>France.</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2051</year>
      </pub-date>
      <abstract>
        <p>This paper describes the approach we developed for the WikipediaMM task on 2009 [4], which builds on our last year contribution. The main novelties are the re nement of textual query expansion procedure and the introduction of a k-NN based visual reranking procedure. Our main purpose was to test whether combining textual and content based retrieval improves over purely textual search and the results we report here con rm that combining modalities results in a signi cant improvement of results.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        The wikipediaMM task consists of retrieving images from a large-scale collection of heterogenous
images [
        <xref ref-type="bibr" rid="ref4 ref5">5, 4</xref>
        ]. We propose a fully automatic approach which rst retrieves images for a given
query using textual query expansion and then reranks the results of the textual run using visual
representations of the queries. Query expansion exploits the categorical structure of Wikipedia
in order to nd and rank relevant concepts from the collaborative encyclopedia. Terms in the
query are compared to Wikipedia categories and articles which matched a maximum number of
such terms are ranked best. We build visual models for each topic by retrieving Web images
which match that topic from Google Image and Yahoo! Image and reranking them using a k-NN
approach and an negative set of images which contains diversi ed images. Each Web image is
compared to all other images retrieved with the same query and to the negative set and is well
ranked if it is visually distant from the external class. Since Web images are usually noisy, we
retain only the best ranked images for creating the visual model. Then, each image retrieved
from the Wikipedia corpus is compared to the visual model and to the negative set in order to
favor those which are close to the visual model. We index images using both local (bags of visual
words) and global (texture-color) descriptors and report results for both types of indexers as well
as for their fusion. There are two main common points to these steps. The rst is that we use
an intermediate conceptual level as a reference to describe the images. The second is that in both
cases, we use external sources (from the web) to build these conceptual level.
      </p>
    </sec>
    <sec id="sec-2">
      <title>Automatic building of conceptual structure</title>
      <p>
        Last year, we introduced a Wikipedia based query expansion procedure which takes advantage of
the categorical structure of the encyclopedia[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and we obtained encouraging results. This year
we re ne our approach by modifying the way query related concepts are ranked.
2.1
      </p>
      <sec id="sec-2-1">
        <title>Data sources</title>
        <p>The main resource we exploited was Wikipedia. Dumps of the encyclopaedia are regularly provided
for a free use. We downloaded the April 2009 English dump, which contains over 2.6 million articles
and is provided as a single le, in XML format. Next, we split the dump into individual articles
in order to process the information faster. The information in Wikipedia spans over a large
number of conceptual domains, with a high number of articles describing known people, places,
entertainment, organisations animals and plants. Each article is placed in at least one category, a
property that facilitates the extraction of IsA relations from the encyclopedia.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Conceptual neighbourhood building</title>
        <p>Wikipedia images are accompanied by brief textual descriptions and query expansion is an
appealing way to improve recall and, if performed in a judicious way, to also improve results precision.
Topics are preprocessed in order to eliminate stop words and eliminate visual terms from a closed
list (including image(s), photograph(s) etc.). Then, we lemmatize remaining terms and arrange
them using their term frequency in Wikipedia in order to favor rare terms (which are more likely
to be discriminant than frequent words). Then, we compare terms in the query to Wikipedia
categories and retain all articles that match at least on term in the query. A limit of 5000 articles
is imposed to speed up the processing. For instance, a query with orthodox icons with Jesus will
have related concepts such as Christ the Redeemer and Theotokos but also Our Lady of Kazan or
Manhattan.</p>
        <p>It it obvious that these concepts are not equally relevant to the query and it is necessary to
rank them so as to put the most pertinent rst. Whereas Christ the Redeemer and Theotokos
match all terms in orthodox icons with Jesus, Our Lady of Kazan matches only orthodox icons
and Manhattan is found because orthodox appears in one of its categories (United States Places
with Orthodox Jewish communities ). We rank concepts by counting the number of terms from
the initial query which appear in each article's categories and by favoring related concepts which
match rare terms in the query. At this point, there are usually several concepts with the same
score and in order to di erentiate them we re ne the ranking by answering the following questions:
is the concept ambiguous?
do all terms in the initial query appear in the rst paragraph of the Wikipedia article?
do all terms in the initial query appear in the article's text?
The re ned list of related concepts will favor unambiguous elements and concepts which contain
all terms in the query either in the rst paragraph of the associated article (which is often a
de nition) or in the remaining text.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Textual retrieval</title>
      <p>Relevant images are found by launching queries with the initial terms and the expanded queries
in the following order:
all terms in the initial queries and related concepts
the initial query
a part of the initial queries and related concepts
related concepts or a part of the initial query
The intuition behind this type of querying is that an image described by many terms (from the
initial query and the related concepts) is more likely to be relevant than another image described
by fewer terms. Weights are applied to the terms in the initial query and to related concepts and
individual image scores are calculated by adding these scores.</p>
      <p>Let Rt be the list of images returned by the textual matching procedure.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Visual reranking</title>
      <p>The basic idea of our content based reranking procedure is that an image which is visual close to
the visual model of a query is more likely to be a good answer than another image which is less
similar to the visual model. To evaluate the similarities between the Rt images and a topic, we
need to obtain a low-level description of that particular topic. We create a visual model (a positive
set of images Rpos which depicts the concept) using Web images downloaded from Google Image
and Yahoo! Image. A negative set Rneg containing diversi ed images is manually constructed
and used as an outlier for all topics in order to discard images which are not visually close to
the topic's visual model. Then we use the visual coherence to evaluate the relevance of both sets
(Rall = Rpos [ Rneg) and rerank them. Eventually, we merge this new ordered list Rt using
window merging or blocks merging.
4.1</p>
      <sec id="sec-4-1">
        <title>Visual coherence</title>
        <p>The visual coherence of an image is a metric mesuring the similarities beetween an image and a
concept. This metric is computed using two sets of images : a positive set Rpos containing Npos
various relevant images of the concept and a negative set Rneg of Nneg non relevant images. These
two sets compose the visual prototype of the concept. We compute the visual coherence score as
a couple of scores :
False neighbours: For an image, we search for its Nneigh closest neighbours in Rpos as well as
its Nneigh closest neighbors in Rneg. The visual similarity is calculated using a low-level
descriptor (global or local) and the euclidean distance. The two lists of Nneigh neighbours
are then merged and the rst part of the VC score is de ned as the number of neighbours
which belong to the negative set among the rst Nneigh images of this 2 Nneigh size list. In
an ideal case, an image which perfectly represents the concepts will not have any pictures of
Rneg among its Nneigh closest neighbours. Inversely, the more negative elements among the
rst Nneigh neighbours are found, the less the image is a good representative of the concept.
Distances: The second score is the sum of the distances of their Nsum closest neighbours in Rpos.</p>
        <p>A small value of this sum implies that the image is visually similar to the concept in the
descriptor's space. The distances computation is based on the same descriptor as in the
previous step.</p>
        <sec id="sec-4-1-1">
          <title>This algorithm is strongly dependant to the descriptor and the metric.</title>
          <p>To rank a set of images according to their likeness to a concept, we sort them using only the
rst score (which is a number between 0 and Nneigh). For two images having the same number of
neighbors, we use the second score of the visual coherence to re ne the ranking.
4.2</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>Visual prototype conception</title>
        <p>We hypothesize that a visual prototype of a topic can be extracted by querying a web-scale image
search engine with that particular topic and by reranking the answers in order to reduce noise.
Nq images. We also download a set of Nneg various images that we index to build the negative set
Rneg. The negative set contains diversi ed images which depict a large number of concepts such
as mountain, dog, car, football or protest. Several content-based descriptors are then computed
from the raw set of images. For the visual prototype to be e ective, the retained images need to
be as accurate as possible and it is necessary to lter out noisy results returned by the Web search
engine. For this, we compute the VC on the raw set: for each image, the Nq 1 other images of
the set are temporarily considered as Rpos and used with Rneg to compute its VC score.
We keep the top Npos results only, that are considered good enough to represent the concept and
be part of its visual prototype.</p>
        <p>Since the visual coherence computation depends on the features, a visual prototype is thus also
de ned for each descriptor. It is nally composed of an ordered list of Npos positive images and a
set of Nneg negative images relatively to the concept.
4.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Textual results reranking</title>
        <p>Let Rt be the list of images returned by the textual matching procedure. It can now be reordered
using the visual prototypes using the same k-NN method used to build the prototype itself. In
practice, we compute the signature of each image of Rt and its distance to all the images of the
visual prototype (both to the Rpos and the Rneg set) and compute their visual coherence. Since
the visual prototype covers di erent aspects of the topic, we expect that a relevant result from the
textual run to be related to some of the images composing the visual prototype. Since the visual
prototype is dependant on the descriptor, we run experiments using:
Descriptor Reranking One descriptor is chosen. We use the visual prototype created with this
descriptor to compute the visual coherence scores of the Rt list. Then, the list is simply
ranked according to the VC score.</p>
        <p>Late-fusion Reranking The picture is reordered according to the sum of its ranks into the lists
coming from several Descriptor Rerankings (i.e. using several descriptors) (cf. gure 1).
Early-fusion Reranking We fuse the visual prototypes which depict the same concept but use
two di erent descriptors by crediting each image with the sum of its rank. Then, the new
visual prototype is used to build each Descriptor Reranking. The Descriptors Rerankings
are merged using the Late-fusion Reranking approach.
Textual results have weights which express their closeness to the topic. We describe two methods
for merging textual and visual information.
4.4.1</p>
        <p>Window merging
In this process, we consider two rankings : Rt and a content-based ranking (Descriptor Reranking,
Late-fusion Reranking, Early-fusion Reranking). A window of Swindow images slides on the textual
order:
1. All the elements in the window are reranked according to the content based order.
2. The rst one is selected as a new member of the window merging order.</p>
        <sec id="sec-4-3-1">
          <title>3. Next, it is removed from the window.</title>
          <p>4. Then, we add the next element of Rt in the window.</p>
          <p>We repeat the process until all the elements of Rt are ranked (cf. gure2).
4.4.2</p>
          <p>Blocks merging
The Rt ranking is divided into blocks and each block is reranked accordind to the same
contentbased order. A block is composed of images which are similarly related to the topic. For instance
described by all terms in the initial query and a related concept or described only by all the
terms is the initial query. Given a query with orthodox icons with Jesus, two images which are
annotated with all the terms in the query and with Christ the Redeemer, respectively Theotokos
are in the same block, a third image annotated with orthodox icons with Jesus is in a second
block and another image tagged with Manhattan (concept which is loosely related to the query
via orthodox ) is in a third block. This is a kind of fusion since the order of the blocks keeps the
Rt ranking but, within each block, we use the content-based ranking (cf. gure 3).
5</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experimental validation</title>
      <p>
        Our method has been evaluated in the WikipediaMM task of the ImageCLEF 2009 campaign [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
In these experiments, we consider each query as a concept.
For the visual prototypes creation, we query each query's heading using both Google and Yahoo!
search engines to bene t from the di erence between the index and improve our prototype variety.
In our case, Nq = 100: only the top 50 pictures of each search engine results are selected to
illustrate the concept. After processing the raw set, the new positive set contains Npos = 50
images.
      </p>
      <p>For practical reasons, all the concepts share the same Rneg negative set even if it should be
ideally rede ned for each query. Moreover, it is not a trivial task to build a coherent noisy class
without relation with the concept, thus, in practice, this set contains concepts we consider as
generic. We xed Nneg to 300.</p>
      <p>
        To compute the visual coherence score, Nneigh and Nsum are both xed to 10 : the diversity
of a concept can make two pictures semantically relevant but very far from each other according
to a descriptor. The descriptors used are the color and texture based Local Edge Pattern (LEP,
derived from [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]) and classical bags of features [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Hence, for each query, we obtain three visual
prototypes : LEP visual prototype, Bag of feature visual prototype and the Early-fusion visual
prototype which results from the mix of the two previous. Finally we get four content-based
rerankings : the texture LEP reranking, the Bag of features reranking, the Late-fusion reranking
and the Early-fusion reranking.
      </p>
      <p>For the Window merging method, we use a window of Swindow = 10 elements. For the Blocks
merging method, we divide Rt according to textual score.
5.2</p>
      <sec id="sec-5-1">
        <title>Results</title>
        <p>The results of our methods are reported in table 1. Note that we highlight some mistakes in the
generation of the runs using the window merging approach bringing about di erences with the
results providing with the o cial runs (see footnotes). We report here the results of the correct
runs, these runs do not outperform our best submissions but are still better than text only baseline.</p>
        <sec id="sec-5-1-1">
          <title>Reranking procedure</title>
          <p>Blocks merging(Rt textual ranking, Late-fusion reranking)
Blocks merging(Rt textual ranking, Early-fusion reranking)
Blocks merging(Rt textual ranking, Late fusion reranking)
1000 images maximum/concept
Blocks merging(Rt textual ranking, Bag of features reranking)
Blocks merging(Rt textual ranking, Texture LEP reranking)
Blocks merging(Rt textual ranking, Bag of features reranking)
1000 images maximum/concept
Blocks merging(Rt textual ranking, Texture LEP reranking)
1000 images maximum/concept
Window merging(Rt textual ranking, Early-fusion reranking)
Window merging(Rt textual ranking,Late-fusion reranking)
Window merging(Rt textual ranking, Texture LEP reranking)
Window merging(Rt textual ranking, Bag of features reranking)</p>
          <p>The use of the content-based reranking (i.e the second step) always improves the results
compared to the text-based baseline. However, we noticed during our preliminary
experiments that the results can decrease when the reranking is global (i.e without window or
block merging approaches). We expected such results because we retain up to 1000 results
for each topic and the test collection contains around 150000 images. Consequently, a large
part of the 1000 answers are not relevant and it is useful to exploit the textual ranking.
The block merging is more e cient than the sliding window approach. The di erence
between block and window is signi cant (around 3 points in MAP). This proves that splitting
the textual results according to the relatedness of each image to the query makes sense and
Bag-of-feature (local descriptors) give slightly better results than texture LEP descriptor
(global features). When looking at the P@10 scores, we notice that the bag of features runs
return the best results.</p>
          <p>Fusion reranking signi cantly improves MAP results (around 1 point) in comparison to
simple descriptor reranking. This nding con rms that when dealing with diversi ed topics,
it is interesting to merge local and global descriptors in order to obtain better results.
The late-fusion reranking gives better results than the early-fusion one but the di erence is
not signi cant.
6</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion and future work</title>
      <p>We presented methods for expanding textual queries and merging textual and visual information
that are adapted for retrieving images in large datasets. Emphasis was put on designing
techniques which can easily be scaled-up to larger image repositories. Textual results were somehow
1o cial result with the wrongly generated run was map=0.1182 P@10=0.2267 P@20=0.1889
2o cial result with the wrongly generated run was map=0.1178 P@10=0.2267 P@20=0.1900
3o cial result with the wrongly generated run was map=0.0944 P@10=0.1889 P@20=0.1544
4o cial result with the wrongly generated run was map=0.0921 P@10=0.1800 P@20=0.1633
disappointing and we are currently investigating why this happened and how they can be
improved. One interesting work direction is to replace the re nement part of the concept ranking
with techniques such as explicit semantic analysis [8] and it to the current performances of the
system. We will also test the e ects of the query expansion on larger datasets (such as the Web
corpus) in order to compare it to standard search engine results. We are con dent that results
are likely to improve in terms of precision and diversity because the related concepts cover various
aspects of a topic.</p>
      <p>A second line of work concerns the visual processing in our system. We designed a k-NN based
method for image reranking which is fast enough to be introduced in a search engine's pipeline
given that the images are preindexed. We will explore the e ects of introducing other content
descriptors. Visual coherence was used at an image level but it is easy to compute a similar metric
for topics and try to use it in order to determine automatically whether the visual reranking will
be e cient (intuitively, it will work well for visually coherent queries). Visual coherence at a topic
level can also be exploited in order to determine which descriptor to use for the reranking (it is
probable for the best results to be obtained for the descriptor which provides the best separation
between the positive and the negative sets).</p>
      <p>Finally, it is interesting to explore how to adapt the size of the window (window merging) or
the limits of the blocks (Block merging) to each list returned for each topic.
7</p>
    </sec>
    <sec id="sec-7">
      <title>Aknowledgement</title>
      <p>We thank the Direction Generale des Entreprises for funding us through the regional business
cluster Systematic (project POPS) and Cap Digital (project Mediatic).
[8] E. Gabrilovich and S. Markovitch. Computing semantic relatedness using Wikipedia-based
explicit semantic analysis, Proceedings of IJCAI, pp. 1606-1611, 2007.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1] Cheng, Ya-Chun and Chen, Shu-Yuan.
          <article-title>Image classi cation using color, texture and regions</article-title>
          .
          <source>Image Vision Computing</source>
          ,
          <volume>21</volume>
          (
          <issue>9</issue>
          ):
          <volume>759</volume>
          {
          <fpage>776</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Adrian</given-names>
            <surname>Popescu</surname>
          </string-name>
          , Herve Le Borgne,
          <article-title>Pierre-Alain Moellic, Conceptual Image Retrieval over the Wikipedia Corpus Working notes for the CLEF</article-title>
          2008 Workshop, Aarhus, Denmark,
          <fpage>17</fpage>
          -
          <issue>19</issue>
          <year>September 2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Adrian</given-names>
            <surname>Popescu</surname>
          </string-name>
          , Herve Le Borgne,
          <article-title>Pierre-Alain Moellic, Conceptuel Image retrieval over a Large Scale Database</article-title>
          .
          <source>In Evaluating Systems for Multilingual and Multimodal Information Access, Proceedings of the 9th Workshop of the Cross-Language Evaluation Forum, Lecture Notes in Computer Science</source>
          , vol.
          <volume>5706</volume>
          , Springer 2009.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Theodora</given-names>
            <surname>Tsikrika</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jana</given-names>
            <surname>Kludas</surname>
          </string-name>
          .
          <article-title>Overview of the wikipediaMM task at ImageCLEF 2009</article-title>
          .
          <source>CLEF workng notes 2009</source>
          , Corfu, Greece,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Theodora</given-names>
            <surname>Tsikrika</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jana</given-names>
            <surname>Kludas</surname>
          </string-name>
          .
          <article-title>Overview of the wikipediaMM task at ImageCLEF 2008</article-title>
          .
          <article-title>In Evaluating Systems for Multilingual and Multimodal Information Access</article-title>
          ,
          <source>Proceedings of the 9th Workshop of the Cross-Language Evaluation Forum, Lecture Notes in Computer Science</source>
          , vol.
          <volume>5706</volume>
          , pp.
          <fpage>539</fpage>
          -
          <lpage>550</lpage>
          ,
          <year>Springer 2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Marszalek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lazebnik</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Cordelia</given-names>
            <surname>Schmid</surname>
          </string-name>
          .
          <article-title>Local features and kernels for classifcation of texture and object categories: An in-depth study</article-title>
          .
          <source>Technical Report RR-5737</source>
          , INRIA Rh^
          <article-title>one-</article-title>
          <string-name>
            <surname>Alpes</surname>
          </string-name>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A. W. .M .</given-names>
            <surname>Smeulders</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Worring</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Santini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gupta &amp; R. Jain</surname>
          </string-name>
          <article-title>Content-Based Image Retrieval at the End of the Early Years</article-title>
          ,
          <source>IEEE Trans. on Patt. Anal. and Machine Intell</source>
          ., vol.
          <volume>22</volume>
          , No.
          <volume>12</volume>
          , pp.
          <fpage>467</fpage>
          -
          <lpage>477</lpage>
          ,
          <year>1997</year>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>