<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>XRCE's Participation to ImageCLEFphoto 2007</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Stephane Clinchant</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jean-Michel Renders</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gabriela Csurka</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Xerox Research Centre Europe</institution>
          ,
          <addr-line>6 ch. de Maupertuis, 38240 Meylan</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2001</year>
      </pub-date>
      <fpage>403</fpage>
      <lpage>410</lpage>
      <abstract>
        <p>Our participation to ImageCLEFphoto07, for the first time, was motivated by assessing several transmedia similarity measures that we recently designed and developed. The object of investigation consists here in some “intermediate level” fusion approaches, where we use some principles coming from pseudo-relevance feedback and, more specifically, use transmedia pseudo-relevance feedback for enriching the mono-media representation of an object with features coming from the other media. One issue that arises when adopting such a strategy is to determine how to compute the mono-media similarity between an aggregate of objects coming from a first (pseudo-)feedback step and one single multimodal object. We propose two alternative ways of adressing this issue, that result in what we called the “transmedia document reranking” and “complementary feedback” methods respectively. This year, with a “lightly” annotated corpus of images, it appears that mono-media retrieval performance is more or less equivalent for pure image and pure text content (around 20% MAP). Using our transmedia pseudofeedback-based similarity measures allowed us to dramatically increase the performance by ∼50% (relative). Trying to model the textual “relevance concept” present in the top ranked documents issued from a first (purely visual) retrieval and combining this model with the textual part of the original query turns out to be the best strategy, being slightly superior to our transmedia document reranking method. Enriching the image annotations by extra tags extracted from an external resource (namely the Flickr database) does not offer a significant advantage in the ImageCLEF07 corpus, even if we observed an improvement using other multimedia corpora and query sets. From a cross-lingual perspective, the use of domain-specific, corpus-adapted probabilistic dictionaries seems to offer better results than the use of a broader, more general standard dictionary. With respect to the monolingual baselines, multilingual runs show a slight degradation of retrieval performance ( ∼6 to 10% relative).</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Efficient access to multimedia information requires the ability to search and organize the
information. While, the technology to search text has been available for some time - and in the form
of web search engines is familiar to many people - the technology to search images and videos is
much more challenging. Early systems were based mainly on visual similarity with a query image
making use of lower-level features like texture, color, and shape. The only visual-based approach
to retrieval has several drawbacks. It does not actually bridge the semantic gap but rather forces
the user to work on low-level feature space. A gap remains between the user’s conceptualization
of a query and the query that is actually specified to the system.</p>
      <p>The scientific challenge is to understand the nature of interaction between text and images:
How can a text be associated with a piece of image (and reciprocally an illustrative image with
text) ? How can we organize and access text and image repositories in a better way than naive
late fusion techniques ? A lot of attempts have tried to find correlation between image and text
features. The main difficulty is to overcome the semantic gap and, especially, the fact that visual
and textual features are expressed at different semantic levels. As far as the “pure” visual mode is
concerned, features associated to images have become more and more complex, trying to abstract
their representation and to bridge the semantic gap. These trials take at least two directions.
The first one is to adopt the strategy of developing more expressive visual vocabularies,
potentially hierarchical, relying on latent semantic extraction techniques such as Probabilistic Semantic
Analysis (PLSA) or Latent Dirichlet Allocation (LDA). The second one focus on segmentation
approaches, that aim at cutting out an image into several related semantic regions.</p>
      <p>
        Departing from the classical “late fusion” strategy, recent approaches have considered fusion at
the feature level, estimating correspondences or joint distributions between components across the
image and text modes from training data. The main idea is to enrich the images with textual data
(annotations) — and vice versa – in order to facilitate their retrieval. These methods could be
classify into three general families: latent variable models, graph models and cross-lingual models.
Latent variable models generally extend PLSA or LDA to explain jointly image and text (e.g.
[
        <xref ref-type="bibr" rid="ref1 ref17">17, 1</xref>
        ]). Graph models consider the structure of an image through a graph, e.g. with a markov
network [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] or a concept graph (e.g. [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]). Finally, cross-lingual models find their inspiration in
machine translation and cross lingual information retrieval (e.g. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]).
      </p>
      <p>
        Our work is most aligned with the third family, namely Cross-Lingual Models (CLM). The main
idea of CLM applied to hybrid text-image retrieval, is to consider the visual feature space as a
language constitued of blobs or patches, that we will simply call visual words. Unlike [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] which was
inspired by machine translation, Jeon et al [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] proposed to extend cross-lingual relevance models
to cross-media relevance models. This method estimates the joint probability distribution of blobs
that could appear in image and words that could appear in the caption of the image assuming
mutual independence between a word and the blobs given an image. These joint probabilities
can be used in two ways to annotate/retrieve images. They further showed in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] that working
directly with continuous features describing the patch instead of working on discrete visual words
was even better. Further extensions of this model are: an improved normalization in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and a
Bernouilli distribution for text in Feng et al [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        Relevance Models are a family of pseudo-feedback methods: given a query q and a database of
objects o, relevance models first compute the probability P (q|o) and then automatically enhance,
enrich the query with textual or visual features. These models, when extended to mixed modalities,
can be considered as the ancestors of methods proposed last year by [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] and [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] for instance. These
last models can be called intermedia feedback, or transmedia feedback techniques. They do not rely
on relevance models but act in the same spirit. For example, from a query image, a first visual
similarity is computed and an initial set of (assumed) relevant objects is retrieved. As the object
are multimodal, each image has also a text, and this text can feed any ’text’ feedback method
(others than relevance models). In other words, the modality of data is switched , from image to
text or text to image, during the (pseudo) feedback process. In that sense, transmedia techniques
generalize the pseudo-feedback idea present in cross-media relevance model, but are freed from
the particular textual and/or visual models proposed by cross-media relevance model.
      </p>
      <p>In the remaining of this report, we will first describe our mono-media similarities. The next
section will explain the different cross-media similarity models we developed for ImageCLEFphoto
2007. Finally, we will describe our official runs and conclude.
2
2.1
2.1.1</p>
      <sec id="sec-1-1">
        <title>Text</title>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Monomedia Similarities</title>
      <sec id="sec-2-1">
        <title>Cross-Entropy measure of similarity</title>
        <p>Starting from a traditional bag-of-word representation of pre-processed texts (here, preprocessing
includes tokenization, lemmatization, word decompounding and standard stopword removal), we
adopt the language modeling approach to information retrieval as the basis of our asymmetric
similarity measure:
• Let p(w|q) be the multinomial language model of a text query q (obtained by maximum
likelihood estimates, i.e. by simple counting and normalisation).
• Let p(w|d), the multinomial language model of a document d. Documents language models
are smoothed via a Jelinek-Mercer Method (other schemes are applicable, such as Dirichlet
Prior or Absolute Discounting) :
p(w|d) = αpMLE (w|d) + (1 − α) p(w|Corpus)
(1)
where pMLE (w|d) (resp. p(w|Corpus)) is simply the ratio of the number of occurrences of
w in the textual object d (resp. in the global corpus) to the total document (resp. corpus)
length in words.</p>
        <p>The cross-entropy function is used as out textual similarity measure:
simtxt(q, d) = CE(q|d) =</p>
        <p>X p(w|q) log(p(w|d))
w</p>
        <p>This is obviously an asymmetric similarity measure. It can be trivially generalised to define
the similarity between two textual objects d1 and d2:
simtxt(d1, d2) = CE(d1|d2) =</p>
        <p>X pMLE (w|d1) log(p(w|d2))
w
2.1.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Enriching Text with Flickr</title>
        <p>
          Motivated by the fact that, this year, the textual content of the documents was very poor (text
annotations were limited to the &lt;TITLE&gt; fields of documents), we decided to enrich the corpus
thanks to the Flickr database [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], at least for texts in English. Flickr API provide a function to
get tags related to a given tag [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. According to Flickr documentation, this function returns a list
of tags ’related’ to the given tag, based on clustered usage analysis. It appears that queries, on the
one hand, and photographic annotations on the other hand, adopt a different level of description.
Queries are often more abstract and more general than annotations. As a consequence, it is easier
and more relevant to enrich the annotations rather than the queries : related tags are often at
the same level or at the upper (more general) semantic level. Table 1 show some example of
enrichment terms, related to the annotation corpus. We can observe the the related terms does
encode a kind of semantic similarity, often towards a more abstract direction, but contains also
some noise or ambiguity.
        </p>
        <p>Below, is an example of an enrich document where each original term has been expanded with
its top 20 related terms:
(2)
(3)
DOCNO: annotations/00/116.eng
ORIGINAL TEXT: Termas de Papallacta Papallacta Ecuador
ADDED TERMS: chillan colina sur caracalla cajon piscina snow roma italy maipo
thermal nieve volcan argentina mendoza water italia montaa araucania santiago quito
southamerica germany worldcup soccer football bird andes wm church fifa volcano
iguana cotopaxi travel mountain mountains cathedral sealion market</p>
        <p>Enriching the text corpus partially solved the term mismatch but it also introduced a lot of
noise in a document. Hence, most of the probabilitic mass of the language model is devoted to
the the original text of a document. In the Language Modelling framework, the enriched terms
acts as a smoothing methods: we give more weight to terms from the original text than for those
added, by linear interpolation between the original document language model and the word profile
derived from Flickr.</p>
        <p>Note that this kind of semantic enrichment was done only for English documents (even if some
words in other languages are automatically, and erroneously, added). As we decided to investigate
the bilingual case as well (choosing German as the second language), we also built probabilitic
translation matrices (ENG - GER) from standard alignment method (Giza++) using the small
set of parallel sentences that we were allowed to exploit in the ImageCLEFPhoto 2007 corpus. It
appeared that the bilingual lexicons automatically extracted from this parallel corpus provided
better results than broader, but more general, standard dictionaries. It is worth to emphasize the
fact that such automatically extracted lexicons have often the extra advantage to realize some
semantic smoothing by side effect: related — but not strict — translations are often derived as
potential candidates. Probabilistic translation dictionaries are applied on the source language
models of the query to give the new target language models of the query by matrix product.
2.2</p>
        <sec id="sec-2-2-1">
          <title>Image</title>
          <p>
            The image similarity was defined from a continuous vectorial representation of the image, obtained
as follows. Image patches are first extracted on regular grids at 5 different scales with a ratio of
√2 between two consecutive scales. Two types of low-level features are used: grey-level
SIFTlike features [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ] and color features. In both cases the image patch is subdivided in 4 × 4 = 16
subregions. SIFT-like features are then computed as gradient orientation histograms (8 bins)
collected on each subregion leading to a vector of 128 dimentions. Color features are simple means
and standard deviations of the 3 RGB channels in the same subregions, which leads to a 96
dimensional feature vector. The dimensionality of both type of features are subsequently reduced
down to 50 using principal component analysis (PCA).
          </p>
          <p>
            Then, some kind of Gaussian Mixture Model (GMM) clustering [
            <xref ref-type="bibr" rid="ref20 ref6">6, 20</xref>
            ] is performed to build
a visual vocabulary [
            <xref ref-type="bibr" rid="ref21 ref4">21, 4</xref>
            ] of low-level image features where each Gaussian component models a
visual word (each one is characterized by λ = {wi, μi, σi, i = 1...N } where wi, μi and σi denote
respectively the weight, mean vector and covariance matrix of Gaussian i).
          </p>
          <p>Two visual vocabularies are built: one is based on texture, the other on color. Both of them
have a dictionary size of 64 (meaning that we have 64 Gaussian components fore each).</p>
          <p>
            Finally, we represent each image with a Fisher Kernel based normalized gradient vector as
proposed in [
            <xref ref-type="bibr" rid="ref19">19</xref>
            ]. The main idea is that given a generative model (here the Gaussian Mixture
Model) with parameters λ, one can compute the gradient vector of each sample I :
∇λ log p(I|λ)
Intuitively, the gradient of the log-likelihood describes the direction in which parameters should
be modified to best fit the data. It transforms a variable length sample I into a fixed length vector
whose size is only dependent on the number of parameters in the model.
          </p>
          <p>
            Before computing a similarity measure between images, each vector is first normalized using
the Fisher Information matrix Fλ, as suggested in [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ]:
(4)
(5)
(6)
The normalized gradient vector, simply called Fisher vector, is thus given by:
          </p>
          <p>Fλ = EX [∇λ log p(X|λ)∇λ log p(X|λ)0] .</p>
          <p>f = Fλ−1/2∇λ log p(X|λ) .</p>
          <p>
            See [
            <xref ref-type="bibr" rid="ref19">19</xref>
            ] for closed form approximations of F −1/2.
          </p>
          <p>λ
The Fischer vectors for color and texture respectively are then simply concatened.</p>
          <p>To compute the similarity measure between images I and J , we simply use the the L1-norm
of the difference between the Fisher vectors:
simImg(I, J ) = normmax − ||fI − fJ || = normmax − X
i i
|fI − fJ |</p>
          <p>(7)
i
where f i are the elements of the vector f and normmax = 2.
3</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Cross-Media Similarities</title>
      <p>We want to define cross-media similarity measures that are more elaborated — and, hopefully,
more efficient — than simple late fusion approaches. What we want to investigate here is some
“intermediate level” fusion approaches, where we use some principles coming from pseudo-relevance
feedback and, more specifically, use transmedia pseudo-relevance feedback for enriching the
monomedia representation of an object with features coming from the other media. Our basic material
is a multimedia (text + image) database O : in other words, we assume to have at our disposal a
collection of images with associated text (for each image) or, in a dual view, a collection of texts
illustrated with one image (for each textual element).
3.1</p>
      <sec id="sec-3-1">
        <title>Transmedia Document Reranking Approach</title>
        <p>The main idea is the following: for a given image i, consider as new features the (textual) terms of
the texts associated to the most similar images (from a purely visual viewpoint). We will denote
this neighbouring set as Nimg(i). Its size is fixed a priori: this is typically the topN objects
returned from a retrieval system using equation 7 as ranking measure. Then we can compute a
new similarity with respect to any multimodal object j of the collection O as the textual similarity
of this new representation of the image i with the textual parts of j.</p>
        <p>We still need to be more precise on how we compute the mono-media similarity between
an aggregate of objects Nimg(i) and one single multimodal object. There are three families of
approaches:
1. aggregating Nimg(i) to form a single object (typically by concatenation) and then compute
standard similarity between two objects;
2. aggregating all similarity measures (assuming we can) between all possible couple of objects
3. use a method of pseudo feedback algorithm (for instance Rochhio’s algorithm) to extract
relevant, meaningfull features of an aggregate and finally use a mono-media similarity.</p>
        <p>The first approach we propose belongs to the family 2 (using a simple sum or, equivalently, an
arithmetic average to aggregate the individual similarity measures). The next section
(Complementary Feedback) will propose an alternative approach that belongs to family 3.</p>
        <p>More formally, if we denote by T (u) the text associated to multimodal object u and by Tˆ(i)
the new textual representation of image i, then the new cross-media similarity measure w.r.t. the
multimodal object j is:
simImgT xt(i, j) = simtxt(Tˆ(i), T (j)) =
simtxt(T (d), T (j))</p>
        <p>(8)</p>
        <p>X
d∈Nimg(i)
where simtxt is typically defined by equation 3 (e.g. the one based on Language Modelling, even
if it is assymetric).</p>
        <p>This method can be seen as a reranking method. Suppose that q is some image query; if T (d)
is the text of an image belonging to the initial feedback set Nimg(q), then the rank of the own
neighbors of T (d) in the textual sense will be increased, even if they are not so similar from a
purely visual viewpoint. In particular, this allows to define a similarity between a purely image
query and a simple textual object without visual counterpart. To sum up, the main idea of our
method amounts to (i) perform an initial retrieval step to identify Nimg(q), (ii) to switch mode and
to virtually make several queries (one for each element in Nimg(q) instead of one) and combining
them afterwards. Due to renormalization effects and smoothing methods, the resulting ranking
function is different from the one obtained by considering the simple concatenation of text in
step (ii), since the considered models are not linear. Lastly, the values simtxt(T (u), T (v)) can
be pre-computed in a matrix of textual similarities between all pairs of objects in the multimedia
database O, if the corpus is of reasonable size</p>
        <p>By duality, we can define another cross-media similarity measure: for a given text i, we consider
as new features the Fisher vectors of the images associated to the most similar texts (from a purely
textual viewpoint) in the multimodal database. We will denote this neighbouring set as Ntxt(i).
If we denote by I(u) the image associated to multimodal object u and by Iˆ(i) the new visual
representation of text i, then the new cross-media similarity measure is:
simT xtImg(i, j) = simimg(Iˆ(i), I(j)) =
simimg(I(d), I(j))
where simimg is typically defined by equation 7</p>
        <p>Note that we could even extend these definitions inside one mode. For instance, we have:
and
simT xtT xt(i, j) = simtxt(Tˆ(i), T (j)) =</p>
        <p>simtxt(T (d), T (j))
simImgImg(i, j) = simimg(Iˆ(i), I(j)) =
simimg(I(d), I(j))</p>
        <p>X
d∈Ntxt(i)</p>
        <p>X
d∈Ntxt(i)</p>
        <p>X
d∈Nimg(i)
(9)
(10)
(11)</p>
        <p>Once again, the process could be fast and efficient, if we can precompute the similarity matrices
simimg(I(u), I(v)) and/or simtxt(T (u), T (v)) for all pairs (u, v) of mulmodal objects in O.</p>
        <p>Finally, we can combine all the similarities to define a global similarity measure between two
multi-modal objects i and j: for instance, using a linear combination,
simglob(i, j) = λ1simtxt(T (i), T (j))+λ2simimg(I(i), I(j))+λ3simImgT xt(i, j)+λ4simT xtImg(i, j)
(12)
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Complementary Feedback</title>
        <p>Recall that the fundamental problem in transmedia feedback is to define how we compute the
mono-media similarity between an aggregate of objects Nimg(i) (or Ntxt(i)) and one single
multimodal object. Instead of adopting the strategy of the previous section, we would now consider
the set Nimg(i) as the “relevance concept” F and derive its corresponding language model (LM)
θF . Afterwards, we can use the Cross-entropy criterion between θF and the LM of the textual
part of any object j in O as the new transmedia similarity. We illustrate this approach when we
use Nimg(i) (using only the image part of query object i) to derive a textual LM of θF that can
be used in conjunction with the original LM of the textual part of query i.</p>
        <p>To this aim (build the LM of the relevance concept F), we use a pseudo-feedback method
issued from the language modelling approach to information retrieval, namely the mixture model
method from Zhai and Lafferty [23] originally designed to enrich textual queries (however, more
elaborated techniques of feedback for language models can also be envisaged : e.g. [22]).
Let θF be a multinomial parameter, standing for the distribution of relevant terms in F: in other
words θF is a probability distribution over words but peaked on relevant terms. A generative
model is assumed to estimate θF from F:</p>
        <p>P (F|θ) =</p>
        <p>Y</p>
        <p>Y(λθF,w + (1 − λ)P (w|C))c(w,d)
(13)
d∈Nimg(i) w
where P (w|C) is word probability built upon the corpus, λ is a fixed parameter, which can be
understood as a noise parameter for the distribution of terms. c(w, d) is the number of occurence
of term w in document d. Finally θF is learned by maximum likelihood with an Expectation
Maximization algorithm. Once θF has been estimated, a new query LM can be obtained trough
interpolation:
θnew query = αθold query + (1 − α)θF
(14)
where θold query corresponds to the LM of the textual part of the query i. As mentionned, we
then use the Cross-Entropy similarity measure to perform a new retrieval on the textual part of
objects in O.</p>
        <p>Setting the value of α is done experimentally and adapted to the particular collection. The
robustness of the estimation of θF has a significant impact on the value of α. Lastly, the value of
α can be interpreted as a mixing weight between image and text.</p>
        <p>Finally, note that we illustrated the approach using Nimg(i) to derive a textual LM of θF that
can be used in conjunction with the original LM of the textual part of query i. But we can derive
a similar scheme using Ntxt(i) to derive a new representation (actually some generalized Fisher
Vectors) of the “relevance concept”, this time relying on Rocchio’s method that is more adapted
to continuous feature representation.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Description of submitted runs</title>
      <p>
        For a description of the task, we refer to the overview paper [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>Table 2 shows the name of our runs and the corresponding mean average precision measures.</p>
      <p>Below is a detailed description of our official runs.
4.1</p>
      <sec id="sec-4-1">
        <title>EN-EN-AUTO-FB-TXT FLR</title>
        <p>This run was a pure text run: documents were basically preprocessed and each document was
enriched using Flickr database. For each term of a document, its top 20 related tags from Flickr were
added to the document. Then, a unigram language model for each document is estimated, giving
more weight to the original document terms. An additional step of pseudo-relevance feedback
using the method explained in [23] is then performed.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>AUTO-NOFB-IMG COMBFK</title>
        <p>This run is a pure image run: it uses Fisher Kernel metric (cf. equation 7) to define the image
similarity. As a query encompasses 3 visual sub-queries, we have to combine the similarity score
with respect to these 3 subqueries. To this aim, the result lists from the image sub-queries are
renormalized (by substracting the mean and dividing by the standard deviation) and merged by
simple sum.
4.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>AUTO-FB-TXTIMG PREFFKTXT.off</title>
        <p>This run uses both texts and images: it starts from query images only, to determine Nimg(i) for
each query i (as in the previous run above) and then implements the method described by eq. 8.
The size of the neighbouring set is 5.
4.4</p>
      </sec>
      <sec id="sec-4-4">
        <title>AUTO-FB-TXTIMG PREFFKTXT FLR</title>
        <p>It is basically the same algorithm as the preceding run, except that the textual part of the data
(annotations) is enriched with Flickr tags.
4.5</p>
      </sec>
      <sec id="sec-4-5">
        <title>EN-EN-AUTO-FB-TXTIMG QTXT COMBPREFFKTXT.off</title>
        <p>This run uses the same algorithm as in AUTO-FB-TXTIMG PREFFKTXT but with one more
step at the end, that amounts to merge the result lists from in AUTO-FB-TXTIMG PREFFKTXT
and from the purely text queries (EN-EN-AUTO-FB-TXT FLR), by summing the relevance scores
after normalisation (by substracting the mean and dividing by the standard deviation for each
list).
4.6</p>
      </sec>
      <sec id="sec-4-6">
        <title>EN-EN-AUTO-FB-TXTIMG MPRF.off</title>
        <p>This run uses both texts and images: it starts from query images only, to determine the relevance
set Nimg(i) for each query i (as in the run AUTO-FB-TXTIMG PREFFKTXT.off) and then
implements the method described as “the complementary (intermedia) feedback” in section 3.2.
The size of the neighbouring set is 15. Refering to the notations of section 3.2, the values of λ and
α are respectively 0.5 and 0.5.
4.7</p>
      </sec>
      <sec id="sec-4-7">
        <title>DE-EN-AUTO-FB-TXTIMG MPRF FLR.off</title>
        <p>This runs works with the same principle as the previous run EN-EN-AUTO-FB-TXTIMG MPRF.off.
The main difference is that (target) english documents have been enriched with Flickr and that
the initial query — in German — was translated by multiplying its “Language Model” by the
probabilistic translation matrix extracted from the (small) parallel part of the corpus. Otherwise,
it uses the same parameters as previously.
This run uses the same process as in EN-EN-AUTO-FB-TXTIMG MPRF.off. The difference is
the starting point: english queries to search for german annotations. English queries are
translated with the probabilistic translation matrix extracted from the (small) parallel part of the
corpus and the translated queries follow the same process as in EN − EN − AU T O − F B −
T XT IM G M P RF.of f but with different parameter : the size of the neighbouring set is 10, while
the values of λ and α are respectively 0.5 and 0.7.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>With a slightly annotated corpus of images, also characterised by an abstraction level in the
textual description that is significantly different from the one used in the queries, it appears that
mono-media retrieval performance is more or less equivalent for pure image and pure text content
(around 20% MAP). Using our transmedia pseudofeedback-based similarity measures allowed us
to dramatically increase the performance by ∼50% (relative). Trying to model the textual
“relevance concept” present in the top ranked documents issued from a first (purely visual) retrieval
and combining this with the textual part of the original query turns out to be the best strategy,
being slightly superior to our transmedia document reranking method. Enriching the image
annotations by extra tags extracted from the Flickr database does not offer a significant advantage
in the ImageCLEF07 corpus, even if we observed an improvement using other multimedia corpora
and query sets. From a cross-lingual perspective, the use of domain-specific, corpus-adapted
probabilistic dictionaries seems to offer better results than the use of a broader, more general standard
dictionary. With respect to the monolingual baseline, multilingual runs show a slight degradation
of retrieval performance ( ∼6 to 10% relative).</p>
      <p>In the future, we want to investigate more systematically and more thoroughly the ways to
combine the numerous transmedia similarity measures we introduced in this report, by determining
in which cases they can provide us with significant advantages with respect to more traditional
“late fusion” approaches.</p>
    </sec>
    <sec id="sec-6">
      <title>Aknowledgments</title>
      <p>This work was partly funded by the French Government under the Infomagic project, part of the
Pole CAP DIGITAL (IMVN) de Paris, Ile-de-France. The authors also want to thank Florent
Perronin for his greatly appreciated help in applying some of the Generic Visual Categorizer
(GVC) components in ImageCLEF07 experiments.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Blei</surname>
          </string-name>
          , Michael, and
          <string-name>
            <given-names>M. I.</given-names>
            <surname>Jordan</surname>
          </string-name>
          .
          <article-title>Modeling annotated data</article-title>
          .
          <source>In ACM SIGIR</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Carbonetto</surname>
          </string-name>
          , N. de Freitas, and
          <string-name>
            <given-names>K.</given-names>
            <surname>Barnard</surname>
          </string-name>
          .
          <article-title>A statistical model for general contextual object recognition</article-title>
          .
          <source>In ECCV</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.-C.</given-names>
            <surname>Chang</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.-H.</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <article-title>Approaches of using a word-image ontology and an annotated image corpus as intermedia for cross-language image retrieval, clef 2006 working notes</article-title>
          .
          <source>In CLEF 2006 Working Notes</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G.</given-names>
            <surname>Csurka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Dance</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Willamowski</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bray</surname>
          </string-name>
          .
          <article-title>Visual categorization with bags of keypoints</article-title>
          .
          <source>In ECCV Workshop on Statistical Learning for Computer Vision</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Duygulu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Barnard</surname>
          </string-name>
          , J. de Freitas, and
          <string-name>
            <given-names>D.</given-names>
            <surname>Forsyth</surname>
          </string-name>
          .
          <article-title>Object recognition as machine translation :learning a lexicon for a fixed image vocabulary</article-title>
          .
          <source>In ECCV</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Farquhar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Szedmak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Meng</surname>
          </string-name>
          , and J.
          <string-name>
            <surname>Shawe-Taylor</surname>
          </string-name>
          .
          <article-title>Improving “bag-of-keypoints” image categorisation</article-title>
          .
          <source>Technical report</source>
          , University of Southampton,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Lavrenko</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Manmatha</surname>
          </string-name>
          .
          <article-title>Multiple bernoulli relevance models for image and video annotation</article-title>
          .
          <source>In CVPR</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Flickr</surname>
          </string-name>
          .
          <article-title>Flickr api</article-title>
          . http://www.flickr.com/services/api/.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Flickr</surname>
          </string-name>
          . tags.getrelated. http://www.flickr.com/services/api/flickr.tags.getRelated.html.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Grubinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Clough</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hanbury</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Mu</surname>
          </string-name>
          <article-title>¨ller. Overview of the ImageCLEFphoto 2007 photographic retrieval task</article-title>
          .
          <source>In Working Notes of the 2007 CLEF Workshop</source>
          , Budapest, Hungary, Sept.
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T.</given-names>
            <surname>Jaakkola</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Haussler</surname>
          </string-name>
          .
          <article-title>Exploiting generative models in discriminative classifiers</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          <volume>11</volume>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Jeon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Lavrenko</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Manmatha</surname>
          </string-name>
          .
          <article-title>Automatic image annotation and retrieval using cross-media relevance models</article-title>
          .
          <source>In ACM SIGIR</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>V.</given-names>
            <surname>Lavrenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Feng</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Manmatha</surname>
          </string-name>
          .
          <article-title>Models for automatic video annotation and retrieval</article-title>
          .
          <source>In ICASSP</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>V.</given-names>
            <surname>Lavrenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Manmatha</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Jeon</surname>
          </string-name>
          .
          <article-title>A model for learning the semantics of pictures</article-title>
          .
          <source>In NIPS</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>D. G.</given-names>
            <surname>Lowe</surname>
          </string-name>
          .
          <article-title>Distinctive image features from scale-invariant keypoints</article-title>
          .
          <source>IJCV</source>
          ,
          <volume>60</volume>
          (
          <issue>2</issue>
          ),
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>N.</given-names>
            <surname>Maillot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-P.</given-names>
            <surname>Chevallet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Valea</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Lim</surname>
          </string-name>
          .
          <article-title>Ipal inter-media pseudo-relevance feedback approach to imageclef 2006 photo retrieval, clef 2006 working notes</article-title>
          .
          <source>In CLEF 2006 Working Notes</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>F.</given-names>
            <surname>Monay</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Gatica-Perez</surname>
          </string-name>
          .
          <article-title>Plsa-based image auto-annotation: Constraining the latent space</article-title>
          .
          <source>In ACM MM</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>J.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Faloutsos</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Duygulu</surname>
          </string-name>
          . Gcap:
          <article-title>Graph-based automatic image captioning</article-title>
          .
          <source>In CVPR Workshop on Multimedia Data and Document Engineering</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>F.</given-names>
            <surname>Perronnin</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Dance</surname>
          </string-name>
          .
          <article-title>Fisher kernels on visual vocabularies for image categorization</article-title>
          .
          <source>In CVPR</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>F.</given-names>
            <surname>Perronnin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Dance</surname>
          </string-name>
          , G. Csurka, and
          <string-name>
            <given-names>M.</given-names>
            <surname>Bressan</surname>
          </string-name>
          .
          <article-title>Adapted vocabularies for generic visual categorization</article-title>
          .
          <source>In ECCV</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Sivic</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <article-title>Video google: A text retrieval approach to object matching in videos</article-title>
          . In ICCV, volume
          <volume>2</volume>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>