=Paper= {{Paper |id=Vol-1174/CLEF2008wn-ImageCLEF-FerecatuEt2008 |storemode=property |title=TELECOMParisTech at ImageClefphoto 2008: Bi-Modal Text and Image Retrieval with Diversity Enhancement |pdfUrl=https://ceur-ws.org/Vol-1174/CLEF2008wn-ImageCLEF-FerecatuEt2008.pdf |volume=Vol-1174 |dblpUrl=https://dblp.org/rec/conf/clef/FerecatuS08 }} ==TELECOMParisTech at ImageClefphoto 2008: Bi-Modal Text and Image Retrieval with Diversity Enhancement== https://ceur-ws.org/Vol-1174/CLEF2008wn-ImageCLEF-FerecatuEt2008.pdf
              TELECOM ParisTech at ImageClefphoto
                  2008: Bi-Modal Text and Image
               Retrieval with Diversity Enhancement
                              Marin Ferecatu∗†             Hichem Sahbi†∗
                               ∗
                               Institut TELECOM, TELECOM ParisTech
                                        †
                                          CNRS LTCI, UMR 5141
                              46, rue Barrault, 75634 Paris Cedex, France
                           Marin.Ferecatu@telecom-paristech.fr
                            Hichem.Sahbi@telecom-paristech.fr


                                                  Abstract
      In this paper we describe the participation of TELECOM ParisTech in the ImageClefphoto
      2008 challenge. This edition focuses on promoting diversity in the results produced by the
      retrieval systems. Given the high level semantic content of the topics, search engines based
      solely on text or visual descriptors are unlikely to offer satisfactory results. Our system uses
      several text and visual descriptors, as well as several combination algorithms to improve the
      overall retrieval performance. The text part includes a collection of manually built boolean
      queries and a set of textual descriptors extracted automatically using dictionary filtering and
      dimensionality reduction. Text and visual descriptors are combined using two strategies: ad-
      hoc concatenation and re-ranking. Diversity makes it possible to reduce the redundancy in
      the final results and it is obtained using two techniques, threshold clustering and maxmin
      exploration. Several runs were submitted to the challenge, including individual (text or visual),
      combined, and with different settings of diversity. The results show that the combined runs
      outperform by a significant amount the individual runs. These results clearly corroborate (i)
      the complementarity of text and visual descriptors and (ii) the effectiveness of boolean queries
      suggesting promising future research directions.

Categories and Subject Descriptors
H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Information
Search and Retrieval; H.3.4 Systems and Software; H.3.7 Digital Libraries; H.2.3 [Database Manage-
ment]: Languages—Query Languages

General Terms
Measurement, Performance, Experimentation.

Keywords
Image retrieval, Reranking, Support Vector Machines, Hybrid Text and Image Search.


1 Introduction
Stimulated by the exponential growth of multimedia contents, the interest in document indexing and re-
trieval has steadily increased in recent years. Although text based retrieval has been studied for several
decades [26], many problems remain unsolved [20]. Recently, text based retrieval emerged successfully in
internet search [17], and it usually relies on statistical text processing and tags collections. Content based
image retrieval also developed rapidly in the last decade, motivated by the growing amount of image and
video collections and by the need to organize, share and search those contents effectively and efficiently
[28, 4]. Since neither world (text or image indexing and retrieval) offers a satisfactory bridge to solve the in-
famous semantic gap, more and more search engines employ hybrid text and visual representations in order
to describe and search multimedia databases. In this context, the ImageClef challenge offers an excellent
benchmark to test and compare several state of the art algorithms proposed by different participants1.
    ImageClefphoto 2008 focuses on promoting diversity in the results produced by search engines, i.e.,
those which reduce the number of redundant elements in their output are preferred. As we shall see in §2,
the query topics are very semantic so individual text or visual descriptors are not expected to offer satis-
factory results. In this work, we investigate several combination strategies for text and visual descriptors
as well as several algorithms for reducing the redundancy in the results returned by the system. In order to
describe the textual content we use two representations: (1) a set of manually constructed boolean queries
and (2) a set of automatically extracted vector representations based on dictionary filtering and dimension-
ality reduction. We also use several global image descriptors [19, 4], which even though less performant
than local ones [24], have the advantage of being generic and computationally cheap (see §3).
    We compare hybrid combination by concatenation and re-ranking, using both the Query By Example
(QBE) paradigm and SVM learning. In order to eliminate the redundancy in the final results we employ
two strategies: a modified version of Quality Thresholding algorithm and a maxmin exploration strategy.
Our results show that combining the visual search results (even when using a small number of examples,
e.g. three in the challenge) with the textual results improves significantly the overall performance. Clearly,
the information provided by the two modalities are complementary. Furthermore, the manually prepared
boolean queries provide a noticeable and consistent gain suggesting that semantic parsing and automatic
extraction of boolean representations is a promising research direction.
    The paper is organized as follows. We start by a short presentation of the ImageClef Photo Retrieval
Task (§2). Then, we describe our visual descriptors and the retrieval results obtained using image descrip-
tors only (§3). Manual queries are described in §4.1 while in §4.2 we present our automatic text feature
extraction from raw data and their combination with the visual descriptors. In §5 we describe our diversi-
fication algorithm; we end the paper with a discussion and concluding remarks.


2 ImageCLEF Photo Retrieval Task
In this section we briefly describe the ImageClefphoto retrieval challenge, description that we use later
to motivate our choices and to discuss various results. This year, ImageClefphoto focuses on promoting
diversity in the results produced by search engines, i.e. those which eliminate (near-)duplicate documents
are likely to produce a higher percentage of meaningful results. The performance measures used in the
challenge are the precision at 20 (P20) defined as
                                                           Relevant(20)
                                                 P 20 =                                                      (1)
                                                                20
and the cluster-recall at 20 (CR20), or sub-topic recall:
                                                             S20
                                                                i=1 S(di )
                                                CR20 =                                                       (2)
                                                                  NS
where for a given topic, the document di is relevant for the set of sub-topics S(di ) and NS is the total
number of sub-topics [31]. The submitted runs were ranked according to the average of the two above
measures.
   The benchmark uses the IAPR-TC12 [12] data collection, which consists of 20,000 images associated
with a set of XML annotations, including text descriptions, location tags, title and date. The challenge pro-
poses 39 query topics, also available in XML format. Each topic is defined by a narrative block indicating
  1 25 research teams submitted runs to the ImageClefphoto 2008 challenge
            Figure 1: IAPR-TC12 database entry description (left) and query topic format (right).


the search target, a diversity criterion and a set of three image queries (Fig. 1 shows one topic description
and a database entry). A short analysis of these data reveals that the topics combine highly semantic con-
cepts and complex logical structures. For each topic, three images are also supplied in order to enhance
understanding and to help formulating visual queries, but they are not intended as a replacement of text
descriptions. Indeed, due to the complex semantic definitions of the topics, image queries are unlikely to
retrieve all relevant results. Thus, image examples are perhaps best used to enhance the results through
their combination with the text description, and this is the approach we adopted for this challenge.
    Different acronyms were used by participants and stand for the following runs and query types: IMG
(image), TXT (text) and TXTIMG (combination text/image), MAN (manual) and AUTO (automatic).


3 Visual Content Description and Search
In this section we motivate our choice of image descriptors for the ImageClefphoto retrieval use case (see
§3.1 and §3.2). We then present, in §3.3, our querying paradigms and we show in §3.4 some performance
measurements, including a comparison with other ImageClef runs. We end this section with a brief discus-
sion of the limits of visual queries.

3.1 Motivation
Extracting visual features that capture high level semantic content of an image is a difficult task. Indeed,
the main challenge for content based retrieval systems is the infamous “semantic gap” that instantiates the
discrepancy between the low level features used to represent the visual content and the high level concepts
expected by the users [4].
    For some domain specific databases and applications, such as browsing fingerprint or face images, there
exists enough a priori knowledge of the image content to be able to propose more accurate mathematical
models. However, for generic databases, the task becomes much harder since there is no perfect and
unique description of the visual content which agrees with the semantic2 . Therefore, many systems rely
on holistic combined image descriptions such as color, texture and shape [4, 18, 10] in order to search for
target images. This approach, even though ad-hoc, has been successfully applied for unstructured image
databases (which lack text descriptions) through an interactive description of concepts using relevance
feedback and machine learning techniques [33].
    Recent object recognition approaches rely on local descriptors (for example, image regions or interest
points) in order to describe more precisely the visual content [24]. While local descriptors have many de-
sirable properties, such as stability and invariance to common geometric and photometric transformations
[23, 32], they are resource (time and memory) demanding and cannot be easily extended to large-scale
search engines. Moreover, although these algorithms can be tuned to perform well for some object cate-
  2 This is due to lack of consensus about the underlying semantics, which usually depends on the context (even humans disagree

when interpreting images.)
gories, they are less adapted to pictures involving deformable objects or context dependent scenes that are
difficult to describe using individual rigid objects (for example, emotional states or esthetic impressions.)
    Since ImageClefphoto query topics are very semantic and sometimes expressed as a combination of
several concepts, they are not easily translated in terms of low level visual features. The large number of
topics and concepts involved in their definition rules out the possibility to use specific visual models for
each concept. Instead, we use global image descriptors as described below. While being less performant
compared to local descriptions, they are more appropriate for our use case as (1) they have small memory
footprint and thus fit into standard PCs without any specific storage requirements; and (2) they are very
fast to compute as they involve simple distance measure operations, guaranteeing real time responses.
Furthermore, as they do not include any a priori object model, they can be applied to any target category.
Indeed, global descriptors have been shown to perform well in this framework, for example with SVM-
based machine learning [33].

3.2 Global Image Descriptors
As described in §3.1, we use global image descriptors in order to represent the visual content of images.
More precisely, we use a combination of color, texture and shape features, as described below.
Color histograms: they provide a summary description of the color information but ignore spatial corre-
lations between colors; thus, pixels having the same color distribution may not be similar in the context of
their spatial neighborhoods [30], [1]. Alternatively, given an image of size M × N , we weight each color
instance by a measure related to its local context:
                                             M−1 N −1
                                        1 X X
                              h(c) =               w(x, y)δ(f (x, y) − c)
                                       M N x=0 y=0

where h(c) is frequency of color c and w(x, y) is a pixel-based weighting function. We use w(x, y) =
||∆(x, y)||2 , the Laplacian at the pixel (x, y), to emphasize corners and edges in the image and local color
frequency to emphasize non-uniform regions.
Texture features: we use the power spectral density distribution in the complex plane. This has been
shown to perform well when combined with color and shape histograms [19]. Roughly, a high energy
spectrum concentrated at low frequencies highlights large scale informations in an image, while high fre-
quencies correspond to textured regions (small scale details).
Shape features: in order to describe the shape content of an image we use standard edge orientation
histograms. First, edges are extracted from images, then the gradient is computed using only the edge
pixels. The orientation of the gradient is quantized w.r. to the angle resulting into a histogram that is
sensible to the general flow of lines in the image [14]. More details on image descriptors can be found in
[7].
     Visual feature vectors are combined by concatenation and then, in order to reduce resource requirements
and to avoid the curse of dimensionality, we apply linear PCA [16] and keep the 100 largest principal
components (which preserves 95% of the energy of the signal.)

3.3 Querying Paradigms
For each topic, we use the three query images combined with two different paradigms: (i) similarity search
by minimum distance and (ii) SVM filtering.
MinDistance Query: let B be the database and let Q = {q1 , q2 , q3 } denote the query, here q1 , q2 and q3
are the three images of the query topic. We extend the “query by example” search paradigm for multiple
inputs by introducing a composed measure of dissimilarity between an image x ∈ B and Q:

                                        d(x, Q) = min d(x, qi ),                                         (3)
                                                    i=1,2,3

This definition naturally follows from the fact that images in B close to one of the three query images
{qi } are more likely to belong to the same topic. Instead of taking the min-value, one can consider a
                                     0.8
                                                                                         Two−class SVM
                                                                                         One−class SVM
                                     0.7                                                 QBE MinDist


                                     0.6


                      Precision(K)   0.5


                                     0.4


                                     0.3


                                     0.2


                                     0.1


                                      0
                                       0   5   10      15   20     25       30    35    40    45     50
                                                                   K




Figure 2: Comparative performance of two-class SVM, one-class SVM and similarity retrieval with
MinDist.

convex combination of distances; nevertheless, if {qi } are very distant in the description space, the resulting
dissimilarity measure is uncorrelated with the semantic captured by the query topics.
SVM Based Querying: Support Vector Machines (SVM) [27] have been applied with success to a great
wealth of practical problems since their inception in early ’90s by Vapnik [29]. In image retrieval, they
provide state of the art results in many recognition tasks [4] and relevance feedback [33]. They use a linear
combination of kernels as a decision function:
                                                                  N
                                                                  X
                                                    fSV M (x) =         αi K(x, xi ),                       (4)
                                                                  i=1

where K(·, ·) is a positive definite kernel, {αi } are the signed Lagrange multipliers and {xi : αi 6= 0} are
known as the support vectors (see [27, 29] for details).
    We trained our SVMs using Q = {q1 , q2 , q3 } as positive examples. One may use one class SVMs for
training, but their generalization performances were reported to be sub-optimal (with respect to standard
SVMs) for image retrieval [4, 6] mainly when the positive classes contain few examples. In practice, we
used, for each topic, standard two class SVMs trained on Q and random subsets of 10 negative examples
taken from B. As the size of B is 20, 000, it is unlikely that some relevant images belong to these negative
random subsets. While simple, this procedure proved to be effective in practice and produced better results
compared to one class SVMs.
    Once each topic SVM learnt3 , the system ranks the database according to the score given by Eq. 4
and returns the most positive examples.
                                           In all these experiments, we use the Laplacian kernel defined as
K(xi , xj ) = exp − γkxi − xj k . This kernel was advocated in [3] for histogram-based image description
and proved to provide better results than the usual Gaussian kernel for relevance feedback tasks [15].
Notice that the dominant term in its Taylor expansion corresponds to the triangular kernel, K(xi , xj ) =
−γkxi − xj k, which is proved to be scale invariant with respect to the distribution of the data in the
description space [8]. In practice, we found that a good setting of γ is 1. For consistency and comparison
issues, we fix in all our experiments the kernel and its parameters. Of course, our results could be improved
by fine tuning the kernel parameters or by exploring other kernels, but this is not in the scope of this work.

3.4 Performance
In Fig. 2 we present the average precision over all topics as a function of the number of results sent by the
query engine. At this stage, we draw the following conclusions:
  3 In our experiments we used the well known LIBSVM package [2], see http://www.csie.ntu.edu.tw/˜cjlin/libsvm
Figure 3: “Bird flying” search topic: comparative results for similarity retrieval by MinDist (left) and SVM
retrieval (right).


   1. As expected, the SVM query procedure described above outperforms all the other paradigms (a gain
      of almost 10% when compared to QBE MinDist and One-class SVM).
   2. One-class SVM performs about the same as the QBE MinDist. This can be explained by the fact
      that the number of learning examples is very small (only three) and thus, not enough to capture the
      complexity of the target topic.
    These runs do not include diversity and were not submitted to the ImageCLEF-PRT (see §5). The
two IMG runs we submitted (using only visual features), were ranked 2nd and 3rd tailing the 1st rank run
performances on the combined P20/CR20 measure (see §5). In terms of the P 20 measure (Eq 1), our
two runs were ranked 4th and 6th which proves that the diversification algorithms, although lowering the
P 20 measure, improved the overall performance. As an illustration, Fig 3 shows the difference between
SVM and direct MinDist similarity retrieval using the “birds flying” topic. As expected, SVM uniformly
provides more consistent results.

3.5 Limits of the Visual Search
Motivated by a wealth of practical applications, image retrieval by visual content has became a rapidly
evolving research field, although breakthrough advances are still rare. State of the art results are far from
satisfactory and search by image descriptors alone is unlikely to offer complete satisfaction for most practi-
cal task [4]. This state of progress clearly motivates the use of hybrid descriptions, for example by combin-
ing visual features with text and other available media (sound, music, meta-tags, etc.) Meanwhile, recent
trends in research suggest that machine learning based search methods with relevance feedback provide
excellent results for many tasks. In the following sections we describe our approach for ImageClefphoto
challenge by using manually prepared boolean queries (§4.1) and combined text and image content repre-
sentations (§4.2).


4 Hybrid Document Search
As mentioned in §3, using only visual descriptors is not enough to provide satisfactory results in this
challenge. Our goal is to measure the gain when combining text and image features; in §4.1 we describe
boolean queries and their combination with visual descriptors while in §4.2 we present our text descriptor
based on dictionary filtering and dimensionality reduction. Both approaches are illustrated by examples of
actual queries; we also present different results from the challenge.
                     Figure 4: Some examples of boolean retrieval (see the text for details).

4.1 Boolean Queries
A short analysis of the query topics reveals complex and highly semantic concepts involving several objects
and relations. One possible way to represent these topics in a principled way is to use boolean queries. Let
us consider the 3rd topic (“religious statue in the foreground”):
      Relevant images will show a statue of one (or more) religious figures such as gods, angels, prophets etc.
      from any kind of religion in the foreground. Non-religious statues like war memorials or monuments are
      not relevant. Images with statues that are not the focus of the image (like the front view of church with
      many small statues) are not relevant. The statues of Easter Island are not relevant as they do not have any
      religious background.
  This can naturally be expressed using boolean operations involving concepts and operations: “(religious
AND statue) AND NOT (memorial OR monument OR war) AND NOT (LOCATION ’Easter Island’)”. Boolean
retrieval relies on the use of logical operators where the terms in queries are linked together using an algebra
of simple operations (including AND, OR and NOT). Nevertheless, automatic extraction of boolean queries
from raw text is known to be a difficult and still unsolved task [11, 20]. Hence, we choose to manually build
these expressions from the raw text. This approach which is very popular in Internet search, has emerged
as one of the easy to use standards in text retrieval.

4.1.1 Query Construction
We first introduce a small querying language adapted for the ImageClefphoto challenge. We use AND, OR
and NOT to connect terms and we use a “LOCATION” specifier to makes it possible to filter documents
by their locations (such as “country” or “city” tags). These informations can easily be extracted from the
XML document descriptions.
    For some topics, we filter and tag (as “BW”: black and white) images depending on their grey level
information. This is implemented using the saturation component in the HSV color space. For instance,
the query “church AND BW AND NOT LOCATION France” seeks grey level pictures of churches not located
in France.
    Queries are created using a web interface. Using the raw text, the user interactively formulates boolean
queries for different topics. This procedure is similar to relevance feedback as the user iteratively updates
the boolean queries until the results are satisfactory. Fig. 4 shows some results: (left) topic #2 “((church
OR cathedral OR mosque) AND (towers) AND (three OR four OR five))”, (middle) topic #11 “((LOCATION
Russia) AND (BW))” and (right) topic #2 “(lighthouse AND (water OR sea OR ocean))”.

4.1.2 Combination with Visual Descriptors
Boolean queries described earlier return a set of candidate relevant images which are not scored (and hence
unranked) so it is not possible, at this stage, to use merging techniques in order to produce combined
text/image ranks. Instead, we intersect the results of image and text queries. Notice that topics are very
complex and difficult to express exactly using only boolean queries. The latter may produce relevant and
(also) irrelevant results which are filtered using visual search4 .
  4 In practice, visual queries return 1000 documents which are intersected with the results of text and scored using visual distances.
4.1.3 Experiments
In our experiments, we extract the raw text, the LOCATION field from XML documents and the “BW” tag
from the underlying images. Using these informations and the boolean queries, the search engine returns
the results in real time. Notice again that the user is involved only in the construction of boolean queries
so all further steps, are fully automatic. Fig. 5 shows the evaluation of different querying schemes (text,
image and combined text/image).
    We clearly see that boolean queries outperform visual searches. This is predictable since the former are
manually and interactively constructed while visual search is fully automatic. We also see from the results
that combining visual and boolean searches boosts the precision by ∼ 20% and even though limited, the
information provided by the visual descriptors is always informative. This result is slightly unexpected,
because of the small number of positive examples (only three) and suggests that visual descriptors provide
a complementary information w.r. to the boolean queries. We noticed that the influence of the visual
descriptors decreases as the rank increases: at rank 50, combined text/image performs about the same as
text only. This is due to the small number of examples available to compute the visual similarity, e.g. for
lower ranks the results are less semantically correlated with the topics. Finally, we conclude that hybrid
search always outperforms individual text or image searches.
    The runs we submitted to the ImageClefphoto were similar, but they differ in terms of their diversity
algorithms (see. §5) and were ranked 1st , 2nd and 3rd in ImageClefphoto.


                                    0.8
                                                                             TXT
                                                                             TXTIMG
                                    0.7                                      IMG

                                    0.6
                     Precision(K)




                                    0.5


                                    0.4


                                    0.3


                                    0.2


                                    0.1
                                       0   10   20          30          40          50
                                                       K


 Figure 5: Precision of boolean retrieval: text and image alone versus combined text and visual queries.


4.2 Automatic Queries
In this section, we compare several schemas that combine automatically extracted text descriptors and
visual features. The automatic query description and querying is expected to be less performant than the
manual one. As described below, we consider in these experiments both early and late descriptor merging
techniques.

4.2.1 Text Description
Text indexing and retrieval is a growing field and many existing state of the art techniques provide reason-
ably accurate results [26, 22, 17]. Nevertheless, most of them rely on the estimation of statistical measures
and can only be applied on large datasets. For the IAPR-TC12 database, almost all the terms appear only
once, so these statistical techniques are clearly not applicable. In order to extract meaningful information
from short descriptions, semantic analysis and natural language parsing are necessary; however, these are
known to be difficult and still unsolved problems [21, 20].
    Our goal is to investigate the performance improvement obtained by combining both textual and visual
descriptions, so we adopted a simple vector space model in order to represent the text information. First,
we eliminate the stop words, then we parse the list of terms with the Porter stemmer5 [25]. We associate
to each resulting term a coordinate in the feature space. As no relationships are considered between terms,
we only keep the terms that are used in the definition of the query topics. A further step is considered in
order to reduce the dimensionality based on a linear version of PCA. In this step, we only keep the first 100
principal components, corresponding to 98% of the total statistical variance of the text data. We measure
dissimilarity between documents using the L1 distance (cosine distance produced similar results). Query
topics are described in a similar way but they are pre-process ed in order to eliminate words common to all
topics6 .

4.2.2 Combination with Visual Descriptors
We use two merging strategies: ’Early merging’ refers to combining descriptors prior to querying while
’late merging’ performs individual text and visual queries prior to combination.
Early merging: we create hybrid descriptors by concatenation (cartesian product space) of text and the
visual feature spaces. To ensure equal contributions in the final feature vector, we normalize individual
features by the mean and the variance computed on the whole dataset.
Late merging: for each document, we first run individual queries (text and visual) obtaining two ranking
lists. Then each document is assigned a final rank based on the MINRANK scheme defined as

                                                 r(I) = min(rV (I), rT (I))                              (5)

where i ∈ B is an image from the database B and rV (resp. rT ) is its visual (resp. text) rank. The intuition
behind this combination strategy comes from the fact that the best rank should be preferred, e.g. low ranked
documents are unlikely to be similar to the query topic.

4.2.3 Experiments
Fig. 6 shows a comparison of the merging techniques. As we see, textual features extracted automatically
outperform significantly the visual ones. We also notice that early combination by concatenation (cartesian
product) produces slightly better results than the MINRANK late combination, but the difference is not
significant. From these experiments, the best results were obtained by applying the MINRANK algorithm
(as described previously) to the hybrid text/image and text only. We obtain ∼ 10% improvement w.r.t.
to text retrieval. Nevertheless, this gain is less significant compared to boolean queries which are built
manually (see §4.1).
    Finally, this run is ranked 6th in the ImageClefphoto AUTO TXTIMG and even though no diversifi-
cation algorithm was used, it ranked 2nd w.r. to the CR20 measure. This can be explained by the fact
that the MINRANK mixes best ranks from both image and text results and since text and visual features
are independently extracted, some reduction in the redundancy of the results is expected, i.e. the returned
images do not belong to the same sub-categories.


5 Diversity and Clustering
As presented in §2, the measures used to evaluate the runs are precision and cluster-recall on the first 20
results (denoted as P20 and CR20 resp.) According to these measures, results should be both relevant and
as diverse as possible. Notice that extra information is also provided for each query topic (specified by the
field “”) about the targeted diversity criterion. In this section, we describe our
diversity schemes and their impacts on the P20 and CR20 performance.
  5 http://tartarus.org/˜martin/PorterStemmer
  6 For example, expressions like “relevant images” are used in all topic descriptions.
                                    0.8
                                                                          TXT
                                                                          IMG
                                    0.7                                   MinRank
                                                                          Concatenation
                                    0.6                                   MinRank on Concat
                     Precision(K)
                                    0.5


                                    0.4


                                    0.3


                                    0.2


                                    0.1
                                       0    10          20           30         40        50
                                                              K


              Figure 6: Precision results for several text and image combination techniques.

5.1 Text Clustering
Some of the diversity criteria required by the query topics can be directly extracted from the XML data.
Each document contains a tag such as “city”, “state” and “location” which can be used in order to filter and
group documents. For that purpose and in order to keep a good balance between precision and diversity,
we start with a larger selection of 40 results and we cluster them following the specified location tags. The
output is formed by taking the 1st element in each cluster, then the 2nd , etc.

5.2 Visual Diversification
Visual clustering is applied when the diversity criterion cannot be extracted from the XML tags associated
to images. We consider two types of “visual diversity” algorithms, described below.

5.2.1 MAXMIN Diversity
The MAXMIN diversity algorithm is based on the maximization of the mininum distance of a given docu-
ment with respect to the (so far) selected results. Our algorithm starts by choosing the document with the
best rank. Afterwards, it chooses the next document as the one which maximizes the distance with respect
to the first. More precisely, let S be the candidate set (the initial window of size 40 in our case) and suppose
that C ⊂ S is the set of already selected examples. Then, the next example is chosen as:
                                            x = argmaxxk ∈S\C min d(xk , xi )                               (6)
                                                                  xi ∈C

This procedure does not produce a clustering, but rather a permutation of the initial selection such as its
prefixes are very diversified according to Eq. 6 with respect to the distance defined on the description space.

5.2.2 QT Clustering
Our second “diversification” method is based on visual clustering. We tested several standard clustering
techniques, such as Fuzzy-K-Means [5] and Competitive Agglomeration [9], but we obtained many clusters
of large size and highly diverse semantic content. To control the size of the generated clusters, we developed
a variant of the Quality Threshold algorithm [13], described below.
    Let s denotes the cluster size defined as s = N/nC , where N , nC are respectively the number of
images and the expected (fixed) number of clusters. The algorithm builds the clusters iteratively. First the
center of the new cluster is chosen by minimizing the following criterion
                                           xt = argminxi ∈St R (KNNs (xi ; St ))                            (7)
where KNNs (xi ; St ) denotes the set of s nearest neighbors of xi in St (S1 = {x1 , ..., xN }) and R the
radius of the smallest sphere enclosing KNNs (xi ; St ). The new cluster is built as Ct = KNNs (xi ; St ),
and then removed from the remaining data, i.e. St+1 = St \Ct . As for the text clustering, the final output is
formed by taking the 1st element in each cluster, then the 2nd , etc., until all elements are exhausted.

5.3 Experiments
We submitted several runs to ImageClefphoto including the diversity settings described above. More specif-
ically, 26 out of the 39 topics have “cluster tags” related to the location of the pictures; diversity is per-
formed for these topics using text clustering while for the rest it is visual. We observed that applying a
diversity schema lowers the P20 measure; however, as the CR20 measure increases, we expect the results
to be more meaningful.
    Visual Retrieval allows us to evaluate the impact of visual diversity (using QT or MAXMIN clus-
tering). As already discussed in §3.4, the best “visual retrieval” performances were achieved using the
two class SVM (P20=0.248) and without diversity. We submitted two runs to the challenge using QT and
MAXMIN clustering and they are ranked 2nd and 3rd when using the combined P20 and CR20 score. Their
P20 measure is, as expected, lower (0.20 and 0.17 respectively) when compared to the non-diversified
results, and this suggests that diversity indeed produced an important CR20 gain.

           0.55                                                         0.55
                                                    MaxMin
                                                                               MaxMin
                                                    Qt clust
                                                                         0.5   Qt Clus
                                                    No Div
            0.5                                                                No Div
                                                                        0.45

                                                                         0.4
           0.45
                                                                CR(K)
    P(K)




                                                                        0.35

            0.4
                                                                         0.3

                                                                        0.25
           0.35
                                                                         0.2


              5       10      15       20      25          30              5        10   15       20   25   30
                                   K                                                          K




                  Figure 7: Hybrid text and image descriptor: the effect of “diversification” of results.

    Combined Text/Visual Search: in Fig. 7 we examine the impact of different diversity algorithms on
the P20 and CR20 measures for the combined text and image descriptors (see §4.2.2). First, we observe a
loss of precision for both QT and MAXMIN algorithms. Nevertheless, the QT algorithm failed to produce
better CR rankings when compared to the case without diversity. For this algorithm, we set the number of
clusters to 20 and the initial selection contains the 40 first ranked results. As the size of different clusters
is very small (i.e., s = 2), many clusters are similar and this affects the quality of diversity. Moreover,
even for a perfect retrieval, there is simply not enough data (less than 100 in average) per class in order
to generate consistent clustering. These results show that the space of query results is better explored and
summarized using MAXMIN than QT clustering which suffers from the insufficient amount of data.


6 Conclusion and Perspectives
In this paper we presented our experiments investigating the performance of several combination techniques
for image and text descriptors, on the ImageClefphoto challenge database. We compared early merging of
descriptors by concatenation with late ranking combination obtained by separate queries on text and image
features. We also described two schemes for reducing redundancy in the results returned by our search
engine.
    In our first conclusion, we found that even with very few images (three in the ImageClefphoto), our
system was able to improve the results significantly. Moreover, the improvement is more significant in case
of manually prepared boolean queries. This clearly indicates that good quality boolean queries are less
likely to return noisy results with respect to the targeted topic. Automatic extraction of boolean queries
from raw text is hence identified as a worthy to explore research direction, for instance by using Parts of
Speech (POS) tagging and language parsing.
    In our second conclusion, we noticed that using a diversification algorithm helped improving the rank-
ing of our submitted runs. This is more noticeable for queries using only visual descriptors (see §3) where
the proposed diversification schemes significantly improved the ranking of our runs (2nd and 3rd ). However,
because of the limited size of ground truth classes (less than 100 images per topic), it is not possible, at
this stage, to draw firm conclusions. Indeed, in a real search engine, where topics might be represented by
millions of (possibly similar) images, we expect the obtained clusters to be much more consistent.


Acknowledgements
This work was supported by the French National Research Agency (ANR) under the AVEIR7 project,
ANR-06-MDCA-002.


References
 [1] Nozha Boujemaa, Julien Fauqueur, Marin Ferecatu, François Fleuret, Valérie Gouet, Bertrand Le
     Saux, and Hichem Sahbi. Ikona: Interactive generic and specific image retrieval. In Proceedings of
     the International workshop on Multimedia Content-Based Indexing and Retrieval (MMCBIR’2001),
     2001.
 [2] Chih-Chung Chang and Chih-Jen Lin.         LIBSVM: a library for support vector ma-
     chines.    Technical report, National Taiwan University, 2001.  Software available at
     http://www.csie.ntu.edu.tw/˜cjlin/libsvm.
 [3] Olivier Chapelle, P. Haffner, and Vladimir N. Vapnik. Support-vector machines for histogram-based
     image classification. IEEE Transactions on Neural Networks, 10(5):1055–1064, 1999.
 [4] Ritendra Datta, Dhiraj Joshi, Jia Li, and James Wang. Image retrieval: Ideas, influences, and trends
     of the new age. ACM Computing Surveys, 40(2):5:1–60, 2008.
 [5] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classification. Wiley Interscience, 2001.
 [6] M. Ferecatu, N. Boujemaa, and M. Crucianu. Semantic interactive image retrieval combining visual
     and conceptual content description. ACM Multimedia Systems Journal, 13(5–6):309–322, 2008.
 [7] Marin Ferecatu. Image retrieval with active relevance feedback using both visual and keyword-based
     descriptors. PhD thesis, INRIA—University of Versailles Saint Quentin-en-Yvelines, France, 2005.
 [8] François Fleuret and Hichem Sahbi. Scale-invariance of support vector machines based on the tri-
     angular kernel. In 3rd International Workshop on Statistical and Computational Theories of Vision,
     October 2003.
 [9] Hichem Frigui and Raghu Krishnapuram. A robust competitive clustering algorithm with applications
     in computer vision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(5):450–465,
     1999.
[10] Theo Gevers and Arnold W. M. Smeulders. Content-based image retrieval: An overview. In
     G. Medioni and S. B. Kang, editors, Emerging Topics in Computer Vision. Prentice Hall, 2004.
[11] David A. Grossman and Ophir Frieder. Information Retrieval: Algorithms and Heuristics. Springer,
     2004.
  7 http://aveir.lip6.fr
[12] M. Grubinger, P. Clough, H. Muller, and T. Deselaers. The iapr tc-12 benchmark: A new evalua-
     tion resource for visual information systems. In In Proceedings of International Workshop OntoIm-
     age’2006 Language Resources for Content-Based Image Retrieval, held in conjuction with LREC’06,
     2006.
[13] L.J. Heyer, S. Kruglyak, and S. Yooseph. Exploring expression data: Identification and analysis of
     coexpressed genes. Genome, 9(11):1106–1115, 1999.
[14] A.K. Jain and A. Vailaya. Shape-based retrieval: a case study with trademark image databases.
     Pattern Recognition, 31(9):1369–1390, 1998.
[15] Feng Jing, Mingjing Li, Lei Zhang, Hong-Jiang Zhang, and Bo Zhang. Learning in region-based
     image retrieval. In Proceedings of the IEEE International Symposium on Circuits and Systems, 2003.
[16] I.T. Jolliffe. Principal Component Analysis. Springer-Verlag, 2002.
[17] Amy N. Langville and Carl D. Meyer. Google’s PageRank and Beyond: The Science of Search Engine
     Rankings. Princeton University Press, 2006.
[18] M. Lew, N. Sebe, C. Djeraba, and R. Jain. Content-based multimedia information retrieval: State-of-
     the-art and challenges. ACM Transactions on Multimedia Computing, Communication, and Applica-
     tions, 2(1):1–19, 2006.
[19] B.S. Manjunath, P. Salembier, and T. Sikora, editors. Introduction to MPEG-7: Multimedia Content
     Description Interface. Wiley, 2002.
[20] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schtze. Introduction to Information Re-
     trieval. Cambridge University Press, 2008.
[21] Christopher D. Manning and Hinrich Schuetze. Foundations of Statistical Natural Language Pro-
     cessing. The MIT Press, 1999.
[22] Charles T. Meadow, Bert R. Boyce, Donald H. Kraft, and Carol L Barry. Text Information Retrieval
     Systems. Academic Press, 2007.
[23] Krystian Mikolajczyk and Cordelia Schmid. A performance evaluation of local descriptors. IEEE
     Transactions on Pattern Analysis & Machine Intelligence, 27(10):1615–1630, 2005.

[24] Jean Ponce, Martial Hebert, Cordelia Schmid, and Andrew Zisserman. Towards category-level object
     recognition, volume 4170. Springer, 2006.
[25] M.F. Porter. An algorithm for suffix stripping, program. Program, 14(3):130–137, 1980.
[26] Gerard Salton and Michael J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill,
     1986.
[27] Bernhard Schölkopf and Alexander Smola. Learning with Kernels. MIT Press, 2002.
[28] A. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. Content-based image retrieval at the
     end of the early years. IEEE Trans. Pattern Analysis and Machine Intelligence, 22(12):1349–1380,
     2000.
[29] V. N. Vapnik. Statistical Learning Theory. John Wiley, September 1998.
[30] Constantin Vertan and Nozha Boujemaa. Upgrading color distributions for image retrieval: can we
     do better? In International Conference on Visual Information Systems (Visual2000), November 2000.
[31] C.X. Zhai, W.W. Cohen, and J. Lafferty. Beyond independent relevance: methods and evaluation met-
     rics for subtopic retrieval. In In Proceedings of the 26th Annual international ACM SIGIR Conference
     on Research and Development in informaion Retrieval, August 2003.
[32] Jianguo Zhang, Marcin Marszałek, Svetlana Lazebnik, and Cordelia Schmid. Local features and ker-
     nels for classification of texture and object categories: a comprehensive study. International Journal
     of Computer Vision, 73(2):213–238, 2007.

[33] Xiang Sean Zhou and Thomas S. Huang. Relevance feedback for image retrieval: a comprehensive
     review. Multimedia Systems, 8(6):536–544, 2003.