Diversity in Photo Retrieval:
Overview of the ImageCLEFPhoto Task 2009
Monica Lestari Paramita, Mark Sanderson and Paul Clough
{m.paramita, m.sanderson, p.d.clough}@sheffield.ac.uk
University of Sheffield, United Kingdom
Abstract
The ImageCLEF Photo Retrieval Task 2009 focused on image retrieval and diversity. A new
collection was utilised in this task consisting of approximately half a million images with English
annotations. Queries were based on analysing search query logs and two different types were
released: one containing information about image clusters; the other without. A total of 19
participants submitted 84 runs. Evaluation, based on Precision at rank 10 and Cluster Recall at
rank 10, showed that participants were able to generate runs of high diversity and relevance.
Findings show that submissions based on using mixed modalities performed best compared to
those using only concept-based or content-based retrieval methods. The selection of query fields
was also shown to affect retrieval performance. Submissions not using the cluster information
performed worse with respect to diversity than those using this information. This paper
summarises the ImageCLEFPhoto task for 2009.
Categories and Subject Descriptors
H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Information Search and
Retrieval; H.3.4 Systems and Software; H.3.7 Digital Libraries; H.2.3 [Database Management]: Languages-
Query Languages
General Terms
Measurement, Performance, Experimentation
Keywords
Performance Evaluation, Image Retrieval, Diversity, Clustering
1 Introduction
The ImageCLEFPhoto task is part of the CLEF evaluation campaign, the focus for the past two years being
promoting diversity within image retrieval. The task originally began in 2003 and has since attracted participants
from many institutions worldwide. For the past three years, ImageCLEFPhoto has used a dataset of 20,000
general photos called the IAPR TC-12 Benchmark. In 2008, we adapted this collection to enable the evaluation
of diversity in image retrieval results. We recognised that this setup had limitations and therefore moved to using
a larger and more realistic collection of photos (and associated search query logs) from Belga1, a Belgian press
agency. Even though photos in this collection have English-only annotations and hence provide little challenge
to cross-language information retrieval systems, there are other characteristics of the dataset which provide new
challenges to participating groups (explained in Section 1.1). The resources created for the 2009 task have given
us the opportunity to study diversity for image retrieval in more depth.
1.1 Evaluation Scenario
Given a set of information needs (topics), participants were tasked with finding not only relevant images, but
also generating ranked lists that promote diversity. To make the task harder, we released two types of queries:
the first type of query included written information about the specific requirement for diversity (represented as
clusters); queries of the second type contained a more conventional title and example relevant images. In the
former type of query participants were required to retrieve diverse results with some indication of what types of
clusters were being sought; in the latter type of query little evidence was given for what kind of diversity was
required. Evaluation gave more credence to runs that presented diverse results without sacrificing precision than
those exhibiting less diversity.
1
Belga Press Agency: http://www.belga.be
1.2 Evaluation Objectives for 2009
The Photo Retrieval task in 2009 was focused at studying diversity further. Using resources from Belga, we
provided a much larger collection, containing just under half a million images, compared to 20,000 images
provided in 2008. We also obtained statistics on popular queries submitted to the Belga website in 2008 [1],
which we exploited to create representative queries for this diversity task. We experimented with different ways
of specifying the need for diversity which was given to participants, and this year decided to release half of the
queries without any indication of diversity required or expected. We were interested in addressing the following
research questions:
• Can results be diverse without sacrificing relevance?
• How much will knowing about query clusters a priori help increase diversity in image search results?
• Which approaches should be used to maximize diversity and relevance for image search results?
These research questions will be discussed further in section 4.
2 Evaluation Framework
One of the major challenges for participants of the 2009 ImageCLEFPhoto task was a new collection which was
25 times larger than that used for 2008. Query creation was based completely on query log data, which helped to
make the retrieval scenario as realistic as possible [2]. We believe this new collection will provide a framework
in which to conduct a more thorough analysis of diversity in image retrieval.
2.1 Document Collection
The collection consists of 498,920 images with English-only annotations (i.e. captions) describing the content of
the image. However, different to the structured annotations of 2008, the annotations in this collection are
presented in an unstructured way (Table 1). This increases the challenge for participants as they must
automatically extract information about the location, date, photographic source, etc of the image as a part of the
indexing and retrieval process. The photos cover a wide-ranging time period, and there are many cases where
pictures have not been orientated correctly, thereby increasing the challenge for content-based retrieval methods.
Table 1. Example image and caption
Annotation:
20090126 - DENDERMONDE, BELGIUM: Lots of people
pictured during a commemoration for the victims of the
knife attack in Sint-Gilles, Dendermonde, Belgium, on
Monday 26 January 2009. Last friday 20-Year old Kim De
Gelder killed three people, one adult and two childs, in a
knife attack at the children's day care center "Fabeltjesland"
in Dendermonde. BELGA PHOTO BENOIT DOPPAGNE
2.2 Query Topics
Based on search query logs from Belga, 50 example topics were generated and released as two query types (as
mentioned previously). From this set, we randomly chose 25 queries to be released with information including
the title, cluster title, cluster description and image (example) as shown in Table 2. We refer to these queries as
Query Part 1. In this example, participants can notice that this result about ‘Clinton’ requires 3 different clusters,
which are ‘Hillary Clinton’, ‘Obama Clinton’ and ‘Bill Clinton’. Results covering other aspects of “Clinton”,
such as Chelsea Clinton or Clinton Cards, will not be counted towards the final diversity score. More
information about these clusters and the method used to produce them can be found in [2].
Given that one might argue that the diversity result in Query Part 1 could be relatively easy to produce as
detailed information about the different sub-topics is provided as part of the query topic and there are often in
practice instances when little or no query log information is available to indicate possible clusters, we released
25 queries containing no information about the kind of diversity expected (referred to as Query Part 2). An
example of this query type is given in Table 3. It should be noted that information about the cluster titles and
description were also based on Belga’s query logs. However, we did not release any of this information to the
participants.
Table 2. Example of Query Part 1
12
clinton
hillary clinton
Relevant images show photographs of Hillary Clinton. Images
of Hillary with other people are relevant if she is shown in the
foreground. Images of her in the background are irrelevant.
belga26/05859430.jpg
obama clinton
Relevant images show photographs of Obama and Clinton. Images
of those two with other people are relevant if they are shown in the
foreground. Images of them in the background are irrelevant.
belga28/06019914.jpg
bill clinton
Relevant images show photographs of Bill Clinton. Images of
Bill with other people are relevant if he is shown in the foreground.
Images of him in the background are irrelevant.
belga44/00085275.jpg
Table 3. Example of Query Part 2
26
obama
belga30/06098170.jpg
belga28/06019914.jpg
belga30/06107499.jpg
The list of 50 topics used in this collection is given in Table 4. Since Belga is a press agency based in Belgium,
there are a large number of queries which contain the names of Belgian politicians, Belgian football clubs and
members of the Belgian royal family. Other queries, however, are more general such as Beckham, Obama, etc.
There are some queries which are very broad and under-specified (e.g. Belgium); others are highly ambiguous
(e.g. Prince and Euro).
Table 4. Overall list of topics used in the 2009 task
Query Part 1 Query Part 2
1 leterme 14 princess** 26 obama* 39 beckham*
2 fortis 15 monaco** 27 anderlecht 40 prince**
3 brussels** 16 queen** 28 mathilde 41 princess mathilde
4 belgium** 17 tom boonen 29 boonen 42 mika*
5 charleroi 18 bulgaria** 30 china** 43 ellen degeneres
6 vandeurzen 19 kim clijsters 31 hellebaut 44 henin
7 gevaert 20 standard 32 nadal 45 arsenal
8 koekelberg 21 princess maxima 33 snow** 46 tennis**
9 daerden 22 club brugge 34 spain** 47 ronaldo*
10 borlee* 23 royals** 35 strike** 48 king**
11 olympic** 24 paola* 36 euro* 49 madonna
12 clinton* 25 mary* 37 paris** 50 chelsea
13 martens* 38 rochus
* = ambiguous, ** = under-specified queries, bold queries: queries with more than 677 (median) relevant
documents
2.3 Relevance Assessments
Relevance assessments were performed using the DIRECT (Distributed Information Retrieval Evaluation
Campaign Tool)2, a system which enables assessors to work in a collaborative environment. We hired 25
assessors to be involved in this process and assessments were divided into 2 phases: in the first phase, assessors
were asked to identify images relevant to a given query. Information about all relevant clusters to the topic was
given to assessors to ensure they were aware of the scope of relevant images for a query. The number of relevant
images for each query resulting from this stage is shown in Figure 1.
Figure 1. Number of relevant documents per query
Having queries from different types shown in Table 4, we then analysed the number of relevant documents in
each type. This data, shown in Table 5, illustrates that under specified queries have the highest average number
of relevant documents.
Table 5. Number of relevant documents in each type
All Queries Ambiguous Queries Under Specified Queries Other Queries
Number of
50 10 16 24
Queries
Average Doc 697.74 490 1050.19 549.33
Min 2 35 246 2
Max 2210 1052 2210 1563
Standard Dev 512.16 366.28 459.29 490.5
After a set of relevant images were found, for the second stage different assessors were asked to find images
relevant to each cluster (some images could belong to multiple clusters). Since topics varied widely in content
and diversity, the number of relevant images varied from 1 to 1,266 for each cluster. Initially, there were 206
clusters created for the 50 queries, but this number dropped to 198 as there were 8 clusters with no relevant
images which had to be deleted. There are an average number of 208.49 relevant documents for each cluster,
with a standard deviation of 280.59. The distribution of clusters is shown in Figure 2.
2
http://direct.dei.unipd.it
Figure 2. Distribution of clusters in the queries
2.4 Generating the Results
The method for generating results from participant’s submissions was similar to that used in 2008 [3]. The
precision of each run (P@10) was evaluated using trec_eval and cluster recall (CR@10) was used to measure
diversity. Since the maximum number of clusters was set to 10 [2], we focussed evaluation on P@10 and
CR@10. The F1 score calculates the harmonic mean of these two measures.
3 Overview of Participation and Submissions
A total of 44 different institutions registered for the ImageCLEFPhoto task (the highest number of applications
ever received for this task). From this number, 19 institutions from 10 different countries finally submitted runs
to the evaluation. Due to the large number of runs received last year, we limited the number of submitted runs to
5 per participant. A total of 84 runs were submitted and evaluated (some groups submitted less than 5 runs).
3.1 Overview of Submissions
The participating groups for 2009 are listed in Table 8. From the 24 groups participating in the 2008 task, 15
groups returned and were involved this year (Returning). We also received four new participants who joined this
task for the first time (New).
Participants were asked to specify the query fields used in their search and the modality of the runs. Query fields
were described as T (Title), CT (Cluster Title), CD (Cluster Description) and I (Image). The modality was
described as TXT (text-based search only), IMG (content-based image search only) or TXT-IMG (both text and
content-based image search). The range of approaches is shown in Tables 6 and 7 and summarised in Figure 3.
Table 6. Choice of query fields
Query Fields Number of Runs
T 17
T-CT-CD-I 15
T-CT 15
T-CT-I 9
T-CT-CD 9
I 8
T-I 7
CT-I 2
CT 2
Table 7. Modality of the runs
Modality TXT-IMG TXT IMG
Number of Runs 36 41 7
Figure 3. Summary of query fields used in submitted runs
Table 8. Participating groups
No Group ID Institution Country Runs Status
1 Alicante University of Alicante Spain 5 Returning
2 Budapest-ACAD Hungarian Academy of Science, Budapest Hungary 5 Returning
3 Chemnitz Computer Science, Trinity College, Dublin Ireland 4 Returning
4 CLAC-Lab Computational Linguistics at Concordia (CLAC) Canada 4 Returning
Lab, Concordia University, Montreal
5 CWI Interactive Information Access Netherlands 5 New
6 Daedalus Computer Science Faculty, Daedalus, Madrid Spain 5 Returning
7 Glasgow Multimedia IR, University of Glasgow UK 5 Returning
8 Grenoble Lab. Informatique Grenoble France 4 Returning
9 INAOE Language Tech Mexico 5 Returning
10 InfoComm Institution for InfoComm Research Singapore 5 Returning
11 INRIA LEAR Team France 5 New
12 Jaen Intelligent Systems, University of Jaen Spain 4 Returning
13 Miracle-GSI Intelligent System Group, Daedalus, Madrid Spain 3 Returning
14 Ottawa NLP, AI.I.Cuza U. of IASI Canada 5 Returning
15 Southampton Electronics and Computer Science, University of UK 4 New
Southampton
16 UPMC-LIP6 Department of Computer Science, Laboratoire France 5 Returning
d’Informatique de Paris 6
17 USTV-LSIS System and Information Sciences Lab, France France 2 Returning
18 Wroclaw Wroclaw University of Technology Poland 5 New
19 XEROX-SAS XEROX Research France 4 Returning
4 Results
This section provides an overview of the results based on the type of queries and modalities used to generate the
runs. As mentioned in the previous section, we used P@10 to calculate the fraction of relevant documents in the
top 10 and CR@10 to evaluate diversity, which calculates the proportion of subtopics retrieved in the top 10
documents as shown below:
K
U subtopics(d )
i =1 i
Cluster − recall at K ≡
nA
The F1 score was used to calculate the harmonic mean of P@10 and CR@10, to enable the results to be sorted by
one single measure:
2 x (P10 x CR10 )
F1 =
(P10 + CR10)
4.1 Results across all Queries
The top 10 runs computed across all 50 queries (ranked in descending order of F1 score) are shown in Table 9.
Table 9. Systems with highest F1 score for all queries
No Group Run Name Query Modality P@10 CR@10 F1
1 XEROX-SAS XRCEXKNND T-CT-I TXT-IMG 0.794 0.8239 0.8087
2 XEROX-SAS XRCECLUST T-CT-I TXT-IMG 0.772 0.8177 0.7942
3 XEROX-SAS KNND T-CT-I TXT-IMG 0.8 0.7273 0.7619
4 INRIA LEAR5_TI_TXTIMG T-I TXT-IMG 0.798 0.7289 0.7619
5 INRIA LEAR1_TI_TXTIMG T-I TXT-IMG 0.776 0.7409 0.7580
6 InfoComm LRI2R_TI_TXT T-I TXT 0.848 0.6710 0.7492
7 XEROX-SAS XRCE1 T-CT-I TXT-IMG 0.78 0.7110 0.7439
8 INRIA LEAR2_TI_TXTIMG T-I TXT-IMG 0.772 0.7055 0.7373
9 Southampton SOTON2_T_CT_TXT T-CT TXT 0.824 0.6544 0.7294
10 Southampton SOTON2_T_CT_TXT_IMG T-CT TXT-IMG 0.746 0.7095 0.7273
Looking at the top 10 runs, we observe that highest effectiveness is reached using mixed modality (text and
image) and using information from the query title, cluster title and the image content itself. The scores for P@10,
CR@10 and F1 in this year’s task are notably higher than the evaluation last year. Moreover, the number of
relevant images in this year’s task was higher. Having two different types of queries, we analysed how
participants dealt with the different queries. Tables 10 and 11 summarise the top 10 runs in each of query types.
Table 10. Systems with highest F1 score for Queries Part 1
No Group Run Name Query Modality P@10 CR@10 F1
1 Southampton SOTON2_T_CT_TXT T-CT TXT 0.868 0.7730 0.8178
2 Southampton SOTON2_T_CT_TXT_IMG T-CT TXT-IMG 0.804 0.8063 0.8052
3 XEROX-SAS KNND T-CT-I TXT-IMG 0.768 0.8289 0.7973
4 XEROX-SAS XRCE1 T-CT-I TXT-IMG 0.768 0.8289 0.7973
5 XEROX-SAS XRCECLUST T-CT-I TXT-IMG 0.768 0.8289 0.7973
6 XEROX-SAS XRCEXKNND T-CT-I TXT-IMG 0.768 0.8289 0.7973
7 Southampton SOTON1_T_CT_TXT T-CT TXT 0.824 0.7470 0.7836
8 InfoComm LRI2R_TCT_TXT T-CT TXT 0.828 0.7329 0.7776
9 Southampton SOTON1_T_CT_TXT_IMG T-CT TXT-IMG 0.76 0.7933 0.7763
10 INRIA LEAR1_TI_TXTIMG T-I TXT-IMG 0.772 0.7779 0.7749
Different compared to results presented previously, it is interesting to see that the top run in Queries Part 1 used
only text retrieval approaches. Even though the CR@10 score was lower than most of the runs, it obtained the
highest F1 score due to a high P@10 score. The uses of tags vary within results, but the top 9 runs consistently
use both title and cluster title. We therefore conclude that the use of title and cluster title do help the participants
to achieve a good score in both precision and cluster recall.
In the queries part two, participants did not have access to cluster information. We specifically intended this to
see how well the system finds diverse results without any hints. The results of the top runs in queries part 2 is
shown in Table 11.
Table 11. Systems with highest F1 score for Queries Part 2
No Group Run Name Query Modality P@10 CR@10 F1
1 XEROX-SAS XRCEXKNND T-I TXT-IMG 0.82 0.8189 0.8194
2 XEROX-SAS XRCECLUST T-I TXT-IMG 0.776 0.8066 0.7910
3 InfoComm LRI2R_TI_TXT T-I TXT 0.828 0.6901 0.7528
4 INRIA LEAR5_TI_TXTIMG T-I TXT-IMG 0.756 0.7399 0.7479
5 INRIA LEAR1_TI_TXTIMG T-I TXT-IMG 0.78 0.7039 0.7400
6 GRENOBLE LIG3_TI_TXTIMG* T-I TXT-IMG 0.7708 0.6711 0.7175
7 XEROX-SAS KNND T-I TXT-IMG 0.832 0.6257 0.7143
8 INRIA LEAR2_TI_TXTIMG T-I TXT-IMG 0.728 0.6849 0.7058
9 GRENOBLE LIG4_TCTITXTIMG T-I TXT-IMG 0.792 0.6268 0.6998
10 GLASGOW GLASGOW4 T TXT 0.76 0.6401 0.6949
* submitted results for 24 out of 25 queries. Score shown is the average of the submitted queries only.
It is shown in the table that the top 9 runs use information from example images, which shows that example
images and their annotations might have given useful hints to detect diversity. To analyse this further, we
divided the runs which used the Image field and those which did not, and found that the average CR@10 scores
were 0.5571 and 0.5270 respectively. We conclude that having example images helps to identify diversity and
present a more diverse set of results.
Comparing the CR@10 scores in the top 10 runs of Queries Part 1 and Queries Part 2, the scores in the latter
group were lower, which implied that systems did not find as many diverse results when cluster information was
not available. The F1 scores from these top 10 were also lower, but they only differed slightly compared to the
Queries Part 1. We also calculated the magnitude of difference between results for different query types (shown
in Table 12). This indicates that on average runs do perform lower in Query Part 2, however the difference is
small and not sufficient to conclude that runs will be less diverse if cluster titles are not available (p=0.146).
Table 12. Cluster Recall score difference between Queries Part 1 and Queries Part 2
Mean StDev Max Min
-0.0234 0.1454 0.2893 -0.6459
It is important to understand that not all the runs in Query Part 1 use the cluster title. To analyse how useful the
“Cluster Title” (CT) information is, we divided the runs of Query Part 1 based on the use of CT field. The mean
and standard deviation of P@10, CR@10 and the F1 scores is shown in Table 13 (the highest score shown in
italics).
Table 13. Comparison of CR@10 scores
Number P@10 CR@10 F1
Queries
of Runs Mean SD Mean SD Mean SD
Query part 1 with CT 52 0.6845 0.2 0.5939 0.1592 0.6249 0.1701
Query part 1 without CT 32 0.6641 0.2539 0.5006 0.1574 0.5581 0.1962
Query part 2 84 0.6315 0.2185 0.5415 0.1334 0.5693 0.1729
Table 13 provides more evidence that the Cluster Title field has an important role in identifying diversity. When
Cluster Title is not being used, the F1 scores of both Query Part 1 and Query Part 2 do not differ significantly.
Figure 3 shows a scatter plot of F1 scores for each query type. Using a two-tailed paired t-test, the scores
between Queries Part 1 and Queries Part 2 were found to be significantly different (p=0.02). There is also a
significant correlation between the scores: the Pearson correlation coefficient equals 0.691.
We evaluated the same test on the runs using Cluster Title only to the runs in Query Part 2, and found that they
are also significantly different (p=0.003), the Pearson correlation coefficient equals 0.745. However, when the
same evaluation was being performed on runs not using Cluster Title, the difference in scores was not significant
(p=0.053), although obtaining a Pearson correlation coefficient of 0.963.
Figure 4. Scatter plot for F1 scores of each run between query types
Table 14 summarises the results across all queries (mean scores). According to these results, highest scores from
the three conditions are obtained when the query has full information about potential diversity.
Table 14. Summary of results across all queries
P@10 CR@10 F1
Queries
Mean SD Mean SD Mean SD
All Queries 0.655 0.2088 0.5467 0.1368 0.5848 0.1659
Query Part 1 0.6768 0.2208 0.5583 0.1641 0.5995 0.1823
Query Part 2 0.6315 0.2185 0.5415 0.1334 0.5693 0.1729
Figure 5. Scatter plot for mean CR@10 scores for each query
We also analysed whether the number of clusters have any effect on the diversity score. To measure this factor,
we calculated the mean CR@10 for all of the runs. These scores are then plotted based on the number of clusters
contained in each specified query. This scatter plot, shown in Figure 5, has a Pearson correlation coefficient of
-0.600, confirming that the more clusters a query contains, the lower the CR@10 score is.
4.2 Results by Retrieval Modality
In this section, we will present an overview result of runs using different modalities.
Table 15. Results by retrieval modality
Number of P@10 CR@10 F1
Modality
Runs Mean SD Mean SD Mean SD
TXT-IMG 36 0.713 0.1161 0.6122 0.1071 0.6556 0.1024
TXT 41 0.698 0.142 0.5393 0.0942 0.5976 0.0964
IMG 7 0.103 0.027 0.2535 0.0794 0.1456 0.0401
According to Table 15, both the precision and cluster recall scores are highest if systems use both low-level
features based on the content of an image and its associated text. The mean of the runs using image content only
(IMG) is drastically lower based on the P@10 score; however the gap decreases when considering only the
CR@10 score. Further research should be carried out to improve runs using content-based approaches only, as
the best run using this approach had the lowest F1 score (0.218) compared to TXT (0.351) and TXT-IMG
(0.297).
4.3 Approaches Used by Participants
Having known that the mixed modality performs best, we were also interested to see the best combination of
query fields to maximize the F1 score of the runs. We therefore calculated the mean of each combination and
modality and the result is shown in Table 16 with the highest score for each modality shown in italic.
Table 16. Choice of query tags with mean F1 score
Modality
Average F1
TXT-IMG TXT IMG
T 2 runs 0.4621 14 runs 0.5905 1 run 0.0951 0.5462
T-CT-CD-I 10 runs 0.5729 2 runs 0.4579 3 runs 0.1296 0.4689
T-CT 2 runs 0.7214 13 runs 0.6071 - 0.6233
Query Type
T-CT-I 8 runs 0.7344 1 run 0.6842 - 0.7288
T-CT-CD 2 runs 0.6315 7 runs 0.5688 - 0.5827
I 4 runs 0.6778 1 run 0.6741 3 runs 0.1786 0.4901
T-I 6 runs 0.7117 1 run 0.7492 - 0.7171
CT-I 2 runs 0.6925 - - 0.6925
CT - 2 runs 0.6687 - 0.6687
It is interesting to note that the highest F1 score was different for each modality. A combination of T-CT-I had
the highest score in TXT-IMG modality. In the TXT modality, a combination of T-I scored the highest, with T-
CT-I following on the second place. However, since only one run used the T-I, it was not enough to provide a
conclusion about the best run. Calculating the average F1 score regardless of diversity shows that the best runs
are achieved using a combination of Title, Cluster Title and Image. Using all tags in the queries resulted in the
worst performance.
5 Conclusions
This paper has reported the ImageCLEF Photo Retrieval Task for 2009. Still focusing on the topic of diversity,
this year’s task introduced new challenges to the participants, mainly through the use of a much larger collection
of images than used in previous years and by other tasks. Queries were released as two ‘types’: the first type of
queries included information about the kind of diversity expected in the results; the second type of queries not
providing this level of detail.
The number of registering participants in this year was the highest of all the ImageCLEFPhoto tasks since 2003.
Nineteen participants submitted a total of 84 runs, which were then categorised based on the query fields used to
find information, and the modalities being used. The result showed that participants were able to present a
diverse result without sacrificing precision. In addition, results showed the following:
• Information about the cluster title is essential for providing diverse results, as this enables participants
to correctly present images based on each cluster. When the cluster information was not being used, the
cluster recall score is proven to drop, which showed that participants need better approach to predict the
diversity need in it.
• A combination of Title, Cluster Title and Image was proven to maximize the diversity and relevance of
the search engine.
• Using mixed modality (text and image) in the runs managed to achieve the highest F1 compared to
using only text or image features alone.
Considering the increasing interest of participants in ImageCLEFPhoto, the creation of the new collection was
seen as a big achievement in providing a more realistic framework for the analysis of diversity and evaluation of
retrieval systems aimed at promoting diverse results. The findings from this new collection were found to be
promising and we plan to make use of other diversity algorithms in the future to enable evaluation to be done
more thoroughly.
Acknowledgments
We would like to thank Belga Press Agency for providing us the collection and query logs and Theodora
Tsikrika for the preprocessed queries which we used as the basis for this research.
The work reported has been partially supported by the TrebleCLEF Coordination Action, within FP7 of the
European Commission, Theme ICT-1-4-1 Digital Libraries and Technology Enhanced Learning (Contract
215231).
References
[1] Tsikrika, T. 2009. Queries Submitted by Belga Users in 2008.
[2] Paramita, M. L, Sanderson, M., and Clough, P. 2009. Developing a Test Collection to Support Diversity
Analysis. SIGIR 2009 Workshop: Redundancy, Diversity, and Interdependent Document Relevance, July
23rd, Boston, Massachusetts, USA.
[3] Arni, T., Clough, P., Sanderson, M., and Grubinger, M. 2008. Overview of the ImageCLEFPhoto 2008
Photographic Retrieval Task. Cross Language Evaluation Forum.