<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Overview of the Personal Photo Retrieval Pilot Task at ImageCLEF 2012</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>David Zellhofer</string-name>
          <email>david.zellhoefer@tu-cottbus.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Brandenburg Technical University, Database and Information Systems Group</institution>
          ,
          <addr-line>Walther-Pauer-Str. 1, 03046 Cottbus</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>As a consequence of a discussion at ImageCLEF 2011, the personal photo retrieval pilot task has been designed to represent a personal photo collection. In contrast to other existing collections where the contributors often remain unknown, the proposed collection has been sampled from 19 layperson photographers and enriched by their demographics. To ensure a variance in photographic motifs and style, the contributors have been chosen from di erent demographic groups. Thus, one can interpret the content of the collection as a mirror of a photographer's lifespan with typical changing usage behaviors, cameras, topics, and places. The task consists of two subtasks. The rst task is aiming at retrieving visual concepts such as trees, animals, or market scenes. The second is focussing on the retrieval of particular events such as parties or rock concerts. To solve both tasks, the participants were provided with queryby-example documents in addition to browsing data. The participation in this task was very low as only three groups submitted results. To summarize the rst subtask, the best group achieved a precision at 20 of 0.7333 and a NDCG at 20 of 0.5459. In contrast, the second subtask focussing on events was solved with a precision at 20 of 0.9333 and a NDCG at 20 of 0.9697. Regarding the provided browsing data, only one group decided to exploit this resource instead of the provided metadata. Interestingly, it could use this data successfully to solve subtask 1 but reached the last position at subtask 2. This result indicates that there is a particularly strong in uence of metadata on the retrieval of events.</p>
      </abstract>
      <kwd-group>
        <kwd>Content-Based Image Retrieval</kwd>
        <kwd>Benchmark</kwd>
        <kwd>Experiments</kwd>
        <kwd>Personal Photograph Collection</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        As a consequence of a discussion at ImageCLEF 2011, the personal photo
retrieval pilot task has been designed to represent a personal photo collection.
The presented pilot task is aiming at providing a test bed for QBE-based
retrieval scenarios in the scope of personal information retrieval. In contrast to
other tasks relying on downloads from Flickr or the like, the underlying data
set re ects an amalgamated personal image collection that has been taken by 19
photographers. Hence, it can be used best as a test set for layperson retrieval
tasks carried out ad hoc on their own collections such as: \ nd all images with
a street scene", \ nd a beach similar to this", or more event-based tasks like
\show me more pictures from the last U2 concert". The aim of this pilot task
is to retrieve relevant images based on typical layperson usage scenarios in their
own collections, i.e., the search for similar images or images depicting a similar
event, e.g. a rock concert [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>To ensure a variance in photographic motifs and style, the contributors have
been chosen from di erent demographic groups. Thus, one can interpret the
content of the collection as a mirror of a photographer's lifespan with typical
changing usage behaviors, cameras, topics, and places.</p>
      <p>
        Unlike system-centric (Cran eld-based) benchmarks, the pilot tasks tries to
establish a more user-centered perspective on multimodal information retrieval
(MIR) and content-based image retrieval (CBIR). As such, it features two
different retrieval subtasks that can be derived from the camera usage behavior of
the contributing photographers (see below). Additionally, it provides simulated
browsing data re ecting a user's interaction with the system based on multiple
search strategies as observed by [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] or described by [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] respectively.
      </p>
      <p>
        In order to express the subjectivity of relevance assessments, the ground truth
is based on graded relevance judgements. To include these assessments into the
evaluation, the pilot tasks uses the NDCG metric [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] (see Section 4.1) in addition
to precision at various cut-o levels.
      </p>
      <p>As said before, the task consists of two subtasks. The rst task is aiming at
retrieving visual concepts such as trees, animals, or market scenes. The second
is focussing on the retrieval of particular events such as parties or rock concerts.
To solve both tasks, the participants were provided with query-by-example
documents in addition to browsing data.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Task Resources</title>
      <p>
        The pilot task relies on a subset of the Pythia dataset [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] which will be described
in the next section. To complete the description of the provided resources, Section
2.2 will comment on the acquisition of the ground truth. The following section
will then discuss the elicitation of the browsing data o ered to the participants
as an additional resource.
2.1
      </p>
      <sec id="sec-2-1">
        <title>The Pythia Dataset</title>
        <p>
          To overcome limitations by binary relevance judgments often found in common
test collections, the Pythia collection [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] has been proposed. The collection is
aiming at providing a benchmark for user-centered or relevance feedback-related
experiments which are a ected by subjective relevance levels in particular. The
collection di ers from collections consisting of Flickr downloads or the like as it
has been sampled from 19 layperson photographers. For the individual
contribution of the photographers, see Figure 1. In addition to the image data, the
contributors to the collection completed a survey asking for their photograph
taking behavior, their demographics etc. To ensure a variance in photographic
motifs and style, the contributors have been chosen from di erent demographic
groups. Thus, one can interpret the content of the collection as a mirror of a
photographer's lifespan with typical changing usage behaviors, cameras, topics,
and places. The total size of the collection is 5,555 documents.
        </p>
        <p>The documents within the collection have neither been processed extensively
nor have duplicates been removed. Hence, the data can be considered a
realistic sample from a typical user's hard-disk. The collection is rich on metadata
including GPS, IPTC, EXIF, and information about events depicted on each
photography. All this information is available to the participants of the pilot
task. For an overview, see Table 1.</p>
        <p>actor18   actor0;  0,85%   actor1  
4,07%   0,90%  
actor2  
8,55%  
actor3  
1,15%   actor4  
2,03%  
In order to obtain the ground truth, 42 assessors were asked to participate. The
core characteristics can be subsumed as follows. The majority of the assessors (28
out of 42) are male and born between 1979 and 1991 (median: 1987). Most of the
assessors are students with a background in economics (26), the second largest
group (13) has a background in computer science and information technology.
Figure 2 illustrates the other elds of education or working area. Regarding
2.2
12  
10  
8  
6  
4  
2  
0  </p>
        <p>Business  
Admin.  &amp;  
Engineering    </p>
        <p>Business  
Admin.  
eBusiness  </p>
        <p>Urban  &amp;   Informa?on  &amp;   Computer   Travel  Agent   Engineer  
Regional   Media   Science  </p>
        <p>Planning     Technology  
their level of expertise in the eld of MIR or IR, 9 assessors took classes in MIR
while 11 heard IR. When asked directly about their knowledge of the eld the
median lies at \little knowledge" with an average of 1.40, i.e., a trend towards
considering themselves as an `informed outsiders".</p>
        <p>Using a web-based evaluation tool (see Figure 3), the assessors could judge
the relevance of an image with respect to a topic on a graded scale ranging
from 0 (irrelevant) to 3 (fully relevant). All assessors had to judge all documents
regarding a topic. The topics were associated with the assessors by random. To
keep them motivated, the assessors were allowed to work with the collection from
a place of their choice. Additionally, they could pause an assessment run and
continue from later on. A time constraint has not been de ned. In average 2.69
topics were evaluated per assessor (standard deviation: 1.60). The individual
assessments were saved separately in order to maintain them for later usage.
Calculation of the Ground Truth for each Topic Based on the individual
assessments, an averaged ground truth has been calculated. First, the frequency
of each graded relevance judgement (out of an interval from 0 (irrelevant) to 3
(fully relevant)) was counted per image and topic. Based on these relevance
judgment frequencies, an estimation value was calculated and rounded. The rounded
estimation value of the relevance of an image regarding a topic was then used as
the averaged graded relevance assessment for this image. In consequence, each
image could be associated with a graded relevance judgment for each topic.
Generating Browsing Information As we could not obtain real browsing
information, it had to be generated arti cially. Using the graded relevance
assessments, multiple images were chosen as browsing images. The provided browsed
images have a relevance grade ranging from 1 to 2, i.e., they are judged
neither irrelevant nor fully relevant for a given topic. In other words, the browsing
data consists of interesting images which were not fully relevant for the modeled
user which caused him or her to proceed with the search. This change of search
strategy (from browsing to a directed QBE search) is re ected by the following
subtask.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Task Description</title>
      <sec id="sec-3-1">
        <title>Subtask 1: Retrieval of Visual Concepts</title>
        <p>
          The objective of the rst subtask is to nd similar images to a speci ed visual
concept or topic. Out of the 32 topics provided by the Pythia dataset [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], the 24
topics with the most relevant images in the corpus were chosen. The topics are
listed in Table 2.
        </p>
        <p>To solve the task, 5 QBE documents were provided to the participants. All
QBE documents are fully relevant according to our assessors. In addition to
the metadata present in the images, browsing data consisting of images that
have been inspected during the search (see above) is o ered. The usage of this
browsing data is voluntary as the utilization of image features or metadata (e.g.
GPS information).
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Subtask 2: Retrieval of Events</title>
        <p>
          With respect to the fact that most contributors to the collection used their
cameras only at special events [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], an additional event retrieval subtask was
de ned. Its objective is to nd further images from an event speci ed by 3 QBE
images from the same event. In contrast to subtask 1, browsing data is not
available. Table 3 lists all events.
        </p>
        <p>
          The events range from special events such as a U2 concert to their
generalization, i.e., a rock concert. It is noteworthy that the events can reoccur and are
not always chronologically connected. The focus on events representing a holiday
or a city trip is not a freely chosen bias. Instead, it re ects the state of randomly
picked images from real-world personal photo collections [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <sec id="sec-4-1">
        <title>Evaluation Metrics</title>
        <p>
          It is widely known that relevance judgments are highly subjective. Because of
this fact, the presented ground truth is based on a gradual scale of relevance.
Unfortunately, traditional measurements such as the mean average precision (MAP)
or precision at n cannot deal with this kind of judgements. Hence, we will rely
on the discounted cumulative gain (DCG) measurement [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] in addition to
precision at n. As stated in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] \DCG relies on graded relevance assessments and has
become more and more used within the information retrieval (IR) community,
which is re ected by a performance evaluation of di erent metrics presented
at SIGIR '11 showing that DCG `really is a useful user-centered measure of
system e ectiveness' [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Besides its capability of re ecting subjectivity, DCG
also provides more appropriate means to evaluate relevance feedback (RF) or
adaptive systems as it can be used to measure slight changes or re-orderings of
relevant documents with varying degrees of relevance within the result list". The
core idea of DCG is to apply \a discount factor to the relevance scores in order
to devaluate late-retrieved documents" [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. In other words, the metric rewards
highly relevant documents at the rst positions in the result ranking and
punishes systems retrieving less relevant documents at the rst places. For the scope
of this task, the DCG implementation of trec eval version 9.0 with standard
discount settings is used. A full discussion of the metric is available by Jarvelin
and Kekalainen [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
4.2
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Results of the Participants</title>
        <p>Because of the low participation rate, a general interpretation of the results is
hardly possible. Table 4 and 5 summarize the participants' results. The
submitted runs consist of both automatic and manually assisted runs. While two
groups worked without relevance feedback (NOFB), the University of Cagliari
used it in a binary way. That is, relevance feedback was given with relevance or
irrelevance judgments.</p>
        <p>Regarding the retrieval type, the runs are more diverse. The participants
could use the following combinations of the provided data and metadata:
{ visual features alone (IMG)
{ visual features and metadata (IMGMET)
{ visual features and browsing data (IMGBRO)
{ metadata alone (MET)
{ metadata and browsing data (METBRO)
{ browsing data alone (BRO)
{ a combination of all modalities (IMGMETBRO)</p>
        <p>None of the participants used all modalities in combination. The participants
relied on IMG, MET, IMGMET, or IMGBRO alone. Interestingly, only the group
REGIM decided to exploit the browsing data instead of the provided metadata.
Surprisingly, it could use this data successfully to solve subtask 1 but reached
the last position at subtask 2. This result indicates that there is a particularly
strong in uence of metadata on the retrieval of events.</p>
        <p>To summarize the rst subtask (see Table 4), the best group achieved a
precision at 20 of 0.7333 and a NDCG at 20 of 0.5459. In contrast, the second
subtask focussing on events was solved with a precision at 20 of 0.9333 and a
NDCG at 20 of 0.9697 (see Table 5).</p>
      </sec>
      <sec id="sec-4-3">
        <title>The E ect of Di erent User Groups on the Retrieval Quality Because of</title>
        <p>the nature of the acquisition of the ground truth (see Section 2.2), distinct ground
truths could be generated per user groups. The main objective for these di erent
ground truths was to examine if the retrieval metrics for each participant di er
per user group. Hence, 6 user groups were de ned on basis of the demographics
of the assessors. These are:
Experts A group of users that stated that they have an expertise with IR.
Non-Experts The complement of the experts group.</p>
        <p>Male/Female The assessors divided by gender.
IT This groups consists of assessors with an IT background (see Figure 2).
Non-IT The complement of the IT group.</p>
        <p>As not all images and topics have been assessed by members of each separate
user group, missing assessments had to be added from the averaged ground truth
(see above). Figures 4-6 illustrate the results of some sample runs regarding
di erent user groups. The x-axis indicates di erent retrieval measurements, i.e.,
1) P@10, 2) P@20, 3) P@30, 4) NDCG@10, 5) NDCG@20, and 6) NDCG@30.
Besides in Figure 6, the results of the individual user groups are very close to
the results from the averaged ground truth. Further research is needed to nd
out why this is the case. For now, it seems that the addition of missing relevance
assessments is causing this low level of variation.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions and Future Work</title>
      <p>As this is the pilot phase of a more user-centered benchmark, the task posed
more questions and revealed more issues as it actually answered.</p>
      <p>First, it became obvious that the generation of user-centered tasks and the
acquisition of the accompanying data takes much more time than expected.
Originally, we also wanted to provide data for user simulations to all participants
so they could tune their systems with respect to di erent user groups. Due to
the time constraints, this data could not be released on time. If this has had an
impact on the low participation rate remains an open question.</p>
      <p>
        To our surprise, only one group used the provided browsing data. Regarding
this data, we expected more interest as studies in interactive IR clearly show that
users are changing their search strategies during the search process [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Anyhow,
the positive results of this group might motivate further studies of others how
to exploit this resource.
      </p>
      <p>Interestingly, there was no interest in solving the so-called user-centered
initiative of the subtasks. The initiative asked for an alternative representation of
the top-k results o ering a more diverse view onto the results to the user. This
challenge re ects the assumption that a user-centered system should o er users
good and varying retrieval results. Varying results are likely to compensate for
the vagueness inherent in both retrieval and query formulation. Hence, an
additional ltering or clustering of the result list could improve the e ectiveness and
e ciency (in terms of usability) of the retrieval process. It remains unclear, if
this task was too complex or just out of the area of expertise of the participants
that used the dataset for the rst time.</p>
      <p>To conclude with, we are happy that the participants tried to solve to task
using diverse techniques and hope to motivate further research in the eld of
user-centered MIR and CBIR.</p>
      <sec id="sec-5-1">
        <title>Acknowledgments</title>
        <p>This research was supported by a grant of the Federal Ministry of Education
and Research (Grant Number 03FO3072).</p>
        <sec id="sec-5-1-1">
          <title>Group</title>
          <p>KIDS
KIDS
KIDS
KIDS
REGIM
REGIM
REGIM
REGIM
REGIM
University
of Cagliari
KIDS
University
of Cagliari
University
of Cagliari</p>
        </sec>
        <sec id="sec-5-1-2">
          <title>Group</title>
          <p>KIDS
KIDS
KIDS
KIDS
KIDS
REGIM
REGIM
REGIM</p>
          <p>KIDS  NUTN  IOOA4    
0,75  
0,7  
00,6,65    EANxvopener-­a‐rEtgxsep  e rts  
0,55   FMeamlea le  
0,04,55    INTo  n-­‐IT  
0,4   0   1   2   3   4   5   6   7  
Fig. 5. Retrieval performance for di erent user groups (University of Cagliari Run1.1)
0,9  
0,85  
0,8  
0,75  
0,7  
0,65  
0,6  
0,55  
0,5  
0,45  
0,4  </p>
          <p>REGIM  #1  </p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Belkin</surname>
          </string-name>
          , N.:
          <article-title>Intelligent information retrieval: Whose intelligence?</article-title>
          <source>In: ISI '96: Proceedings of the Fifth International Symposium for Information Science</source>
          . pp.
          <volume>25</volume>
          {
          <issue>31</issue>
          (
          <year>1996</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Carterette</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>System e ectiveness, user models, and user utility: a conceptual framework for investigation</article-title>
          .
          <source>In: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information</source>
          . pp.
          <volume>903</volume>
          {
          <fpage>912</fpage>
          . SIGIR '11,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2011</year>
          ), http://doi.acm.
          <source>org/10</source>
          .1145/2009916.2010037
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. Jarvelin,
          <string-name>
            <surname>K.</surname>
          </string-name>
          , Kekalainen, J.:
          <article-title>Cumulated gain-based evaluation of IR techniques</article-title>
          .
          <source>ACM Trans. Inf. Syst</source>
          .
          <volume>20</volume>
          (
          <issue>4</issue>
          ),
          <volume>422</volume>
          {
          <fpage>446</fpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Reiterer</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          , Mu ler, G.,
          <string-name>
            <surname>Mann</surname>
            ,
            <given-names>M.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Handschuh</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>INSYDER - an information assistant for business intelligence</article-title>
          .
          <source>In: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval</source>
          . pp.
          <volume>112</volume>
          {
          <fpage>119</fpage>
          . SIGIR '00,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2000</year>
          ), http://doi.acm.
          <source>org/10</source>
          .1145/345508.345559
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Voorhees</surname>
            ,
            <given-names>M.E.</given-names>
          </string-name>
          :
          <article-title>On test collections for adaptive information retrieval</article-title>
          .
          <source>Information Processing &amp; Management</source>
          <volume>44</volume>
          (
          <issue>6</issue>
          ),
          <year>1879</year>
          {
          <year>1885</year>
          (
          <year>2008</year>
          ), http://www.sciencedirect.com/science/article/pii/S0306457308000253
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. Zellhofer, D.:
          <article-title>An Extensible Personal Photograph Collection for Graded Relevance Assessments and User Simulation</article-title>
          .
          <source>In: Proceedings of the ACM International Conference on Multimedia Retrieval. ICMR '12</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>