<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The Fraunhofer IDMT at ImageCLEF 2011 Photo Annotation Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Karolin Nagel</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefanie Nowak</string-name>
          <email>research@stefanie-nowak.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Uwe Kuhhirt</string-name>
          <email>uwe.kuehhirt@idmt.fraunhofer.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kay Wolter</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fraunhofer Institute for Digital Media Technology (IDMT) Ehrenbergstr.</institution>
          <addr-line>31, 98693 Ilmenau</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents the participation of the Fraunhofer IDMT in the ImageCLEF 2011 Photo Annotation Task. Our approach is focused on text-based features and strategies to combine visual and textual information. First, we apply a pre-processing step on the provided Flickr tags to reduce noise. For each concept, tf-idf values per tag are computed and used to construct a text-based descriptor. Second, we extract RGB-SIFT descriptors using the codebook approach. Visual and text-based features are combined, once with early fusion and once with late fusion. The concepts are learned with SVM classi ers. Further, a post-processing step compares tags and concept names to each other. Our submission consists of one text-only and four multi-modal runs. The results show, that a combination of text-based and visual-features improves the result. Best results are achieved with the late fusion approach. The post-processing step only improves the results for some concepts, while others worsen. Overall, we scored a Mean Average Precision (MAP) of 37.1% and an example-based F-Measure (F-ex) of 55.2%.</p>
      </abstract>
      <kwd-group>
        <kwd>image annotation</kwd>
        <kwd>multi-modal fusion</kwd>
        <kwd>tag features</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The ImageCLEF 2011 Photo Annotation Task challenges participants to
evaluate their multi-label image annotation approaches on a set of Flickr images with
the goal to achieve the most accurate annotation of these images. The images
belong to 99 di erent concepts. These range from scene descriptions such as
place and time over abstract categories, e.g., partylife to very speci c concepts
such as dog or car. This year's newly added concepts focus on emotions that
the images convey, e.g., happy or melancholic. In addition to the images and
concept associations, the participants are provided with the Flickr user tags and
EXIF data of the images. A detailed overview of the data set and the task can
be found in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>Our main objective to solve this task is to explore how tags can be combined
with visual features in order to optimize the annotation result.</p>
    </sec>
    <sec id="sec-2">
      <title>Visual</title>
    </sec>
    <sec id="sec-3">
      <title>Features</title>
      <p>RGB-SIFT
+
T</p>
    </sec>
    <sec id="sec-4">
      <title>S Text-based</title>
      <p>E
TA FeaTatgusres
AD =</p>
    </sec>
    <sec id="sec-5">
      <title>Multi-modal</title>
    </sec>
    <sec id="sec-6">
      <title>Features</title>
    </sec>
    <sec id="sec-7">
      <title>Training</title>
    </sec>
    <sec id="sec-8">
      <title>SVM Models</title>
    </sec>
    <sec id="sec-9">
      <title>Test Data</title>
    </sec>
    <sec id="sec-10">
      <title>Annotation Post-processing</title>
      <p>
        In Figure 1, an overview of our annotation system is shown. We use visual
and textual information of the training data to learn models. These are then
employed to annotate the test data. Afterwards, a post-processing step is applied.
The following sections describe each step in detail.
Visual Features: As our focus lies on the text-based features and the
combination of di erent modalities, we only use one visual descriptor. The baseline
makes use of dense-sampled RGB-SIFT descriptors [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. These scale-invariant
features describe the form and shape of a region around a certain pixel using
edge orientation histograms [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. They are extracted on a 6 pixel wide grid and
post-processed with a k -means algorithm to generate a dictionary which contains
2,000 visual words.
      </p>
      <p>
        Text-based Features: We use the Flickr user tags to construct text-based
feature descriptors. As tagging on Flickr is relatively free, tags exist in di erent
languages and word variations. In order to reduce this redundancy, we pre-process
the tags prior to the generation of textual features. First, all Flickr user tags
are translated into English by using the Google Translate API [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Afterwards,
tags are stemmed with the help of the Porter Stemming Algorithm [5] in order
to merge word variations like explorer { explored into one tag.
      </p>
      <p>We employ a supervised approach which learns tag frequencies on the
concepts of the training set. Similar to the group of Meiji University [6],
conceptbased tf-idf weights [7] are assigned to each tag. A tag's term frequency (tf ) is
detected by counting the number of times the tag occurs in a certain concept.
The document frequency (df ) term is equivalent to the fraction of concepts the
tag t appears in, as shown in Equation 1. Therefore, tags that appear very often
in only a few concepts get higher weights assigned than tags that appear fairly
often in many concepts:
dft =
number of concepts with tag t
total number of concepts
:
(1)
Finally, the inverse document frequency (idf ) is calculated as log(dft).</p>
      <p>For each concept, the tf-idf values of the tags of an image are summed up.
This leads to a feature vector containing 99 elements with scores normalized in
the range of [0; 1]. These features are then employed in the learning stage.
2.2</p>
      <p>Concept Learning and Annotation
For each concept, a SVM with RBF kernel is learned using the one-against-all
strategy and optimized with the concept-based F-Measure on the training set.
To combine visual and textual features, we employ two di erent approaches:
early fusion and late fusion.</p>
      <p>For the early fusion approach, both, visual and text-based features, are
considered simultaneously to learn the SVM models. The late fusion approach learns
SVM models for each modality separately and then combines the classi cation
results using the geometric mean.
2.3</p>
      <p>Post-processing
To further optimize the annotation result, we apply a simple post-processing
step. Each image's tags are again translated and stemmed and afterwards
compared to the concept names, which are stemmed as well. In case a concept consist
of more than one word, the tags are compared to each of these words. If a tag and
at least one word of the concept match, the image is assigned to that concept.
3</p>
      <p>Submission
We submitted ve di erent runs in total. One run uses only textual information,
the other four runs make use of multi-modal information sources.</p>
      <p>Tags only
Early fusion of RGB-SIFT and tags
Early fusion of RGB-SIFT and tags with post-processing step
Late fusion of RGB-SIFT and tags</p>
      <p>Late fusion of RGB-SIFT and tags with post-processing step
4</p>
      <p>
        Results and Discussion
The results are evaluated with concept-based and example-based performance
measures. Detailed information about the evaluation process can be found in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
o
m
p
a
r
i
s
o
n
o
f
t
h
e
e
a
r
l
y
f
u
s
i
o
n
r
e
s
u
l
t
s
i
t
h
a
n
d
i
t
h
o
u
t
p
o
s
t
p
r
o
c
e
s
s
i
n
g
.
w
w
      </p>
      <p>E
a
r
l
y
F
u
s
i
o
n
o
f
R
G
B
S
I
F
T
a
n
d
T
a
g
s
E
a
r
l
y
F
u
s
i
o
n
o
f
R
G
B
S
I
F
T
a
n
d
T
a
g
s
+
P
o
s
t
P
r
o
c
e
s
s
i
n
g
0
.
0
0
.
1
C
:
3
.
g
i
F
0
.
1
In Table 1, the nal scores for the concept-based evaluation with the MAP are
presented. Overall, our system scored a best run of 37.1% MAP for the
multimodal approach. The text-only approach results in a MAP of 32.6%.</p>
      <p>The late fusion approach outperformed the early fusion one by about 3%
(37.1% versus 34.7%). The post-processing step does not improve the result of
the late fusion approach, though it increases the results for the early fusion run.
Figure 2 shows that the post-processing actually works well for some concepts,
while the detection performance for others worsens. Concepts that su er the
most from the post-processing step are those whose names consist of more than
one word, e.g., park or garden, small group or old person. Meanwhile, concepts
like cat, horse, airplane, or skateboard improve signi cantly. The main reason for
this is the rather simple approach of the post-processing step. The consideration
of composite concepts should help to improve the performance.</p>
      <p>For most of the concepts, early and late fusion perform quite similarly. The
main di erence can be found for the concepts abstract, boring and cute as well
as the di erent kinds of animals and vehicles. Here, late fusion outperforms early
fusion, as can be seen in Figure 3.
4.2</p>
      <p>Evaluation per Example
The rst participation of Fraunhofer IDMT in the ImageCLEF Photo
Annotation Task reveals promising results. Using our textual descriptor in combination
with one visual descriptor, we achieve annotation results that can compete well
with other systems. The textual features work especially well for rather speci c
concepts that describe objects in an image. A combination of di erent textual
and visual features is likely to result in a very stable annotation.</p>
      <p>Future work will consider relations between tags as well as concepts more
intently. Additionally, the inclusion of more visual features and text-based
descriptors will be a main objective.</p>
      <p>Acknowledgements
This work was partly supported by grant 01MQ07017 of the German research
program THESEUS funded by the Ministry of Economics.
5. Porter, M.: An algorithm for su x stripping. Program: electronic library and
information systems 14(3) (1993) 130{137
6. Motohashi, N., Izawa, R., Takagi, T.: Meiji University at the ImageCLEF2010
Visual Concept Detection and Annotation Task: Working notes. In: Working Notes
of CLEF 2010, Padova, Italy. (2010)
7. Manning, C.D., Raghavan, P., Schutze, H.: Introduction to Information Retrieval.</p>
      <p>Cambridge University Press (2008)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Nowak</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nagel</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liebetrau</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>The CLEF 2011 Photo Annotation and Concept-based Retrieval Tasks</article-title>
          . In: CLEF 2011 working notes, Amsterdam, The Netherlands.
          <article-title>(</article-title>
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. Van de Sande,
          <string-name>
            <given-names>K.E.A.</given-names>
            ,
            <surname>Gevers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Snoek</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.G.M.:</surname>
          </string-name>
          <article-title>Evaluating color descriptors for object and scene recognition</article-title>
          .
          <source>Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>32</volume>
          (
          <issue>9</issue>
          ) (
          <year>2010</year>
          )
          <volume>1582</volume>
          {
          <fpage>1596</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Lowe</surname>
            ,
            <given-names>D.G.</given-names>
          </string-name>
          :
          <article-title>Distinctive Image Features from Scale-Invariant Keypoints</article-title>
          .
          <source>International Journal of Computer Vision</source>
          <volume>60</volume>
          (
          <issue>2</issue>
          ) (
          <year>November 2004</year>
          )
          <volume>91</volume>
          {
          <fpage>110</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Google</given-names>
            <surname>Code</surname>
          </string-name>
          : Google Translate http://translate.google.de/,
          <source>last check: 05 Aug</source>
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>