<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The PRA and AmILAB at ImageCLEF 2012 Photo Flickr Annotation Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Luca Piras</string-name>
          <email>luca.piras@diee.unica.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roberto Tronci</string-name>
          <email>roberto.tronci@diee.unica.it</email>
          <email>roberto.tronci@sardegnaricerche.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gabriele Murgia</string-name>
          <email>gabriele.murgia@sardegnaricerche.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giorgio Giacinto</string-name>
          <email>giacinto@diee.unica.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>AmILAB - Laboratorio Intelligenza d'Ambiente</institution>
          ,
          <addr-line>Sardegna Ricerche</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>DIEE - Department of Electric and Electronic Engineering University of Cagliari</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2012</year>
      </pub-date>
      <abstract>
        <p>This paper presents the rst participation of the Pattern Recognition and Application Group (PRA Group), and the Ambient Intelligence Lab (AmILAB) at the ImageCLEF 2012 Photo Flickr Concept Annotation Task. In this task, the teams' goal is to detect the presence of 94 concepts in the images, and to provide a con dence score related to the con dence of the decision of each concept detector. We faced the challenge by relying on visual information only, combining di erent image descriptors by means of di erent score combination techniques. Experimental results show that just combining concept detectors not speci cally designed for handling the large variety of concepts does not allow reaching satisfactory results.</p>
      </abstract>
      <kwd-group>
        <kwd>image annotation</kwd>
        <kwd>dynamic score combination</kwd>
        <kwd>SVM</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The visual concept annotation task is a multi-label classi cation challenge where
the goal consists in the analysis of a collection of photos in order to detect the
presence of one or more concepts. The number of selected concepts is equal to
94, and their semantics cover a wide range. They include categories related to
persons (e.g. baby, child, teenager, adult), animals (e.g. cat, dog, horse), and
sentiments (e.g. unpleasant, euphoric). In addition to the images and the
associated concepts, participants are provided with textual features, and visual
features. Our main objective to solve this task is to use the combination of
outputs of visual concept detectors based on visual descriptors. A detailed overview
of the data set, and the related task can be found in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Visual features
For this task, a subset of the MIRFLICKR3 collection has been used. This subset
comprises 25 thousand images that have been manually annotated using a limited
3 http://press.liacs.nl/mir ickr/
number of concepts. With respect to the previous editions of the competition,
this year the annotation process has been carried out by resorting to
crowdsourcing mechanisms. Several concepts have been reused of last year's task, and,
for most of these concepts, the remaining photos of the MIRFLICKR-25K
collection that had not yet been used in the previous task, have been annotated.
In order to boost the quality of all 25,000 images, they have been reannotated
for several concepts too. All the images have been naturally annotated for the
new concepts. All images have been accompanied by di erent kind of features:
textual, and visual features. Detailed information about the feature sets can be
found in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        In our approach, we focused on visual descriptors only. The visual descriptors
proposed for this task are the following: sift, c-sift, RGB-sift, and
Opponentsift. For each descriptor, the histogram of the occurrence frequencies has been
extracted by using the Color Descriptors toolkit [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. As expected, the K-means
clustering used to produce a "bag of visual words" representation, is quite slow
for the data sizes at hand, as clustering 250,000 points takes at least 12 hours
per iteration. The solution usually proposed is to reduce the number of points to
cluster. By default, the toolkit extracts 250,000 points regardless of the number
of training images, thus reducing the number of point per image automatically
as the number of images increase. It means that the toolkit extracts less than
17 points per image, thus loosing dozens of descriptors. At the same time, if
up to 200 points are extracted for each of 15,000 training images, the K-means
algorithm should cluster 3,000,000 points!
      </p>
      <p>For this reasons we decided to divide the 15,000 training images into four
groups, by retaining the same proportion of image per concept as in the whole
training set. Then, for each group, we clustered around 750,000 points in
order to obtain four di erent codebooks (one codebook for each descriptor) with
2,048 visual words. Each codebook has then be used to produce the Bag of
Visual Words descriptors. This procedure allowed obtaining a large vocabulary of
"visual words", and at the same time reduced the number of points to cluster.
3</p>
      <p>Concept detection by dynamic combination of visual
classi ers
We submitted three runs in total. All of these runs are based only on the bag of
visual words descriptors illustrated in the previous section.</p>
      <p>
        For all the runs, we used the Multiple Classi er System paradigm, and the
Support Vector Machine has been used as the base classi er for its good
performances on various image classi cation tasks [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. We trained a single SVM for
each global image descriptor and each visual concept. Thus, for each concept i,
a set of four SVMs fSisift; Sirgb; Sicolor; Siopponentg is available.
      </p>
      <p>We classi ed all the pattern of the test set by means of these sets of SVMs.
k
Thus, for each test pattern xj , we obtained as output the class decision dij
taken by the classi er k (i.e., 1 if the pattern belongs or not to the concept i, 0
otherwise), and the distance from the decision border is transformed through a</p>
      <p>
        In the case of the other two combination rules, we used the Dynamic Score
Selection (DSC) approach [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]:
sidjsc = (1
      </p>
      <p>k
) mkinfsij g +</p>
      <p>
        k
mkaxfsij g
This combination rule is able to perform a dynamic combination at the score
level, by allowing to dynamically chose the best scores and weights to be
combined. In [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] di erent methods to compute dynamically the weights are
proposed. In these runs, we used one of those methods, and one that has been
speci cally designed for the task at hand.
      </p>
      <p>The rule for computing for the Dynamic Score Selection by Majority Vote
is the following:
=
1; if at least half of the dikj are equal to 1
0; otherwise</p>
      <p>The rule for computing
the following:
for the Dynamic Score Selection by Mean Rule is
(1)
(2)
(3)
(4)
min-max normalization into a classi cation score sikj , in the range [0; 1], of the
test pattern with respect to the concept.</p>
      <p>We used the following three combination rules:
{ The Mean rule
{ The Dynamic Score Selection by Majority Vote
{ The Dynamic Score Selection by Mean rule</p>
      <p>
        For the Mean rule, we computed the average of the classi cation scores
obtained from the classi ers [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]:
simj ean =
1
k
      </p>
      <p>X sij</p>
      <p>k
k
=
1
k</p>
      <p>X sij</p>
      <p>k
k
4</p>
      <p>
        Results and Discussion
The performances (Interpolated Mean Average Precision (MiAP), Interpolated
Geometric Mean Average Precision (GMiAP), and F1-measure on all concepts
related to our runs are listed in Table 1, and they are compared to the
performances obtained by the other team that used visual features only. Detailed
information about the evaluation process can be found in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>A rst conclusion that can be drawn from Table 1 is that using a combination
of general purpose classi ers does not permit to obtain very satisfactory results,
as we obtained just the tenth position out of thirteen participants.</p>
      <p>The proposed results also show that the Dynamic Score Selection by Majority
Vote does not work as expected, as it is outperformed by the Dynamic Score
Selection by Mean Rule for the MiAP and GMiAP measures, and by the Mean
rule for the F1-measure.
m
e
- o
me -m fda "0
ofda eofd y_da
y a y
_ y "
s _
u n
nrises ight"
c u
e n
ce les-a set"
le l_
s- su
a n
ce l_m "
le o
w s
wewawetaheteahrte_hroe_vrce_-rcacllea_assrttssakkrnys"""
e lo y
ath ud "
weather_rainysky"
e b
wea r_ligh ow"
we ther_ tning
co athe fogm "
m r_ is
cocombbuus-ons_nflowicet""
m s- am
bus- on_ es"
o sm
ligh-n_firewoke"
ligh- ng_shorks"
n a
ligh g_re dow
- fl "
n e
g c
lig _s -o
scaphe-_nmg_leilnhsoeuffeFen""
o e
u c
s nta t"
scapcape_dinhil"
e e
_ s
fore ert"
s
scape_tpark"
c
s o
cap ast
e "
_
scaprural"
s e
water_cuape_gr_acity"
n ffi
wa de -"
te rw
r_s ate
e r
a "
o
w w ce
ate ate an"
r_ r_
rivers lake"
w tre
a a
te m
r_ "
o
florather"
_
flo tre
ra "
_
fl p
o la
ra_ nt"
fl
o
fl w
o e
ra r"
_
g
fa ras
un "
a
_
faun cat"
a
fauna__dog"
h
o
fa rs
u e
n "
a
_
fau fish
n "
a
fa fauna _bird
u _ "
na_amfaunainsect
ph _sp "
ib id
ia e
fa nre r"
u p
n
qu a_ro le"
a d
n e
- n
ty t"
quan _non
q -ty e"
uan _on
q qu -ty e"
uan-tya_n-ty_t_htwo"
qu sm re
an al "
- g
ty ro
_ u
b p
ig "
n
a
age ger"
_
a
a d
ge_e ult"
relraer-eloa-ng_oegfnnae_dmncedoirle_yrff_relidmmeenaardllleey"""
la-on work s"
_ e
ququalqituyalitsyt_rannogerrss""
a _ b
lity_ par- lur"
qualityc_ompletaelblur"</p>
      <p>m b
stylest_yqplsueitc_aytlcluieitrryec_oiuna-lapro-rincwftabuclluurtesrr""""
_ a
gra rp"
sty yco
le lo
_o r"
v vie ve
iew_ w_p rlay"
c o
lose rtra
up it"
m
vie ac
v w_in ro"
ie d
w o
_
seK outd r"
n o
seK g_c r"
seK seKng_paitylife"
ng_sseespnKo-nrngtsg_r_feohcordemratey-lloiiffnee"""
m
en rin
se t_ k"
n h
se -m ap
sen n-m ent_ y"
- e c
m n a
e t_ lm
sen- nt_m inac- "
m e v
en lan e"
t_ ch
u o
sen nple lic"
- a
m s
sen- ent_ ant"
se m sc
n e a
-m nt_ ry"
e a
n c
s t_ -v
e e e
n u "
- p
m h
tranent_fuoric"
sp n
transpttorraartn_sotprrout_rctck_yccalyer"""
n b
tra spo us"
nspo rt_ra
rt_ il"
tra wa
nspo ter"
rt_
a
ir
"</p>
      <p>Average'Precision'
0
,5
"
0
,6
"
0
,1
"
0
,2
"
0
,3
"
0
,4
"
0
,7
"
0
,8
"
0
,9
"
1
"</p>
    </sec>
    <sec id="sec-2">
      <title>Dynamic"Mean"Rule"</title>
    </sec>
    <sec id="sec-3">
      <title>Mean"rule"</title>
    </sec>
    <sec id="sec-4">
      <title>Dynamic"Majority"Vote"</title>
    </sec>
    <sec id="sec-5">
      <title>MEDIAN"</title>
      <p>Conclusions
In our participation to the ImageCLEF photo annotation task, multiple visual
features has been used for representing the images. We combine the di erent
information using the Bag-of-Words model taking care that a number of image
descriptor big enough was used for each image. After the BoW extraction, we
combined the four feature spaces in three di erent ways. The evaluation results
showed that a simple combination of di erent feature spaces using classi ers not
speci cally designed for taking into account the big variety of concepts is not
able to reach satisfactory results.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Cristianini</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shawe-Taylor</surname>
          </string-name>
          , J.:
          <article-title>An Introduction to Support Vector Machines and Other Kernel-based Learning Methods</article-title>
          . Cambridge University Press (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Kittler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hatef</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duin</surname>
            ,
            <given-names>R.P.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Matas</surname>
          </string-name>
          , J.:
          <article-title>On combining classi ers</article-title>
          .
          <source>IEEE Trans. Pattern Anal. Mach. Intell</source>
          .
          <volume>20</volume>
          (
          <issue>3</issue>
          ),
          <volume>226</volume>
          {
          <fpage>239</fpage>
          (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. van de Sande,
          <string-name>
            <given-names>K.E.A.</given-names>
            ,
            <surname>Gevers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Snoek</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.G.M.:</surname>
          </string-name>
          <article-title>Evaluating color descriptors for object and scene recognition</article-title>
          .
          <source>IEEE Trans. Pattern Anal. Mach. Intell</source>
          .
          <volume>32</volume>
          (
          <issue>9</issue>
          ),
          <volume>1582</volume>
          {
          <fpage>1596</fpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Thomee</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Popescu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Overview of the imageclef 2012 ickr photo annotation and retrieval task</article-title>
          .
          <source>CLEF 2012 working notes</source>
          , Rome, Italy (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Tronci</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Giacinto</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roli</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Dynamic score combination: A supervised and unsupervised score combination method</article-title>
          .
          <source>Machine Learning and Data Mining in Pattern Recognition</source>
          <volume>5632</volume>
          ,
          <issue>163</issue>
          {
          <fpage>177</fpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>