<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>LIRIS-Imagine at ImageCLEF 2012 Photo Annotation task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ningning Liu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emmanuel Dellandrea</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Liming Chen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aliaksandr Trus</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chao Zhu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yu Zhang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Charles-Edmond Bichot</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stephane Bres</string-name>
          <email>stephane.bres@insa-lyon.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bruno Tellez</string-name>
          <email>bruno.tellez@univ-lyon1.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universite de Lyon</institution>
          ,
          <addr-line>CNRS, Ecole Centrale de Lyon, LIRIS, UMR5205, F-69134</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we present the methods we have proposed and evaluated through the ImageCLEF 2012 Photo Annotation task. More precisely, we have proposed the Histogram of Textual Concepts (HTC) textual feature to capture the relatedness of semantic concepts. In contrast to term frequency-based text representations mostly used for visual concept detection and annotation, HTC relies on the semantic similarity between the user tags and a concept dictionary. Moreover, a Selective Weighted Late Fusion (SWLF) is introduced to combine multiple sources of information which by iteratively selecting and weighting the best features for each concept at hand to be classi ed. The results have shown that the combination of our HTC feature with visual features through SWLF can improve the performance signi cantly. Our best model, which is a late fusion of textual and visual features, achieved a MiAP (Mean interpolated Average Precision) of 43.67% and ranked rst out of the 80 submitted runs.</p>
      </abstract>
      <kwd-group>
        <kwd>textual features</kwd>
        <kwd>visual feature</kwd>
        <kwd>feature fusion</kwd>
        <kwd>concept detection</kwd>
        <kwd>photo annotation</kwd>
        <kwd>multimodality</kwd>
        <kwd>ImageCLEF</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Machine-based recognition of visual concepts aims at recognizing
automatically from images high-level semantic concepts (HLSC), including scenes (indoor,
outdoor, landscape, etc.), objects (car, animal, person, etc.), events (travel, work,
etc.), or even emotions (melancholic, happy, etc.). It proves to be extremely
challenging because of large intra-class variations (clutter, occlusion, pose changes,
etc.) and inter-class similarities [1{4]. The past decade has witnessed tremendous
e orts from the research communities as testi ed the multiple challenges in the
eld, e.g., ImageCLEF [5{8], TRECVID [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and Pascal VOC [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Increasing
works in the literature have discovered the wealth of semantic meanings
conveyed by the abundant textual captions associated with images [11{13]. As a
result, multimodal approaches have been increasingly proposed visual concept
detection and annotation task (VCDT) by making joint use of user textual tags
and visual descriptions to bridge the gap between low-level visual features and
HLSC [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        The VCDT is a multi-label classi cation challenge. It aims at the automatic
annotation of a large number of consumer photos with multiple annotations.
There were remarkable works have been proposed for ImageCLEF photo
annotation tasks. The LEAR and XRCE group [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] in ImageCLEF 2010 employed
the Fisher vector image representation with the TagProp method for image
autoannotation. The TUBFI group [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] in ImageCLEF 2011 built textual features
using a soft mapping of textual Bag-of-Words (BoW) and Markov random walks
based on frequent Flickr user tags. Our group in ImageCLEF 2011 [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] rstly
proposed a novel textual representation, named Histogram of Textual Concept
(HTC), which captures the relatedness of semantic concepts. Meanwhile we also
proposed a novel selective weighted late fusion (SWLF) method, which
automatically selects and weights the best discriminative features for each visual concept
to be predicted in optimizing the overall mean average precision. This year, we
have improved our approaches in the following aspects:
{ We evaluated di erent textual preprocessing methods, and proposed
enhanced HTC features using term frequency information. Meanwhile, we
implemented two types of distributional term representations: documents
occurrence representation (DOR) and DOR TFIDF [17].
{ We investigated a set of mid-level features, which are related to harmony,
dynamism, aesthetic quality, emotional color representation, etc.. Meanwhile,
we improved the harmony and dynamism features by adding a local
information.
      </p>
      <p>The rest of this paper is organized as follows. The features are introduced
in Section 2, including textual and visual features as well as the fusion scheme
proposed to combine them. The results are analysed in Section 3. Finally, Section
4 draws the conclusion and gives some hints for future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Features for semantic concepts recognition</title>
      <p>In this section, we rstly present the textual features including HTC and
enhanced HTC in Section 2.1, following with (Section 2.2) description of visual
features which can be categorized into four groups: color, texture, shape and
mid-level. The feature fusion scheme, SWLF, is presented in Section 2.3.
2.1</p>
      <sec id="sec-2-1">
        <title>Textual features</title>
        <p>The Histogram of Textual Concepts, HTC, of a text document is de ned as a
histogram based on a vocabulary or dictionary where each bin of this histogram
represents a concept of the dictionary, whereas its value is the accumulation of
the contribution of each word within the text document toward the underlying
concept according to a prede ned semantic similarity measure.</p>
        <p>The advantages of HTC are multiple. First, for a sparse text document as
image tags, HTC o ers a smooth description of the semantic relatedness of
user tags over a set of textual concepts de ned within the dictionary. More
importantly, in the case of polysemy, HTC helps disambiguate textual concepts
according to the context. For instance, the concept of \bank" can refer to a
nancial intermediary but also to the shoreline of a river. However, when a
tag \bank" comes with a photo showing a nancial institution, correlated tags
such as \ nance", \building", \money", etc., are very likely to be used, thereby
clearly distinguishing the concept \bank" in nance from that of a river where
correlated tags can be \water", \boat", \river", etc. Similarly, in the case of
synonyms, the HTC will reinforce the concept related to the synonym as far
as the semantic similarity measurement takes into account the phenomenon of
synonyms. The algorithm for the extraction of a HTC feature is detailed in the
following algorithm:
The Histogram of Textual Concepts (HTC) Algorithm:
Input: Tag data W = fwtg with t 2 [1; T], dictionary D = fdig with i 2 [1; d].
Output: Histogram f composed of values fi with 0 fi 1, i 2 [1; d].
{ Preprocess the tags by using a stop-words lter.
{ If the input image has no tags (W = ;), return f with 8i fi = 0:5.1
{ Do for each word wt 2 W :
1. Calculate dist(wt; di), where dist is a semantic similarity distance between
wt and di.</p>
        <p>2. Obtain the semantic matrix S as: S(t; i) = dist(wt; di).
{ Calculate the feature f as: fi = PtT=1 S(t; i), and normalize it to [0 1] as: fi =
fi= Pd</p>
        <p>j=1 fj.
1</p>
        <p>When an input image has no tag at all, in this work we simply assume that every bin value is
0:5, therefore at halfway between a semantic similarity measurement 0 (no relationship at all
with the corresponding concept in the dictionary) and 1 (full similarity with the corresponding
concept in the dictionary). Alternatively, we can also set these values to the mean of HTCs
over the captioned images of a training set.</p>
        <p>The computation of HTC requires the de nition of a dictionary and a proper
semantic relatedness measurement over textual concepts. For the ImageCLEF
2012 photo annotation task, we used two types of dictionaries. The rst one
is dictionary based on the term frequency on the training set, e.g. dictionary
TF 10T consists of top 10 thousand words sorted by their frequencies in the
training set. While the second one, D Anew, is the set of 1034 English words
used in the ANEW study [18]. The interest of the ANEW dictionary lies in the
fact that each of its word is rated on a scale from 1 to 9 using a ective norms
txtFtr HTC Danew odbicttaiionneadryb.y using WordNet path distance on ANEW
txtFtr TFIDF Danew obtained on ANEW dictionary.
txtFtr eHTC Danew toxbttFaitnre5d. by adding each bins of txtFtr 4 and
txtFtr TFIDF TF 10T 1o0bttahinoeudsaonnd twhoerddisctsioornteadrybTyFth1e0tTe,rmwhfircehquisentchye. top
txtFtr HTC VAD obtained using Eq. 1, Eq. 2 and Eq. 3.
txtFtr HTC TF 10T odbicttaiionneadryb.y using WordNet path distance on TF 10T
txtFtr HTC TF 20T odbicttaiionneadryb.y using WordNet path distance on TF 20T
txtFtr TFIDF TF 20T obtained on TF 20T dictionary.</p>
        <p>obtained by adding each bins of txtFtr 9 and
txtFtr eHTC TF 20T txtFtr 10.
in terms of valence (a ective dimension expressing positive versus negative),
arousal (a ective dimension expressing active versus inactive) and dominance
(a ective dimension expressing dominated versus in control). For instance,
according to ANEW, the concept \beauty" has a mean valence of 7.82, a mean
arousal of 4.95 and a mean dominance of 5.23 while the concept \bird" would
have a mean valence of 7.27, a mean arousal of 3.17 and a mean dominance of
4.42. Using the a ective ratings of the ANEW concepts and the HTCs computed
over image tags, one can further de ne the coordinates of an image caption in
the three dimensional a ective space [19], in terms of valence, arousal and
dominance by taking a linear combination of the ANEW concepts weighted by the
corresponding HTC values. More precisely, given a HTC descriptor f extracted
from a text document, the valence, arousal and dominance coordinates of the
text document can be computed as follows:
fvalence = (1=d) X(fi Vi)</p>
        <p>i
farousal = (1=d) X(fi</p>
        <p>i
fdominance = (1=d) X(fi
i</p>
        <p>Ai)</p>
        <p>Di)
(1)
(2)
(3)
where Vi, Ai and Di are respectively the valence, the arousal and the dominance
of the ith word wi in the D Anew dictionary, and d is the size of D Anew.</p>
        <p>The HTC features fail to calculate the semantic distance of two terms when
the semantic relatedness measurement are not de ned between these two terms.
In order to cope with this problem, we enhanced the HTC features by combining
it with TF/IDF features in a simple way: sum the value on each bin, and then
normalize for the same dictionary. Meanwhile, we employed the distributional
term representation DOR and DOR-TF/IDF [17]. A summary of textual features
is given in Table 1.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Visual features</title>
        <p>
          For ImageCLEF 2011 photo annotation task, we have introduced various
visual features to describe interesting details and to catch the global image
atmosphere. Thus, 5 groups of features have been considered: color, texture, shape,
local descriptor and mid-level features [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. This year, we have enriched this set
of visual features by adding color SIFT features with 4000 codewords and soft
assignment [20] and TOPSURF feature [21]. Moreover, we have enhanced the
mid-level features harmony and dynamism by adding a local information through
their computation using a pyramid grid.
2.3
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Feature fusion through SWLF</title>
        <p>In order to combined e ciently textual and visual features, we have proposed
a Selective Weighted Late Fusion (SWLF) scheme which learns to automatically
select and weight the best features for each visual concept to be recognized.</p>
        <p>SWLF scheme has a learning phase which requires a training dataset for
the selection of the best experts and their corresponding weights for each visual
concept. Speci cally, given a training dataset, we divide it into two disjoint parts
composed of a training set and a validation set. For each visual concept, a binary
classi er (concept versus no concept) is trained, which is also called expert in
the subsequent, for each type of features using the data in the training set. Thus,
for each concept, we generate as many experts as the number of di erent types
of features. The quality of each expert can then be evaluated through a quality
metric using the data in the validation set. In this work, the quality metric is
chosen to be the interpolated Average Precision (iAP). The higher iAP is for
a given expert, the more weight should be given to the score delivered by that
expert for the late fusion. This fusion is performed as the sum of the weighted
scores. More details on SWLF can be found in [22].
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments and Results</title>
      <p>Our methods have been evaluated through the ImageCLEF 2012 photo
annotation task, and particularly through the visual concept detection, annotation
and retrieval subtask whose details are provided in [23]. There are 94 concepts
to automatically detect, that can be categorized into 5 groups: natural elements
(day, night, sunrise, etc. ), environment (desert, coast, landscape, etc. ), people
(baby, child, teenager, etc. ), image elements (in focus, city life, active, etc.),
human elements (rail vehicle, water vehicle, air vehicle, etc. ).</p>
      <p>In order to obtain a stable and better performance, we divided the training
set into a training part (50%, 7501 images) and a validation part (50%, 7499
images) as required by SWLF presented in section 2.3.
3.1</p>
      <sec id="sec-3-1">
        <title>The submitted runs</title>
        <p>We submitted 5 runs to the ImageCLEF 2012 photo annotation challenge (2
textual model, 1 visual model and 2 multimodal models). All runs were based on
the features described in the previous sections, including 11 textual ones and 32
visual ones. For the example evaluation, we propose two methods to chose the
threshold. One is based on the distribution of training data. More speci cally,
we rstly calculate the distribution of concepts on the training set, then for
each concept, we set the threshold as the boundary which makes the proportion
of positive sample as same as it is in the training data. This idea is that we
consider the training and test set share the same distribution for each concept.
The other is to select a best threshold, which receives the best FMeasure value
on the validation set. Based on the previous experiments and observations, we
performed our runs based on the following con guration:
1. textual model 1: the combination of the top 4 features among the 11
textual features for each concept based on the weighted score SWFL scheme.
2. textual model 2: the combination of the top 6 features among the 11
textual features for each concept based on the weighted score SWFL scheme.
3. visual model 3: the combination of the top 5 features among the 24 visual
features for each concept based on the weighted score SWFL scheme.
4. multimodal model 4: the combination of the top 22 features among the
43 visual and textual features for each concept based on the weighted score
SWFL scheme.
5. multimodal model 5: the combination of the top 26 features among the
43 visual and textual features for each concept based on the weighted score
SWFL scheme.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Results</title>
        <p>The results obtained by our 5 runs are given in Table 2. The best performance
was provided by our multimodal models which outperformed the purely textual
and purely visual ones. Moreover, our best model obtained the rst rank based
on the MiAP among the 80 runs submitted to the challenge.</p>
        <p>Submitted runs
text model 1
text model 2
visual model 3
multimodal model 4
multimodal model 5 43.67</p>
        <p>For the textual features, we proposed to apply two preprocessing methods.
One is the removing of stopping words. The other one is stemming on 4 language
(English, Germany, French, Italian). Based on the ImageCLEF 2012 photo
annotation dataset, we nd that after these two preprocessing, the MiAP
performance of term frequency features e.g. TF/IDF, DOR improves about 1%. But
the stemming is not proper for HTC features as it fails to calculate the semantic
similarity measurement after stemming.</p>
        <p>For the visual features, the harmony and dynamism features computed
locally using a pyramid grid achieved 3% improvement on MiAP compared to the
original ones.</p>
        <p>For the HTC, we tested several semantic distances methods of WordNet
including path, wup and lin. It is found that the path distance obtained the best
performance.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>We have presented in this paper the models that we have evaluated through
the ImageCLEF 2012 photo annotation challenge. Our best multimodal
prediction model which relies on the fusion through SWLF of our textual features
(HTC) and visual features including low-level and mid-level information achieved
a MiAP of 43.6% and ranked the best performance out of the 80 submitted runs.
From the experimental results, we can conclude the following: (i) the proposed
multimodal approach greatly improve the performance of purely textual and
purely visual ones, with about 9% higher than the best visual-only model; (ii)
the fused experts through weighted score-based SWLF, display a very good
generalization skill on unseen test data and prove particularly useful for the image
annotation task with multi-label scenarios in e ciently fusing visual and textual
features.</p>
      <p>In our future work, we envisage further investigation of the interplay between
textual and visual content, in studying in particular the visual relatedness in
regard to textual concepts. We also want to study some mid-level visual features
or representations, for instance using an attentional model, which better account
for a ect related concepts.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgement</title>
      <p>This work was supported in part by the French research agency ANR through
the VideoSense project under the grant 2009 CORD 026 02.
17. H. J. Escalante, M. Montes, E. Sucar, Multimodal indexing based on semantic
cohesion for image retrieval, Information Retrieval 15 (2011) 1{32.
18. M. M. Bradley, P. J. Lang, A ective norms for english words (ANEW): Stimuli,
instruction manual, and a ective ratings, Tech. rep., Center for Research in
Psychophysiology, University of Florida, Gainesville, Florida (1999).
19. K. Scherer, Appraisal Processes in Emotion: Theory, Methods, Research (Series in</p>
      <p>A ective Science), Oxford University Press, USA, 2001.
20. J. C. van Gemert, C. J. Veenman, A. W. M. Smeulders, J. M. Geusebroek, Visual
word ambiguity, IEEE Transactions on Pattern Analysis and Machine Intelligence
32 (7) (2010) 1271{1283.
21. B. Thomee, E. M. Bakker, M. S. Lew, Top-surf: a visual words toolkit, in:
Proceedings of the international conference on Multimedia, 2010, pp. 1473{1476.
22. N. Liu, E. Dellandrea, C. Zhu, C.-E. Bichot, L. Chen, A selective weighted late
fusion for visual concept recognition, in: ECCV 2012 Workshop on Information
fusion in Computer Vision for Concept Recognition, 2012.
23. B. Thomee, A. Popescu, Overview of the imageclef 2012 ickr photo annotation
and retrieval task, in: CLEF 2012 working notes, Rome, Italy, 2012.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>A. W. M. Smeulders</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Worring</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Santini</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Jain</surname>
          </string-name>
          ,
          <article-title>Content-based image retrieval at the end of the early years</article-title>
          ,
          <source>IEEE Trans. Pattern Anal. Mach. Intell</source>
          .
          <volume>22</volume>
          (
          <year>2000</year>
          )
          <volume>1349</volume>
          {
          <fpage>1380</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>A.</given-names>
            <surname>Mojsilovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gomes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Rogowitz</surname>
          </string-name>
          ,
          <article-title>Semantic-friendly indexing and quering of images based on the extraction of the objective semantic cues</article-title>
          ,
          <source>Int. J. Comput. Vision</source>
          <volume>56</volume>
          (
          <year>2004</year>
          )
          <volume>79</volume>
          {
          <fpage>107</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Automatic linguistic indexing of pictures by a statistical modeling approach</article-title>
          ,
          <source>IEEE Trans. Pattern Anal. Mach</source>
          . Intell. (
          <year>2003</year>
          )
          <volume>1075</volume>
          {
          <fpage>1088</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Lew</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Sebe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Djeraba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <article-title>Content-based multimedia information retrieval: State of the art and challenges</article-title>
          ,
          <source>TOMCCAP</source>
          (
          <year>2006</year>
          )
          <volume>1</volume>
          {
          <fpage>19</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>M. J. Huiskes</surname>
            ,
            <given-names>M. S.</given-names>
          </string-name>
          <string-name>
            <surname>Lew</surname>
            ,
            <given-names>M. S.</given-names>
          </string-name>
          <string-name>
            <surname>Lew</surname>
          </string-name>
          ,
          <article-title>The mir ickr retrieval evaluation</article-title>
          ,
          <source>in: Multimedia Information Retrieval</source>
          ,
          <year>2008</year>
          , pp.
          <volume>39</volume>
          {
          <fpage>43</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>M. J. Huiskes</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Thomee</surname>
            ,
            <given-names>M. S.</given-names>
          </string-name>
          <string-name>
            <surname>Lew</surname>
          </string-name>
          ,
          <article-title>New trends and ideas in visual concept detection: The mir ickr retrieval evaluation initiative</article-title>
          ,
          <source>in: MIR '10: Proceedings of the 2010 ACM International Conference on Multimedia Information Retrieval</source>
          ,
          <year>2010</year>
          , pp.
          <volume>527</volume>
          {
          <fpage>536</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>S.</given-names>
            <surname>Nowak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Nagel</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Liebetrau,</surname>
          </string-name>
          <article-title>The clef 2011 photo annotation and conceptbased retrieval tasks</article-title>
          , in: CLEF Workshop Notebook Paper,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>S.</given-names>
            <surname>Nowak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Huiskes</surname>
          </string-name>
          ,
          <article-title>New strategies for image annotation: Overview of the photo annotation task at imageclef 2010</article-title>
          , in: CLEF Workshop Notebook Paper,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>A. F.</given-names>
            <surname>Smeaton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Over</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Kraaij</surname>
          </string-name>
          ,
          <article-title>Evaluation campaigns and trecvid</article-title>
          ,
          <source>in: MIR '06: Proceedings of the 8th ACM International Workshop on Multimedia Information Retrieval</source>
          ,
          <year>2006</year>
          , pp.
          <volume>321</volume>
          {
          <fpage>330</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>M. Everingham</surname>
            ,
            <given-names>L. J. V.</given-names>
          </string-name>
          <string-name>
            <surname>Gool</surname>
            ,
            <given-names>C. K. I.</given-names>
          </string-name>
          <string-name>
            <surname>Williams</surname>
            ,
            <given-names>J. M.</given-names>
          </string-name>
          <string-name>
            <surname>Winn</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Zisserman</surname>
          </string-name>
          ,
          <article-title>The pascal visual object classes (voc) challenge</article-title>
          ,
          <source>Int. J. Comput. Vision</source>
          (
          <year>2010</year>
          )
          <volume>303</volume>
          {
          <fpage>338</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>G.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hoiem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Forsyth</surname>
          </string-name>
          ,
          <article-title>Building text features for object image classication</article-title>
          .,
          <source>in: CVPR</source>
          ,
          <year>2009</year>
          , pp.
          <volume>1367</volume>
          {
          <fpage>1374</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>J. Sivic</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Zisserman</surname>
          </string-name>
          ,
          <article-title>Video google: A text retrieval approach to object matching in videos</article-title>
          , in: ICCV,
          <year>2003</year>
          , pp.
          <volume>1470</volume>
          {
          <fpage>1477</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>M. Guillaumin</surname>
            ,
            <given-names>J. J.</given-names>
          </string-name>
          <string-name>
            <surname>Verbeek</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Schmid</surname>
          </string-name>
          ,
          <article-title>Multimodal semi-supervised learning for image classi cation</article-title>
          .,
          <source>in: CVPR</source>
          ,
          <year>2010</year>
          , pp.
          <volume>902</volume>
          {
          <fpage>909</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14. T. Mensink,
          <string-name>
            <given-names>G.</given-names>
            <surname>Csurka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Perronnin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Snchez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Verbeek</surname>
          </string-name>
          ,
          <article-title>Lear and xrce's participation to visual concept detection task</article-title>
          - imageclef
          <year>2010</year>
          , in: CLEF Workshop Notebook Paper,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <given-names>A.</given-names>
            <surname>Binder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Samek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kloft</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Muller,</article-title>
          <string-name>
            <surname>K.-R. Mu</surname>
          </string-name>
          <article-title>ller, M. Kawanabe, The joint submission of the tu berlin and fraunhofer rst (tub ) to the imageclef2011 photo annotation task</article-title>
          , in: CLEF Workshop Notebook Paper,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <given-names>N.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Dellandrea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , C.-E. Bichot,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bres</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Tellez</surname>
          </string-name>
          , LIRIS-Imagine at
          <article-title>ImageCLEF 2011 Photo Annotation task</article-title>
          , in: CLEF Workshop Notebook Paper,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>