<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CEA LIST's participation to the Scalable Concept Image Annotation task of ImageCLEF 2015</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Etienne Gadeski</string-name>
          <email>etienne.gadeski@cea.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Herve Le Borgne</string-name>
          <email>herve.le-borgne@cea.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Adrian Popescu</string-name>
          <email>adrian.popescu@cea.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CEA, LIST, Laboratory of Vision and Content Engineering</institution>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes our participation to the ImageCLEF 2015 scalable concept image annotation task. Our system is only based on visual features extracted with a deep Convolutional Neural Network (CNN). The network is trained with noisy web data corresponding to the concepts to detect in this task. We introduce a simple concept localization pipeline that provides the localization of the detected concepts, among the 251 concepts, within the images.</p>
      </abstract>
      <kwd-group>
        <kwd>image annotation</kwd>
        <kwd>classi cation</kwd>
        <kwd>concept localization</kwd>
        <kwd>convolutionnal neural networks</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        The scalable concept image Annotation subtask of ImageCLEF 2015 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is
described in detail in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The system we propose only relies on visual features
extracted with a deep Convolutional Neural Network (CNN). In addition to
the speci c training of this CNN for this task we propose a concept
localization pipeline which uses the spatial information that CNNs o er. We are able
to propose a single concept annotation and localization framework for the 251
concepts. Section 2 outlines our method in both training and testing steps. In
Sect. 3 we present our experiments, report our results and discuss them.
In this section we detail the training and testing frameworks we used. Our
method is based upon deep CNNs which have lately shown outstanding
performances in diverse computer vision tasks such as object classi cation, localization,
action recognition, etc [
        <xref ref-type="bibr" rid="ref3">3, 4</xref>
        ].
2.1
      </p>
      <sec id="sec-1-1">
        <title>Training</title>
        <p>Data. We have crawled a set of 251,000 images (1,000 images per concept)
from the Bing Images search engine. For each concept we used its name and its
synonyms (if present) to query the search engine. This dataset is of course noisy
but we believe it is not a big issue for training a deep CNN [5, 6]. We only used
this additional data to train a 16-layer CNN [7]. We used 90% of the dataset for
training and 10% for validation.</p>
        <p>Network Settings. The network is initialized with ImageNet weights. The
initial learning rate is set to 0:001 and the batch size is set to 256. The last layer
(the classi er) is trained from scratch, i.e. it is initialized with random weights
sampled from a Gaussian distribution ( = 0:01 and = 0) and its learning rate
is 10 times larger than for other layers. During training, the dataset is enhanced
with random transformations: RGB jittering, scale jittering, shearing, rotation,
contrast adjustment, etc. It is known that data augmentation leads to better
models [8] and reduces over tting. Finally, the network takes a 224 224 RGB
image as input and produces 251 outputs, i.e. the number of concepts. The
models were trained on a single Nvidia Titan Black with our modi ed version
of the Caffe framework [9].
2.2</p>
      </sec>
      <sec id="sec-1-2">
        <title>Testing</title>
        <p>Feature Extraction. We convert the ne-tuned CNN model to a fully
convolutional network. To do so, we explicitly convert the rst fully-connected layer to
a 7 7 convolution layer. The last two fully-connected layers are then converted
to 1 1 convolution layers. We are able to do such a conversion because fully
connected layers are nothing more than 1 1 convolutional kernels and a full
connection table. This conversion allows us to feed images of any size (H W
pixels) to the network. In practice, any image with a side smaller than 256 pixels
is isotropically rescaled so that its smallest side equals 256. In consequence, the
output of the last layer (i.e. the classi er) is no longer a single vector with 251
dimensions but rather a spatial map S of M N 251 dimensions. The network
has ve 2 2 pooling layers, therefore it has a global stride of 32 pixels. We
have:
For instance, with a 512
feature map.</p>
        <p>H
32
M =
7 + 1
and N =</p>
        <p>7 + 1 :
W
32
384 image the network will produce a 10
6
(1)
251
Concept Localization. For each image, our network outputs a spatial feature
map, S, of size M N 251. In every of the M N cells of this feature map,
we obtain the probability for each concept to appear at this location. In our
experiments, we chose to apply a max-pooling to each cell (over all 251 concepts)
to only get the most probable concept within the location. Thus, we obtain a
unique map giving the most probable concept at each cell location:
R =
8i 2 [1; M ]; 8j 2 [1; N ]; ri;j : ri;j = arg max (si;j (c)) ;
c</p>
        <p>(2)
97 97 97 97 97 97 97 97 97 97 97
where si;j is the 251-dimensional vector at position (i; j) in S. As illustrated in
Fig. 1 this gives us an approximate but yet interesting localization of the di erent
concepts within the image. To limit the number of regions where a concept is
likely to appear, we further merge the neighboring cells which have that concept,
by retaining the largest rectangular region containing all the contiguous cells
with a given most probable concept. We map these regions coordinates to the
original image coordinates. The nal result can be seen in Fig. 2.
3</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Experiments and Results</title>
      <p>We report the results obtained for the two runs submitted to the campaign and
propose ways of improvement to our framework.
3.1</p>
      <sec id="sec-2-1">
        <title>Runs</title>
        <p>Two runs were submitted for this task. We used two models, one per run. They
only di er in the number of iterations accomplished in the training step. Results
are reported in Table 1. Figure 3 shows the mean average precision (mAP) of
our runs using varying percentage of overlap between our bounding boxes and
the ground truth.
97 97 97 97 97 97 97 97 97 97 97
97 97 97 97 97 97 97 97 97 97 97
97 97 97 97 97 97 97 97 97 97 97
97 97 97 97 97 97 97 97 97 97 97
97 97 97 97 97 97 97 97 97 97 97
97 97 97 97 97 97 97 97 97 97 97
97 97 97 97 97 97 21 21 21 21 21
97 104 104 97 97 97 21 21 21 21 21
104 104 104 97 97 21 21 21 21 21 21
104 104 104 97 21 21 21 21 21 21 21
104 104 97 97 97 21 21 21 21 21 21
97 97 97 97 97 21 21 21 21 21 21
# iterations mAP 0% overlap mAP 50% overlap
26,000
34,000
We note that, although there is only a small gap of 8,000 iterations of stochastic
gradient descent between the two models, the second one performs signi cantly
better (+2.5%) than the rst one. Unfortunately, due to the lack of time, we
could not have trained the model further and we believe it would provide a non
negligible improvement.</p>
        <p>Regarding the localization, our method is quite e cient but it limits its
ability to detect several similar objects forming a compact group into the image,
such as a cow herd.</p>
        <p>However, we think there are several ways of improving our method. The rst
one would be to directly train the CNN in a \fully convolutional" way to improve
the localization of concepts that can appear at di erent scales, therefore we could
also improve the robustness of our method by extracting the spatial feature map
at di erent scales.
We proposed a simple and e ective framework to annotate and localize concepts
within images. The results obtained in this campaign are an encouraging step for
us toward building a better framework for concept annotation and localization.
4. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for
accurate object detection and semantic segmentation. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (CVPR). (2014)
5. Ginsca, A.L., Popescu, A., Le Borgne, H., Ballas, N., Vo, P., Kanellos, I.: Large-scale
image mining with ickr groups. In: 21th International Conference on Multimedia
Modelling (MMM 15). (2015)
6. Vo, P.D., Ginsca, A.L., Le Borgne, H., Popescu, A.: E ective training of
convolutional networks using noisy web images. In: 13th International Workshop on Content
Based Multimedia Indexing (CBMI). (2015) Prague, Czech Republic.
7. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image
recognition. CoRR abs/1409.1556 (2014)
8. Wu, R., Ya, S., Shan, Y., Dang, Q., Sun, G.: Deep image: Scaling up image
recognition. CoRR abs/1501.02876 (2015)
9. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R.,
Guadarrama, S., Darrell, T.: Ca e: Convolutional architecture for fast feature embedding.
arXiv preprint arXiv:1408.5093 (2014)</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Villegas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , Muller, H.,
          <string-name>
            <surname>Gilbert</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piras</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolajczyk</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>de Herrera</surname>
            ,
            <given-names>A.G.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bromuri</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Amin</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mohammed</surname>
            ,
            <given-names>M.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Acar</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Uskudarli</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marvasti</surname>
            ,
            <given-names>N.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aldana</surname>
            ,
            <given-names>J.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>del Mar</surname>
          </string-name>
          Roldan Garc a, M.:
          <article-title>General Overview of ImageCLEF at the CLEF 2015 Labs</article-title>
          . Lecture Notes in Computer Science. Springer International Publishing (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Gilbert</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piras</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yan</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dellandrea</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gaizauskas</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Villegas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolajczyk</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Overview of the ImageCLEF 2015 Scalable Image Annotation, Localization and Sentence Generation task</article-title>
          .
          <source>In: CLEF2015 Working Notes. CEUR Workshop Proceedings</source>
          , Toulouse, France, CEUR-WS.org (
          <issue>September 8-11</issue>
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Sharif</given-names>
            <surname>Razavian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Azizpour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Sullivan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Carlsson</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          :
          <article-title>Cnn features o -theshelf: An astounding baseline for recognition</article-title>
          .
          <source>In: The IEEE Conference on Computer Vision</source>
          and
          <article-title>Pattern Recognition (CVPR) Workshops</article-title>
          . (
          <year>June 2014</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>