<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MRIM-LIG at ImageCLEF 2016 Scalable Concept Image Annotation Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Maxime Portaz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mateusz Budnik</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Philippe Mulhem</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Johann Poignant</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CNRS, LIG</institution>
          ,
          <addr-line>F-38000 Grenoble</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Univ. Grenoble Alpes, LIG</institution>
          ,
          <addr-line>F-38000 Grenoble</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes the participation of the the MRIM research Group of the LIG laboratory in the ImageCLEF scalable concept image annotation subtask 1. We made use of a classical framework to annotate the 500K images of this task: we tuned an existing Convolutional Neural Network model to learn the 251 concepts and to locate bounding boxes of such concepts, and we applied a speci c process to handle faces and face parts. Because of time constraints, we fully processed 35% of the full corpus (i.e. 180K images), and partially the remaining images of the corpus. For our rst participation to this task, the results obtained show that we have to manage the localization in a more e ective way.</p>
      </abstract>
      <kwd-group>
        <kwd>Convolutional Neural Networks</kwd>
        <kwd>Landmark face detection</kwd>
        <kwd>ImageNet</kwd>
        <kwd>TRECVID</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        The rst participation of the MRIM group from the LIG laboratory at the
ImageCLEF 2016 [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] scalable concept image annotation subtask 1 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is presented.
Our approach was to use a classical framework based on face detection [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
followed by facial landmarks detection [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] for faces and face parts (eyes, nose and
mouth), and to rely on convolutional neural networks [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] for each of the 251
concepts.
      </p>
      <p>
        The ImageCLEF 2016 scalable concept image annotation subtask 1 consists
of nding the location of 251 classes of objects in a corpus of 500K images. This
task is challenging because of the di culty of nding accurate location of objects
in large sets of images. The objective is to assign a maximum of 100 bounding
boxes per image, each bounding box being associated to one or more of the 251
concepts proposed. It is also possible to provide a con dence value for each of the
tagging de ned. The visual concepts de ned for this subtasks do not match fully
with concepts coming from the well known ImageNet database [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], so speci c
work has to be done to be able to tackle these concepts.
      </p>
      <p>Because of time needed to process the whole corpus, we fully processed
around 35% of the full image corpus (i.e. 180K images), and partially the
remaining of the corpus. The results obtained are then negatively impacted by this
partial processing.</p>
      <p>The rest of this paper is organized as follows. In section 2, we de ne our
approach: we mainly rely on convolutional neural networks for \classical"
concepts, with a speci c process dedicated to faces. Then, in section 3, we detail
the results obtained, as well as some additional elements dedicated to analyzing
our results in more detail. We conclude in section 5.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Proposed Approach</title>
      <sec id="sec-2-1">
        <title>Overview</title>
        <p>The overall process applied for detection and localization of concepts in
images is described in gure 1. We generate possible bounding boxes, then apply
Convolutional Neural Networks for each of the 251 concepts. For face and face
part detection, we use face and facial landmarks detection. Such approaches
have been successfully used by several participants during the 2015 campaign of
ImageCLEF concept annotation task. We nally rank all the labeled bounding
boxes by score or by size, depending on the run. This ranking is used as ltering
to reduce the number of boxes per image, as we take only up to 100 boxes for
each image (a limit chosen by the organizers).
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Convolutional Neural Networks</title>
        <p>
          We used a Deep Residual Convolutional Neural Network (ResNet) with 152
layers presented by Microsoft in the ImageNet'16 challenge [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. The network
was netuned to match the 251 labels from ImageClef. Only the nal layer was
retrained.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Data Processed</title>
        <p>A rst step in the learning process was to map, when possible, the 251 CLEF
concepts into concepts from existing image collections, namely the ImageNet
concepts. From the full set C of 251 concepts, 224 are mapped directly to
ImageNet concepts, and for each of the 27 remaining concepts we acquired 4519
images from Bing API using the concept name as query. We do not lter
manually the resulting set of images.</p>
        <p>As described in gure 2, we also de ne a second set of images to increase the
quality of the concept detection. This second set also includes both Bing API
and the validation set (2000 images, 10000 tagged bounding boxes) provided by
the organizers of the task.</p>
      </sec>
      <sec id="sec-2-4">
        <title>CNN Processing</title>
        <p>
          One speci city of our proposal is to de ne a two-step learning process
(basically two netuning stages) as a way to increase the e ectiveness of the concept
detection. The CNN network comes pre-trained on the ImageNet dataset [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. We
used two validation sets: a) the rst one is the set provided by the organizers of
the ImageCLEF task, and b) a second one that we de ned to assess the quality
of the training on \clean" images. The rst netuning step is evaluated on these
two validation sets. While during the second learning step the rst set (a) is
used for training as well as some additional images (which were crawled from
the Internet) for the concepts with the lowest recognition rate. After the second
netuning, the system is tested only on the (b) validation set. In other words:
{ On our rst set of training images, learn the last layer of CNN, then evaluate
(success@1 success@5) on the two validation sets;
{ During the second learning stage, for the low quality recognition concepts,
we generate the second set of 200 additional training images per concept. As
described above, we also add the validation set (a) provided by CLEF. We
retrain the network on this combined and extended set.
        </p>
        <p>At the end of these two steps, we obtained the results presented in table 1.
The two rst rows of this table present the results after the rst tuning step.
The remaining two rows give the results after the second phase of netuning.
The second step seems to signi cantly increase the performance on the Bing
validation set.</p>
        <p>The ImageCLEF validation set was included in the training set at the second
stage of tuning. That is why a surprisingly strong result (denoted with \*"),
compared to the rst tuning, is obtained: it does not generalize and was included
just for illustrative purposes.</p>
      </sec>
      <sec id="sec-2-5">
        <title>Concept Localization</title>
        <p>
          We used the work of Uijlings, van de Sande, Gevers and Smeulders [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] to perform
selective search to de ne bounding box detection. The idea is mainly to de ne a
priori a set of bounding boxes that are expected to contain one visual concept.
The selective search use Felzenswalb algorithm [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] for image segmentation. In
our runs, we use a width for Gaussian kernel of 0:8, and a scale factor of 500.
The minimum size for a box is set to 200 pixels. These constant give a average
of 517 boxes per image. Each of these boxes will be used as an input image on
which the CNN will be applied to detect objects.
        </p>
      </sec>
      <sec id="sec-2-6">
        <title>Actual Processing Achieved</title>
        <p>Due to time constraints, we applied the full process to 180k images: selective
search and clustering of bounding boxes, and CNN detection on each of the
selected boxes. On average, the number of boxes generated per image is 517. For
each of the remaining images (320K images), we applied detection on: a) the full
image, and b) a small subset of the initial boxes selected randomly. On average,
the number of boxes generated per image for each remaining image is 8. Overall,
we processed 95 millions of boxes for our submissions.
2.3</p>
      </sec>
      <sec id="sec-2-7">
        <title>Face Detection</title>
        <p>
          The detection and localization of parts of faces is achieved through a two step
process:
{ Frontal faces are detected using the \classical" Viola and Jones approach [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]
based on cascade of simple Haar-like features;
{ Then 8 facial landmarks [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] are detected on these faces. They correspond to
the 2 mouth corners, 4 eye canthus, the tip of the nose and the center of the
face. We used then simple heuristics to de ne faces, eyes, noses and mouths
bounding boxes based on these landmarks.
        </p>
        <p>All images of the ImageCLEF corpus are processed using the above steps. With
such process, at least one faces is detected on 64642 of the 510K images (12.7% of
the whole corpus). A total of 91102 faces \boxes" are detected on these images.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Evaluation Results</title>
      <p>The runs submitted by the MRIM-LIG team are the following:
{ RUN1 LIG DLo: Annotation using the Convolutional Neural Network
described in part 2.2, with a ranking of the bounding boxes according to the
con dence value;
{ RUN2 LIG DLo: Annotation using the CNN described in part 2.2, with
a ranking of the bounding boxes according to the surfaces of the boxes;
{ RUN3 LIG Fo: Annotation of the face parts only, using the Viola/Jones
approach described in part 2.3;
{ RUN4 LIG DLF: Annotation using both the CNN and face parts
detection, with a ranking of the bounding boxes according to the con dence value;
{ RUN5 LIG DLF: Annotation using both the CNN and face parts
detection, with a ranking of the bounding boxes according to the surfaces of the
boxes;
3.1</p>
      <p>cial Results
The o cial MAP at 0% overlap and MAP at 50% overlap results of our runs are
presented in table 2. We nd that the run RUN5 (that fuses the face parts and
deep learning results, ranking based on surfaces) achieves our best result (rank
11 for overlap 0, and rank 9 for overlap 0.5). At overlap 0.5, our second best result
is RUN4 (that fuses the face parts and deep learning results, ranking based on
con dence values). The di erence between RUN5 and RUN4 are negligible. We
suppose that comes from the fact that only 180K images where fully processed,
and for the remaining ones we did not have more than 100 boxes, and the ranking
only plays a role when we obtain more than 100 boxes. The same holds also for
our runs RUN1 and RUN2 (based only on deep learning features).</p>
      <p>Compared to the runs of other participants, we nd that our general runs that
integrate deep learning do not obtain very high results. This can be explained
by the fact that, as mentioned before, the whole proposed process was applied
only on 180K images of the 510K images of the corpus.</p>
      <p>As expected, our run RUN3, that detects only face parts has a very low
overall result, ranked 23 for both overlap 0 and overlap 0.5.</p>
      <p>When considering the additional o cial measures related to the minimum
number of boxes per image, we see a plateau above a minimum of 20 boxes.
This shows that when a image has less than 20 boxes in the ground truth set
our proposal has di culty to nd relevant concepts or boxes. This can be also
attributed to the fact that we did not fully process the whole corpus, as explained
earlier.
3.2</p>
      <sec id="sec-3-1">
        <title>Detailed analysis of face parts results</title>
        <p>
          Here we try to give additional insight into the results obtained when considering
only the face elements from deep learning and prede ned face extraction
approaches [
          <xref ref-type="bibr" rid="ref6 ref8">8, 6</xref>
          ]. In table 3, we present the average precision results obtained for
our overlap ranking approaches runs RUN2 (deep learning only), RUN3 (face
parts only), and RUN5 (fusion), for the concepts mouth, eye, nose and f ace.
        </p>
        <p>One interesting point that we get from table 3 is that, for the MAP at
0 and for the f ace concept, the deep learning approach (RUN2) outperforms
both the prede ned detection (RUN3) and fusion (RUN5). We recall that face
is already a concept available in ImageNet. However, for the other concepts
this is not the case. When the localization is evaluated, then the prede ned
detection outperforms the deep learning approach. When considering the fusion
run (RUN5), we see that most of the time such fusion does not work properly as
it does not seem to boost the results. The only case when the fusion outperforms
the other runs is for MAP 0.5 for the eye, and the increment is marginal.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Current limitations of the scalable concept annotation task</title>
      <p>After checking the o cial global results and the per concept results, we feel that:
{ The size of the ground truth seems small: many concept results aP values
are equal to 1 (or exactly 0.5, 0.25, etc.), leading to think that there are only
very few ground truth regions de ned for most concepts. A collaborative
annotation interface open to participants may be a good idea to get more
ground truth, leading to results that are more statistically valid. In this
case, it should be possible to force a minimum number of examples for each
concept in the ground truth;
{ The ground truth is not released by the organizers after the o cial results.</p>
      <p>Even if we understand the reason why the organizers do that, such ground
truth may be of a great help for the participant to study why and when their
approach fail. Alternatively, a bigger and more representative validation set
should be very helpful to participants;.
{ Without obtaining the ground truth, we think that the number of boxes per
concept in the ground truth should be released, so that participants may
have cues about their results per concept;
{ Even if the name of the task is \scalable concept annotation", we wonder
if it should be possible to get, in addition to the existing measures, other
measures that are able to focus on the runs submitted: limiting the evaluation
on the concepts detected is already possible by averaging a posteriori the aP
of a subset of concepts, but it is impossible for the participants that were
not able for any reason to process all the images to evaluate the quality of
such runs only on the subset of image processed.</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>For our rst participation in the Image CLEF scalable concept detection, we
used classical approaches based on convolutional networks as well as speci c
elements related to the detection of parts of faces. Selective search was applied
on the images in a way to detect concepts from CNNs. Because only a subset
(35%) of the whole corpus was fully processed, the o cial results we obtain are
not as high as they could have been. We found that the fusion of prede ned face
part extraction and deep learning detection did not give positive results: such
fusion has to be studied in more detail in the future. The elements related to
the de nition of localization has also to be studied in the future to allow fast
detection of such boxes.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.-J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Fei-Fei</surname>
          </string-name>
          .
          <article-title>Imagenet: A large-scale hierarchical image database</article-title>
          .
          <source>In Computer Vision and Pattern Recognition</source>
          ,
          <year>2009</year>
          .
          <article-title>CVPR 2009</article-title>
          . IEEE Conference on, pages
          <volume>248</volume>
          {
          <fpage>255</fpage>
          . IEEE,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>P. F.</given-names>
            <surname>Felzenszwalb</surname>
          </string-name>
          and
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Huttenlocher</surname>
          </string-name>
          .
          <article-title>E cient graph-based image segmentation</article-title>
          .
          <source>International Journal of Computer Vision</source>
          ,
          <volume>59</volume>
          (
          <issue>2</issue>
          ):
          <volume>167</volume>
          {
          <fpage>181</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>A.</given-names>
            <surname>Gilbert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Piras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramisa</surname>
          </string-name>
          , E. Dellandrea,
          <string-name>
            <given-names>R.</given-names>
            <surname>Gaizauskas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Villegas</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Mikolajczyk</surname>
          </string-name>
          .
          <article-title>Overview of the ImageCLEF 2016 Scalable Concept Image Annotation Challenge</article-title>
          . In CLEF2016 Working Notes, CEUR Workshop Proceedings, Evora, Portugal, September 5-8
          <year>2016</year>
          .
          <article-title>CEUR-WS.org</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren, and
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <article-title>Deep residual learning for image recognition</article-title>
          .
          <source>CoRR, abs/1512.03385</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>J.</given-names>
            <surname>Uijlings</surname>
          </string-name>
          , K. van de Sande, T. Gevers,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Smeulders</surname>
          </string-name>
          .
          <article-title>Selective search for object recognition</article-title>
          .
          <source>International Journal of Computer Vision</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>M.</given-names>
            <surname>Uricar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Franc</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Hlavac</surname>
          </string-name>
          .
          <article-title>Detector of facial landmarks learned by the structured output SVM</article-title>
          .
          <source>In VISAPP '12: Proceedings of the 7th International Conference on Computer Vision Theory and Applications</source>
          , pages
          <volume>547</volume>
          {
          <fpage>556</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>M.</given-names>
            <surname>Villegas</surname>
          </string-name>
          , H. Muller, A. Garc a Seco de Herrera,
          <string-name>
            <given-names>R.</given-names>
            <surname>Schaer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bromuri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gilbert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Piras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramisa</surname>
          </string-name>
          , E. Dellandrea,
          <string-name>
            <given-names>R.</given-names>
            <surname>Gaizauskas</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. M. J. Puigcerver</surname>
            ,
            <given-names>A. H.</given-names>
          </string-name>
          <string-name>
            <surname>Toselli</surname>
            ,
            <given-names>J.-A.</given-names>
          </string-name>
          <string-name>
            <surname>Snchez</surname>
            , and
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Vidal</surname>
          </string-name>
          .
          <source>General Overview of ImageCLEF at the CLEF 2016 Labs. Lecture Notes in Computer Science</source>
          . Springer International Publishing,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>P.</given-names>
            <surname>Viola</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Jones</surname>
          </string-name>
          .
          <article-title>Rapid object detection using a boosted cascade of simple features</article-title>
          .
          <source>In Computer Vision and Pattern Recognition</source>
          ,
          <year>2001</year>
          .
          <article-title>CVPR 2001</article-title>
          .
          <article-title>Proceedings of the 2001</article-title>
          IEEE Computer Society Conference on, volume
          <volume>1</volume>
          ,
          <string-name>
            <surname>pages</surname>
            <given-names>I{</given-names>
          </string-name>
          <year>511</year>
          . IEEE,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>