Introduction

MRIM-LIG at ImageCLEF 2016 Scalable Concept Image Annotation Task

Maxime Portaz

0 1

Mateusz Budnik

0 1

Philippe Mulhem

0 1

Johann Poignant

0 1 0 CNRS, LIG , F-38000 Grenoble , France 1 Univ. Grenoble Alpes, LIG , F-38000 Grenoble , France

This paper describes the participation of the the MRIM research Group of the LIG laboratory in the ImageCLEF scalable concept image annotation subtask 1. We made use of a classical framework to annotate the 500K images of this task: we tuned an existing Convolutional Neural Network model to learn the 251 concepts and to locate bounding boxes of such concepts, and we applied a speci c process to handle faces and face parts. Because of time constraints, we fully processed 35% of the full corpus (i.e. 180K images), and partially the remaining images of the corpus. For our rst participation to this task, the results obtained show that we have to manage the localization in a more e ective way.

Convolutional Neural Networks Landmark face detection ImageNet TRECVID

Introduction

The rst participation of the MRIM group from the LIG laboratory at the ImageCLEF 2016 [ 7 ] scalable concept image annotation subtask 1 [ 3 ] is presented. Our approach was to use a classical framework based on face detection [ 8 ] followed by facial landmarks detection [ 6 ] for faces and face parts (eyes, nose and mouth), and to rely on convolutional neural networks [ 4 ] for each of the 251 concepts.

The ImageCLEF 2016 scalable concept image annotation subtask 1 consists of nding the location of 251 classes of objects in a corpus of 500K images. This task is challenging because of the di culty of nding accurate location of objects in large sets of images. The objective is to assign a maximum of 100 bounding boxes per image, each bounding box being associated to one or more of the 251 concepts proposed. It is also possible to provide a con dence value for each of the tagging de ned. The visual concepts de ned for this subtasks do not match fully with concepts coming from the well known ImageNet database [ 1 ], so speci c work has to be done to be able to tackle these concepts.

Because of time needed to process the whole corpus, we fully processed around 35% of the full image corpus (i.e. 180K images), and partially the remaining of the corpus. The results obtained are then negatively impacted by this partial processing.

The rest of this paper is organized as follows. In section 2, we de ne our approach: we mainly rely on convolutional neural networks for \classical" concepts, with a speci c process dedicated to faces. Then, in section 3, we detail the results obtained, as well as some additional elements dedicated to analyzing our results in more detail. We conclude in section 5. 2 2.1

Proposed Approach Overview

The overall process applied for detection and localization of concepts in images is described in gure 1. We generate possible bounding boxes, then apply Convolutional Neural Networks for each of the 251 concepts. For face and face part detection, we use face and facial landmarks detection. Such approaches have been successfully used by several participants during the 2015 campaign of ImageCLEF concept annotation task. We nally rank all the labeled bounding boxes by score or by size, depending on the run. This ranking is used as ltering to reduce the number of boxes per image, as we take only up to 100 boxes for each image (a limit chosen by the organizers). 2.2

Convolutional Neural Networks

We used a Deep Residual Convolutional Neural Network (ResNet) with 152 layers presented by Microsoft in the ImageNet'16 challenge [ 4 ]. The network was netuned to match the 251 labels from ImageClef. Only the nal layer was retrained.

Data Processed

A rst step in the learning process was to map, when possible, the 251 CLEF concepts into concepts from existing image collections, namely the ImageNet concepts. From the full set C of 251 concepts, 224 are mapped directly to ImageNet concepts, and for each of the 27 remaining concepts we acquired 4519 images from Bing API using the concept name as query. We do not lter manually the resulting set of images.

As described in gure 2, we also de ne a second set of images to increase the quality of the concept detection. This second set also includes both Bing API and the validation set (2000 images, 10000 tagged bounding boxes) provided by the organizers of the task.

CNN Processing

One speci city of our proposal is to de ne a two-step learning process (basically two netuning stages) as a way to increase the e ectiveness of the concept detection. The CNN network comes pre-trained on the ImageNet dataset [ 1 ]. We used two validation sets: a) the rst one is the set provided by the organizers of the ImageCLEF task, and b) a second one that we de ned to assess the quality of the training on \clean" images. The rst netuning step is evaluated on these two validation sets. While during the second learning step the rst set (a) is used for training as well as some additional images (which were crawled from the Internet) for the concepts with the lowest recognition rate. After the second netuning, the system is tested only on the (b) validation set. In other words: { On our rst set of training images, learn the last layer of CNN, then evaluate (success@1 success@5) on the two validation sets; { During the second learning stage, for the low quality recognition concepts, we generate the second set of 200 additional training images per concept. As described above, we also add the validation set (a) provided by CLEF. We retrain the network on this combined and extended set.

At the end of these two steps, we obtained the results presented in table 1. The two rst rows of this table present the results after the rst tuning step. The remaining two rows give the results after the second phase of netuning. The second step seems to signi cantly increase the performance on the Bing validation set.

The ImageCLEF validation set was included in the training set at the second stage of tuning. That is why a surprisingly strong result (denoted with \*"), compared to the rst tuning, is obtained: it does not generalize and was included just for illustrative purposes.

Concept Localization

We used the work of Uijlings, van de Sande, Gevers and Smeulders [ 5 ] to perform selective search to de ne bounding box detection. The idea is mainly to de ne a priori a set of bounding boxes that are expected to contain one visual concept. The selective search use Felzenswalb algorithm [ 2 ] for image segmentation. In our runs, we use a width for Gaussian kernel of 0:8, and a scale factor of 500. The minimum size for a box is set to 200 pixels. These constant give a average of 517 boxes per image. Each of these boxes will be used as an input image on which the CNN will be applied to detect objects.

Actual Processing Achieved

Due to time constraints, we applied the full process to 180k images: selective search and clustering of bounding boxes, and CNN detection on each of the selected boxes. On average, the number of boxes generated per image is 517. For each of the remaining images (320K images), we applied detection on: a) the full image, and b) a small subset of the initial boxes selected randomly. On average, the number of boxes generated per image for each remaining image is 8. Overall, we processed 95 millions of boxes for our submissions. 2.3

Face Detection

The detection and localization of parts of faces is achieved through a two step process: { Frontal faces are detected using the \classical" Viola and Jones approach [ 8 ] based on cascade of simple Haar-like features; { Then 8 facial landmarks [ 6 ] are detected on these faces. They correspond to the 2 mouth corners, 4 eye canthus, the tip of the nose and the center of the face. We used then simple heuristics to de ne faces, eyes, noses and mouths bounding boxes based on these landmarks.

All images of the ImageCLEF corpus are processed using the above steps. With such process, at least one faces is detected on 64642 of the 510K images (12.7% of the whole corpus). A total of 91102 faces \boxes" are detected on these images. 3

Evaluation Results

The runs submitted by the MRIM-LIG team are the following: { RUN1 LIG DLo: Annotation using the Convolutional Neural Network described in part 2.2, with a ranking of the bounding boxes according to the con dence value; { RUN2 LIG DLo: Annotation using the CNN described in part 2.2, with a ranking of the bounding boxes according to the surfaces of the boxes; { RUN3 LIG Fo: Annotation of the face parts only, using the Viola/Jones approach described in part 2.3; { RUN4 LIG DLF: Annotation using both the CNN and face parts detection, with a ranking of the bounding boxes according to the con dence value; { RUN5 LIG DLF: Annotation using both the CNN and face parts detection, with a ranking of the bounding boxes according to the surfaces of the boxes; 3.1

cial Results The o cial MAP at 0% overlap and MAP at 50% overlap results of our runs are presented in table 2. We nd that the run RUN5 (that fuses the face parts and deep learning results, ranking based on surfaces) achieves our best result (rank 11 for overlap 0, and rank 9 for overlap 0.5). At overlap 0.5, our second best result is RUN4 (that fuses the face parts and deep learning results, ranking based on con dence values). The di erence between RUN5 and RUN4 are negligible. We suppose that comes from the fact that only 180K images where fully processed, and for the remaining ones we did not have more than 100 boxes, and the ranking only plays a role when we obtain more than 100 boxes. The same holds also for our runs RUN1 and RUN2 (based only on deep learning features).

Compared to the runs of other participants, we nd that our general runs that integrate deep learning do not obtain very high results. This can be explained by the fact that, as mentioned before, the whole proposed process was applied only on 180K images of the 510K images of the corpus.

As expected, our run RUN3, that detects only face parts has a very low overall result, ranked 23 for both overlap 0 and overlap 0.5.

When considering the additional o cial measures related to the minimum number of boxes per image, we see a plateau above a minimum of 20 boxes. This shows that when a image has less than 20 boxes in the ground truth set our proposal has di culty to nd relevant concepts or boxes. This can be also attributed to the fact that we did not fully process the whole corpus, as explained earlier. 3.2

Detailed analysis of face parts results

Here we try to give additional insight into the results obtained when considering only the face elements from deep learning and prede ned face extraction approaches [ 8, 6 ]. In table 3, we present the average precision results obtained for our overlap ranking approaches runs RUN2 (deep learning only), RUN3 (face parts only), and RUN5 (fusion), for the concepts mouth, eye, nose and f ace.

One interesting point that we get from table 3 is that, for the MAP at 0 and for the f ace concept, the deep learning approach (RUN2) outperforms both the prede ned detection (RUN3) and fusion (RUN5). We recall that face is already a concept available in ImageNet. However, for the other concepts this is not the case. When the localization is evaluated, then the prede ned detection outperforms the deep learning approach. When considering the fusion run (RUN5), we see that most of the time such fusion does not work properly as it does not seem to boost the results. The only case when the fusion outperforms the other runs is for MAP 0.5 for the eye, and the increment is marginal. 4

Current limitations of the scalable concept annotation task

After checking the o cial global results and the per concept results, we feel that: { The size of the ground truth seems small: many concept results aP values are equal to 1 (or exactly 0.5, 0.25, etc.), leading to think that there are only very few ground truth regions de ned for most concepts. A collaborative annotation interface open to participants may be a good idea to get more ground truth, leading to results that are more statistically valid. In this case, it should be possible to force a minimum number of examples for each concept in the ground truth; { The ground truth is not released by the organizers after the o cial results.

Even if we understand the reason why the organizers do that, such ground truth may be of a great help for the participant to study why and when their approach fail. Alternatively, a bigger and more representative validation set should be very helpful to participants;. { Without obtaining the ground truth, we think that the number of boxes per concept in the ground truth should be released, so that participants may have cues about their results per concept; { Even if the name of the task is \scalable concept annotation", we wonder if it should be possible to get, in addition to the existing measures, other measures that are able to focus on the runs submitted: limiting the evaluation on the concepts detected is already possible by averaging a posteriori the aP of a subset of concepts, but it is impossible for the participants that were not able for any reason to process all the images to evaluate the quality of such runs only on the subset of image processed.

Conclusion

For our rst participation in the Image CLEF scalable concept detection, we used classical approaches based on convolutional networks as well as speci c elements related to the detection of parts of faces. Selective search was applied on the images in a way to detect concepts from CNNs. Because only a subset (35%) of the whole corpus was fully processed, the o cial results we obtain are not as high as they could have been. We found that the fusion of prede ned face part extraction and deep learning detection did not give positive results: such fusion has to be studied in more detail in the future. The elements related to the de nition of localization has also to be studied in the future to allow fast detection of such boxes.

Deng ,

Dong ,

Socher ,

L.-J.

Li ,

Li , and

Fei-Fei . Imagenet: A large-scale hierarchical image database . In Computer Vision and Pattern Recognition , 2009 . CVPR 2009 . IEEE Conference on, pages 248 { 255 . IEEE, 2009 .

P. F.

Felzenszwalb and

D. P.

Huttenlocher . E cient graph-based image segmentation . International Journal of Computer Vision , 59 ( 2 ): 167 { 181 , 2004 .

Gilbert ,

Piras ,

Wang ,

Yan ,

Ramisa , E. Dellandrea,

Gaizauskas ,

Villegas , and

Mikolajczyk . Overview of the ImageCLEF 2016 Scalable Concept Image Annotation Challenge . In CLEF2016 Working Notes, CEUR Workshop Proceedings, Evora, Portugal, September 5-8 2016 . CEUR-WS.org .

He ,

Zhang , S. Ren, and

Sun . Deep residual learning for image recognition . CoRR, abs/1512.03385 , 2015 .

Uijlings , K. van de Sande, T. Gevers, and

Smeulders . Selective search for object recognition . International Journal of Computer Vision , 2013 .

Uricar ,

Franc , and

Hlavac . Detector of facial landmarks learned by the structured output SVM . In VISAPP '12: Proceedings of the 7th International Conference on Computer Vision Theory and Applications , pages 547 { 556 , 2012 .

Villegas , H. Muller, A. Garc a Seco de Herrera,

Schaer ,

Bromuri ,

Gilbert ,

Piras ,

Wang ,

Yan ,

Ramisa , E. Dellandrea,

Gaizauskas , K. M. J. Puigcerver , A. H.

Toselli , J.-A.

Snchez , and E.

Vidal . General Overview of ImageCLEF at the CLEF 2016 Labs. Lecture Notes in Computer Science . Springer International Publishing, 2016 .

Viola and

Jones . Rapid object detection using a boosted cascade of simple features . In Computer Vision and Pattern Recognition , 2001 . CVPR 2001 . Proceedings of the 2001 IEEE Computer Society Conference on, volume 1 , pages

511 . IEEE, 2001 .