Hybrid Learning Framework for Large-Scale
       Web Image Annotation and Localization

    Yong Li1 , Jing Liu1 , Yuhang Wang1 , Bingyuan Liu1 , Jun Fu1 , Yunze Gao1 ,
                Hui Wu2 , Hang Song1 , Peng Ying1 , and Hanqing Lu1
 1
     IVA Group, National Laboratory of Pattern Recognition, Institute of Automation,
                              Chinese Academy of Sciences
                 2
                   Institute of Software, Chinese Academy of Sciences
         {yong.li,jliu,yuhang.wang,byliu,peng.ying,luhq}@nlpr.ia.ac.cn
                         {fujun2015,gaoyunze2015}@ia.ac.cn
                     wuhui13@iscas.ac.cn,hangsongv@gmail.com
                            http://www.nlpr.ia.ac.cn/iva
                              http://www.foreverlee.net


         Abstract. In this paper, we describe the details of our participation in
         the ImageCLEF 2015 Scalable Image Annotation task. The task is to an-
         notate and localize different concepts depicted in images. We propose a
         hybrid learning framework to solve the scalable annotation task, in which
         the supervised methods given limited annotated images and the search-
         based solutions on the whole dataset are explored jointly. We adopt a
         two-stage solution to first annotate images with possible concepts and
         then localize the concepts in the images. For the first stage, we adopt
         the classification model to get the class-predictions of each image. To
         overcome the overfitting problem of the trained classifier with limited
         labelled data, we use a search-based approach to annotate an image by
         mining the textual information of its similar neighbors, which are similar
         on both visual appearance and semantics. We combine the results of clas-
         sification and the search-based solution to obtain the annotations of each
         image. For the second stage, we train a concept localization model based
         on the architecture of Fast R-CNN, and output the top-k predicted re-
         gions for each concept obtained in the first stage. Meanwhile, localization
         by search is adopted, which works well for the concepts without obvious
         objects. The final result is achieved by combing the two kinds of localiza-
         tion results. The submitted runs of our team achieved the second place
         among the different teams. This shows the outperformance of the pro-
         posed hybrid two-stage learning framework for the scalable annotation
         task.

         Keywords: Hybrid Learning, SVM, Fast R-CNN, Annotation, Concept
         Localization


1      Introduction
With the advance of digital cameras and high quality mobile devices as well as
the Internet technologies, there are increasingly huge number of images available
on the web. This necessitates scalable image annotation techniques to effectively
organize and retrieval the large scale dataset. Although some possibly related
textual information to images is presented on their associated web pages, the
relationship between the surrounding text and images varies greatly, with much
of the text being redundant and unrelated. Therefore, how to best explore the
weak supervision from textual information is a challenging problem for the task
of scalable image annotation.
     The goal of scalable image annotation task in ImageCLEF 2015 is to describe
visual content of images with concepts, and to localize the concepts in the images
[7, 17]. The task provides a dataset of 500,000 web images with the textual
information extracted from web pages, in which 1979 items with ground truth
localized concept labels form the development set. The overall performance will
be evaluated to annotate and localize concepts on the full 500,000 images. The
large scale test data and the new task of concept localization are the main
differences from the previous ImageCLEF challenges. Unlike the other popular
challenges like ILSVRC [14] and Pascal [5], such task has few fully labelled
training data but a large-amount of raw web resources used for model learning.
    For the participation of the scalable image annotation task, we adopt a two-
stage hybrid learning framework to fully use the limited labelled data and the
large scale web resource. In the first stage, we train a SVM classifier for each
concept in a one-vs-rest manner. To avoid the overfitting problem brought by the
small scale training data, we adopt another unsupervised solution as a comple-
ment to enhance the scalability of our work. We attempt to annotate an image
by search on the whole 500,000 dataset, in which the visual and semantical sim-
ilarities are jointly estimated with deep visual features [10] and deep textual
features (i.e., Word2Vec [12]), and the WordNet is used to mine the relevant
concepts from the textual information of those similar images. After the concept
annotation stage, we obtain a set of concepts relevant to each image. We will
localize the concepts through the second stage, in which the latest deep mod-
el, Fast R-CNN [8] is adopted to predict the possible locations of the concepts
obtained in the first stage. Although the deep model can directly predict and
localize the concepts depicted in each images, the performance is unstable pos-
sibly due to the too small number of training data with ground truth localized
concept labels, which can be demonstrated from the experimental results. Thus,
we use the top-K predicted regions to each concept obtained in the first stage as
outputs. Besides, we adopt a search-based approach to localize the scene-related
concepts (e.g., “sea”, “beach” and “river”). Specifically, the location of each pre-
dicted concept for an image is decided by the spatial layout of its visually similar
images in training dataset. The experimental results show that the hybrid two-
stage learning framework contributes to the improvement of image annotation
and localization. Furthermore, there are a few concepts related with the concept
“face” ( e.g., “head”, “eye”, “nose”, “mouth” and “beard”). Since face detection
and facial point detection have been actively studied over the past years and
achieve satisfactory performance [19, 4, 15], we employ face detection and facial
point detection to localize face related concepts exactly.
                                                                                                           boat     or                                 boat
                                                                                                                                                        sky
                                                                                                           sea      or
                                                                           Linear                                                                      shoe


                                                                                                                    …
                                                                            SVM                                                                         sea
                                                                                                            beach   or                                   …


                                                           CNN                                                                                                         boat

                           …
                                                                                                          boat                      boat
                                                         Features                                        beach                      water
                                                                                                                                                                        sky
                                                                                                                                                                       sea
                                                                                                           …                         …
    Annotation        ImageCLEF                                                                                                                                          …
                        dataset                                                                           sky                        boat
                    (500K Images)                                      CNN                               house                      beach
                                                                     Features                              …                          …                 boat
                                                                                                                                                         sky
                                                                         LSH &                                                                          sea
                                                                                                         boat                        boat
                   Input Image                                           Search                          water                       sea               beach
                                                                                                                                                          …

                                                                                                     …                          …
                                                  VGG19                                                   …                           …


                                         boat, sea, Hawaii, …                       Visual Similarity Search             Semantic Reranking


                             Selective
                              Search                                                      Bounding
                                                                                            Box
                                                                                          Regressor                                             boat
                                                                                                                                            shoe
                                                                                                                                          sea
                                                                                                                                    sky
                                                                  ROI                      Softmax
                                                                                                                                …
                                                                Pooling                                                                                               sky
    Localization      Regions of                                 Layer
                    Interest (ROI)
                                                                  Fast R-CNN
                                                                                                                                                               boat
                                                                                                                                                                        sea

                                                 sky
                                               sea
                                                                     sky

                                                                    sea
                                                                           …        sea
                                                                                          sky                            sky                                   Annotation &
                                                                                                                                                                Localization
                                                                                                                                                                   Result
                                                          Localization by Search                                         sea


                                                       Annotation By                        Annotation By                      Localization by                  Localization by
                                                       Classification                          Search                           Fast R-CNN                         Search


Fig. 1. Flowchart of the proposed method. The top part shows the first stage of image
annotation. For a given query image, deep visual feature is extracted with the VGG
19 network. Then the SVM classifier is adopted to make prediction for the annotation.
Furthermore, visual based retrieval is performed and the result is reranked with the
surrounding text. The annotations to the testing image will be mined from the textual
descriptions of the above obtained similar image set with the WordNet. The lower
part shows the second stage of concept localization. Two methods are employed during
localization. The Fast R-CNN is adopted, which works well for the concepts with
obvious object. Meanwhile, localization by search method is adopted, which works well
for the scene related concepts ( e.g., “sea” and “beach”). Finally, localization results
from the two methods are combined.


    The remainder of this working note is structured as follows. Section 2 presents
the details about data preparation for model training. In Section 3, we elaborate
the details about how we obtain the results of image annotation. Section 4 intro-
duces the annotation localization approach. In section 5, we discuss the results of
experiments and the parameter settings. Finally, we conclude our participation
in ImageCLEF 2015 in section 6.


2    Data Preparation

In this year, hand labeled data is allowed in the image annotation and localiza-
tion task. We prefer multiple online resources to perform such task, including the
ImageNet database [14], the Sun database [18], the WordNet [13] and the online
image sharing website Flicker 1 . To perform image annotation by classification,
we attempt to collect the training images from the well labeled dataset ImageNet
and Sun dataset. There are 175 concepts concurrent in the ImageNet dataset and
the ImageCLEF task simultaneously. Meanwhile, there are 217 concepts concur-
rent in the Sun dataset and the ImageCLEF task. For the concepts not in the
ImageNet and Sun database, images are crawled from the online image sharing
website Flicker and filtered by humans with 50 images left for each concept. In
our work, the visual features of images are represented with deep features [9],
we employ the VGG19 model [10] pretrained on the ImageNet dataset (1000
classes) and average the output of its relu6 layer for 10-view image patches (4
corners and 1 center patches of an image as well as their mirrors) as our visual
feature.
    There are 1979 images has been released to test the proposed method. The
frequency of different concept is unbalanced and there are 17 concepts do not
occur in the development dataset. Then we have collected some images for such
concepts to make the development set more applicable to set hyper-parameters
and validate the proposed method.


3     Concept Annotation

3.1     Annotation By Classification

Image annotation by classification is to train a multi-class classifier or one-vs-rest
classifier corresponding to different concepts. Such solution is simple, and usually
can achieve satisfactory performance given abundant training data. Towards this
problem, we choose a linear Support Vector Machine (SVM) [6] to train a one-
vs-rest classifier for each concept. Due to images usually being labelled with
multiple concepts in training data, the negative samples for a given concept
classifier are selected as the ones whose all labels do not include the concept.
For a testing image, we select the most confident concepts by thresholding the
classification confidences to obtain the annotations of each image.


3.2     Annotation By Search

The search-based approach for image annotation works on the assumption that
visual similar images should reflect similar semantical concepts, and most textual
information of web images is relevant to their visual content. Thus, the search-
based annotation process can be divided into two phases: one is the search for
similar images, and the other is relevant concept selection from the textual in-
formation of those similar images.
    First, given a testing image with textual information, we will search its sim-
ilar neighbors on the whole 500,000 dataset. As mentioned in section 2, images
are represented with 4096-dimensional deep features. To speed up similar image
1
    https://www.flickr.com/
retrieval for the large scale image database, we adopt a hash encoding algo-
rithm. Specially, we map the deep features to 32768-dimensional binary hash
codes leveraging the random projection algorithm proposed in [2], and employ
hamming distance to rank the images in the dataset.
    To further improve the results of visual similarity search, we explore the tex-
tual information of the given image, and perform the semantic similarity search
on the top-NA visually similar images to rerank the similar image set. Here, we
use a publicly available tool of Word2Vec [12] to compute vector representations
of textual information of images, which are provided with the scofeat descrip-
tors. With the word vector representations, the cosine distance is used to rerank
images in order to obtain a set of visually and semantically similar images.
    Next, the annotations to the testing image will be mined from the textual
descriptions of the above obtained similar image set. For the annotation mining,
we employ a WordNet-based approach, which is similar to the solution in [3].
The major difference is that we mine the concepts from a set of visually and
semantically similar images, while they considered only the visual similarities
among images. A candidate concept graph is built with the help of WordNet,
and the top-NW concepts with higher number of links are selected as the final
image description.
    We combine the results of the above classification-based solution and the
search-based solution with different thresholding settings, while their different
performances will be discussed in the experimental session.
    Concept extension is adopted to deal with the strong correlation among con-
cepts to make the annotation result more sufficient. For the given 251 concepts,
some concepts have strong correlation like “eyes” and “nose”, which usually
occur together. Hierarchy relation may exist like concepts “apple” and “fruit”.
When the child node concept “apple” occurs, the parent node concept “fruit”
must occur. These relations can be achieved by exploring the WordNet concept
hierarchy and the provided ImageCLEF concept set with general level categories.


4     Annotation Localization

4.1    Localization by Fast RCNN

To localize the objects in the given images, we follow the Fast R-CNN framework
(FRCN) proposed in [8], which provides classification result and regressed loca-
tion simultaneously for each candidate object proposal. The FRCN approach is
conducted on a number of regions of interest (RoIs), which are sampled from the
object proposals of the images. To reduce repetitive computation of the over-
lapped regions, the last pooling layer of the FRCN network is replaced with a RoI
pooling layer compared with traditional CNN network, which observably speeds
up the training and testing process. Furthermore, the network uses two sibling
loss terms as supervision to learn the classification and localization information
collaboratively, which are proved to be helpful to improving the performance
and make the approach a one-stage detection framework.
    Each RoI corresponds to a region in the feature map provided by the last
convolutional layer. The RoI pooling layer carries out max pooling on each of
the corresponding regions and pools them into fixed-size feature maps. The scale
of pooling mask in RoI pooling layer is auto-adjusted according to the spatial
size of the input feature map regions, to make the outputs all have the same
size. Therefore, the feature map of each RoI can match the following fully con-
nected layers seamlessly after RoI pooling and contribute to the network as an
independent instance.
    As for the supervision, the FRCN network employs two sibling output layers
to predict classification probability and bounding box regression offsets respec-
tively for each RoI on each category. The first output layer is a softmax layer
which outputs a probability distribution over all categories. And we use the
standard cross-entropy loss function to constrain it as follows,

                                     Lcls = −log(p̂k∗ )                                     (1)

where k ∗ is the groundtruth label and p̂k∗ is the predicted probability on this
class, assuming that there are totally K + 1 categories including K object classes
and a background class. The second output layer is a regression layer which pre-
dicts the bounding box regression offsets for each category as tk = (tkx , tky , tkh , tkw ),
where k is the index of the K object classes. Assuming that the groundtruth
offsets for the class k ∗ is t∗ = (t∗x , t∗y , t∗h , t∗w ), the regression loss function is for-
mulated as follows,
                                     X                             ∗
                         Lloc =                    smoothL1 (tki , t∗i )                     (2)
                                   i∈{x,y,h,w}

where                                   (
                                         0.5x2     if |x| < 1
                         smoothL1 (x) =                                                     (3)
                                         |x| − 0.5 otherwise
Thus, the loss function for the whole network can be formulated as follows,

                                 L = Lcls + λ[k ∗ ≥ 1]Lloc                                  (4)

where λ is a weighting parameter to balance the two loss terms. And [k ∗ ≥ 1] is
an indicator function with the convention that the background class is labeled
as k ∗ = 0 and the object classes as k ∗ = 1, 2, · · · , K, which means that the
localization regression loss term is ignored for the background RoIs.
    In practice, we first extract object proposals with the selective search method
[16] and then sample RoIs from them for training. We take 25% of the RoIs
from the object proposals that overlap certain groundtruth bounding boxes with
more than 0.5 IoU (intersection over union) and label them the same with the
groundtruth bounding boxes. The rest RoIs are sampled from the object propos-
als with a maximum IoU between 0.1 and 0.5 with the groundtruth bounding
boxes and are labeled as background, as instructed in [8]. While during testing,
we input all the object proposals into the network and predict labels and regres-
sion offsets for all of them. A preliminary screening is also implemented in this
 Annotation   Query Image      Similar Image 1   Similar Image 2   Similar Image 3


   Valley


   Shore


 Bathroom


Fig. 2. Examples about search-based localization. Annotation of the query image is
achieved by the method introduced in Section 3. Similar images is achieved by visual
similarity. Localization of a given concept for the query image can be achieved by
transferring the bounding box in the similar images.


step with non-maximum suppression to take out some object proposals with too
low classification probabilities.

4.2     Localization by Search
We give a special consideration to the scene related concepts for the annotation
localization ( e.g., “beach”, “sea”, “river” and “valley”). If an image is predicted
as a scenery concept, we first find its top-NL visually similar neighbors with the
same concept in the localization training data, and use their merged bounding
box as the location of the scenery concept. Figure 2 shows some examples about
search-based location results. The final localization results will be composed of
the predicted results of the Fast R-CNN model and search-based localization
results.

4.3     Localization of Face Related Concepts
There are a few concepts related with the concept “face” ( e.g., “head”, “eye”,
“nose”, “mouth” and “beard”). To localize face related concepts exactly, we
employ face detection algorithm with aggregate channel features [19, 4]. Facial
point detection is performed to locate key points in the face [15]. Localization
of the concepts “eye”, “nose”, “mouth” and “beard” is got based on the key
points. Concepts of “ear” and “neck” are located based on the relative location
of face with experience. Besides, linear classifiers [6] are trained with the SIFT
[11] features extracted on the facial points to determine the concepts of “man”,
“woman”, “male child” and “female child”.
                            80


                            70


                            60
Mean average precision(%)


                            50


                            40


                            30


                            20        Setting 1
                                      Setting 2
                            10        Setting 3
                                      Setting 4
                                      Setting 5
                             0
                                 0      10        20   30       40        50        60   70     80     90
                                                            Overlap with GT labels(%)


                                       Fig. 3. Mean average precision with different recall rates.


5                            Experiments
We have submitted 8 runs with 5 different settings of combinations with the
above model modules, including Annotation By Classification (ABC), Annota-
tion By Search (ABS), localization by Fast R-CNN (FRCN), Localization By
Search (LBS) and Concept Extension (CE). Some runs are of the same setting
with different parameter values.


                                     Table 1. Results for the submitted runs with different settings

                              Method ABC CE ABS LBS FRCN SVM Threshold Overlap 0.5 Overlap 0
                             Setting 1 yes no yes yes yes    0.5         0.510      0.642
                             Setting 2 yes yes yes yes yes   0.4         0.510      0.635
                             Setting 3 yes yes no yes yes    0.4         0.486      0.613
                             Setting 4 yes no no yes yes     0.4         0.432      0.552
                             Setting 5 no no no no     yes   0.4         0.368      0.469


    The experimental results of our method with different settings are presented
in Table 1. The last two columns show the mean average precision with differen-
t overlap percentage with ground truth labels. Detailed results with increasing
overlap percentage is shown in Fig. 3. We can find the setting 1 with the modules
ABC, ABS, FRCN and LBS achieves the best result, and setting 2 extending
                              Map with 50% overlap with the GT labels
0.7


0.6


0.5


0.4


0.3


0.2


0.1


 0
           CEA_LIST
       CNRS_TPT-1
       CNRS_TPT-2
       CNRS_TPT-3
       CNRS_TPT-4
       CNRS_TPT-5
       CNRS_TPT-6
       CNRS_TPT-7
       CNRS_TPT-8
       CNRS_TPT-9
      CNRS_TPT-10
          IRIP-iCC-1
          IRIP-iCC-2
          IRIP-iCC-3
          IRIP-iCC-4
          IRIP-iCC-5
          IRIP-iCC-6
          IRIP-iCC-7
          IRIP-iCC-8
          IRIP-iCC-9
        IRIP-iCC-10
                 ISIA
         IVANLPR-1
         IVANLPR-2
         IVANLPR-3
         IVANLPR-4
         IVANLPR-5
         IVANLPR-6
         IVANLPR-7
         IVANLPR-8
            kdevir-1
            kdevir-2
            kdevir-3
            kdevir-4
            kdevir-5
            Lip6.fr-1
            Lip6.fr-2
            Lip6.fr-3
            Lip6.fr-4
            Lip6.fr-5
            Lip6.fr-6
            Lip6.fr-7
        MLVISP6-1
        MLVISP6-2
        MLVISP6-3
        MLVISP6-4
        MLVISP6-5
        MLVISP6-6
        MLVISP6-7
        MLVISP6-8
        MLVISP6-9
       MLVISP6-10
       MLVISP6-11
       MLVISP6-12
       MLVISP6-13
       MLVISP6-14
              RUC-1
              RUC-2
              RUC-3
              RUC-4
              RUC-5
              RUC-6
              RUC-7
              RUC-8
           SMIVA-1
           SMIVA-2
           SMIVA-3
           SMIVA-4
           SMIVA-5
           SMIVA-6
           SMIVA-7
           SMIVA-8
           SMIVA-9
          SMIVA-10
          SMIVA-11
          SMIVA-12
          SMIVA-13
          SMIVA-14
          SMIVA-15
          SMIVA-16
Fig. 4. Submission results of different teams with 50% overlap with the ground truth
labels. Results of our method with different runs are colored red.


setting 1 with CE achieves comparable results. By comparing the result of set-
ting 2 and setting 3, we can find the effectiveness by taking annotation by search
into consideration. Furthermore, we have validated the effect of FRCN, and we
find that result of localization by detection only is unsatisfactory. This is mainly
due to two reasons, one is that the training data is very limited for each concept,
the other is that the content in the web images do not have obvious objectness
[1] for some concepts. The proposed hybrid learning framework with two stage
process is more suitable to deal with such task. Comparisons of our runs (de-
noted IVANLPR-*) and other participants’ runs are illustrated in figure 4. The
submitted runs of our team achieved the second place among the different team-
s, which shows the outperformance of the proposed hybrid two-stage learning
framework for the scalable annotation and localization task.


6     Conclusion

In this paper, we described the participation of IVANLPR team at ImageCLEF
2015 Scalable Concept Image Annotation task. We proposed a hybrid learning
framework to solve the scalable annotation task. We adopt a two-stage solution to
first annotate images with possible concepts and then localize the concepts in the
images. For the first stage, both supervised method and unsupervised method are
adopted to make full use of the available hand-labeled data and surrounding text
in the webpage. For the second stage, Fast R-CNN and search-based method are
adopted to locate the annotation concepts. Extensive experiments demonstrate
the outperformance of the proposed hybrid two-stage learning framework for the
scalable annotation task.
7    Acknowledgments.

This work was supported by 973 Program (2012CB316304) and National Natural
Science Foundation of China (61332016, 61272329 and 61472422).


References

 1. Alexe, B., Deselaers, T., Ferrari, V.: What is an object? In: CVPR. pp. 73–80
    (2010)
 2. Bingham, E., Mannila, H.: Random projection in dimensionality reduction: Ap-
    plications to image and text data. In: Proceedings of the Seventh ACM SIGKDD
    International Conference on Knowledge Discovery and Data Mining. pp. 245–250.
    KDD, ACM (2001)
 3. Budı́ková, P., Botorek, J., Batko, M., Zezula, P.: DISA at imageclef 2014: The
    search-based solution for scalable image annotation. In: Working Notes for CLEF
    2014 Conference. pp. 360–371 (2014)
 4. Dollár, P., Appel, R., Belongie, S., Perona, P.: Fast feature pyramids for object
    detection. Pattern Analysis and Machine Intelligence, IEEE Transactions on 36(8),
    1532–1545 (2014)
 5. Everingham, M., Gool, L.J.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The
    pascal visual object classes (VOC) challenge. International Journal of Computer
    Vision 88(2), 303–338 (2010)
 6. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: A library
    for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (Jun 2008)
 7. Gilbert, A., Piras, L., Wang, J., Yan, F., Dellandrea, E., Gaizauskas, R., Villegas,
    M., Mikolajczyk, K.: Overview of the ImageCLEF 2015 Scalable Image Annotation,
    Localization and Sentence Generation task. In: CLEF2015 Working Notes. CEUR
    Workshop Proceedings, CEUR-WS.org, Toulouse, France (September 8-11 2015)
 8. Girshick, R.: Fast r-cnn. arXiv preprint arXiv:1504.08083 (2015)
 9. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadar-
    rama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding.
    arXiv preprint arXiv:1408.5093 (2014)
10. K, S., A., Z.: Very deep convolutional networks for large-scale image recognition.
    arXiv preprint arXiv:1409.1556 (2014)
11. Lowe, D.: Distinctive image features from scale-invariant keypoints. International
    Journal of Computer Vision 60(2), 91–110 (2004)
12. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre-
    sentations in vector space. arXiv preprint arXiv:1301.3781 (2013)
13. Miller, G.A.: Wordnet: A lexical database for english. Communications of the ACM
    38(11), 39–41 (1995)
14. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,
    Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: Imagenet large
    scale visual recognition challenge. International Journal of Computer Vision (2015)
15. Sun, Y., Wang, X., Tang, X.: Deep convolutional network cascade for facial point
    detection. In: CVPR. pp. 3476–3483. IEEE (2013)
16. Uijlings, J.R., van de Sande, K.E., Gevers, T., Smeulders, A.W.: Selective search
    for object recognition. International journal of computer vision 104(2), 154–171
    (2013)
17. Villegas, M., Müller, H., Gilbert, A., Piras, L., Wang, J., Mikolajczyk, K., de Her-
    rera, A.G.S., Bromuri, S., Amin, M.A., Mohammed, M.K., Acar, B., Uskudarli,
    S., Marvasti, N.B., Aldana, J.F., del Mar Roldán Garcı́a, M.: General Overview of
    ImageCLEF at the CLEF 2015 Labs. Lecture Notes in Computer Science, Springer
    International Publishing (2015)
18. Xiao, J., Hays, J., Ehinger, K., Oliva, A., Torralba, A.: Sun database: Large-scale
    scene recognition from abbey to zoo. In: CVPR. pp. 3485–3492 (2010)
19. Yang, B., Yan, J., Lei, Z., Li, S.Z.: Aggregate channel features for multi-view face
    detection. In: IJCB. pp. 1–8. IEEE (2014)