<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Hybrid Learning Framework for Large-Scale Web Image Annotation and Localization</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yong Li</string-name>
          <email>yong.li@nlpr.ia.ac.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jing Liu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yuhang Wang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bingyuan Liu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jun Fu</string-name>
          <email>fujun2015@ia.ac.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yunze Gao</string-name>
          <email>gaoyunze2015@ia.ac.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hui Wu</string-name>
          <email>wuhui13@iscas.ac.cn</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hang Song</string-name>
          <email>hangsongv@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Peng Ying</string-name>
          <email>peng.ying@nlpr.ia.ac.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hanqing Lu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>IVA Group, National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Software, Chinese Academy of Sciences</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we describe the details of our participation in the ImageCLEF 2015 Scalable Image Annotation task. The task is to annotate and localize di erent concepts depicted in images. We propose a hybrid learning framework to solve the scalable annotation task, in which the supervised methods given limited annotated images and the searchbased solutions on the whole dataset are explored jointly. We adopt a two-stage solution to rst annotate images with possible concepts and then localize the concepts in the images. For the rst stage, we adopt the classi cation model to get the class-predictions of each image. To overcome the over tting problem of the trained classi er with limited labelled data, we use a search-based approach to annotate an image by mining the textual information of its similar neighbors, which are similar on both visual appearance and semantics. We combine the results of classi cation and the search-based solution to obtain the annotations of each image. For the second stage, we train a concept localization model based on the architecture of Fast R-CNN, and output the top-k predicted regions for each concept obtained in the rst stage. Meanwhile, localization by search is adopted, which works well for the concepts without obvious objects. The nal result is achieved by combing the two kinds of localization results. The submitted runs of our team achieved the second place among the di erent teams. This shows the outperformance of the proposed hybrid two-stage learning framework for the scalable annotation task.</p>
      </abstract>
      <kwd-group>
        <kwd>Hybrid Learning</kwd>
        <kwd>SVM</kwd>
        <kwd>Fast R-CNN</kwd>
        <kwd>Annotation</kwd>
        <kwd>Concept Localization</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>With the advance of digital cameras and high quality mobile devices as well as
the Internet technologies, there are increasingly huge number of images available
on the web. This necessitates scalable image annotation techniques to e ectively
organize and retrieval the large scale dataset. Although some possibly related
textual information to images is presented on their associated web pages, the
relationship between the surrounding text and images varies greatly, with much
of the text being redundant and unrelated. Therefore, how to best explore the
weak supervision from textual information is a challenging problem for the task
of scalable image annotation.</p>
      <p>
        The goal of scalable image annotation task in ImageCLEF 2015 is to describe
visual content of images with concepts, and to localize the concepts in the images
[
        <xref ref-type="bibr" rid="ref7">7, 17</xref>
        ]. The task provides a dataset of 500,000 web images with the textual
information extracted from web pages, in which 1979 items with ground truth
localized concept labels form the development set. The overall performance will
be evaluated to annotate and localize concepts on the full 500,000 images. The
large scale test data and the new task of concept localization are the main
di erences from the previous ImageCLEF challenges. Unlike the other popular
challenges like ILSVRC [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and Pascal [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], such task has few fully labelled
training data but a large-amount of raw web resources used for model learning.
      </p>
      <p>
        For the participation of the scalable image annotation task, we adopt a
twostage hybrid learning framework to fully use the limited labelled data and the
large scale web resource. In the rst stage, we train a SVM classi er for each
concept in a one-vs-rest manner. To avoid the over tting problem brought by the
small scale training data, we adopt another unsupervised solution as a
complement to enhance the scalability of our work. We attempt to annotate an image
by search on the whole 500,000 dataset, in which the visual and semantical
similarities are jointly estimated with deep visual features [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and deep textual
features (i.e., Word2Vec [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]), and the WordNet is used to mine the relevant
concepts from the textual information of those similar images. After the concept
annotation stage, we obtain a set of concepts relevant to each image. We will
localize the concepts through the second stage, in which the latest deep
model, Fast R-CNN [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] is adopted to predict the possible locations of the concepts
obtained in the rst stage. Although the deep model can directly predict and
localize the concepts depicted in each images, the performance is unstable
possibly due to the too small number of training data with ground truth localized
concept labels, which can be demonstrated from the experimental results. Thus,
we use the top-K predicted regions to each concept obtained in the rst stage as
outputs. Besides, we adopt a search-based approach to localize the scene-related
concepts (e.g., \sea", \beach" and \river"). Speci cally, the location of each
predicted concept for an image is decided by the spatial layout of its visually similar
images in training dataset. The experimental results show that the hybrid
twostage learning framework contributes to the improvement of image annotation
and localization. Furthermore, there are a few concepts related with the concept
\face" ( e.g., \head", \eye", \nose", \mouth" and \beard"). Since face detection
and facial point detection have been actively studied over the past years and
achieve satisfactory performance [
        <xref ref-type="bibr" rid="ref15 ref4">19, 4, 15</xref>
        ], we employ face detection and facial
point detection to localize face related concepts exactly.
      </p>
      <p>Annotation
Localization</p>
      <p>Regions of
Interest(ROI)</p>
      <p>…
ImageCLEF
dataset
(500KImages)
Input Image</p>
      <p>Selective
Search</p>
      <p>CNN</p>
      <p>Features</p>
      <p>VGG19
boat,sea, Hawai, …
CNN
Features</p>
      <p>LSH &amp;</p>
      <p>Search
ROI
Pooling
Layer</p>
      <p>FastR-CNN
sky
sea
sky …
sea
Localizationby Search</p>
      <p>sky
sea
boat or
sea … or
beach or
boat
beach
…
sky
house
…
boat
water
… …
Visual Similarity Search</p>
      <p>Bounding
Box
Regressor
Softmax
boat
water
…
boat
beach
…
boat
sea
… …
Semantic Reranking</p>
      <p>shoeboat
… sky sea
sky
sea
boat
sky
shoe
s…ea
boat
sky
sea
beach
…
boat
sky
s…ea
sky
boat sea
Annotation&amp;
Localization
Result
Annotation By
Classification</p>
      <p>Annotation By</p>
      <p>Search</p>
      <p>Localization by
Fast R-CNN</p>
      <p>Localization by</p>
      <p>Search</p>
      <p>The remainder of this working note is structured as follows. Section 2 presents
the details about data preparation for model training. In Section 3, we elaborate
the details about how we obtain the results of image annotation. Section 4
introduces the annotation localization approach. In section 5, we discuss the results of
experiments and the parameter settings. Finally, we conclude our participation
in ImageCLEF 2015 in section 6.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Data Preparation</title>
      <p>
        In this year, hand labeled data is allowed in the image annotation and
localization task. We prefer multiple online resources to perform such task, including the
ImageNet database [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], the Sun database [18], the WordNet [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and the online
image sharing website Flicker 1. To perform image annotation by classi cation,
we attempt to collect the training images from the well labeled dataset ImageNet
and Sun dataset. There are 175 concepts concurrent in the ImageNet dataset and
the ImageCLEF task simultaneously. Meanwhile, there are 217 concepts
concurrent in the Sun dataset and the ImageCLEF task. For the concepts not in the
ImageNet and Sun database, images are crawled from the online image sharing
website Flicker and ltered by humans with 50 images left for each concept. In
our work, the visual features of images are represented with deep features [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ],
we employ the VGG19 model [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] pretrained on the ImageNet dataset (1000
classes) and average the output of its relu6 layer for 10-view image patches (4
corners and 1 center patches of an image as well as their mirrors) as our visual
feature.
      </p>
      <p>There are 1979 images has been released to test the proposed method. The
frequency of di erent concept is unbalanced and there are 17 concepts do not
occur in the development dataset. Then we have collected some images for such
concepts to make the development set more applicable to set hyper-parameters
and validate the proposed method.
3
3.1</p>
    </sec>
    <sec id="sec-3">
      <title>Concept Annotation</title>
      <sec id="sec-3-1">
        <title>Annotation By Classi cation</title>
        <p>
          Image annotation by classi cation is to train a multi-class classi er or one-vs-rest
classi er corresponding to di erent concepts. Such solution is simple, and usually
can achieve satisfactory performance given abundant training data. Towards this
problem, we choose a linear Support Vector Machine (SVM) [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] to train a
onevs-rest classi er for each concept. Due to images usually being labelled with
multiple concepts in training data, the negative samples for a given concept
classi er are selected as the ones whose all labels do not include the concept.
For a testing image, we select the most con dent concepts by thresholding the
classi cation con dences to obtain the annotations of each image.
3.2
        </p>
        <p>Annotation By Search
The search-based approach for image annotation works on the assumption that
visual similar images should re ect similar semantical concepts, and most textual
information of web images is relevant to their visual content. Thus, the
searchbased annotation process can be divided into two phases: one is the search for
similar images, and the other is relevant concept selection from the textual
information of those similar images.</p>
        <p>
          First, given a testing image with textual information, we will search its
similar neighbors on the whole 500,000 dataset. As mentioned in section 2, images
are represented with 4096-dimensional deep features. To speed up similar image
1 https://www. ickr.com/
retrieval for the large scale image database, we adopt a hash encoding
algorithm. Specially, we map the deep features to 32768-dimensional binary hash
codes leveraging the random projection algorithm proposed in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], and employ
hamming distance to rank the images in the dataset.
        </p>
        <p>
          To further improve the results of visual similarity search, we explore the
textual information of the given image, and perform the semantic similarity search
on the top-NA visually similar images to rerank the similar image set. Here, we
use a publicly available tool of Word2Vec [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] to compute vector representations
of textual information of images, which are provided with the scofeat
descriptors. With the word vector representations, the cosine distance is used to rerank
images in order to obtain a set of visually and semantically similar images.
        </p>
        <p>
          Next, the annotations to the testing image will be mined from the textual
descriptions of the above obtained similar image set. For the annotation mining,
we employ a WordNet-based approach, which is similar to the solution in [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
The major di erence is that we mine the concepts from a set of visually and
semantically similar images, while they considered only the visual similarities
among images. A candidate concept graph is built with the help of WordNet,
and the top-NW concepts with higher number of links are selected as the nal
image description.
        </p>
        <p>We combine the results of the above classi cation-based solution and the
search-based solution with di erent thresholding settings, while their di erent
performances will be discussed in the experimental session.</p>
        <p>Concept extension is adopted to deal with the strong correlation among
concepts to make the annotation result more su cient. For the given 251 concepts,
some concepts have strong correlation like \eyes" and \nose", which usually
occur together. Hierarchy relation may exist like concepts \apple" and \fruit".
When the child node concept \apple" occurs, the parent node concept \fruit"
must occur. These relations can be achieved by exploring the WordNet concept
hierarchy and the provided ImageCLEF concept set with general level categories.
4
4.1</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Annotation Localization</title>
      <sec id="sec-4-1">
        <title>Localization by Fast RCNN</title>
        <p>
          To localize the objects in the given images, we follow the Fast R-CNN framework
(FRCN) proposed in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], which provides classi cation result and regressed
location simultaneously for each candidate object proposal. The FRCN approach is
conducted on a number of regions of interest (RoIs), which are sampled from the
object proposals of the images. To reduce repetitive computation of the
overlapped regions, the last pooling layer of the FRCN network is replaced with a RoI
pooling layer compared with traditional CNN network, which observably speeds
up the training and testing process. Furthermore, the network uses two sibling
loss terms as supervision to learn the classi cation and localization information
collaboratively, which are proved to be helpful to improving the performance
and make the approach a one-stage detection framework.
        </p>
        <p>Each RoI corresponds to a region in the feature map provided by the last
convolutional layer. The RoI pooling layer carries out max pooling on each of
the corresponding regions and pools them into xed-size feature maps. The scale
of pooling mask in RoI pooling layer is auto-adjusted according to the spatial
size of the input feature map regions, to make the outputs all have the same
size. Therefore, the feature map of each RoI can match the following fully
connected layers seamlessly after RoI pooling and contribute to the network as an
independent instance.</p>
        <p>As for the supervision, the FRCN network employs two sibling output layers
to predict classi cation probability and bounding box regression o sets
respectively for each RoI on each category. The rst output layer is a softmax layer
which outputs a probability distribution over all categories. And we use the
standard cross-entropy loss function to constrain it as follows,</p>
        <p>Lcls =
log(p^k )
where k is the groundtruth label and p^k is the predicted probability on this
class, assuming that there are totally K + 1 categories including K object classes
and a background class. The second output layer is a regression layer which
predicts the bounding box regression o sets for each category as tk = (tkx; tyk; tkh; tkw),
where k is the index of the K object classes. Assuming that the groundtruth
o sets for the class k is t = (tx; ty; th; tw), the regression loss function is
formulated as follows,
(1)
(2)
(3)
(4)
where</p>
        <p>Lloc =</p>
        <p>X
i2fx;y;h;wg</p>
        <p>smoothL1 (tik ; ti )
smoothL1 (x) =
(0:5x2
jxj
0:5
if jxj &lt; 1
otherwise
Thus, the loss function for the whole network can be formulated as follows,
L = Lcls + [k
1]Lloc
where is a weighting parameter to balance the two loss terms. And [k 1] is
an indicator function with the convention that the background class is labeled
as k = 0 and the object classes as k = 1; 2; ; K, which means that the
localization regression loss term is ignored for the background RoIs.</p>
        <p>
          In practice, we rst extract object proposals with the selective search method
[
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] and then sample RoIs from them for training. We take 25% of the RoIs
from the object proposals that overlap certain groundtruth bounding boxes with
more than 0.5 IoU (intersection over union) and label them the same with the
groundtruth bounding boxes. The rest RoIs are sampled from the object
proposals with a maximum IoU between 0.1 and 0.5 with the groundtruth bounding
boxes and are labeled as background, as instructed in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. While during testing,
we input all the object proposals into the network and predict labels and
regression o sets for all of them. A preliminary screening is also implemented in this
        </p>
        <sec id="sec-4-1-1">
          <title>Valley</title>
        </sec>
        <sec id="sec-4-1-2">
          <title>Shore</title>
        </sec>
        <sec id="sec-4-1-3">
          <title>Bathroom</title>
          <p>step with non-maximum suppression to take out some object proposals with too
low classi cation probabilities.
4.2</p>
          <p>Localization by Search
We give a special consideration to the scene related concepts for the annotation
localization ( e.g., \beach", \sea", \river" and \valley"). If an image is predicted
as a scenery concept, we rst nd its top-NL visually similar neighbors with the
same concept in the localization training data, and use their merged bounding
box as the location of the scenery concept. Figure 2 shows some examples about
search-based location results. The nal localization results will be composed of
the predicted results of the Fast R-CNN model and search-based localization
results.
4.3</p>
          <p>
            Localization of Face Related Concepts
There are a few concepts related with the concept \face" ( e.g., \head", \eye",
\nose", \mouth" and \beard"). To localize face related concepts exactly, we
employ face detection algorithm with aggregate channel features [
            <xref ref-type="bibr" rid="ref4">19, 4</xref>
            ]. Facial
point detection is performed to locate key points in the face [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ]. Localization
of the concepts \eye", \nose", \mouth" and \beard" is got based on the key
points. Concepts of \ear" and \neck" are located based on the relative location
of face with experience. Besides, linear classi ers [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ] are trained with the SIFT
[
            <xref ref-type="bibr" rid="ref11">11</xref>
            ] features extracted on the facial points to determine the concepts of \man",
\woman", \male child" and \female child".
          </p>
          <p>Setting 1
Setting 2
Setting 3
Setting 4
Setting 5
10
20
30</p>
          <p>40 50 60
Overlap with GT labels(%)
70
80
90
We have submitted 8 runs with 5 di erent settings of combinations with the
above model modules, including Annotation By Classi cation (ABC),
Annotation By Search (ABS), localization by Fast R-CNN (FRCN), Localization By
Search (LBS) and Concept Extension (CE). Some runs are of the same setting
with di erent parameter values.
Method ABC CE ABS LBS FRCN SVM Threshold Overlap 0.5 Overlap 0
Setting 1 yes no yes yes yes 0.5 0.510 0.642
Setting 2 yes yes yes yes yes 0.4 0.510 0.635
Setting 3 yes yes no yes yes 0.4 0.486 0.613
Setting 4 yes no no yes yes 0.4 0.432 0.552
Setting 5 no no no no yes 0.4 0.368 0.469
The experimental results of our method with di erent settings are presented
in Table 1. The last two columns show the mean average precision with di
erent overlap percentage with ground truth labels. Detailed results with increasing
overlap percentage is shown in Fig. 3. We can nd the setting 1 with the modules
ABC, ABS, FRCN and LBS achieves the best result, and setting 2 extending</p>
          <p>
            Map with 50% overlap with the GT labels
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0 ILSTE_CA -STT_1PCRN -STT_2PCRN -STT_3PCRN -STT_4PCRN -STT_5PCRN -STT_6PCRN -STT_7PCRN -STT_8PCRN -STT_9PCRN -STT_01PCRN iII--1PCCR iII--2PCCR iII--3PCCR iII--4PCCR iII--5PCCR iII--6PCCR iII--7PCCR iII--8PCCR iII--9PCCR iII--01PCCR IISA I-L1PRVAN I-L2PRVAN I-L3PRVAN I-L4PRVAN I-L5PRVAN I-L6PRVAN I-L7PRVAN I-L8PRVAN i-rvke1d i-rvke2d i-rvke3d i-rvke4d i-rvke5d i.f-rL61p i.f-rL62p i.f-rL63p i.f-rL64p i.f-rL56p i.f-rL66p i.f-rL67p I-LS61PVM I-LS62PVM I-LS63PVM I-LS64PVM I-LS65PVM I-LS66PVM I-LS67PVM I-LS68PVM I-LS69PVM I-LS601PVM I-LS611PVM I-LS621PVM I-LS631PVM I-LS641PVM -1CRU -2CRU -3CRU -4CRU -5CRU -6CRU -7CRU -8CRU I-S1VAM I-S2VAM I-S3VAM I-S4VAM I-S5VAM I-S6VAM I-S7VAM I-S8VAM I-S9VAM I-S01VAM I-S11VAM I-S21VAM I-S31VAM I-S41VAM I-S51VAM I-S61VAM
setting 1 with CE achieves comparable results. By comparing the result of
setting 2 and setting 3, we can nd the e ectiveness by taking annotation by search
into consideration. Furthermore, we have validated the e ect of FRCN, and we
nd that result of localization by detection only is unsatisfactory. This is mainly
due to two reasons, one is that the training data is very limited for each concept,
the other is that the content in the web images do not have obvious objectness
[
            <xref ref-type="bibr" rid="ref1">1</xref>
            ] for some concepts. The proposed hybrid learning framework with two stage
process is more suitable to deal with such task. Comparisons of our runs
(denoted IVANLPR-*) and other participants' runs are illustrated in gure 4. The
submitted runs of our team achieved the second place among the di erent
teams, which shows the outperformance of the proposed hybrid two-stage learning
framework for the scalable annotation and localization task.
6
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this paper, we described the participation of IVANLPR team at ImageCLEF
2015 Scalable Concept Image Annotation task. We proposed a hybrid learning
framework to solve the scalable annotation task. We adopt a two-stage solution to
rst annotate images with possible concepts and then localize the concepts in the
images. For the rst stage, both supervised method and unsupervised method are
adopted to make full use of the available hand-labeled data and surrounding text
in the webpage. For the second stage, Fast R-CNN and search-based method are
adopted to locate the annotation concepts. Extensive experiments demonstrate
the outperformance of the proposed hybrid two-stage learning framework for the
scalable annotation task.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments.</title>
      <p>This work was supported by 973 Program (2012CB316304) and National Natural
Science Foundation of China (61332016, 61272329 and 61472422).
17. Villegas, M., Muller, H., Gilbert, A., Piras, L., Wang, J., Mikolajczyk, K., de
Herrera, A.G.S., Bromuri, S., Amin, M.A., Mohammed, M.K., Acar, B., Uskudarli,
S., Marvasti, N.B., Aldana, J.F., del Mar Roldan Garc a, M.: General Overview of
ImageCLEF at the CLEF 2015 Labs. Lecture Notes in Computer Science, Springer
International Publishing (2015)
18. Xiao, J., Hays, J., Ehinger, K., Oliva, A., Torralba, A.: Sun database: Large-scale
scene recognition from abbey to zoo. In: CVPR. pp. 3485{3492 (2010)
19. Yang, B., Yan, J., Lei, Z., Li, S.Z.: Aggregate channel features for multi-view face
detection. In: IJCB. pp. 1{8. IEEE (2014)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Alexe</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deselaers</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferrari</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>What is an object? In: CVPR</article-title>
          . pp.
          <volume>73</volume>
          {
          <issue>80</issue>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bingham</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mannila</surname>
          </string-name>
          , H.:
          <article-title>Random projection in dimensionality reduction: Applications to image and text data</article-title>
          .
          <source>In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
          . pp.
          <volume>245</volume>
          {
          <fpage>250</fpage>
          . KDD,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bud</surname>
            <given-names>kova</given-names>
          </string-name>
          , P.,
          <string-name>
            <surname>Botorek</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Batko</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zezula</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          : DISA at imageclef
          <year>2014</year>
          :
          <article-title>The search-based solution for scalable image annotation</article-title>
          .
          <source>In: Working Notes for CLEF 2014 Conference</source>
          . pp.
          <volume>360</volume>
          {
          <issue>371</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Dollar</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Appel</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Belongie</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perona</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Fast feature pyramids for object detection</article-title>
          .
          <source>Pattern Analysis and Machine Intelligence</source>
          ,
          <source>IEEE Transactions on 36(8)</source>
          ,
          <volume>1532</volume>
          {
          <fpage>1545</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Everingham</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gool</surname>
            ,
            <given-names>L.J.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Williams</surname>
            ,
            <given-names>C.K.I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Winn</surname>
            ,
            <given-names>J.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zisserman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>The pascal visual object classes (VOC) challenge</article-title>
          .
          <source>International Journal of Computer Vision</source>
          <volume>88</volume>
          (
          <issue>2</issue>
          ),
          <volume>303</volume>
          {
          <fpage>338</fpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>R.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>K.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hsieh</surname>
            ,
            <given-names>C.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>X.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>C.J.:</given-names>
          </string-name>
          <article-title>Liblinear: A library for large linear classi cation</article-title>
          .
          <source>J. Mach. Learn. Res. 9</source>
          ,
          <year>1871</year>
          {1874 (Jun
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Gilbert</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piras</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yan</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dellandrea</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gaizauskas</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Villegas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolajczyk</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Overview of the ImageCLEF 2015 Scalable Image Annotation, Localization and Sentence Generation task</article-title>
          .
          <source>In: CLEF2015 Working Notes. CEUR Workshop Proceedings</source>
          , CEUR-WS.org, Toulouse,
          <source>France (September 8-11</source>
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Girshick</surname>
          </string-name>
          , R.:
          <string-name>
            <surname>Fast</surname>
          </string-name>
          r-cnn.
          <source>arXiv preprint arXiv:1504.08083</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Jia</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shelhamer</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Donahue</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karayev</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Long</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Girshick</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guadarrama</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Darrell</surname>
          </string-name>
          , T.:
          <article-title>Ca e: Convolutional architecture for fast feature embedding</article-title>
          .
          <source>arXiv preprint arXiv:1408.5093</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. K, S.,
          <string-name>
            <surname>A.</surname>
          </string-name>
          ,
          <string-name>
            <surname>Z.</surname>
          </string-name>
          :
          <article-title>Very deep convolutional networks for large-scale image recognition</article-title>
          .
          <source>arXiv preprint arXiv:1409.1556</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Lowe</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Distinctive image features from scale-invariant keypoints</article-title>
          .
          <source>International Journal of Computer Vision</source>
          <volume>60</volume>
          (
          <issue>2</issue>
          ),
          <volume>91</volume>
          {
          <fpage>110</fpage>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>E cient estimation of word representations in vector space</article-title>
          .
          <source>arXiv preprint arXiv:1301.3781</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>G.A.</given-names>
          </string-name>
          :
          <article-title>Wordnet: A lexical database for english</article-title>
          .
          <source>Communications of the ACM</source>
          <volume>38</volume>
          (
          <issue>11</issue>
          ),
          <volume>39</volume>
          {
          <fpage>41</fpage>
          (
          <year>1995</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Russakovsky</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deng</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Su</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krause</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Satheesh</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , Ma,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            ,
            <surname>Karpathy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Khosla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Bernstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Berg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.C.</given-names>
            ,
            <surname>Fei-Fei</surname>
          </string-name>
          ,
          <string-name>
            <surname>L.</surname>
          </string-name>
          :
          <article-title>Imagenet large scale visual recognition challenge</article-title>
          .
          <source>International Journal of Computer Vision</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>Deep convolutional network cascade for facial point detection</article-title>
          .
          <source>In: CVPR</source>
          . pp.
          <volume>3476</volume>
          {
          <fpage>3483</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Uijlings</surname>
            ,
            <given-names>J.R</given-names>
          </string-name>
          ., van de Sande,
          <string-name>
            <given-names>K.E.</given-names>
            ,
            <surname>Gevers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Smeulders</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.W.</surname>
          </string-name>
          :
          <article-title>Selective search for object recognition</article-title>
          .
          <source>International journal of computer vision 104(2)</source>
          ,
          <volume>154</volume>
          {
          <fpage>171</fpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>