<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>RUC-Tencent at ImageCLEF 2015: Concept Detection, Localization and Sentence Generation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Xirong Li?</string-name>
          <email>xirong@ruc.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Qin Jin?</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shuai Liao</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Junwei Liang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xixi He</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yujia Huo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Weiyu Lan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bin Xiao</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yanxiong Lu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jieping Xu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Multimedia Computing Lab, School of Information, Renmin University of</institution>
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Pattern Recognition Center, WeChat Technical Architecture Department</institution>
          ,
          <addr-line>Tencent</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we summarize our experiments in the ImageCLEF 2015 Scalable Concept Image Annotation challenge. The RUCTencent team participated in all subtasks: concept detection and localization, and image sentence generation. For concept detection, we experiments with automated approaches to gather high-quality training examples from the Web, in particular, visual disambiguation by Hierarchical Semantic Embedding. Per concept, an ensemble of linear SVMs is trained by Negative Bootstrap, with CNN features as image representation. Concept localization is achieved by classifying object proposals generated by Selective Search. For the sentence generation task, we adopt Google's LSTM-RNN model, train it on the MSCOCO dataset, and netune it on the ImageCLEF 2015 development dataset. We further develop a sentence re-ranking strategy based on the concept detection information from the rst task. Overall, our system is ranked the 3rd for concept detection and localization, and is the best for image sentence generation in both clean and noisy tracks.</p>
      </abstract>
      <kwd-group>
        <kwd>Concept detection</kwd>
        <kwd>Concept localization</kwd>
        <kwd>Image Captioning</kwd>
        <kwd>Deep Learning</kwd>
        <kwd>Hierarchical Semantic Embedding</kwd>
        <kwd>Negative Bootstrap</kwd>
        <kwd>CNN</kwd>
        <kwd>LSTM-RNN</kwd>
        <kwd>Sentence Reranking</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        This year we participated in all the subtasks, i.e., concept detection and
localization and image sentence generation, in the ImageCLEF 2015 Scalable Concept
Image Annotation challenge. In addition to the 500k web image set (webupv2015
hereafter) and the 2k develop set (Dev2k) provided by the organizers [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], we
leverage a number of external resources, listed below:
? X. Li and Q. Jin contributed equally to this work.
{ Task 2: MSCOCO [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], a pretrained VGGNet CNN [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], and a pretrained
word2vec model [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>Next, we introduce in Section 2 our concept detection and localization system,
followed by our image sentence generation system in Section 3.
2</p>
      <p>
        Task 1: Concept Detection and Localization
We develop a concept detection and localization system that learns concept
classi ers from image-level annotations and localizes concepts within an image by
region-level classi cation, as illustrated in Fig. 1. Compared with our earlier
systems [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ], this year we make two technical improvements for concept modeling.
First, we replace bag of visual words features by an o -the-shelf CNN feature,
i.e., the last fully connected layer (fc7) of Ca e [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Second, we employ
Hierarchical Semantic Embedding [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] for concept disambiguation, which is found to be
e ective for acquiring positive examples for some ambiguous concepts such as
`mouse', `basin', and `blackberry'.
      </p>
      <p>Training: Image-level concept modeling</p>
      <p>web / Flickr images
concept</p>
      <p>Tag-based
image search
Test: Concept detection and localization</p>
      <p>Positive Example</p>
      <p>Selection by</p>
      <p>HierSE</p>
      <p>Object
Selective Search proposals</p>
      <p>CNN Feature
Extraction</p>
      <p>
        Refinement
Positive Training Examples Since hand labeled data is allowed this year, we
collect positive training examples from multiple sources of data, including
1. ImageNet. For 208 of the 251 ImageCLEF concepts, we can nd labeled
examples from ImageNet [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
2. webupv2015. It consists of 500K web images provided by the organizers [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
3. ickr2m. We start with the 1.2 million Flickr image set from [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], and extend
it by adding more Flickr images labeled with the ImageCLEF concepts,
resulting in a set of two million images.
      </p>
      <p>Negative
Bootstrap</p>
      <p>** **
Prediction</p>
      <p>
        Since the annotations of webupv2015 and ickr2m are noisy, per concept we
collect its positive training examples by a two-step procedure. In the rst step,
we conduct tag-based image search using tags of the concept as the query to
generate a set of candidate images. For webupv2015, an image is chosen as a
candidate as long as its meta data, e.g., url and query logs, overlap the tags. In
the second step, these candidate images are sorted in descending order in terms
of their relevance to the concept, and the top-n ranked images are preserved
as the positive training set. To compute the relevance score between the given
concept and a speci c image, we embed them into a common semantic space
by the Hierarchical Semantic Embedding (HierSE) algorithm [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Consequently,
the cross-media relevance score is computed as the cosine similarity between the
embedding vectors.
      </p>
      <p>
        Since HierSE takes the WordNet hierarchy into account, it can resolve
semantic ambiguity by embedding a label into distinct vectors, depending on its
position in WordNet. Consider the label `basin' for instance. In the context of
ImageCLEF, it is `basin.n.01', referring to a bow-shaped vessel. On the other
hand, it can also be `basin.n.03', meaning a natural depression in the surface of
the land. In HierSE, a label with a given sense is represented by a convex
combination of the embedding vectors of this label and its ancestors tracing back to
the root. Fig. 2 shows the top-ranked images of `basin' from ickr2m, returned
by the tag relevance (tagrel) algorithm [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and HierSE, respectively.
      </p>
      <p>Given the HierSE ranked images per concept, we empirically preserve the
top n = 1000 images as positive training examples. Since the number of genuine
positives varies over concepts, this year we also consider an adaptive selection
strategy. We train an SVM model using the top 20 images as positives and 20
images at the bottom as negatives. The previously ranked images are classi ed
by the model and labeled as positive if the classi cation scores exceed 0. We
denote this strategy as HierSE + SVM.</p>
      <p>
        Negative Bootstrap To e ectively exploit an overwhelming number of (pseudo)
negative examples, we learn an ensemble of linear SVMs by the Negative
Bootstrap algorithm [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. The base classi ers are trained using LIBLINEAR [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
The classi er ensemble is compressed into a single model using the technique
developed in [14] so that the prediction time complexity is independent of the
ensemble size, and linear w.r.t. the visual feature dimension.
      </p>
      <p>
        To evaluate the multiple methods for selecting positive and negative
examples, we split the ImageNet set into two disjoint subsets, imagenet-clef15train
and imagenet-clef15test, which contain positive examples for the 208
ImageCLEF concepts. Using imagenet-clef15test as the test set, the performance of the
individual methods is summarized in Table 1. For webupv2015, the combination
of HierSE + SVM and Negative Bootstrap performs the best. For ickr2m, the
combination of HierSE and Negative Bootstrap performs the best. Still, there is
a substantial gap between the model trained on auto-selected examples (MAP
0.375) and the model trained on hand labeled examples (MAP 0.567). So for
the 208 concepts, we combine all the models trained on imagenet-clef15train,
(a) basin by tagrel
(b) basin by HierSE
webupv2015, and ickr2m, while for the other 43 concepts, we combine all the
models trained on webupv2015 and ickr2m. Although learned weights are found
to be helpful according to our previous experiences [
        <xref ref-type="bibr" rid="ref8">8, 15</xref>
        ], we use model
averaging in this evaluation due to time constraints.
2.2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Concept Localization</title>
      <p>Object Proposal Generation For each of the 500k images, we use Selective
Search [16] to get a number of bounding-box object proposals. In particular, the
image is rst over-segmented by the graph-based image segmentation algorithm
[17]. The segmented regions are iteratively merged by hierarchical grouping as
depicted in [16]. As the CNN feature is extracted per bounding box, the number
of chosen object proposals is set to 20 given our computational power.
Detection Given an image and its 20 object proposals, we classify each of
them using the 251 concept models, and label it with the concept of maximum
response. Notice that for each concept, we have two choices to implement its
model. One choice is HierSE, as it was developed for zero-shot learning, and
is thus directly applicable to compute cross-media relevance scores. The other
choice is linear SVMs trained from the selective positives combined with Negative
Bootstrap. HierSE is the same as used for positive example selection, while the
SVMs used here are distinct from the SVMs in Section 2.1.</p>
      <p>To reduce false alarms, we re ne the detection as follows. For object proposals
labeled as the same concept, if their number is lower than a given Minimum
Detection threshold md (we tried 1, 2, and 3), they are discarded, otherwise we
sort them in descending order in terms of detection scores. We go through the
ranked proposals, preserving a proposal if its bounding box has less than 30%
overlap with the previously preserved proposals.
2.3</p>
    </sec>
    <sec id="sec-3">
      <title>Submitted Runs</title>
      <p>We submitted eight runs in the concept detection and localization task.
ruc task1 hierse md1 is our baseline run, directly using HierSE to compute
the relevance score between an object proposal and a concept, with the Minimum
Detection threshold md set to 1, namely no removal.
ruc task1 hierse md2 is the same as ruc task1 hierse md1, but with md = 2.
ruc task1 hierse md3 is the same as ruc task1 hierse md1, but with md = 3.
ruc task1 svm md1 uses SVMs for concept detection, with md = 1.
ruc task1 svm md2 is the same as ruc task1 svm md1, but with md = 2.
ruc task1 svm md3 is the same as ruc task1 svm md1, but with md = 3.
ruc task1 svm md2 nostar is the same as ruc task1 svm md2, but
empirically thresholding the detection results of two concepts `planet' and `star', as
they tend to be over red.
ruc task1 svm md3 nostar is the same as ruc task1 svm md3, but
empirically thresholding the detection results of two concepts `planet' and `star'.
Result Analysis The performance of the eight runs is shown in Fig. 3. We
attribute the low performance of the HierSE runs to the following two reasons.
First, HierSE is designed for zero-shot learning. It does not use any examples of
the ImageCLEF concepts. Second, the current implementation of HierSE relies
on the ImageNet 1k label set [18] to embed an image, while the 1k set is loosely
related with the ImageCLEF concepts. We can see that the Minimum Detection
threshold is helpful, improving MAP 0.5Overlap from 0.452 to 0.496.
0.8
0.7
0.6
p0.5
a
l
r
e
v
.5O0.4
0
_
P
AM0.3
0.2
0.1
0.00
runs form other teams
ruc_task1_svm_md3
ruc_task1_svm_md3_nostar
ruc_task1_svm_md2_nostar
ruc_task1_svm_md2
ruc_task1_svm_md1
ruc_task1_hierse_md3
ruc_task1_hierse_md2
ruc_task1_hierse_md1
10
20
30</p>
      <p>
        40
subtask1 runs
50
60
70
80
We have used the MSCOCO dataset [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] in addition to the Dev2k dataset
provided by the organizers. For the Dev2k dataset, we split it into a training set
with 1,600 images, a validation set with 200 images, and a test set with 200
images.
Our image description system is built based on deep models proposed by Vinyals
et al. from Google [19]. The deep model contains the following key components:
(1) a Convolutional Neural Network (CNN) for image encoding, and (2) a
LongShort Term Memory based Recurrent Neural Network (LSTM-RNN) for
sentence encoding, and (3) a LSTM-RNN for sentence decoding. In our system as
shown in Fig. 4, we use the pre-trained VGGNet [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] for CNN feature extraction.
The LSTM-RNN implementation is from the NeuralTalk project1. The encoding
LSTM-RNN and the decoding LSTM-RNN are shared.
      </p>
      <p>In the training stage, we rst train the LSTM-RNN on the MSCOCO dataset.
We then ne-tune the model on the ImageCLEF Dev2k dataset using a low
learning rate. Beam search is used in text decoding as in [19]. We nally re-rank
the hypothesis sentences utilizing the concept detection results. Given an image,
our system rst generates k best sentences with con dence scores, meanwhile the
concept detection system from task 1 provides m best detected concepts with
con dence scores. If a detected concept appears in the hypothesis sentence, we
call it a matched concept. We then compute the ranking score for each hypothesis
sentence by matching the detected concepts with the hypothesis sentences as
follows:</p>
      <p>RankScore(hypoSent) =
conceptScore + (1
) sentenceScore;
(1)
where conceptScore refers to the average of the con dence scores of all the
matched concepts in the hypothesis sentence (hypoSent), sentenceScore refers
to the con dence score assigned to the hypothesis sentence by the RNN model.
The parameter has been tuned on the Dev2k validation set. Fig. 5 showcases
some examples of how re-ranking helps the system produce better image
descriptions.</p>
      <p>In the clean track, \golden" concepts, i.e., ground truth concepts by hand
labelling, are provided. So we set the con dence score for each concept as 1. In</p>
      <sec id="sec-3-1">
        <title>1 https://github.com/karpathy/neuraltalk</title>
        <p>
          (a)
(b)
this condition, if the top-ranked sentence does not contain any of the \golden"
concepts, the word in the top-ranked sentence with the closest distance to any
one of the \golden" concepts will be replaced by that concept. The word distance
is computed using a pre-trained word2vec2 model [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>2 code.google.com/p/word2vec</title>
        <p>
          3.2
We submitted six runs for the noisy track, and two runs for the clean track.
[Noisy track] RUC run1 dev2k is our baseline trained on the Dev2k dataset
without sentence re-ranking.
[Noisy track] RUC run2 dev2k rerank-hierse is trained on the Dev2k
dataset, using the 251 concepts predicted by HierSE [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] for sentence re-ranking.
[Noisy track] RUC run3 dev2k rerank-tagrel is trained on the Dev2k
dataset, using the Flickr tags predicted by the tag relevance algorithm [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] for
sentence re-ranking.
[Noisy track] RUC run4 netune-mscoco is rst trained on the MSCOCO
dataset, and then ne-tuned (with a relatively low learning rate) on the Dev2k
dataset, without sentence re-ranking.
[Noisy track] RUC run5 netune-mscoco rerank-hierse is trained the
same way as for RUC run4 netune-mscoco, using the 251 concepts predicted
by HierSE for sentence re-ranking.
[Noisy track] RUC run6 netune-mscoco rerank-tagrel is trained the
same way as for RUC run4 netune-mscoco, using the Flickr tags predicted by
the tag relevance algorithm for sentence re-ranking.
[Clean track] RUC run1 dev2k is trained on the Dev2k dataset, with
sentence re-ranking using the provided \golden" concepts.
[Clean track] RUC run2 netune-mscoco is rst trained on the MSCOCO
dataset and then ne-tuned on Dev2k, with sentence re-ranking using the
provided \golden" concepts.
        </p>
        <p>Result Analysis The performance of the noisy track runs is shown in Fig. 6(a).
We see that ne-tuning improves the system performance (RUC run1 dev2k
0.1659 versus RUC run4 netune-mscoco 0.1759). Sentence re-ranking helps as
well (RUC run1 dev2k 0.1659 versus RUC run2 dev2k rerank-hierse 0.1781).</p>
        <p>The performance of the clean track runs is shown in Fig. 6(b). It shows that
the LSTM-RNN model trained on MSCOCO and ne-tuned on Dev2k is less
e ective than the model trained on Dev2k alone. This is somewhat contradictory
to the results of the noisy track. For a more comprehensive evaluation, Table 2
shows precision, recall, and F1 scores of the two runs, from which we see that
0.30</p>
        <p>RUC run2 netune-mscoco obtains a higher precision score. This is probably
because the reranking process based on the \golden" concepts makes the system
more precision oriented.</p>
        <p>Since the details of the test sets used in the two tracks are not available to
us, we cannot do more in-depth analysis.
For the concept detection and localization task, we nd Hierarchical Semantic
Embedding e ective for resolving visual ambiguity. Negative Bootstrap improves
the classi cation performance further. For the image sentence generation task,
Google's LSTM-RNN model can be improved by sentence re-ranking.</p>
        <p>
          Notice that for varied reasons including the limited number of submitted runs
and the unavailability of detailed information about the test data, our analysis
is preliminary, in particular a component-wise analysis is largely missing.
Acknowledgements. This research was supported by the National Science
Foundation of China (No. 61303184), the Fundamental Research Funds for the
Central Universities and the Research Funds of Renmin University of China
(No. 14XNLQ01), the Beijing Natural Science Foundation (No. 4142029), the
Specialized Research Fund for the Doctoral Program of Higher Education (No.
20130004120006), and the Scienti c Research Foundation for the Returned
Overseas Chinese Scholars, State Education Ministry. The authors are grateful to the
ImageCLEF coordinators for the benchmark organization e orts [
          <xref ref-type="bibr" rid="ref1">1, 20</xref>
          ].
14. Li, X., Snoek, C.: Classifying tag relevance with relevant positive and negative
examples. In: ACM MM. (2013)
15. Liao, S., Li, X., Shen, H.T., Yang, Y., Du, X.: Tag features for geo-aware image
classi cation. IEEE Transactions on Multimedia 17(7) (2015) 1058{1067
16. Uijlings, J., van de Sande, K., Gevers, T., Smeulders, A.: Selective search for object
recognition. International Journal of Computer Vision 104(2) (2013) 154{171
17. Felzenszwalb, Huttenlocher: E cient graph-based image segmentation.
International Journal of Computer Vision 59(2) (2004) 167{181
18. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,
Karpathy, A., Khosla, A., Bernstein, M., Berg, A., Fei-Fei, L.: ImageNet Large
Scale Visual Recognition Challenge. arXiv:1409.0575 (2014)
19. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: A neural image
caption generator. CoRR abs/1411.4555 (2014)
20. Villegas, M., Muller, H., Gilbert, A., Piras, L., Wang, J., Mikolajczyk, K., de
Herrera, A.G.S., Bromuri, S., Amin, M.A., Mohammed, M.K., Acar, B., Uskudarli, S.,
Marvasti, N.B., Aldana, J.F., del Mar Roldan Garc a, M.: General Overview of
ImageCLEF at the CLEF 2015 Labs. Lecture Notes in Computer Science. (2015)
        </p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Gilbert</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piras</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yan</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dellandrea</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gaizauskas</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Villegas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolajczyk</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Overview of the ImageCLEF 2015 Scalable Image Annotation, Localization and Sentence Generation task</article-title>
          .
          <source>In: CLEF Working Notes</source>
          . (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Deng</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dong</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Socher</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>L.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fei-Fei</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>ImageNet: A large-scale hierarchical image database</article-title>
          .
          <source>In: CVPR</source>
          . (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Jia</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shelhamer</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Donahue</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karayev</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Long</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Girshick</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guadarrama</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Darrell</surname>
          </string-name>
          , T.:
          <article-title>Ca e: Convolutional architecture for fast feature embedding</article-title>
          .
          <source>arXiv:1408.5093</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liao</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lan</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Du</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>Zero-shot image tagging by hierarchical semantic embedding</article-title>
          .
          <source>In: SIGIR</source>
          . (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maire</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Belongie</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hays</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perona</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ramanan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dollar</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zitnick</surname>
            ,
            <given-names>C.L.</given-names>
          </string-name>
          :
          <article-title>Microsoft COCO: common objects in context</article-title>
          .
          <source>CoRR abs/1405</source>
          .0312 (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Simonyan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zisserman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Very deep convolutional networks for large-scale image recognition</article-title>
          .
          <source>CoRR abs/1409</source>
          .1556 (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>In: NIPS</source>
          . (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liao</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jin</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Du</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          : Renmin University of China at
          <article-title>ImageCLEF 2013 scalable concept image annotation</article-title>
          .
          <source>In: CLEF working notes</source>
          . (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>He</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jin</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          : Renmin University of China at
          <article-title>ImageCLEF 2014 scalable concept image annotation</article-title>
          .
          <source>In: CLEF working notes</source>
          . (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Snoek</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Worring</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smeulders</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Harvesting social images for biconcept search</article-title>
          .
          <source>IEEE Transactions on Multimedia</source>
          <volume>14</volume>
          (
          <issue>4</issue>
          ) (Aug.
          <year>2012</year>
          )
          <volume>1091</volume>
          {
          <fpage>1104</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Snoek</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Worring</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Learning social tag relevance by neighbor voting</article-title>
          .
          <source>IEEE Transactions on Multimedia</source>
          <volume>11</volume>
          (
          <issue>7</issue>
          ) (Nov.
          <year>2009</year>
          )
          <volume>1310</volume>
          {
          <fpage>1322</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Snoek</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Worring</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koelma</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smeulders</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Bootstrapping visual categorization with relevant negatives</article-title>
          .
          <source>IEEE Transactions on Multimedia</source>
          <volume>15</volume>
          (
          <issue>4</issue>
          ) (Jun.
          <year>2013</year>
          )
          <volume>933</volume>
          {
          <fpage>945</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>R.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>K.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hsieh</surname>
            ,
            <given-names>C.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>X.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>C.J.:</given-names>
          </string-name>
          <article-title>LIBLINEAR: A library for large linear classi cation</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>9</volume>
          (
          <year>2008</year>
          )
          <year>1871</year>
          {
          <fpage>1874</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>