<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Joint Learning of CNN and LSTM for Image Captioning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yongqing Zhu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiangyang Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xue Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jian Sun</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xinhang Song</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shuqiang Jiang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Key Laboratory of Intelligent Information Processing, Institute of Computing Technology Chinese Academy of Sciences</institution>
          ,
          <addr-line>No.6 Kexueyuan South Road Zhongguancun, Haidian District, 100190 Beijing</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we describe the details of our methods for the participation in the subtask of the ImageCLEF 2016 Scalable Image Annotation task: Natural Language Caption Generation. The model we used is the combination of a procedure of encoding and a procedure of decoding, which includes a Convolutional neural network(CNN) and a Long Short-Term Memory(LSTM) based Recurrent Neural Network. We rst train a model on the MSCOCO dataset and then ne tune the model on di erent target datasets collected by us to get a more suitable model for the natural language caption generation task. Both of the parameters of CNN and LSTM are learned together.</p>
      </abstract>
      <kwd-group>
        <kwd>Convolutional neural network</kwd>
        <kwd>Long Short-Term Memory</kwd>
        <kwd>Image caption</kwd>
        <kwd>Joint learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        With the rapid development of Internet technologies and extensive access to
digital cameras, we are surrounded by a huge number of images, accompanied
with a lot of related text. However, the relationship between the surrounding text
and images varies greatly, how to close the loop between vision and language is
a challenging problem for the task of scalable image annotation [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ].
      </p>
      <p>
        It is easy for our human beings to describe a picture after a glance of it.
However, it is not easy for a computer to do the same work. Though great progress
has been achieved in visual recognition, it is still far away from generating
descriptions that a human can compose. The approaches automatically generating
sentence descriptions can be divided into three categories. The rst method is
template-based [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. These approaches often rely heavily on sentence templates,
so the generated sentences lack variety. The second method is retrieval-based [
        <xref ref-type="bibr" rid="ref5 ref6">5,
6</xref>
        ]. The advantage of these methods is that the captions are more human-like.
However, it is not exible to add or remove words based on the content of the
target image. Recently, many researchers have used the combination of CNN
and LSTM to translate an image into a linguistic sentence [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ].
      </p>
      <p>
        Our method is based on deep models proposed by Vinyals [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] which takes
advantage of Convolutional Neural Network (CNN) for image encoding and
LongShort Term Memory based Recurrent Neural Network (LSTM) for sentence
decoding. We rstly train a model on the MSCOCO [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] dataset. We then ne tune
the model on di erent datasets to make the model more suitable for the target
task. In training and netuning, the parameters of both CNN and LSTM are
learned together.
      </p>
      <p>Next, we introduce our methods in Section 2, followed by our experimental
results in Section 3. At last, the section 4 concludes the paper.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Method</title>
      <p>
        The model we use contains two types of neural networks, as illustrated in Figure
1. The rst stage is CNN for image encoding and the second stage is Long-Short
Term Memory(LSTM) based Recurrent Neural Network for sentence encoding
[
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]. For CNN, we use the pre-trained VGGNet [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] for feature extraction. Using
the VGGNet , we transform the pixels inside an image to a 4096-dimensional
vector. After getting the visual features, we train an LSTM to obtain linguistic
captions. In the LSTM training procedure, we change the parameters of both
CNN and LSTM together. At last we ne tune the pre-trained model to get a
more suitable model for the natural language caption generation task.
As shown in Figure 1, we use the pre-trained VGGNet [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] for CNN feature
extraction. We rst train the LSTM on corpora with paired image and sentence
captions, such as MSCOCO [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and Flickr30k [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. In the training procedure of
the LSTM, we change not only the parameters of the LSTM model, but also the
parameters of the CNN model, which is a joint learning of CNN and LSTM. We
then ne tune our model on di erent datasets as described in Section 3. At last,
We use the trained models to predict linguistic sentence of a given image.
Predicting stage
The process of predicting an image is shown in Figure 2. To generate a sentence
caption for an image, we get the CNN features of an image bv, set the rst
hidden state h0=0, x0 to the START vector and compute the hidden state h1
and predict the rst word y1. Then we use the word y1 predicted by our model
and set its embedding vector as x1, the previous hidden state h1, and then
compute the hidden state h2 and use it to predict the next word y2. The process
is repeated until the END token is generated.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Experiments and Submitted Runs</title>
      <p>
        We rst use the LSTM implementation from the NeuralTalk project [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. We
train models separately on the MSCOCO [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] dataset, the Flickr8k [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] dataset
and the Flickr30k [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] dataset. Then we use the model to predict the images
in the ImageCLEF 2016 validation set. The results are shown in Table 1. We
use Meteor [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] to evaluate sentences generated by a model. Validation set we
use here is the 2000 images and their corresponding sentences provided by the
organizers. Because the performance of the model on the MSCOCO dataset is
better than the other dataset, So we use the model trained on the MSCOCO as
our pre-trained model.
      </p>
      <p>We then do experiments to decide whether jointly learn the parameters of
CNN and LSTM together or xed the CNN and just learn the parameters of the
LSTM. Firstly, we train a model which only learns the parameters in LSTM,
then we use the model to predict the images in the validation set. The training
set we use is the MSCOCO dataset, and the test set we use is the provided
validation set. For comparison, we train a new model which not only learns the
parameters of the LSTM but also ne tunes the CNN model. We then use the
second model to predict the images in the validation set. And the results are
shown in Table 2.</p>
      <p>The results demonstrate that the joint learning of CNN and LSTM has a
signi cant improvement in performance. To make full use of the MSCOCO dataset,
we jointly train a model using all of the examples in MSCOCO dataset, not just
using the train split. The results are shown in Table 3. It is demonstrated that
more data can result in better performance.</p>
      <p>At last, we ne tune the jointly learned model on di erent datasets to get a
more suitable model for the natural language caption generation task. We use
the model trained on all of the examples on MSCOCO as the baseline, and ne
tune the model on di erent datasets. We rstly ne tune our model on a very
small dataset. In this experiment, we use 1500 images and their sentences in the
validation set as a training set and use the remaining 500 images to evaluate
the performance of the ne tuned model. The results are shown in Table 3. We
also ne tune the baseline model on a big dataset, which is the combination of
Flickr30K and Flickr8K. We can see that ne tuning on a big dataset can get a
better performance. We use the model obtained in the previous step to generate
image captions on all the 510123 target images. This time, we manually select
1000 satisfactory pairs of image and generated sentence from all the generated
captions and add them to the ne-tuning dataset. As shown in Table 3, the
results show that this pipeline has the best performance. We use the model ne
tuned in the combination of Flickr8K, Flickr30K and the selected 1000 examples
as our nal model.</p>
      <p>Figure 3 is the illustration of the generated image captions by di erent
models. The results qualitatively demonstrate that our nal model can generate more
satisfactory captions that reveal the content of the corresponding images.
We submitted four runs in the natural language caption generation task:
id sentence2.txt is our baseline method. The model is trained only using
all the examples in the MSCOCO dataset. The median score (provided by the
server) of the generated sentences is 0.1676 (The model is used twice to generate
both the two runs, so id sentence3.txt is the same as id sentence2.txt ).</p>
      <p>id sentence.txt is the results generated by the model which is rstly trained
only using the examples in the MSCOCO dataset and then ne tuned on the
combination of Flickr8K and Flickr30K. The median score of the generated
sentences is 0.1710.</p>
      <p>id sentence4.txt is the results generated by our nal model which is rstly
trained only using the examples in the MSCOCO dataset and then ne tuned
on the combination of Flickr8K, Flickr30k and the manually selected examples.
And the median score of the generated sentences is 0.1711. This submitted run
is the best of our submitted runs and also the best one for natural language
caption generation of ImageCLEF 2016.</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>After performing the experiments above, we get the following conclusions. By
learning the parameters in CNN and LSTM, the performance of the model can
be greatly improved. When we just change the parameters of LSTM, the
accuracy on the test set is 0.133. However, when we change parameters in both neural
network, the accuracy is 0.165. Secondly,more data can result in better
performance. We train a model only on the training split of the MSCOCO dataset
and the score of the generated sentences is 0.137. However, when we use all the
data in the MSCOCO dataset, the score is 0.165. Thirdly, when ne-tuning the
model only using a small dataset, the result we get is worse. When we ne tuned
our model only using 3/4 of the validation set, the result on the remaining 1/4
of the validation set is 0.116 which is worse than the model before ne-tuning.
However, the datasets we use to train our model don't include all the concepts
of the ImageCLEF 2016, so some sentences predicted by our model might be
weird.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work was supported in part by the National Basic Research Program of
China (973 Program): 2012CB316400, the National Natural Science
Foundation of China: 61532018, 61322212, the National Hi-Tech Development Program
(863 Program) of China: 2014AA015202. This work is also funded by Lenovo
Outstanding Young Scientists Program (LOYS).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Gilbert</surname>
          </string-name>
          , Luca Piras, Josiah Wang, Fei Yan, Arnau Ramisa, Emmanuel Dellandrea, Robert Gaizauskas, Mauricio Villegas, and
          <string-name>
            <given-names>Krystian</given-names>
            <surname>Mikolajczyk</surname>
          </string-name>
          .
          <article-title>Overview of the ImageCLEF 2016 Scalable Concept Image Annotation Task</article-title>
          .
          <source>In CLEF2016 Working Notes, CEUR Workshop Proceedings</source>
          , Evora, Portugal, September 5-8
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Mauricio</given-names>
            <surname>Villegas</surname>
          </string-name>
          , Henning Muller, Alba Garc a Seco de Herrera, Roger Schaer, Stefano Bromuri, Andrew Gilbert, Luca Piras, Josiah Wang, Fei Yan, Arnau Ramisa, Emmanuel Dellandrea, Robert Gaizauskas, Krystian Mikolajczyk, Joan Puigcerver,
          <string-name>
            <surname>Alejandro H. Toselli</surname>
          </string-name>
          ,
          <string-name>
            <surname>Joan-Andreu Sanchez</surname>
            , and
            <given-names>Enrique</given-names>
          </string-name>
          <string-name>
            <surname>Vidal</surname>
          </string-name>
          .
          <source>General Overview of ImageCLEF at the CLEF 2016 Labs. Lecture Notes in Computer Science</source>
          . Springer International Publishing,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>G.</given-names>
            <surname>Kulkarni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Premraj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dhar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Berg</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T. L.</given-names>
            <surname>Berg</surname>
          </string-name>
          .
          <article-title>Baby talk: Understanding and generating simple image descriptions</article-title>
          .
          <source>CVPR</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>A.</given-names>
            <surname>Farhadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hejrati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Sadeghi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Young</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rashtchian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hockenmaier</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Forsyth</surname>
          </string-name>
          .
          <article-title>Every picture tells a story: Generating sentences from images</article-title>
          .
          <source>ECCV</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Polina</given-names>
            <surname>Kuznetsova</surname>
          </string-name>
          , Vicente Ordonez, Alexander C. Berg,
          <string-name>
            <surname>Tamara L. Berg</surname>
            , and
            <given-names>Yejin</given-names>
          </string-name>
          <string-name>
            <surname>Choi</surname>
          </string-name>
          .
          <article-title>Collective generation of natural image descriptions</article-title>
          .
          <source>In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1,ACL '12</source>
          , pages
          <fpage>359</fpage>
          {
          <fpage>368</fpage>
          ,
          <string-name>
            <surname>Stroudsburg</surname>
          </string-name>
          ,PA,USA,
          <year>2012</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>R.</given-names>
            <surname>Mason</surname>
          </string-name>
          and
          <string-name>
            <given-names>E.</given-names>
            <surname>Charniak</surname>
          </string-name>
          .
          <article-title>Nonparametric method for datadriven image captioning</article-title>
          .
          <source>In ACL</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Oriol</given-names>
            <surname>Vinyals</surname>
          </string-name>
          , Alexander Toshev, Samy Bengio, and
          <string-name>
            <given-names>Dumitru</given-names>
            <surname>Erhan</surname>
          </string-name>
          .
          <article-title>Show and Tell: A Neural Image Caption Generator</article-title>
          .
          <source>In CVPR</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Kelvin</given-names>
            <surname>Xu</surname>
          </string-name>
          , Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <article-title>Show, attend and tell: Neural image caption generation with visual attention</article-title>
          .
          <source>arXiv preprint arXiv:1502.03044</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>T.-Y. Lin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Maire</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Belongie</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Hays</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Perona</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Ramanan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Doll</surname>
            <given-names>A</given-names>
          </string-name>
          ^
          <string-name>
            <surname>Zar</surname>
            , and
            <given-names>C. L.</given-names>
          </string-name>
          <string-name>
            <surname>Zitnick</surname>
          </string-name>
          .
          <article-title>Microsoft coco: Common objects in context</article-title>
          .
          <source>arXiv preprint arXiv:1405.0312</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <article-title>Very deep convolutional networks for large-scale image recognition</article-title>
          .
          <source>arXiv preprint arXiv:1409.1556</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>P.</given-names>
            <surname>Young</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hodosh</surname>
          </string-name>
          <article-title>A. Lai, and</article-title>
          <string-name>
            <given-names>J.</given-names>
            <surname>Hockenmaier</surname>
          </string-name>
          .
          <article-title>From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions</article-title>
          .
          <source>TACL</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>M. Hodosh</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Young</surname>
            , and
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Hockenmaier</surname>
          </string-name>
          .
          <article-title>Framing image description as a ranking task: data, models and evaluation metrics</article-title>
          .
          <source>Journal of Arti cial Intelligence Research</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>M.</given-names>
            <surname>Denkowski</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Lavie</surname>
          </string-name>
          .
          <article-title>Meteor universal: Language speci c translation evaluation for any target language</article-title>
          .
          <source>In Proceedings of the EACL 2014 Workshop on Statistical Machine Translation</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>