-

Joint Learning of CNN and LSTM for Image Captioning

Yongqing Zhu

Xiangyang Li

Xue Li

Jian Sun

Xinhang Song

Shuqiang Jiang

0 0 Key Laboratory of Intelligent Information Processing, Institute of Computing Technology Chinese Academy of Sciences , No.6 Kexueyuan South Road Zhongguancun, Haidian District, 100190 Beijing , China

In this paper, we describe the details of our methods for the participation in the subtask of the ImageCLEF 2016 Scalable Image Annotation task: Natural Language Caption Generation. The model we used is the combination of a procedure of encoding and a procedure of decoding, which includes a Convolutional neural network(CNN) and a Long Short-Term Memory(LSTM) based Recurrent Neural Network. We rst train a model on the MSCOCO dataset and then ne tune the model on di erent target datasets collected by us to get a more suitable model for the natural language caption generation task. Both of the parameters of CNN and LSTM are learned together.

Convolutional neural network Long Short-Term Memory Image caption Joint learning

With the rapid development of Internet technologies and extensive access to digital cameras, we are surrounded by a huge number of images, accompanied with a lot of related text. However, the relationship between the surrounding text and images varies greatly, how to close the loop between vision and language is a challenging problem for the task of scalable image annotation [ 1, 2 ].

It is easy for our human beings to describe a picture after a glance of it. However, it is not easy for a computer to do the same work. Though great progress has been achieved in visual recognition, it is still far away from generating descriptions that a human can compose. The approaches automatically generating sentence descriptions can be divided into three categories. The rst method is template-based [ 3, 4 ]. These approaches often rely heavily on sentence templates, so the generated sentences lack variety. The second method is retrieval-based [ 5, 6 ]. The advantage of these methods is that the captions are more human-like. However, it is not exible to add or remove words based on the content of the target image. Recently, many researchers have used the combination of CNN and LSTM to translate an image into a linguistic sentence [ 7, 8 ].

Our method is based on deep models proposed by Vinyals [ 7 ] which takes advantage of Convolutional Neural Network (CNN) for image encoding and LongShort Term Memory based Recurrent Neural Network (LSTM) for sentence decoding. We rstly train a model on the MSCOCO [ 9 ] dataset. We then ne tune the model on di erent datasets to make the model more suitable for the target task. In training and netuning, the parameters of both CNN and LSTM are learned together.

Next, we introduce our methods in Section 2, followed by our experimental results in Section 3. At last, the section 4 concludes the paper. 2

Method

The model we use contains two types of neural networks, as illustrated in Figure 1. The rst stage is CNN for image encoding and the second stage is Long-Short Term Memory(LSTM) based Recurrent Neural Network for sentence encoding [ 7, 8 ]. For CNN, we use the pre-trained VGGNet [ 10 ] for feature extraction. Using the VGGNet , we transform the pixels inside an image to a 4096-dimensional vector. After getting the visual features, we train an LSTM to obtain linguistic captions. In the LSTM training procedure, we change the parameters of both CNN and LSTM together. At last we ne tune the pre-trained model to get a more suitable model for the natural language caption generation task. As shown in Figure 1, we use the pre-trained VGGNet [ 10 ] for CNN feature extraction. We rst train the LSTM on corpora with paired image and sentence captions, such as MSCOCO [ 9 ] and Flickr30k [ 11 ]. In the training procedure of the LSTM, we change not only the parameters of the LSTM model, but also the parameters of the CNN model, which is a joint learning of CNN and LSTM. We then ne tune our model on di erent datasets as described in Section 3. At last, We use the trained models to predict linguistic sentence of a given image. Predicting stage The process of predicting an image is shown in Figure 2. To generate a sentence caption for an image, we get the CNN features of an image bv, set the rst hidden state h0=0, x0 to the START vector and compute the hidden state h1 and predict the rst word y1. Then we use the word y1 predicted by our model and set its embedding vector as x1, the previous hidden state h1, and then compute the hidden state h2 and use it to predict the next word y2. The process is repeated until the END token is generated. 3

Experiments and Submitted Runs

We rst use the LSTM implementation from the NeuralTalk project [ 7 ]. We train models separately on the MSCOCO [ 9 ] dataset, the Flickr8k [ 12 ] dataset and the Flickr30k [ 11 ] dataset. Then we use the model to predict the images in the ImageCLEF 2016 validation set. The results are shown in Table 1. We use Meteor [ 13 ] to evaluate sentences generated by a model. Validation set we use here is the 2000 images and their corresponding sentences provided by the organizers. Because the performance of the model on the MSCOCO dataset is better than the other dataset, So we use the model trained on the MSCOCO as our pre-trained model.

We then do experiments to decide whether jointly learn the parameters of CNN and LSTM together or xed the CNN and just learn the parameters of the LSTM. Firstly, we train a model which only learns the parameters in LSTM, then we use the model to predict the images in the validation set. The training set we use is the MSCOCO dataset, and the test set we use is the provided validation set. For comparison, we train a new model which not only learns the parameters of the LSTM but also ne tunes the CNN model. We then use the second model to predict the images in the validation set. And the results are shown in Table 2.

The results demonstrate that the joint learning of CNN and LSTM has a signi cant improvement in performance. To make full use of the MSCOCO dataset, we jointly train a model using all of the examples in MSCOCO dataset, not just using the train split. The results are shown in Table 3. It is demonstrated that more data can result in better performance.

At last, we ne tune the jointly learned model on di erent datasets to get a more suitable model for the natural language caption generation task. We use the model trained on all of the examples on MSCOCO as the baseline, and ne tune the model on di erent datasets. We rstly ne tune our model on a very small dataset. In this experiment, we use 1500 images and their sentences in the validation set as a training set and use the remaining 500 images to evaluate the performance of the ne tuned model. The results are shown in Table 3. We also ne tune the baseline model on a big dataset, which is the combination of Flickr30K and Flickr8K. We can see that ne tuning on a big dataset can get a better performance. We use the model obtained in the previous step to generate image captions on all the 510123 target images. This time, we manually select 1000 satisfactory pairs of image and generated sentence from all the generated captions and add them to the ne-tuning dataset. As shown in Table 3, the results show that this pipeline has the best performance. We use the model ne tuned in the combination of Flickr8K, Flickr30K and the selected 1000 examples as our nal model.

Figure 3 is the illustration of the generated image captions by di erent models. The results qualitatively demonstrate that our nal model can generate more satisfactory captions that reveal the content of the corresponding images. We submitted four runs in the natural language caption generation task: id sentence2.txt is our baseline method. The model is trained only using all the examples in the MSCOCO dataset. The median score (provided by the server) of the generated sentences is 0.1676 (The model is used twice to generate both the two runs, so id sentence3.txt is the same as id sentence2.txt ).

id sentence.txt is the results generated by the model which is rstly trained only using the examples in the MSCOCO dataset and then ne tuned on the combination of Flickr8K and Flickr30K. The median score of the generated sentences is 0.1710.

id sentence4.txt is the results generated by our nal model which is rstly trained only using the examples in the MSCOCO dataset and then ne tuned on the combination of Flickr8K, Flickr30k and the manually selected examples. And the median score of the generated sentences is 0.1711. This submitted run is the best of our submitted runs and also the best one for natural language caption generation of ImageCLEF 2016.

Conclusions

After performing the experiments above, we get the following conclusions. By learning the parameters in CNN and LSTM, the performance of the model can be greatly improved. When we just change the parameters of LSTM, the accuracy on the test set is 0.133. However, when we change parameters in both neural network, the accuracy is 0.165. Secondly,more data can result in better performance. We train a model only on the training split of the MSCOCO dataset and the score of the generated sentences is 0.137. However, when we use all the data in the MSCOCO dataset, the score is 0.165. Thirdly, when ne-tuning the model only using a small dataset, the result we get is worse. When we ne tuned our model only using 3/4 of the validation set, the result on the remaining 1/4 of the validation set is 0.116 which is worse than the model before ne-tuning. However, the datasets we use to train our model don't include all the concepts of the ImageCLEF 2016, so some sentences predicted by our model might be weird. 5

Acknowledgments

This work was supported in part by the National Basic Research Program of China (973 Program): 2012CB316400, the National Natural Science Foundation of China: 61532018, 61322212, the National Hi-Tech Development Program (863 Program) of China: 2014AA015202. This work is also funded by Lenovo Outstanding Young Scientists Program (LOYS).

Andrew

Gilbert , Luca Piras, Josiah Wang, Fei Yan, Arnau Ramisa, Emmanuel Dellandrea, Robert Gaizauskas, Mauricio Villegas, and

Krystian

Mikolajczyk . Overview of the ImageCLEF 2016 Scalable Concept Image Annotation Task . In CLEF2016 Working Notes, CEUR Workshop Proceedings , Evora, Portugal, September 5-8 2016 .

Mauricio

Villegas , Henning Muller, Alba Garc a Seco de Herrera, Roger Schaer, Stefano Bromuri, Andrew Gilbert, Luca Piras, Josiah Wang, Fei Yan, Arnau Ramisa, Emmanuel Dellandrea, Robert Gaizauskas, Krystian Mikolajczyk, Joan Puigcerver, Alejandro H. Toselli , Joan-Andreu Sanchez , and Enrique Vidal . General Overview of ImageCLEF at the CLEF 2016 Labs. Lecture Notes in Computer Science . Springer International Publishing, 2016 .

Kulkarni ,

Premraj ,

Dhar ,

Li ,

Choi ,

A. C.

Berg , and

T. L.

Berg . Baby talk: Understanding and generating simple image descriptions . CVPR , 2011 .

Farhadi ,

Hejrati ,

M. A.

Sadeghi ,

Young ,

Rashtchian ,

Hockenmaier , and

Forsyth . Every picture tells a story: Generating sentences from images . ECCV , 2010 .

Polina

Kuznetsova , Vicente Ordonez, Alexander C. Berg, Tamara L. Berg , and Yejin Choi . Collective generation of natural image descriptions . In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1,ACL '12 , pages 359 { 368 , Stroudsburg ,PA,USA, 2012 . Association for Computational Linguistics .

Mason and

Charniak . Nonparametric method for datadriven image captioning . In ACL , 2014 .

Oriol

Vinyals , Alexander Toshev, Samy Bengio, and

Dumitru

Erhan . Show and Tell: A Neural Image Caption Generator . In CVPR , 2015 .

Kelvin

Xu , Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and

Yoshua

Bengio . Show, attend and tell: Neural image caption generation with visual attention . arXiv preprint arXiv:1502.03044 , 2015 .

9. T.-Y. Lin , M.

Maire , S.

Belongie , J.

Hays , P.

Perona , D.

Ramanan , P.

Doll

^ Zar , and C. L. Zitnick . Microsoft coco: Common objects in context . arXiv preprint arXiv:1405.0312 , 2014 .

10.

Simonyan and

Zisserman . Very deep convolutional networks for large-scale image recognition . arXiv preprint arXiv:1409.1556 , 2014 .

11.

Young ,

Hodosh

A. Lai, and

Hockenmaier . From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions . TACL , 2014 .

12. M. Hodosh , P.

Young , and J.

Hockenmaier . Framing image description as a ranking task: data, models and evaluation metrics . Journal of Arti cial Intelligence Research , 2013 .

13.

Denkowski and

Lavie . Meteor universal: Language speci c translation evaluation for any target language . In Proceedings of the EACL 2014 Workshop on Statistical Machine Translation , 2014 .