The neural network image captioning model based on
adversarial training

                 K P Korshunova1
                 1
                  The Branch of National Research University "Moscow Power Engineering Institute" in
                 Smolensk, Russia


                 Abstract. The paper represents the model for image captioning based on deep neural networks
                 and adversarial training process. The model consists of a convolutional network as image
                 encoder, a recurrent network as natural language generator and another convolutional network
                 as an adversarial discriminator. The structure of the model, the training algorithm, some
                 experimental results and evaluation using popular metrics are proposed.


1. Introduction
Nowadays complex artificial intelligence tasks that require processing of combination of visual and
linguistic information has received increasing attention from both the computer vision and natural
language processing communities. These tasks are called multimodal. They are challenging because of
requiring accurate computational visual recognition, comprehensive world knowledge, and natural
language generation. In addition to computer vision and natural language processing problems there
are some problems related to the combination of the fields. One of the most challenging tasks is
automatic Image Captioning [1], [2] known from 1990s [3], [8].

2. Image Captioning Task
Automatic Image Captioning systems generate one or more descriptive sentences in natural language
given a sample image.
    The task is the intersection of two data analysis fields: pattern recognition and natural language
processing. In addition to visual objects, attributes and relations recognizing it requires further
describing them as a natural language text [2].
    The task of generating image descriptions can be understood as translation from one representation
(visual features) to another (text features). In this aspect it is similar to machine translation task that is
to transform data representation written in one language/modality (an input image I) into its
representation in the target language/modality (a target sequence of words C) by maximizing the
likelihood p(C|I) [22].
    Automatic Image Captioning systems include two subsystems: “encoder” and “decoder”. An
“encoder” reads the source data (raw pixels of the given image) and transforms it into a rich fixed-
length vector representation, which in turn is used as the initial hidden state of a “decoder” that
generates the target descriptive sentence in natural language.
    The most successful Image Captioning approaches are based on deep neural networks:
Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN). General Image
Captioning approach (Figure 1): convolutional neural network (first pre-trained for an image


                                                                                                          438
classification task) is used as an image “encoder”, then the last hidden layer is used as an input to the
RNN decoder that generates sentences [22], [12], [5], [24], [10].


                            Figure 1. General Image Captioning approach.

3. The neural network image captioning model based on adversarial training
Generative Adversarial Nets (GANs [9]) that implement adversarial training have been used to
produce samples of photorealistic images, to model patterns of motion in video, to reconstruct 3D
models of objects from images, to improve astronomical images, etc. [23].
    However in this paper we propose image captioning approach based on the Sequence Generative
Adversarial Nets (Sequence GANs [14]).
    GANs represent a combination of two neural network: one network (generative model G) generates
candidates and the other (discriminative model D) evaluates them. Typically, the generator G learns to
map from a latent space to a particular data distribution of interest, while the discriminator D
discriminates between instances from the true data distribution and candidates produced by the
generator. This is the implementation of adversarial training: the generative model’s training objective
is to increase the error rate of the discriminative model (i.e., "fool" the discriminator network by
producing novel synthesised instances that appear to have come from the true data distribution).

3.1. The structure of the model
The general structure of the proposed neural network model is represented in the Figure 2.


 Figure 2. The general structure of the neural network image captioning model based on adversarial
 training.
   The model consists of:
    1) convolutional neural network that is used as an image “encoder”;
    2) recurrent network that produces natural language descriptions;
    3) another convolutional neural network that is used as the discriminator during adversarial
        training process.
as image encoder, recurrent network as natural language generator and convolutional network as a
adversarial discriminator.
   VGG16 model [19] is used for image encoding (CNN), LSTM (Long-Short Term Memory [11])
recurrent network is used for generating text descriptions (G). We choose the convolutional network


                                                                                                     439
as the discriminator (D) as this kind of deep networks have recently been shown of great effectiveness
in text (token sequence) classification [13].

3.2. Training algorithm
The training process of the proposed model consists of the following steps:
   Step 1. Initialization and pre-training:
    1.1. Pre-train CNN and G;
    1.2. Generate negative samples using CNN and G;
    1.3. Pre-train D;
   Step 2. Training (N epochs):
    2.1. Train G for g epochs;
    2.2. Generate negative samples using CNN and G;
    2.3. Train D for d epochs.
   We use the reinforcement learning (RL) modification [20] to train the proposed model. The
generative model is treated as an agent of RL. In the case of adversarial training the discriminative net
D learns to distinguish whether a given data instance is real or not, and the generative net G learns to
confuse D by generating high quality data.
   The discriminator provides the adversariness of the training process. The CNN and the generator G
work during production of the model: raw pixels of the given image are read and transformed into a
rich fixed-length vector representation by the encoder CNN, then generator G generates the target
descriptive sentence in natural language from this representation.

3.3. Experiments results
We have performed some experiments on the challenging public available Microsoft COCO Caption
dataset [6]. It includes images from Microsoft Common Objects in COntext (COCO) [16] database.
All data are divided into training set and validation set. We use 32,000 images and 160,000
corresponding text descriptions (five per image) as training set and 40,000 pairs “image-sentence” as
validation set.
   Several sample descriptions provided by the model after 75 training epochs are represented in the
Figure 3.
   In many cases descriptions made by the proposed model can describe the content of the depicted
scenes (despite grammatical and semantic inaccuracies). However there are some gross mistakes.

3.4. Evaluation
Although it is sometimes not clear whether a description should be deemed successful or not given an
image, prior art has proposed several evaluation metrics [17], [15], [7], [21], [4]. These metrics are
based on evaluating the similarity of two sentences (candidate caption and reference caption). We use
popular metrics BLEU-1, BLEU-2, BLEU-3, BLEU-4 [17], ROUGE-L [15], CIDEr [21].
   We compare the proposed neural network model based on adversarial training to an CNN+RNN
baseline.
   The image captioning performance of the proposed (GAN) and known (CNN+RNN) models are
represented in the Table 1 and Figures 4-5.


                                                                                                     440
Figure 3. Sample image descriptions.


                                       441
        Table 1. The image captioning performance of the models.

Training
              Model     Bleu-1 Bleu-2 Bleu-3 Bleu-4 ROUGE-L CIDEr
 epochs
           CNN+RNN       0,290   0,128   0,054   0,025     0,239   0,041
  10
             GAN         0,309   0,139   0,060   0,027     0,253   0,039
           CNN+RNN       0,293   0,131   0,056   0,026     0,241   0,042
  25
             GAN         0,318   0,145   0,062   0,027     0,258   0,044
           CNN+RNN       0,297   0,133   0,058   0,027     0,244   0,045
  50
             GAN         0,326   0,151   0,066   0,031     0,262   0,043
           CNN+RNN       0,297   0,133   0,058   0,027     0,244   0,044
  75
             GAN         0,314   0,148   0,067   0,031     0,251   0,054
           CNN+RNN       0,296   0,132   0,057   0,026     0,244   0,044
  100
             GAN         0,324   0,155   0,074   0,037     0,272   0,054


            Figure 4. BLEU values w.r.t. the training epochs.


                                                                           442
                  Figure 5. ROUGE-L and CIDEr values w.r.t. the training epochs.
    Table 1 and Figures 4-5 show that the proposed image captioning model based on adversarial
training outperforms the compared baseline (CNN+RNN) in various metrics. The best improvement is
achieved for 100 training epochs. Obviously, the performance of the proposed model depends on the
detailed model structure and training strategy. Choosing the attributes of the model structure (number
of layers, etc.) and values of the training process parameters (number of training and pre-training
epochs) is the problem for further research.

4. Conclusion
In this paper, we proposed a neural network image captioning model based on adversarial training.
The model combines a convolutional neural net for image processing and Sequence Generative
Adversarial Net for generating text descriptions. Some experimental work to measure the effectiveness
of the model has been performed on the challenging Microsoft COCO Caption dataset. It shows that
the proposed model could provide better automatic Image Captioning compared to known baseline of
CNN and RNN.

5. References
[1] Borisov V. V., Korshunova K. P. Direct and Reverse Image Captioning problem definition.
      Postanovka priamoi i obratnoi zadachi poiska i generirovaniia tekstovykh opisanii po
      izobrazheniiam. Energetika, informatika, innovatsii - 2017 (elektroenergetika, elektrotekhnika i
      teploenergetika, matematicheskoe modelirovanie i informatsionnye tekhnologii v proizvodstve).
      [Power engineering, computer science, innovations - 2017. Proceedings of the VII international
      scientific conference]. Smolensk, 2017, pp 228-230 (in Russian).
[2] Korshunova K. P. Automatic Image Captioning: Tasks and Methods. Systems of Control,
      Communication and           Security,   2018,      no. 1,     pp. 30–77.       Available     at:
      http://sccs.intelgr.com/archive/2018-01/02-Korshunova.pdf (in Russian).
[3] Abella A., Kender J. R., Starren J. Description Generation of Abnormal Densities found in
      Radiographs // Proc. Symp. Computer Applications in Medical Care, Journal of the American
      Medical Informatics Association. 1995. pp 542-546.
[4] Anderson P., Fernando B., Johnson M., Gould S. SPICE: Semantic propositional image caption
      evaluation // Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial
      Intelligence and Lecture Notes in Bioinformatics), 9909 LNCS. 2016. pp 382-398.
[5] Chen X., Zitnick C. L. Mind’s eye: A recurrent visual representation for image caption
      generation // Proceedings of the IEEE Computer Society Conference on Computer Vision and
      Pattern Recognition. 2015. pp 2422-2431.
[6] Chen X., Fang H., Lin T. Y., Vedantam R., Gupta S., Dollár P., Zitnick C. L. Microsoft COCO
      Captions: Data Collection and Evaluation Server. arXiv.org, 2015. Available at:
      https://arxiv.org/abs/1504.00325 (accessed: 01 February 2018).


                                                                                                  443
[7]    Denkowski M., Lavie A. Meteor Universal: Language Specific Translation Evaluation for Any
       Target Language // Proceedings of the Ninth Workshop on Statistical Machine Translation.
       2014. pp 376-380.
[8]    Gerber R., Nagel N. H. Knowledge representation for the generation of quantified natural
       language descriptions of vehicle traffic in image sequences // Proceedings of the International
       Conference on Image Processing. 1996. pp 805-808.
[9]    Goodfellow I. J., Pouget-Abadie J., Mirza M., Xu B., Warde-Farley D., Ozair S., Bengio Y.
       Generative Adversarial Networks // Proceedings of NIPS. 2014. pp 2672–2680.
[10]   Gu J., Cai J., Wang G., Chen T. Stack-Captioning: Coarse-to-Fine Learning for Image
       Captioning // Association for the Advancement of Artificial Intelligence. 2018.
[11]   Hochreiter S., Urgen Schmidhuber J. Long Short-Term Memory. Neural Computation, 1997,
       Vol. 9 (8), pp 1735-1780.
[12]   Karpathy A., Fei-Fei L. Deep visual-semantic alignments for generating image descriptions //
       Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern
       Recognition. 2015.
[13]   Kim, Y. Convolutional Neural Networks for Sentence Classification // EMNLP. 2014. pp 1746–
       1751.
[14]   Lantao Yu, Weinan Zhang, JunWang, Y. Y. SeqGAN: Sequence Generative Adversarial Nets
       with Policy Gradient // JAMA Internal Medicine, 177(3). 2017. pp 326–333.
[15]   Lin C. Y. Rouge: A package for automatic evaluation of summaries // Proceedings of the
       Workshop on Text Summarization Branches out (WAS 2004). 2004. Vol. 1. pp 25-26.
[16]   Lin T. Y., Maire M., Belongie S., Hays J., Perona P., Ramanan D., Dollár P., Zitnick C. L.
       Microsoft COCO: Common objects in context. European conference on computer vision, 2014,
       pp740-755.
[17]   Papineni K., Roukos S., Ward T., Zhu W. BLEU: a method for automatic evaluation of machine
       translation. Proceedings of the 40th Annual Meeting of the Association for Computational
       Linguistics, 2002, pp 311-318.
[18]   Robertson S. Understanding inverse document frequency: on theoretical arguments for IDF //
       Journal of Documentation. 2004. № 60 (5). pp 503-520.
[19]   Simonyan K., Zisserman A. Very deep convolutional networks for large-scale image
       recognition // arXiv.org. 2014. – URL: https://arxiv.org/abs/1409.1556 (accessed: 01 February
       2018).
[20]   Sutton R. S., Barto G. Reinforcement learning: an introduction. University College London,
       Computer Science Department, Reinforcement Learning Lectures, 2017.
[21]   Vedantam R., Zitnick C. L., Parikh D. CIDEr: Consensus-based image description evaluation //
       Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern
       Recognition. 2015. pp 4566-4575.
[22]   Vinyals O., Toshev A., Bengio S., Erhan D. Show and Tell: A Neural Image Caption Generator
       // Conference on Computer Vision and Pattern Recognition. 2015. pp 1-10.
[23]   Generative Adversarial Network // Wikipedia, the free encyclopedia. 2018. URL:
       https://en.wikipedia.org/wiki/Generative_adversarial_network (accessed: 01 May 2018).
[24]   Xu K., Ba J., Kiros R., Cho K., Courville A., Salakhutdinov R., Zemel R., Bengio Y. Show,
       Attend and Tell: Neural Image Caption Generation with Visual Attention // International
       Conference on Machine Learning. 2015.

Acknowledgments
This work was supported by the Russian Foundation for Basic Research (Grant No. 18-07-00928)


                                                                                                  444