=Paper= {{Paper |id=Vol-2006/paper030 |storemode=property |title=Deep Learning for Automatic Image Captioning in Poor Training Conditions |pdfUrl=https://ceur-ws.org/Vol-2006/paper030.pdf |volume=Vol-2006 |authors=Caterina Masotti,Danilo Croce,Roberto Basili |dblpUrl=https://dblp.org/rec/conf/clic-it/MasottiC017 }} ==Deep Learning for Automatic Image Captioning in Poor Training Conditions== https://ceur-ws.org/Vol-2006/paper030.pdf
               Deep Learning for Automatic Image Captioning
                        in poor Training Conditions
                 Caterina Masotti and Danilo Croce and Roberto Basili
                         Department Of Enterprise Engineering
                           University of Roma, Tor Vergata
                          caterinamasotti@yahoo.it
                      {croce,basili}@info.uniroma2.it

                 Abstract                       1       Introduction
English. Recent advancements in Deep            The image captioning task consists in generat-
Learning show that the combination of           ing a brief description in natural language of a
Convolutional Neural Networks and Re-           given image that is able to capture the depicted
current Neural Networks enables the def-        objects and the relations between them, as dis-
inition of very effective methods for the       cussed in (Bernardi et al., 2016). More precisely,
automatic captioning of images. Unfortu-        given an image I as input, an image captioner
nately, this straightforward result requires    should be able to generate a well-formed sentence
the existence of large-scale corpora and        S(I) = (s1 , ..., sm ), where every si is a word from
they are not available for many languages.      a vocabulary V = {w1 , ..., wn } in a given natural
This paper describes a simple methodol-         language. Some examples of images and corre-
ogy to automatically acquire a large-scale      sponding captions are reported in Figure 1. This
corpus of 600 thousand image/sentences          task is rather complex as it involves non-trivial
pairs in Italian. At the best of our knowl-     subtasks to solve, such as object detection, map-
edge, this corpus has been used to train        ping visual features to text and generating text se-
one of the first neural systems for the same    quences.
language. The experimental evaluation              Recently, neural methods based on deep neu-
over a subset of validated image/captions       ral networks have reached impressive state-of-the-
pairs suggests that results comparable with     art results in solving this task (Karpathy and Li,
the English counterpart can be achieved.        2014; Mao et al., 2014; Xu et al., 2015). One
                                                of the most successful architectures implements
Italiano. La combinazione di metodi
                                                the so-called encoder-decoder end-to-end struc-
di Deep Learning (come Convolutional
                                                ture (Goldberg, 2015). Differently by most of the
Neural Network e Recurrent Neural Net-
                                                existing encoder-decoder structures, in (Vinyals et
work) ha recentemente permesso di real-
                                                al., 2014) the encoding of the input image is per-
izzare sistemi molto efficaci per la gener-
                                                formed by a convolutional neural network which
azione automatica di didascalie a partire
                                                transform it in a dense feature vector; then, this
da immagini. Purtroppo, l’applicazione
                                                vector is “translated” to a descriptive sentence by
di questi metodi richiede l’esistenza di
                                                a Long short-term memory (LSTM) architecture,
enormi collezioni di immagini annotate e
                                                which takes the vector as the first input and gener-
queste risorse non sono disponibili per
                                                ates a textual sequence starting from it. This neu-
ogni lingua. Questo articolo presenta un
                                                ral model is very effective, but also very expensive
semplice metodo per l’acquisizione auto-
                                                to train in terms of time and hardware resources1 ,
matica di un corpus di 600 mila coppie
                                                because there are many parameters to be learned;
immagine/frase per l’italiano, che ha per-
                                                not to mention that the model is overfitting-prone,
messo di addestrare uno dei primi sis-
                                                thus it needs to be trained on a training set of an-
temi neurali per questa lingua. La va-
                                                notated images that is as large and heterogeneous
lutazione su un sottoinsieme del corpus
manualmente validato suggerisce che é              1
                                                    As of now, training a neural encoder-decoder model
possibile raggiungere risultati compara-        such as the one presented at http://github.com/
                                                tensorflow/models/tree/master/im2txt on a
bili con i sistemi disponibili per l’inglese.   dataset of over 580, 000 image-caption examples takes about
                                                two weeks even with a very performing GPU.
(a) English: A yellow school bus parked                                             (c) English: The workers are trying to
in a handicap spot, Italian: Uno scuo- (b) English: A cowboy rides a bucking        pry up the damaged traffic light, Italian:
labus giallo parcheggiato in un posto per horse at a rodeo, Italian: Un cowboy      I lavoratori stanno cercando di tirare su
disabili.                                 cavalca un cavallo da corsa a un rodeo.   il semaforo danneggiato.

   Figure 1: Three images from the MSCOCO dataset, along with two human-validated descriptions.


as possible, in order to achieve a good generaliza-            set has been manually validated for evaluation pur-
tion capability. Hardware and time constraints do              poses.
not always allow to train a model in an optimal set-              In particular, prior to the experimentations in
ting, and, for example, cutting down on the dataset            Italian, some early experiments have been per-
size could be necessary: in this case we have poor             formed with the same training data originally an-
training conditions. Of course, this reduces the               notated in English, to get a reference benchmark
model’s ability to generalize on new images at                 about convergence time and evaluation metrics on
captioning time. Another cause of poor training                a dataset of smaller size. These results in English
conditions is the lack of a good quality dataset, for          will suggest if the Italian image captioner shows
example in terms of annotations: the manual cap-               similar performance when trained over a reduced
tioning of large collections of images requires a lot          set of examples or the noise induced in the au-
of effort and, as of now, human-annotated datasets             tomatic translation process compromises the neu-
only exist for a restricted set of languages, such as          ral training phase. Moreover, these experiments
in English. As a consequence, training such a neu-             have also been performed with the introduction of
ral model to produce captions in another language              a pre-trained word embedding, (derived using the
(e.g. in Italian) is an interesting problem to ex-             method presented in (Mikolov et al., 2013)), in or-
plore, but also challenging due to the lack of data            der to measure how it affects the quality of the lan-
resources.                                                     guage model learned by the captioner, with respect
   A viable approach is building a resource by au-             to a randomly initialized word embedding that is
tomatically translating the annotations from an ex-            learned together with the other model parameters.
isting dataset: much less expensive than manually                 Overall the contributions of this work are three-
annotating images, but of course it leads to a loss            fold: (i) the investigation of a simple, automatized
of human-like quality in the language model. This              way to acquire (possibly noisy) large-scale cor-
approach has been considered in this work to per-              pora for the training of neural image captioning
form one of the first neural-based image caption-              methods in poor training conditions; (ii) the man-
ing in Italian: more precisely, the annotations of             ual validation of a first set of human-annotated re-
the images from the MSCOCO dataset, one of the                 sources in Italian; (iii) the implementation of one
largest datasets in English of image/caption pairs,            of the first automatic neural-based Italian image
have been automatically translated to Italian in or-           captioners.
der to obtain a first resource for this language: this            In the rest of the paper, the adopted neural ar-
has been exploited to train a neural captioner and             chitecture is outlined in Section 2. The description
whose quality can be improved over time (e.g., by              of a brand new resource for Italian is presented in
manually validating the translations). Then, a sub-            Section 3. Section 4 reports the results of the early
set of this Italian dataset has been used as train-            preparatory experimentations for the English lan-
ing data for the neural captioning system defined              guage and then the ones for Italian. Finally, Sec-
in (Vinyals et al., 2014), while a subset of the test          tion 5 derives the conclusions.
2   The Show and Tell Architecture
The Deep Architecture considered in this paper
is the Show and Tell architecture, described in
(Vinyals et al., 2014) and sketched in Figure 2.
It follows an encoder-decoder structure where the
image is encoded in a dense vector by a state-
of-the-art deep CNN, in this case InceptionV3
(Szegedy et al., 2015), followed by a fully con-
nected layer; the resulting feature vector is fed to
a LSTM, used to generate a text sequence, i.e. the
caption. As the CNN encoder has been trained
over an object recognition task, it allows encod-
ing the image in a dense vector that is strictly        Figure 2: The Deep Architecture presented in
connected to the entities observed in the image.        (Vinyals et al., 2014). LSTM model combined
At the same time, the LSTM implements a lan-            with a CNN image embedder and word embed-
guage model, in line with the idea introduced in        dings. The unrolled connections between the
(Mikolov et al., 2010): it captures the probability     LSTM memories are in blue.
of generating a given word in a string, given the
words generated so far. In the overall training pro-
cess, the main objective is to train a LSTM to gen-     English. The Italian version of the dataset has
erate the next word given not only the string pro-      been acquired with an approach that automatizes
duced so far, but also a set of image features. As      the translation task: for each image, all its five an-
the first CNN encoder is (mostly) language inde-        notations have been translated with Bing2 . The re-
pendent, it can be totally re-used even in the cap-     sult is a big amount of data whose annotations are
tioning of images in other languages, such as Ital-     fully translated, but not of the best quality with re-
ian. On the contrary, the language model underly-       spect to the Italian fluent language. This automat-
ing the LSTM needs new examples to be trained.          ically translated data can be used to train a model,
   In this work, we will train this architecture        but for the evaluation a test set of human-validated
over a corpus that has been automatically trans-        examples is needed: so, the translations of a subset
lated from the MSCOCO dataset. We thus spec-            of the MSCOCO-it have been manually validated.
ulate that the LSTM will learn a sort of simpli-        In (Vinyals et al., 2014), two subsets of 2, 024 and
fied language model, more inherent to the auto-         4, 051 images from the MSCOCO validation set
matic translator than to an Italian speaker. How-       have been held out from the rest of the data and
ever, we are also convinced that the quality achiev-    have been used for development and testing of the
able by modern translation systems (Bahdanau et         model, respectively. A subset of these images has
al., 2014; Luong et al., 2015), combined with the       been manually validated: 308 images from the de-
generalization that can be obtained by a LSTM           velopment set and 596 from the test set. In Table 1,
trained over thousands of (possibly noisy) trans-       statistics about this brand new corpus are reported,
lations will be able to generate reasonable and in-     where the specific amount of unvalidated (u.) and
telligible captions.                                    validated (v.) data is made explicit3 .

3   Automatic acquisition of a Corpus of                4    Experimental Evaluation
    Captions in Italian                                 In order to be consistent with a scenario character-
In this section we present the first release of the     ized by poor training conditions (limited hardware
MSCOCO-it, a new resource for the training of           resources and time constraints) all the experimen-
data-driven image captioning systems in Italian. It     tations in this paper have been made by training
has been built starting from the MSCOCO dataset             2
                                                              Sentences have been translated between December 2016
for English (Lin et al., 2014): in particular we con-   and January 2017.
                                                            3
sidered the training and validation subsets, made             Although Italian annotations are available for all the im-
                                                        ages of the original dataset, in the table some images were
respectively of 82, 783 and 40, 504 images, where       not counted because they are corrupted and therefore have
every image has 5 human-written annotations in          not been used.
                           #images     #sent      #words       # Shards    BLEU-4       METEOR          CIDEr
       training       u.   116,195   581,286   6,900,546              1   10,1 / 11,5   13,4 / 13,1   18,8 / 24,4
                      v.       308     1,516      17,913              2   15,7 / 18,9   18,2 / 16,3   36,1 / 51,9
       valid.         u.     1,696     8,486     101,448              5   22,0 / 22,7   20,2 / 20,4   64,1 / 65,0
                      p.      (14)        25         304             10   22,4 / 24,7   22,0 / 21,7   73,2 / 73,7
                      v.       596     2,941      34,657             20   26,5 / 26,2   21,9 / 22,3   79,3 / 79,1
       test           u.     3,422   17,120      202,533           NIC       27,7          23,7          85,5
                      p.      (23)        41         479         NICv2       32,1          25,7          99,8
              total        122,217   611,415   7,257,880       im2txt        31,2          25,5          98,1

Table 1: Statistics about the MSCOCO-it corpus.            Table 2: Results on im2txt for the English lan-
p. stands for partially validated, since some im-          guage with a training set of reduced size, without /
ages have only some validated captions out of five.        with and the use of a pre-trained word embedding.
The partially validated images are between paren-          Moreover benchmark results are reported.
theses because they are already counted in the val-
idated ones.
                                                           five rows, results are reported both in the case
                                                           of randomly initialized word embedding and pre-
the model on significantly smaller samples of data         trained ones. We compare these results with the
with respect to the whole MSCOCO dataset (made             ones achieved by the original NIC and NICv2 net-
of more than 583, 000 image-caption examples).             works presented in (Vinyals et al., 2014), and the
   First of all, some early experimentations have          ones measured by testing a model available in the
been performed on smaller samples of data from             web5 , trained on the original whole training set.
MSCOCO in English, in order to measure the loss               Results obtained by the network when trained
of performance caused by the reduced size of the           on a reduced dataset are clearly lower w.r.t. the
training set4 . Each training example is a image-          NIC results, but it is straightforward that similar
caption pair and they have been grouped in data            result are obtained, especially considering the re-
shards during the training phase: each shard con-          duced size of the training material. The contri-
tains about 2,300 image-caption examples. The              bution of pre-trained word embeddings is not sig-
model has been trained on datasets of 23, 000,             nificant, in line with the findings from (Vinyals et
34, 500 and 46, 000 image-caption pairs (less than         al., 2014). However, it is still interesting noting
10% of the entire dataset).                                that the lexical generalization of this unsupervised
   In order to balance the reduced size of the train-      word embeddings is beneficial, especially when
ing material and provide some kind of linguistic           the size of the training material is minimal (e.g.
generalization, we evaluated the adoption of pre-          when 1 shard is used, especially if considering the
trained word embedding in the training/tagging             CIDEr metrics). As the amount of training data
process. In fact, in (Vinyals et al., 2014) the LSTM       grows, its impact on the model decreases, until it
architecture initializes randomly all vectors rep-         is not significant anymore.
resenting input words; these are later trained to-
                                                               # Shards    BLEU-4       METEOR          CIDEr
gether with the other parameters of the network.                      1   11.7 / 12.9   16.4 / 16.9   27.4 / 29.4
We wondered if a word embedding already pre-                          2   16.9 / 17.1   18.8 / 18.7   45.7 / 45.6
                                                                      5   22.0 / 21.4   21.2 / 20.9   62.5 / 60.8
trained on a large corpus could help the model to                    10   22.4 / 22.9   22.0 / 21.5   71.9 / 68.8
generalize better on brand new images at test time.                  20   23.7 / 23.8   22.2 / 22.0   73.0 / 73.2
We introduce a word embedding learned through
a Skip-gram model (Mikolov et al., 2013) from an           Table 3: Metrics for the experimentations on
English dump of Wikipedia. The LSTM archi-                 im2txt for the Italian language with a training
tecture has been trained on the same shards but            set of reduced size, without / with and the use of a
initializing the word vectors with this pretrained         pre-trained word embedding.
word embedding.
   Table 2 reports results on the English dataset            For what concerns the results on Italian, the
in terms of BLEU-4, CIDEr and METEOR, the                  experiments have been performed by training the
same used in (Vinyals et al., 2014): in the first          model on samples of 23, 000, 34, 500 and 46, 000
   4
                                                           examples, where the captions are automatically
    A proper tuning phase was too expensive so we
                                                             5
adopted the parameters provided in https://github.             http://github.com/tensorflow/models/
com/tensorflow/models/tree/master/im2txt                   issues/466
translated with Bing. The model has been eval-           Raffaella Bernardi, Ruket Cakici, Desmond Elliott,
uated against the validated sentences, and results         Aykut Erdem, Erkut Erdem, Nazli Ikizler-Cinbis,
                                                           Frank Keller, Adrian Muscat, and Barbara Plank.
are reported in Table 3. Results are impressive
                                                           2016. Automatic description generation from im-
as they are in line with the English counterpart.          ages: A survey of models, datasets, and evaluation
It supports the robustness of the adopted architec-        measures. J. Artif. Int. Res., 55(1):409–442, Jan-
ture, as it seems to learn even from a noisy dataset       uary.
of automatically translated material. Most impor-        Yoav Goldberg. 2015. A primer on neural network
tantly, it confirms the applicability of the proposed      models for natural language processing. CoRR,
simple methodology for the acquisition of datasets         abs/1510.00726.
for image captioning.
                                                         Andrej Karpathy and Fei-Fei Li. 2014. Deep visual-
   When trained with 20 shards, the Italian cap-           semantic alignments for generating image descrip-
tioner generates the following description of the          tions. CoRR, abs/1412.2306.
images shown in Figure 1: Image 1a: “Un auto-
                                                         Tsung-Yi Lin, Michael Maire, Serge J. Belongie,
bus a due piani guida lungo una strada.”, Image            Lubomir D. Bourdev, Ross B. Girshick, James Hays,
1b: “Un uomo che cavalca una carrozza trainata             Pietro Perona, Deva Ramanan, Piotr Dollár, and
da cavalli.”, Image 1c: “Una persona che cam-              C. Lawrence Zitnick. 2014. Microsoft COCO: com-
mina lungo una strada con un segnale di stop.”             mon objects in context. CoRR, abs/1405.0312.
   An attempt to use a word embedding that has           Minh-Thang Luong, Hieu Pham, and Christopher D.
been pre-trained on a large corpus (more precisely,        Manning.       2015.     Effective approaches to
on a dump of Wikipedia in Italian) has also been           attention-based neural machine translation. CoRR,
                                                           abs/1508.04025.
made, but the empirical results reported in Table
3 show that its contribution is not relevant but still   Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and
significant when fewer examples are adopted.               Alan L. Yuille. 2014. Deep captioning with mul-
                                                           timodal recurrent neural networks (m-rnn). CoRR,
                                                           abs/1412.6632.
5       Conclusions
                                                         Tomas Mikolov, Martin Karafiát, Lukás Burget, Jan
In this paper a simple methodology for the train-          Cernocký, and Sanjeev Khudanpur. 2010. Recur-
ing of neural models for the automatic captioning          rent neural network based language model. In IN-
of images is presented. We generated a large scale         TERSPEECH 2010, 11th Annual Conference of the
                                                           International Speech Communication Association,
of about 600, 000 image captions in Italian by us-         Makuhari, Chiba, Japan, September 26-30, 2010,
ing an automatic machine translator. Although the          pages 1045–1048.
noise introduced in this step, it allows to train one
                                                         Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey
of the first neural-based image captioning systems
                                                           Dean. 2013. Efficient estimation of word represen-
for Italian. Most importantly, the quality of this         tations in vector space. CoRR, abs/1301.3781.
system seems comparable with the English coun-
terpart, if trained over a comparable set of data.       Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe,
                                                           Jonathon Shlens, and Zbigniew Wojna. 2015. Re-
These results are impressive and confirm the ro-           thinking the inception architecture for computer vi-
bustness of the adopted Neural Architecture. We            sion. CoRR, abs/1512.00567.
believe that the obtained resource paves the way
                                                         Oriol Vinyals, Alexander Toshev, Samy Bengio, and
to the definition and evaluation of Neural Models
                                                           Dumitru Erhan. 2014. Show and tell: A neural im-
for Image captioning in Italian, and we hope to            age caption generator. CoRR, abs/1411.4555.
contribute to the Italian Community, hopefully us-
ing the validated dataset in a future Evalita6 cham-     Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun
                                                           Cho, Aaron C. Courville, Ruslan Salakhutdinov,
paign.                                                     Richard S. Zemel, and Yoshua Bengio. 2015. Show,
                                                           attend and tell: Neural image caption generation
                                                           with visual attention. CoRR, abs/1502.03044.
References
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua
  Bengio. 2014. Neural machine translation by
  jointly learning to align and translate. CoRR,
  abs/1409.0473.
    6
        http://www.evalita.it/