=Paper=
{{Paper
|id=Vol-2006/paper030
|storemode=property
|title=Deep Learning for Automatic Image Captioning in Poor Training Conditions
|pdfUrl=https://ceur-ws.org/Vol-2006/paper030.pdf
|volume=Vol-2006
|authors=Caterina Masotti,Danilo Croce,Roberto Basili
|dblpUrl=https://dblp.org/rec/conf/clic-it/MasottiC017
}}
==Deep Learning for Automatic Image Captioning in Poor Training Conditions==
Deep Learning for Automatic Image Captioning
in poor Training Conditions
Caterina Masotti and Danilo Croce and Roberto Basili
Department Of Enterprise Engineering
University of Roma, Tor Vergata
caterinamasotti@yahoo.it
{croce,basili}@info.uniroma2.it
Abstract 1 Introduction
English. Recent advancements in Deep The image captioning task consists in generat-
Learning show that the combination of ing a brief description in natural language of a
Convolutional Neural Networks and Re- given image that is able to capture the depicted
current Neural Networks enables the def- objects and the relations between them, as dis-
inition of very effective methods for the cussed in (Bernardi et al., 2016). More precisely,
automatic captioning of images. Unfortu- given an image I as input, an image captioner
nately, this straightforward result requires should be able to generate a well-formed sentence
the existence of large-scale corpora and S(I) = (s1 , ..., sm ), where every si is a word from
they are not available for many languages. a vocabulary V = {w1 , ..., wn } in a given natural
This paper describes a simple methodol- language. Some examples of images and corre-
ogy to automatically acquire a large-scale sponding captions are reported in Figure 1. This
corpus of 600 thousand image/sentences task is rather complex as it involves non-trivial
pairs in Italian. At the best of our knowl- subtasks to solve, such as object detection, map-
edge, this corpus has been used to train ping visual features to text and generating text se-
one of the first neural systems for the same quences.
language. The experimental evaluation Recently, neural methods based on deep neu-
over a subset of validated image/captions ral networks have reached impressive state-of-the-
pairs suggests that results comparable with art results in solving this task (Karpathy and Li,
the English counterpart can be achieved. 2014; Mao et al., 2014; Xu et al., 2015). One
of the most successful architectures implements
Italiano. La combinazione di metodi
the so-called encoder-decoder end-to-end struc-
di Deep Learning (come Convolutional
ture (Goldberg, 2015). Differently by most of the
Neural Network e Recurrent Neural Net-
existing encoder-decoder structures, in (Vinyals et
work) ha recentemente permesso di real-
al., 2014) the encoding of the input image is per-
izzare sistemi molto efficaci per la gener-
formed by a convolutional neural network which
azione automatica di didascalie a partire
transform it in a dense feature vector; then, this
da immagini. Purtroppo, l’applicazione
vector is “translated” to a descriptive sentence by
di questi metodi richiede l’esistenza di
a Long short-term memory (LSTM) architecture,
enormi collezioni di immagini annotate e
which takes the vector as the first input and gener-
queste risorse non sono disponibili per
ates a textual sequence starting from it. This neu-
ogni lingua. Questo articolo presenta un
ral model is very effective, but also very expensive
semplice metodo per l’acquisizione auto-
to train in terms of time and hardware resources1 ,
matica di un corpus di 600 mila coppie
because there are many parameters to be learned;
immagine/frase per l’italiano, che ha per-
not to mention that the model is overfitting-prone,
messo di addestrare uno dei primi sis-
thus it needs to be trained on a training set of an-
temi neurali per questa lingua. La va-
notated images that is as large and heterogeneous
lutazione su un sottoinsieme del corpus
manualmente validato suggerisce che é 1
As of now, training a neural encoder-decoder model
possibile raggiungere risultati compara- such as the one presented at http://github.com/
tensorflow/models/tree/master/im2txt on a
bili con i sistemi disponibili per l’inglese. dataset of over 580, 000 image-caption examples takes about
two weeks even with a very performing GPU.
(a) English: A yellow school bus parked (c) English: The workers are trying to
in a handicap spot, Italian: Uno scuo- (b) English: A cowboy rides a bucking pry up the damaged traffic light, Italian:
labus giallo parcheggiato in un posto per horse at a rodeo, Italian: Un cowboy I lavoratori stanno cercando di tirare su
disabili. cavalca un cavallo da corsa a un rodeo. il semaforo danneggiato.
Figure 1: Three images from the MSCOCO dataset, along with two human-validated descriptions.
as possible, in order to achieve a good generaliza- set has been manually validated for evaluation pur-
tion capability. Hardware and time constraints do poses.
not always allow to train a model in an optimal set- In particular, prior to the experimentations in
ting, and, for example, cutting down on the dataset Italian, some early experiments have been per-
size could be necessary: in this case we have poor formed with the same training data originally an-
training conditions. Of course, this reduces the notated in English, to get a reference benchmark
model’s ability to generalize on new images at about convergence time and evaluation metrics on
captioning time. Another cause of poor training a dataset of smaller size. These results in English
conditions is the lack of a good quality dataset, for will suggest if the Italian image captioner shows
example in terms of annotations: the manual cap- similar performance when trained over a reduced
tioning of large collections of images requires a lot set of examples or the noise induced in the au-
of effort and, as of now, human-annotated datasets tomatic translation process compromises the neu-
only exist for a restricted set of languages, such as ral training phase. Moreover, these experiments
in English. As a consequence, training such a neu- have also been performed with the introduction of
ral model to produce captions in another language a pre-trained word embedding, (derived using the
(e.g. in Italian) is an interesting problem to ex- method presented in (Mikolov et al., 2013)), in or-
plore, but also challenging due to the lack of data der to measure how it affects the quality of the lan-
resources. guage model learned by the captioner, with respect
A viable approach is building a resource by au- to a randomly initialized word embedding that is
tomatically translating the annotations from an ex- learned together with the other model parameters.
isting dataset: much less expensive than manually Overall the contributions of this work are three-
annotating images, but of course it leads to a loss fold: (i) the investigation of a simple, automatized
of human-like quality in the language model. This way to acquire (possibly noisy) large-scale cor-
approach has been considered in this work to per- pora for the training of neural image captioning
form one of the first neural-based image caption- methods in poor training conditions; (ii) the man-
ing in Italian: more precisely, the annotations of ual validation of a first set of human-annotated re-
the images from the MSCOCO dataset, one of the sources in Italian; (iii) the implementation of one
largest datasets in English of image/caption pairs, of the first automatic neural-based Italian image
have been automatically translated to Italian in or- captioners.
der to obtain a first resource for this language: this In the rest of the paper, the adopted neural ar-
has been exploited to train a neural captioner and chitecture is outlined in Section 2. The description
whose quality can be improved over time (e.g., by of a brand new resource for Italian is presented in
manually validating the translations). Then, a sub- Section 3. Section 4 reports the results of the early
set of this Italian dataset has been used as train- preparatory experimentations for the English lan-
ing data for the neural captioning system defined guage and then the ones for Italian. Finally, Sec-
in (Vinyals et al., 2014), while a subset of the test tion 5 derives the conclusions.
2 The Show and Tell Architecture
The Deep Architecture considered in this paper
is the Show and Tell architecture, described in
(Vinyals et al., 2014) and sketched in Figure 2.
It follows an encoder-decoder structure where the
image is encoded in a dense vector by a state-
of-the-art deep CNN, in this case InceptionV3
(Szegedy et al., 2015), followed by a fully con-
nected layer; the resulting feature vector is fed to
a LSTM, used to generate a text sequence, i.e. the
caption. As the CNN encoder has been trained
over an object recognition task, it allows encod-
ing the image in a dense vector that is strictly Figure 2: The Deep Architecture presented in
connected to the entities observed in the image. (Vinyals et al., 2014). LSTM model combined
At the same time, the LSTM implements a lan- with a CNN image embedder and word embed-
guage model, in line with the idea introduced in dings. The unrolled connections between the
(Mikolov et al., 2010): it captures the probability LSTM memories are in blue.
of generating a given word in a string, given the
words generated so far. In the overall training pro-
cess, the main objective is to train a LSTM to gen- English. The Italian version of the dataset has
erate the next word given not only the string pro- been acquired with an approach that automatizes
duced so far, but also a set of image features. As the translation task: for each image, all its five an-
the first CNN encoder is (mostly) language inde- notations have been translated with Bing2 . The re-
pendent, it can be totally re-used even in the cap- sult is a big amount of data whose annotations are
tioning of images in other languages, such as Ital- fully translated, but not of the best quality with re-
ian. On the contrary, the language model underly- spect to the Italian fluent language. This automat-
ing the LSTM needs new examples to be trained. ically translated data can be used to train a model,
In this work, we will train this architecture but for the evaluation a test set of human-validated
over a corpus that has been automatically trans- examples is needed: so, the translations of a subset
lated from the MSCOCO dataset. We thus spec- of the MSCOCO-it have been manually validated.
ulate that the LSTM will learn a sort of simpli- In (Vinyals et al., 2014), two subsets of 2, 024 and
fied language model, more inherent to the auto- 4, 051 images from the MSCOCO validation set
matic translator than to an Italian speaker. How- have been held out from the rest of the data and
ever, we are also convinced that the quality achiev- have been used for development and testing of the
able by modern translation systems (Bahdanau et model, respectively. A subset of these images has
al., 2014; Luong et al., 2015), combined with the been manually validated: 308 images from the de-
generalization that can be obtained by a LSTM velopment set and 596 from the test set. In Table 1,
trained over thousands of (possibly noisy) trans- statistics about this brand new corpus are reported,
lations will be able to generate reasonable and in- where the specific amount of unvalidated (u.) and
telligible captions. validated (v.) data is made explicit3 .
3 Automatic acquisition of a Corpus of 4 Experimental Evaluation
Captions in Italian In order to be consistent with a scenario character-
In this section we present the first release of the ized by poor training conditions (limited hardware
MSCOCO-it, a new resource for the training of resources and time constraints) all the experimen-
data-driven image captioning systems in Italian. It tations in this paper have been made by training
has been built starting from the MSCOCO dataset 2
Sentences have been translated between December 2016
for English (Lin et al., 2014): in particular we con- and January 2017.
3
sidered the training and validation subsets, made Although Italian annotations are available for all the im-
ages of the original dataset, in the table some images were
respectively of 82, 783 and 40, 504 images, where not counted because they are corrupted and therefore have
every image has 5 human-written annotations in not been used.
#images #sent #words # Shards BLEU-4 METEOR CIDEr
training u. 116,195 581,286 6,900,546 1 10,1 / 11,5 13,4 / 13,1 18,8 / 24,4
v. 308 1,516 17,913 2 15,7 / 18,9 18,2 / 16,3 36,1 / 51,9
valid. u. 1,696 8,486 101,448 5 22,0 / 22,7 20,2 / 20,4 64,1 / 65,0
p. (14) 25 304 10 22,4 / 24,7 22,0 / 21,7 73,2 / 73,7
v. 596 2,941 34,657 20 26,5 / 26,2 21,9 / 22,3 79,3 / 79,1
test u. 3,422 17,120 202,533 NIC 27,7 23,7 85,5
p. (23) 41 479 NICv2 32,1 25,7 99,8
total 122,217 611,415 7,257,880 im2txt 31,2 25,5 98,1
Table 1: Statistics about the MSCOCO-it corpus. Table 2: Results on im2txt for the English lan-
p. stands for partially validated, since some im- guage with a training set of reduced size, without /
ages have only some validated captions out of five. with and the use of a pre-trained word embedding.
The partially validated images are between paren- Moreover benchmark results are reported.
theses because they are already counted in the val-
idated ones.
five rows, results are reported both in the case
of randomly initialized word embedding and pre-
the model on significantly smaller samples of data trained ones. We compare these results with the
with respect to the whole MSCOCO dataset (made ones achieved by the original NIC and NICv2 net-
of more than 583, 000 image-caption examples). works presented in (Vinyals et al., 2014), and the
First of all, some early experimentations have ones measured by testing a model available in the
been performed on smaller samples of data from web5 , trained on the original whole training set.
MSCOCO in English, in order to measure the loss Results obtained by the network when trained
of performance caused by the reduced size of the on a reduced dataset are clearly lower w.r.t. the
training set4 . Each training example is a image- NIC results, but it is straightforward that similar
caption pair and they have been grouped in data result are obtained, especially considering the re-
shards during the training phase: each shard con- duced size of the training material. The contri-
tains about 2,300 image-caption examples. The bution of pre-trained word embeddings is not sig-
model has been trained on datasets of 23, 000, nificant, in line with the findings from (Vinyals et
34, 500 and 46, 000 image-caption pairs (less than al., 2014). However, it is still interesting noting
10% of the entire dataset). that the lexical generalization of this unsupervised
In order to balance the reduced size of the train- word embeddings is beneficial, especially when
ing material and provide some kind of linguistic the size of the training material is minimal (e.g.
generalization, we evaluated the adoption of pre- when 1 shard is used, especially if considering the
trained word embedding in the training/tagging CIDEr metrics). As the amount of training data
process. In fact, in (Vinyals et al., 2014) the LSTM grows, its impact on the model decreases, until it
architecture initializes randomly all vectors rep- is not significant anymore.
resenting input words; these are later trained to-
# Shards BLEU-4 METEOR CIDEr
gether with the other parameters of the network. 1 11.7 / 12.9 16.4 / 16.9 27.4 / 29.4
We wondered if a word embedding already pre- 2 16.9 / 17.1 18.8 / 18.7 45.7 / 45.6
5 22.0 / 21.4 21.2 / 20.9 62.5 / 60.8
trained on a large corpus could help the model to 10 22.4 / 22.9 22.0 / 21.5 71.9 / 68.8
generalize better on brand new images at test time. 20 23.7 / 23.8 22.2 / 22.0 73.0 / 73.2
We introduce a word embedding learned through
a Skip-gram model (Mikolov et al., 2013) from an Table 3: Metrics for the experimentations on
English dump of Wikipedia. The LSTM archi- im2txt for the Italian language with a training
tecture has been trained on the same shards but set of reduced size, without / with and the use of a
initializing the word vectors with this pretrained pre-trained word embedding.
word embedding.
Table 2 reports results on the English dataset For what concerns the results on Italian, the
in terms of BLEU-4, CIDEr and METEOR, the experiments have been performed by training the
same used in (Vinyals et al., 2014): in the first model on samples of 23, 000, 34, 500 and 46, 000
4
examples, where the captions are automatically
A proper tuning phase was too expensive so we
5
adopted the parameters provided in https://github. http://github.com/tensorflow/models/
com/tensorflow/models/tree/master/im2txt issues/466
translated with Bing. The model has been eval- Raffaella Bernardi, Ruket Cakici, Desmond Elliott,
uated against the validated sentences, and results Aykut Erdem, Erkut Erdem, Nazli Ikizler-Cinbis,
Frank Keller, Adrian Muscat, and Barbara Plank.
are reported in Table 3. Results are impressive
2016. Automatic description generation from im-
as they are in line with the English counterpart. ages: A survey of models, datasets, and evaluation
It supports the robustness of the adopted architec- measures. J. Artif. Int. Res., 55(1):409–442, Jan-
ture, as it seems to learn even from a noisy dataset uary.
of automatically translated material. Most impor- Yoav Goldberg. 2015. A primer on neural network
tantly, it confirms the applicability of the proposed models for natural language processing. CoRR,
simple methodology for the acquisition of datasets abs/1510.00726.
for image captioning.
Andrej Karpathy and Fei-Fei Li. 2014. Deep visual-
When trained with 20 shards, the Italian cap- semantic alignments for generating image descrip-
tioner generates the following description of the tions. CoRR, abs/1412.2306.
images shown in Figure 1: Image 1a: “Un auto-
Tsung-Yi Lin, Michael Maire, Serge J. Belongie,
bus a due piani guida lungo una strada.”, Image Lubomir D. Bourdev, Ross B. Girshick, James Hays,
1b: “Un uomo che cavalca una carrozza trainata Pietro Perona, Deva Ramanan, Piotr Dollár, and
da cavalli.”, Image 1c: “Una persona che cam- C. Lawrence Zitnick. 2014. Microsoft COCO: com-
mina lungo una strada con un segnale di stop.” mon objects in context. CoRR, abs/1405.0312.
An attempt to use a word embedding that has Minh-Thang Luong, Hieu Pham, and Christopher D.
been pre-trained on a large corpus (more precisely, Manning. 2015. Effective approaches to
on a dump of Wikipedia in Italian) has also been attention-based neural machine translation. CoRR,
abs/1508.04025.
made, but the empirical results reported in Table
3 show that its contribution is not relevant but still Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and
significant when fewer examples are adopted. Alan L. Yuille. 2014. Deep captioning with mul-
timodal recurrent neural networks (m-rnn). CoRR,
abs/1412.6632.
5 Conclusions
Tomas Mikolov, Martin Karafiát, Lukás Burget, Jan
In this paper a simple methodology for the train- Cernocký, and Sanjeev Khudanpur. 2010. Recur-
ing of neural models for the automatic captioning rent neural network based language model. In IN-
of images is presented. We generated a large scale TERSPEECH 2010, 11th Annual Conference of the
International Speech Communication Association,
of about 600, 000 image captions in Italian by us- Makuhari, Chiba, Japan, September 26-30, 2010,
ing an automatic machine translator. Although the pages 1045–1048.
noise introduced in this step, it allows to train one
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey
of the first neural-based image captioning systems
Dean. 2013. Efficient estimation of word represen-
for Italian. Most importantly, the quality of this tations in vector space. CoRR, abs/1301.3781.
system seems comparable with the English coun-
terpart, if trained over a comparable set of data. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe,
Jonathon Shlens, and Zbigniew Wojna. 2015. Re-
These results are impressive and confirm the ro- thinking the inception architecture for computer vi-
bustness of the adopted Neural Architecture. We sion. CoRR, abs/1512.00567.
believe that the obtained resource paves the way
Oriol Vinyals, Alexander Toshev, Samy Bengio, and
to the definition and evaluation of Neural Models
Dumitru Erhan. 2014. Show and tell: A neural im-
for Image captioning in Italian, and we hope to age caption generator. CoRR, abs/1411.4555.
contribute to the Italian Community, hopefully us-
ing the validated dataset in a future Evalita6 cham- Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun
Cho, Aaron C. Courville, Ruslan Salakhutdinov,
paign. Richard S. Zemel, and Yoshua Bengio. 2015. Show,
attend and tell: Neural image caption generation
with visual attention. CoRR, abs/1502.03044.
References
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua
Bengio. 2014. Neural machine translation by
jointly learning to align and translate. CoRR,
abs/1409.0473.
6
http://www.evalita.it/