Generating Text from Images in a Smooth
              Representation Space

                   Graham Spinks and Marie-Francine Moens

          Department of Computer Science, KU Leuven, Leuven, Belgium
                        graham.spinks@cs.kuleuven.be
                          sien.moens@cs.kuleuven.be


      Abstract. A methodology is described for the generation of relevant
      captions for images of an extensive medical dataset in the ImageCLEF
      2018 Caption Prediction competition. Automatic and accurate textual
      descriptions of images could help relieve workload pressure for specialists
      and assist clinical professionals in multiple areas. Instead of generating
      textual sequences directly from images, we first learn a smooth, contin-
      uous representation space for the captions. Subsequently the task is re-
      duced to the minimization of the mapping loss from image to continuous
      representation through the use of a deep convolutional neural network.
      We illustrate how our system learns to generate captions by aligning rel-
      evant embeddings. The submitted run achieves a score of roughly 13.76%
      and ranks 4th out of the 5 participating teams. The top submission in
      the competition achieved a score of 25.01%.


1   Introduction
In this paper, our participation is described in the 2018 ImageCLEF Caption
Prediction task [5] which is a part of the 2018 ImageCLEF competition [6]. The
goal is to regenerate the original caption for a set of images where the cap-
tion is essentially a concise textual interpretation of the content of the image.
The dataset consists of 4 million diverse images that cover a range of radiol-
ogy/clinical data and was collected from open access biomedical journal articles
(Pubmed Central). No additional external data was used for this submission.
    A large amount of effort is dedicated in medical fields to correctly interpret
and describe various images. Automation of this process might help reduce the
bottleneck in certain diagnosis pipelines and help medical professionals focus on
more important tasks.
    Generating captions from images is also a task that requires an understand-
ing of data representations in neural networks. The cross-modal nature of the
task implies a successful alignment of visual and textual data. The nature of
these modes are quite distinct as continuous images usually demand different
processing techniques than discrete texts.
    While the ImageCLEF competition also contains a concept detection subtask
on the same dataset, this submission focuses on directly generating captions
without any additional intermediate steps. This approach has the advantage
that the text generation doesn’t depend on any pre-fabricated conceptual labels
and is directly inferable from images.
    Current text-to-image systems that employ neural networks typically com-
bine a Convolutional Neural Network (CNN) with a discrete decoder in the form
of a Recurrent Neural Network (RNN), which in practice is often a Long Short-
Term Memory (LSTM) network [3][9][10]. The difficulty of such approaches of-
ten lies in the discrete nature of natural language sentences. Back-propagation
is challenging for such data as the gradient of the error becomes infinite on the
boundary of discrete symbols.
    In order to alleviate this problem, our approach starts by creating a smooth
continuous code space for text, which is characterized by a coherent local struc-
ture where similar inputs are mapped to nearby codes. This contrasts with au-
toencoders that simply learn an identity mapping with unstructured latent rep-
resentations. The advantage is that complex modifications can be made to the
text while traversing the data manifold for slightly modified sentences. In order
to obtain such a representation we use an Adversarially Regularized Autoencoder
(ARAE) [8] which trains a discrete autoencoder in an adversarial setting.
    In a subsequent step, we align the images to the continuous data manifold of
the captions rather than to the discrete natural language. This has the benefit
that in this stage we avoid the complex and costly discrete decoder step which is
present in traditional image-to-text systems. Once image and text representation
are aligned, we can decode the aligned vector with the decoder we obtained in
the previous step, thus obtaining natural language text for each image.
    We will show that our method creates a textual representations space from
which the input can easily be reconstructed. By aligning the visual input to this
space, we create varied captions for the images and obtain a score of 13.76% on
the test set.


2     Methodology

We will briefly mention how the data was prepared before discussing the creation
of the text representation as well as the caption generation. An overview of the
entire methodology is presented in figure 1.


2.1   Data Preprocessing

In order to simplify the caption generation task, all words are converted to
lowercase while any words that appear less than 100 times in the entire dataset
are replaced by out-of-vocabulary markers. The remaining vocabulary contains
4303 different words. Any captions that exceed the length of 15 words are capped
while any captions that are shorter are padded.
   All images are randomly cropped to achieve data augmentation and trans-
formed to 256x256 resolution. The images are normalized with a mean and stan-
dard deviation of 0.5.
Fig. 1. Overview of the methodology. The encoder (E) of the ARAE model creates a
textual representation for all captions which can be decoded to the original input with
the decoder(D). In a second step each image is mapped to the continuous representation
space with a CNN. D, for which the weights are frozen in this step, decodes the mapping
to a caption for each image. CT image reference [4].


2.2   Text Representation
While there are several methodologies to create dense continuous representations
for discrete structures, each comes with both advantages and disadvantages.
One might consider for example vector-based word or sentence embeddings that
are trained by predicting the context of a word or one might simply use an
autoencoder that reconstructs the original text from a compact representation.
While the word or sentence embeddings capture basic semantic information their
performance in additional tasks is often quite limited. Autoencoders do create a
dense representation but the learned representation space is not smooth [8].
    For this task, we will use an ARAE [8] to construct smooth, continuous
representations of the sentences. Such representations have been shown to lie in a
smoother contracted codespace than a typical autoencoder, with the benefit that
similar inputs are mapped to nearby codes. The ARAE combines the training of a
generator (G) and critic (C) of a Generative Adversarial Model (GAN) [2] as well
as an encoder (E) and decoder (D) of a regular discrete input autoencoder. In this
setup, E creates a continuous text representation t̂ of the input text, while D uses
a cross-entropy loss to try to recreate the original sentence. Additionally, G is a 2-
layer feedforward network that learns how to generate realistic representations t̂.
C estimates the Wasserstein distance between the generated and real distribution
as defined in the Wasserstein GAN (WGAN) [1], such that G is explicitly trained
to minimize that distance. As a side-result of this setup G eventually learns to
create diverse texts with a low perplexity score [11]. To keep the overview concise,
only E and D are shown for the ARAE model in figure 1.
    For this competition, the ARAE model of [8] is modified by passing the
discrete integer list of text inputs to both the generator and critic after nor-
malization between -0.5 and 0.5. This is done to encourage the ARAE to learn
an even smoother version of the code space as the critic, C, learns to identify
when a representation doesn’t match a text input. We also slightly adapt the
softmax-temperature parameter which influences the extent to which a code of
a text is different than that of another. To increase variability we use a softmax
temperature of 0.1 rather than 1.0 when calculating the cross-entropy loss. In
order to obtain good generalization we perform early stopping and select the
model where the reconstruction error on the validation set is minimal.
    Using only the captions in the training set, a smooth manifold for the cap-
tions is thus created with the above model. In essence, each caption now has an
equivalent continuous representation t̂, from which the original caption can be
reconstructed. With such a representation, an alignment can be learned between
the visual features and the relevant captions for each image, as explained in the
next subsection.

2.3   Image-to-Text
In order to map the visual features to the continuous space we created for the
captions, the input images are passed through a deep neural network that con-
sists of 8 convolutional blocks. One such block contains a convolutional layer
followed by a batch norm layer and a LeakyReLU activation function. At the
end of the network another convolutional layer and two fully connected layers
are added.
    The output is then compared to the continuous textual representation of the
caption as constructed in section 2.2. In our experiments we devised two methods
to determine how suitable a text is for each image.
    The first method simply uses a loss function derived from the cosine similarity
between two embeddings. The output of the CNN network is then trained to be
as similar to the continuous text code, t̂, as possible.
    The second method essentially does the same but runs the output of the CNN
through the decoder, D, that was trained before (see section 2.2). The generated
output distribution is subsequently compared to that of the one generated by the
continuous representation of the original caption. For the comparison we use the
same cosine similarity metric as before. The reasoning behind this approach is
that performance might improve after decoding the representation to individual
time-steps as more information for alignment is available.
    For both approaches, the weights of D are not updated during this stage.

3     Results
A first important task in our system is to create textual representations that
can be decoded to match the original text with enough accuracy. In order to
Table 1. Examples of preprocessed captions and their reconstruction from the autoen-
coder of the ARAE. EOS indicates the end-of-sentence marker while OOV indicates
the out-of-vocabulary marker.

Original                                    Reconstruction
computed tomography of the abdomen computed tomography of the abdomen
showing enlarged adrenal glands , the left showing a left gland with with liver kid-
gland EOS                                   ney EOS
microscopic examination of the tumor microscopic findings of the tumor showed
specimen by hematoxylin and eosin stain showed hematoxylin and eosin staining ;
revealed that EOS                           that EOS
ultrasound of the right upper quadrant axial image the right kidney quadrant
showing the gallbladder free of stones blue showing a common with wall the . EOS
EOS
secondary electron image OOV of a frac- sem structure photomicrograph of of a
tured surface of an OOV lingual bar EOS representative surface showing the OOV
                                            showing view EOS


do so, we perform preprocessing on the text as detailed in section 2.1 and train
an ARAE model to encode and decode the sentences as detailed in section 2.2.
For a trained ARAE, we demonstrate a range of good and bad examples of the
original and decoded sentences in table 1.
    The captions are subsequently encoded and the convolutional net is tasked
with finding the optimal caption for each image. The captions are generated
from the output representations with a greedy method and are evaluated using
the script provided by ImageCLEF which measures the result as a percentage
of the maximum BLEU score over all sentences. Before calculating the score,
stemming is performed and stopwords and punctuation are removed.
    The evaluation score is the percentage of the obtained BLEU score over all
sentences compared to the maximum possible BLEU score. Thus if one would
achieve the best possible BLEU score for each sentence, the score would be 100%.
Using a cosine similarity metric, we reach an accuracy of roughly 13.5% on the
validation set when comparing the continuous embeddings directly (method 1 in
section 2.3). If we use method 2, i.e. after passing the embeddings through the
decoder, we obtain an accuracy of roughly 12.4%. For the ImageCLEF submis-
sion we only submitted the results of method 1, which obtained a score 13.76%
on the test set, indicating that the system didn’t overfit on the validation set.
    Note that since we are using a cutoff of 15 words per caption, the maximum
obtainable score is roughly 36.4% for such captions, as measured on our ground
truth validation set. While the performance does improve over the first epochs,
it turns out that the network is not able to line up the different embeddings with
high accuracy. In fact the output sentences evolve to quite similar output where
only some details are modified as illustrated in table 2.
    This research provides an interesting direction for new image-to-text systems
as there are several possible avenues for improvements. In a first step, training
a stable model for larger captions might provide an immediate boost in perfor-
Table 2. Examples of output texts for different input images. While several captions
are quite similar overall, some details are usually slightly modified.

Output examples
figure 2 a fundus photograph of the right eye showing a large and
figure showing a mass in the right and the uterus and ovaries
computed tomography scan showing a large mass in the right kidney and ureter
computed tomography scan of the abdomen showing a large mass in the right


mance as more relevant textual output can be aligned with the images, thus
obtaining higher BLEU scores. Another possibility to improve the results is to
investigate different distance functions. While in this paper, a simple cosine em-
bedding loss was used, this type of alignment might benefit from a measure
that expresses a distributional divergence, such as the Wasserstein distance [1].
Finally, besides using an ARAE, other methods that create continuous represen-
tations might be more suitable for this type of alignment. For example, one might
consider building a representation that includes concept labels or is constructed
with image alignment in mind, such as the char-CNN-RNN representation [7].

4   Conclusion
We present an alternative approach to caption generation by leveraging continu-
ous representations for text that were learned with an ARAE model. Images are
aligned to the continuous representations rather than discrete natural language.
Measured as a percentage of the obtained BLEU scores over all sentences com-
pared to the maximum possible BLEU score, this methodology achieves 13.76%
on the submitted run and offers a promising avenue for follow-up research. The
proposed setup can be a starting point for implementations with alternative net-
work configurations and text representations that aim to enhance and exceed the
obtained results.


References
 1. Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein gan. arXiv preprint
    arXiv:1701.07875 (2017)
 2. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair,
    S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural
    Information Processing Systems. pp. 2672–2680 (2014)
 3. Hasan, S.A., Ling, Y., Liu, J., Sreenivasan, R., Anand, S., Arora, T.R., Datla,
    V., Lee, K., Qadir, A., Swisher, C., et al.: PRNA at ImageCLEF 2017 caption
    prediction and concept detection tasks (2017)
 4. Hellerhoff: Leberabszess - CT axial PV.jpg, CC by 3.0
 5. Garcı́a Seco de Herrera, A., Eickhoff, C., Andrearczyk, V., , Müller, H.: Overview
    of the ImageCLEF 2018 caption prediction tasks. In: CLEF2018 Working Notes.
    CEUR Workshop Proceedings, CEUR-WS.org <http://ceur-ws.org>, Avignon,
    France (September 10-14 2018)
 6. Ionescu, B., Müller, H., Villegas, M., de Herrera, A.G.S., Eickhoff, C., Andrea-
    rczyk, V., Cid, Y.D., Liauchuk, V., Kovalev, V., Hasan, S.A., Ling, Y., Farri, O.,
    Liu, J., Lungren, M., Dang-Nguyen, D.T., Piras, L., Riegler, M., Zhou, L., Lux, M.,
    Gurrin, C.: Overview of ImageCLEF 2018: Challenges, datasets and evaluation. In:
    Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceed-
    ings of the Ninth International Conference of the CLEF Association (CLEF 2018),
    LNCS Lecture Notes in Computer Science, Springer, Avignon, France (September
    10-14 2018)
 7. Kim, Y., Jernite, Y., Sontag, D., Rush, A.M.: Character-aware neural language
    models. In: AAAI. pp. 2741–2749 (2016)
 8. Kim, Y., Zhang, K., Rush, A.M., LeCun, Y., et al.: Adversarially regularized
    autoencoders for generating discrete structures. arXiv preprint arXiv:1706.04223
    (2017)
 9. Liang, S., Li, X., Zhu, Y., Li, X., Jiang, S.: ISIA at the ImageCLEF 2017 image
    caption task (2017)
10. Lyndon, D., Kumar, A., Kim, J.: Neural captioning for the ImageCLEF 2017 med-
    ical image challenges (2017)
11. Spinks, G., Moens, M.F.: Generating continuous representations of medical texts.
    In: Proceedings of the 16th Annual Conference of the North American Chapter
    of the Association for Computational Linguistics: Human Language Technologies
    (2018)