=Paper=
{{Paper
|id=Vol-2936/paper-95
|storemode=property
|title=PUC Chile team at Caption Prediction: ResNet visual encoding and caption classification
                        with Parametric ReLU
|pdfUrl=https://ceur-ws.org/Vol-2936/paper-95.pdf
|volume=Vol-2936
|authors=Vicente Castro,Pablo Pino,Denis Parra,Hans Lobel
|dblpUrl=https://dblp.org/rec/conf/clef/CastroPPL21
}}
==PUC Chile team at Caption Prediction: ResNet visual encoding and caption classification
                        with Parametric ReLU==
<pdf width="1500px">https://ceur-ws.org/Vol-2936/paper-95.pdf</pdf>
<pre>
PUC Chile team at Caption Prediction: ResNet visual
encoding and caption classification with Parametric
ReLU
Vicente Castro, Pablo Pino, Denis Parra and Hans Lobel
Pontificia Universidad Católica de Chile, Av. Vicuña Mackena 4860, Macul, 7820244, Chile


                                      Abstract
                                      This article describes PUC Chile team’s participation in the Caption Prediction task of ImageCLEFmed-
                                      ical challenge 2021, which resulted in the team winning this task. We first show how a very simple
                                      approach based on statistical analysis of captions, without relying on images, results in a competitive
                                      baseline score. Then, we describe how to improve the performance of this preliminary submission by
                                      encoding the medical images with a ResNet CNN, pre-trained on ImageNet and later fine-tuned with
                                      the challenge dataset. Afterwards, we use this visual encoding as the input for a multi-label classifi-
                                      cation approach for caption prediction. We describe in detail our final approach, and we conclude by
                                      discussing some ideas for future work.

                                      Keywords
                                      Image Captioning, Medical Artificial Intelligence, Deep Learning, Perceptual Similarity, Convolutional
                                      Neural Networks


1. Introduction
ImageCLEF [1] is an initiative with the aim of advancing the field of image retrieval (IR) as
well as enhancing the evaluation of technologies for annotation, indexing and retrieval of
visual data. The initiative takes the form of several challenges, and it is especially aware of the
changes in the IR field in recent years, which have brought about tasks requiring the use of
different types of data such as text, images and other features moving towards multi-modality.
ImageCLEF has been running annually since 2003, and since the second version (2004) there are
medical images involved in some tasks, such as medical image retrieval. Since those versions,
the ImageCLEFmedical challenge group of tasks [2] has integrated new ones involving medical
images, with the medical image captioning task taking place since 2017. It consists of two
subtasks: concept prediction and caption detection. Although there have been changes in
the data used for the newest versions of the challenge, the goal of this task is the same: help
physicians reduce the burden of manually translating visual medical information (such as
radiology images) into textual descriptions. In particular, the caption prediction task within
the ImageCLEFmedical challenge 2021 aims at supporting clinicians in their responsibility to
provide clinical diagnoses by composing coherent captions for the entirety of a medical image.


CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
" vvcastro@uc.cl (V. Castro)
                                    © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
   In this document we describe the participation of our team from the HAIVis group1 within
the artificial intelligence laboratory2 at Pontificia Universidad Catolica de Chile (PUC Chile
team) in the image captioning task at MedicalImageCLEF 2021 [2]. Our team earned 1st place
in this challenge, and our best submission was a combination of deep learning techniques to
visually encode the medical images, followed by a traditional classification of captions that were
re-ranked by statistical information obtained from the training dataset.
   The rest of the paper is structured as follows: section 2 describes our data analysis, while
in section 3 we provide details of our proposed methods and experiments for model training
and validation. Later, in section 4 we provide details of our results, and finally in section 5 we
conclude our article.


2. Data Analysis
The dataset provided for this challenge consists of two sets of 2,756 and 500 image-caption pairs
for training and validation, respectively. Each caption consists of a natural language text, which
is a highly technical annotation made by physicians about abnormalities and medical objects in
the image it corresponds to.


Figure 1: Dataset examples


   Each caption was processed with the NLTK library3 [3], following the evaluation methodology
of the task4 : (1) The caption is converted to lowercase. (2) All punctuation marks are removed
and each caption is split into individual words. (3) Stopwords are removed using NLTK’s
“English” stopwords list. (4) Stemming is applied with NLTK’s Snowball stemmer.

   1
     http://haivis.ing.puc.cl/
   2
     http://ialab.ing.puc.cl/
   3
     https://www.nltk.org/
   4
     Evaluation Methodology at https://www.imageclef.org/2021/medical/caption
  Figure 2 shows the distribution words per number of appearances in the dataset, for example,
28% of words have only one occurrence. Figure 3 shows the distribution of caption lengths.


Figure 2: Distribution of number of words per number of appearances.


Figure 3: Caption length distribution.


  Figure 4 shows the most common words in the dataset and their number of appearances,
showing that some words are very common in the dataset, appearing in about 40% of all training
captions. From a semantic analysis, these words seem to have broader and more descriptive
meanings of the different elements in the images. This may be a direct cause of the fact that
simpler and naive methods, that are based on more statistical approaches, can outperform more
complex models.
Figure 4: Most common words in the dataset.


3. Method and Experimentation
While addressing the task we tried three main approaches: a pure statistical method, a multi-label
classification approach (MLC) and a perceptual similarity based model (Sim).

3.1. Statistical approach
Our initial approach to the challenge tried to leverage the statistics related to the composition
of each caption. This first model was a naive algorithm that randomly selected a caption length
from the training set, created a list of this length with the most popular words in the dataset,
and shuffled them to get a random order. This simple method obtained a mean BLEU score of
0.357 on the validation set and, when submitted, scored 0.378 points in the test set.
   This first approach helped us gain an intuition about how the BLEU score varied and how
susceptible it was to different components of the caption. Our initial hypothesis was that more
relevant than the order of words in the caption, the correctness of them was the most significant
element for the metric. To test this assumption, we explore the alternative of a multi-label
classification approach that, given an image, predicted the most relevant words in the caption.

3.2. Multi-label classification approach (MLC)
In this approach we consider each word as a class, and trained a Convolutional Neural Network
(CNN) to predict the words of a caption given an image. Then, the top classified words are
selected and ordered by a statistical rule to produce a final caption. Figure 5 shows the full
pipeline for caption generation with our approach, and we give more details next5 :


   5
       For all our implementations we used PyTorch as our main DL framework: https://pytorch.org/
3.2.1. Preprocessing
We process each image-caption pair to reduce the number of target classes (words), and to
prepare the image to pass it through the network. The following steps were applied:

   1. Caption processing: We processed each caption according to the evaluation methodol-
      ogy described in the previous section, transforming each caption into a list of stemmed
      words (labels). The vocabulary is composed by all the words in the training data with
      four appearances or more. We did not perform any special handling for words in the
      validation set that were not present in the training vocabulary. After filtering, the training
      vocabulary size was reduced to 1,075 words (1,189 when using training and validation
      set).
   2. Image processing: Each image is transformed to have pixel values within a [0, 1] range
      (in each RGB channel) and then is normalized by the mean and standard deviation (over
      each channel), according to torchvision documentation 6 . As a data augmentation
      method, a crop of 300x300 pixels is taken from the image. For the training set, this crop is
      selected from a random location, whereas for validation and testing, the central crop of
      the image is always taken. This is a common training setup and has been used for similar
      purposes in past versions of the challenge [4].

3.2.2. Classification training
Several ResNet [5] and DenseNet [6] model architectures were tested, with and without fine
tuning from ImageNet [7] pre-trained weights. Fine tuning of a DenseNet121 model pre-trained
on the ChestX-ray14 dataset [8] was also tested. Different layers of the network were frozen
during fine-tuning, as a measure to avoid over-fitting.
   In addition, the last layer of the network was replaced with a fully connected layer that
matched the dimensionality of the training vocabulary size. Furthermore, we added a dropout
layer and passed the resulting values by a Parametric ReLU (PReLU) [9] activation function.
With this, the output of our model was a vector of dimension vocabulary size and unbounded
range.
   In training, we sought to minimize the Binary Cross Entropy loss between the vector predicted
by the model and the one-hot encoded ground truth, calculated as:
                              𝐶
                             ∑︁
                       𝐿=         −𝑤𝑐 [𝑦𝑐 · log 𝜎(𝑥𝑐 ) + (1 − 𝑦𝑐 ) · log(1 − 𝜎(𝑥𝑐 ))]
                              𝑐

where 𝐶 is the number of labels to classify. In code, this loss was calculated with BCEWithLogits7
function from pytorch. As optimizer, we used Adam[10] with no weight decay and an initial
learning rate of 5e-4, after epoch 15 this last hyper-parameter was reduced to 1e-4.


   6
       Documentation @ https://pytorch.org/vision/stable/models.html
   7
       https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html
Figure 5: Model diagram. Top 𝑁 = 23 classified words are selected for the caption.


3.2.3. Captioning
Once the classification output is obtained from our visual model, it needs to be translated into a
caption. We define N as the length of the output caption, a hyper-parameter of the model and
choose the N highest scoring words. Then, we used a statistical approach to order the words in
a logical sentence: for each word, we define its position as the most common one it has across
all training captions.
   Two output examples from our model are shown next, with good (Fig. 6) and bad (Fig. 7)
performance:


Figure 6: Example of caption prediction with good performance, BLEU= 0.850 (N=23)
Figure 7: Example of caption prediction with bad performance, BLEU= 0, (N=23)


3.3. Similarity-based approach (Sim)
Another method that we used and resulted in a fairly good experimental performance was a
similarity-based approach. For each test image, we ranked the most similar images in the training
set using the Learned Perceptual Image Path Similarity (LPIPS) 8 [11], a learned metric based on
the similarity between deep features from several neural network layers, in our experiments,
an AlexNet[12] model. Then, the caption from the closest training image is assigned to the test
image.
   This approach resulted in a very good test performance and helped us to reach and maintain
the top 3 in the leaderboard. Furthermore, we tested this approach for the concept detection
task where we also achieved better performance.


4. Results
To evaluate our model we measured the BLEU score [13] for each caption generated against its
ground truth, following the challenge evaluation procedure9 . It is important to emphasize that
this metric must be calculated with version v3.2.2 of the NLTK library since new updates
change the results considerably. Table 1 shows our methods’ scores in the validation set.

Table 1
Results in the validation set.
                                               Method                     BLEU

                               Sim: LPIPS Similarity (from AlexNet)        0.459
                                         MLC: ResNet                       0.544


   Additionally, we measured word recall as a metric for the classification method. Since BLEU is
a precision-based metric, including a recall-based metric should help evaluate the performance
    8
        Code available @ https://github.com/richzhang/PerceptualSimilarity
    9
        Refer to "Evaluation methodology" @ https://www.imageclef.org/2021/medical/caption
Figure 8: BLEU and Recall Score during training with (N=26)


more extensively, leading to better captions.
   The best result was achieved with the multi-label classification approach, using a ResNet34
[5] model pre-trained on ImageNet and fine-tuned for 15 epochs, only with the last 5 layers with
learnable parameters, whilst the other layers were frozen. The training scheme mentioned above
was followed. For word selection we set N=26, value that was inferred from the distribution in
Figure 3 and validated with experimental results. Figure 8 shows the development of BLEU and
word recall during training.

4.1. CrowdAI Runs
Four submissions were made to crowdai.org using the methods described, the details and results
are shown in Table 2.

Table 2
Submission results.
                                              Method                              BLEU
                                 Statistical: random length + most
           Subm1                                                                  0.378
                                  common words + random order
           Subm3            MLC: ResNet50, random length + fixed order            0.351
           Subm4                  Sim: LPIPS similarity approach                  0.442
                                          MLC: ResNet34
           Subm6                                                                  0.509
                      and most common index for ordering, trained for 20 epochs
                                          MLC: ResNet34
           Subm7                                                                  0.510
                      and most common index for ordering, trained for 15 epochs
5. Conclusion
In this article we have provided details of the participation of the PUC Chile team, winners
of the caption prediction task within the ImageCLEFmedical challenge 2021. In the process of
building our final submission, we tested several approaches, detailed in this paper. Our final
submission was based on a ResNet34 architecture to visually encode the input medical image,
followed by predicting captions as a multi-label word classification task, and finally re-ranking
the word order based on statistical information from the training dataset. In future work, we
plan at testing other CNN architectures, perform further experiments exploiting perceptual
similarity, and test other techniques for neural language modeling.


Acknowledgments
This work was partially funded by ANID - Millennium Science Initiative Program - Code
ICN17_002 and by ANID, FONDECYT grant 1191791.


References
 [1] B. Ionescu, H. Müller, R. Péteri, A. Ben Abacha, M. Sarrouti, D. Demner-Fushman, S. A.
     Hasan, S. Kozlovski, V. Liauchuk, Y. Dicente, V. Kovalev, O. Pelka, A. G. S. de Herrera,
     J. Jacutprakart, C. M. Friedrich, R. Berari, A. Tauteanu, D. Fichou, P. Brie, M. Dogariu, L. D.
     Stefan, M. G. Constantin, J. Chamberlain, A. Campello, A. Clark, T. A. Oliver, H. Moustahfid,
     A. Popescu, J. Deshayes-Chossart, Overview of the ImageCLEF 2021: Multimedia retrieval
     in medical, nature, internet and social media applications, in: Experimental IR Meets
     Multilinguality, Multimodality, and Interaction, Proceedings of the 12th International
     Conference of the CLEF Association (CLEF 2021), LNCS Lecture Notes in Computer
     Science, Springer, Bucharest, Romania, 2021.
 [2] O. Pelka, A. Ben Abacha, A. García Seco de Herrera, J. Jacutprakart, C. M. Friedrich,
     H. Müller, Overview of the ImageCLEFmed 2021 concept & caption prediction task,
     in: CLEF2021 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org, Bucharest,
     Romania, 2021.
 [3] E. Loper, S. Bird, NLTK: The Natural Language Toolkit, in: Proceedings of the ACL-02
     Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing
     and Computational Linguistics - Volume 1, ETMTNLP ’02, Association for Computational
     Linguistics, USA, 2002, p. 63–70. URL: https://doi.org/10.3115/1118108.1118117. doi:10.
     3115/1118108.1118117.
 [4] D. Lyndon, A. Kumar, J. Kim, Neural Captioning for the ImageCLEF 2017 Medical Image
     Challenges., in: CLEF (Working Notes), 2017.
 [5] K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition, in: 2016
     IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
     doi:10.1109/CVPR.2016.90.
 [6] G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger, Densely Connected Convolutional
     Networks, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
     2017, pp. 2261–2269. doi:10.1109/CVPR.2017.243.
 [7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, ImageNet: A large-scale hierarchical
     image database, in: 2009 IEEE Conference on Computer Vision and Pattern Recognition,
     2009, pp. 248–255. doi:10.1109/CVPR.2009.5206848.
 [8] X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, R. M. Summers, ChestX-ray8: Hospital-
     Scale Chest X-Ray Database and Benchmarks on Weakly-Supervised Classification and
     Localization of Common Thorax Diseases, in: Proceedings of the IEEE Conference on
     Computer Vision and Pattern Recognition (CVPR), 2017.
 [9] K. He, X. Zhang, S. Ren, J. Sun, Delving Deep into Rectifiers: Surpassing Human-Level Per-
     formance on ImageNet Classification, in: Proceedings of the IEEE International Conference
     on Computer Vision (ICCV), 2015.
[10] D. P. Kingma, J. Ba, Adam: A Method for Stochastic Optimization, in: Y. Bengio, Y. LeCun
     (Eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego,
     CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL: http://arxiv.org/abs/
     1412.6980.
[11] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, O. Wang, The Unreasonable Effectiveness of
     Deep Features as a Perceptual Metric, in: Proceedings of the IEEE Conference on Computer
     Vision and Pattern Recognition (CVPR), 2018.
[12] A. Krizhevsky, I. Sutskever, G. E. Hinton, ImageNet Classification with Deep Convolu-
     tional Neural Networks, in: Proceedings of the 25th International Conference on Neural
     Information Processing Systems - Volume 1, NIPS’12, Curran Associates Inc., Red Hook,
     NY, USA, 2012, p. 1097–1105.
[13] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, BLEU: A Method for Automatic Evaluation
     of Machine Translation, in: Proceedings of the 40th Annual Meeting on Association
     for Computational Linguistics, ACL ’02, Association for Computational Linguistics, USA,
     2002, p. 311–318. URL: https://doi.org/10.3115/1073083.1073135. doi:10.3115/1073083.
     1073135.

</pre>