=Paper= {{Paper |id=Vol-2936/paper-116 |storemode=property |title=Kdelab at ImageCLEF 2021: Medical Caption Prediction with Effective Data Pre-processing and Deep Learning |pdfUrl=https://ceur-ws.org/Vol-2936/paper-116.pdf |volume=Vol-2936 |authors=Riku Tsuneda,Tetsuya Asakawa,Masaki Aono |dblpUrl=https://dblp.org/rec/conf/clef/TsunedaAA21 }} ==Kdelab at ImageCLEF 2021: Medical Caption Prediction with Effective Data Pre-processing and Deep Learning== https://ceur-ws.org/Vol-2936/paper-116.pdf
Kdelab at ImageCLEF 2021: Medical Caption
Prediction with Effective Data Pre-processing and
Deep Learning
Riku Tsuneda1 , Tetsuya Asakawa2 and Masaki Aono3
1
    Department of Computer Science and Engineering, Toyohashi University of Technology, Aichi, Japan


                                         Abstract
                                         ImageCLEF 2021 Caption Prediction Task is an example of a challenging research problem in the field of
                                         image captioning. The goal of this research is to automatically generate accurate captions describing a
                                         given medical image. We describe our approach to captioning medical images and illustrate the text and
                                         image pre-processing that is effective for our task dataset. In this paper, we have applied sentence-ending
                                         period removal as text pre-processing and histogram normalization of luminance as image pre-processing.
                                         Furthermore, we present the effectiveness of our text data augmentation approach. Submission of our
                                         kdelab team on the task test dataset achieved a BLEU evaluating of 0.362.

                                         Keywords
                                         Image Captioning, Deep Learning, Medical Images




1. Introduction
In recent years, multimodal processing of images and natural language has attracted much
attention in the field of machine learning. Image Captioning is one of these representative tasks,
which aims at proper captioning of input images. As these accuracies improve, it is expected
that computers will not only be able to detect objects in images, but also to understand the
relationships and behaviors between objects.
   Image captioning is also effective in the medical field. For example, interpreting and summa-
rizing possible disease symptoms from a large number of radiology images (e.g. X-ray images
and CT images) is a time-consuming task that can only be understood by highly knowledgeable
specialists. If computers could understand medical images and generate accurate captions, it
would help solve the world’s growing shortage of medical doctors. However, there is still the
bottleneck problem that few physicians are able to give accurate annotations.
   In this paper, we describe our approach to general Image Captioning task in medical domain
at Image Captioning such as Fig. 1(right).
   The nature of medical images are quite different from general images such as MS-COCO [1]
in many aspects.

CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
" tsuneda.riku.am@kde.cs.tut.ac.jp (R. Tsuneda); asakawa@kde.cs.tut.ac.jp (T. Asakawa); aono@tut.jp (M. Aono)
 0000-0002-3063-7489 (R. Tsuneda); 0000-0003-1383-1076 (T. Asakawa); 0000-0002-8345-7094 (M. Aono)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
Figure 1: Example of general (left) and medical (right) Caption Prediction data& left image : via
MS-COCO, [CC BY 4.0](https://cocodataset.org/).


   In the following, we first describe related work on Image Captioning task and Medical
Image Captioning in Section 2, followed by the description of the dataset provided for Image-
CLEF2021 [2] Medical Image Captioning [3]dataset in Section 3. In Section 4, we describe details
of the method we have applied, and then of our experiments we have conducted in Section 5.
We finally conclude this paper in Section 6.


2. Related Work
In the field of image recognition, convolutional neural networks (CNN), including VGG [4],and
ResNet [5], have been widely used. In the field of natural language processing for text under-
standing, encoder-decoder models (seq2seq) [6] have been the mainstream, but in recent years
Transformers [7] such as BERT [8] have become common. The Image Captioning task is a
fusion of image recognition and sentence generation, and lies in the middle of these two.
   For example, Oriol Vinyals et al. proposed caption generation using an encoder-decoder
model [9], and Kelvin Xu et al. proposed Show, Attend and Tell, which adds visual attention
to the encoder-decoder model [10]. Recently, P. Anderson et al. presented a model using
Bottom-Up Attention obtained by pre-training a Faster-R-CNN used for object detection [11].
   In addition, the Caption Prediction Task is the first time of its kind to be held at an Im-
ageCLEF conference. However, a similar task, the VQA-Med task [12], has been contested at
ImageCLEF2018, 2019, and 2020.
3. Dataset of ImageCLEF 2021 Caption Prediction
For the ImageCLEF 2021 Medical Caption Prediction task, organizers have provided us with a
training set of 2,756 radiology images with the same number of captions, a validation set of 500
radiology images with the same number of captions, and a test set of 444 radiology images with
the same number of captions. We are supposed to use these as our datasets. Most of the images
in the dataset are non-colored, and they potentially include non-essential logos and texts. The
task participants have to generate automatic caption based on radiology image data.
   According to our analysis, the top word frequencies were dominated by prepositions and
words such as right and left that indicate position. The word cloud of case-insensitive words
and the top 14 ranking words in terms of word frequency are summarized in Figure 2 and Table
1, respectively.




Figure 2: Word Cloud of caption descriptions




4. Methodlogy
The overview of our Medical Image Captioning methodology is divided into three main parts,
as shown in Figure 3.
   The first is the image and text pre-processing. As preliminaries, we propose a method for
Table 1
Word frequrncy Ranking in Dataset
                   Rank       Word         Freq    Rank       Word        Freq
                    1          right       824      7          axial      372
                    2           left       672      8        images       327
                    3         mass         616      9         image       326
                    4            ct        534      10       within       272
                    5      demonstrates    442      11        lesion      246
                    6        contrast      373      12     demonstrate    244


pre-processing the images and text in the dataset. The second is the encoder part. In the encoder
part, the features of the image are extracted. The third is the decoder part. In the decoder part,
words are predicted recursively using LSTM [13] and attention mechanism.
  We have adopted Show, Attend and Tell as the base model. This model is known to have
high accuracy among Image Captioning models that do not use object detection such Faster
R-CNN [14].




Figure 3: Over view of the our captioning framework



4.1. Input Data Pre-processing
4.1.1. Image Pre-processing
Image pre-processing includes image normalization.
   The image processing consists of two steps. In the first step, we normalize images using
histogram smoothing based on the luminance of the image. In the second step, we resize all
images to a size of 256 × 256.
   We have tried two ways to normalize the luminance distribution of an image. The first is
histogram flattening. Histogram flattening is a method of smoothing the luminance distribution
of the entire image. When flattened, the contrast of the image is enhanced and the image
becomes clearer. The second is adaptive histogram flattening. This method performs the
histogram flattening described in the first method on a small area of the image. In general, this
technique can reduce the occurrence of tone jumps. A comparison of the raw image and the
pre-processed image is shown in Figure 4.




Figure 4: Raw images and normalization images



4.1.2. Text Pre-processing
We preprocess the text by removing and lowercasing periods in the captions of the training
data. In general, the MS-COCO captioning task is not case-sensitive, and it is well known that
symbols such as periods had better be removed. If there are multiple captions for a single image,
only the period in the last caption is should be removed. As a contribution to these, the period is
recognized as one of the words in the sentence, since the period is present only in the sentence.

4.2. Caption Data Expanding using EDA
We tried EDA (Easy Data Augmentation) [15] as an extension of our text dataset. EDA is a
text classification task in natural language processing, and is an effective method that works
well when the dataset is small. In a typical captioning task using MS-COCO, five captions are
provided for one image. However, in the ImageCLEF2021 dataset, only one caption per image
is provided. We have tested the effectiveness of this approach using various data expansion
methods in EDA.

4.3. Neural network model
As a base neural network model for caption generation, we have adopted ”Show, Attend and
Tell” model [10]. This model is capable of highly accurate captioning without using object
detection. The architecture of the models is almost the same, but our model differs in that we
employ ResNet-101 [5] instead of VGG16 [4] as the CNN encoder .


5. Experiments and results
5.1. Setting up hyper-parameters and performing pre-processing with
     validation data
We experimented with hyper-parameter adjustment and image pre-processing using training
and validation data. As noted in 4.1, all characters in the train caption data are lowercased.
   We have setup the following hyper-parameters as follows; batch size as 32, optimization
function as “Adam” with a decoder learning rate of 0.001 , and the number of epochs 200. For
the implementation, we employ PyTorch1.7.1 [16] as our deep learning framework. For the
evaluation of captioning , we utilize BLEU4 [17]. Table 2 shows the results. Here we compare in
terms of BLEU for data pre-processing.

Table 2
Validation BLEU-4 of two data pre-processing ("val" in the table means validation. )
                     Model                  Pre-processing               val BLEU-4
                                                 None                       0.432
                  Xu et al. [10]       Histogram Normalization              0.437
                                   Adaptive Histogram Normalization         0.436



5.2. The results with test data
The test dataset consists of the test images distributed as described in 4.1. The test image
consists of 444 medical images, without the correct answer captions. In contrast to the text
pre-processing in 5.1, the captions used in the training have been all lowercased and the periods
at the end of sentences were deleted.
   Table 3 shows the BLEU results for the test data. In the experiments on the test data, the
BLEU evaluation was the highest when Histogram Normalization was used. Example of our
seemingly successful caption generation results are shown in Fig 5.
   Table 4 shows the BLEU ratings for the EDA attempts. The pre-processing of the dataset uses
the method that achieved the highest BLEU rating in Table 3. Using EDA’s synonym substitution
and other methods, we compare the case of adding one caption, two captions, and four captions.
In all cases where data expansion has been performed using EDA, the BLEU rating has dropped.
Table 3
The results of experiment for Image pre-processing for test data ("val" in the table means validation. )
                Model              Image Pre-processing             val BLEU     test BLEU
                                           None                       0.436         0.332
            Xu et al. [10]      Histogram Normalization               0.451        0.362
                             Adaptive Histogram Normalization         0.443         0.352




Figure 5: Example of generated caption


Table 4
The Results of using EDA to extend training data.
              Image Pre-processing       Added Caption by EDA        val BLEU    test BLEU
                                                None                   0.451       0.362
                                              one caption              0.417        0.339
            Histogram Normalization
                                             two captions              0.397        0.291
                                             four captions             0.384          -


  The results of the submissions of the participants with the highest BLEU values are shown in
Table 5. Our rank turns out to be 4th of participants.


6. Conclusions
We have described our system with which we submitted to the ImageCLEF2021 Caption Predic-
tion task. In our system, we have done our own data pre-processing, and have attempted to
Table 5
The best participants’ runs submitted for the Caption Prediction task
                                   Group Name           Rank    BLEU
                                    IALab_PLC            1      0.510
                                AUEB_NLP_GROUP           2      0.461
                                  AEHRC-CSIRO            3      0.432
                                       kdelab            4      0.362
                               jeanbenoit_delbrouck      5      0.285
                                     ImageSem            6      0.257
                                      RomiBed            7      0.243
                                  ayushnanda14           8      0.103


add data augmentation with EDA. In addition, two types of luminance smoothing and period
removal were applied to image and text pre-processing. The results demonstrate that these
processes have improved the caption prediction accuracy of the neural network model. EDA
turns out to be ineffective in this task. Finally, from organizer’s evaluation, we have achieved a
BLEU score of 0.362 in the ImageCLEF2021Caption Prediction task, placing us 4th.


Acknowledgment
A part of this research was carried out with the support of Grant for Education and Research in
Toyohashi University of Technology.


References
 [1] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan,
     P. Dollár, C. L. Zitnick, Microsoft COCO: common objects in context, CoRR abs/1405.0312
     (2014). URL: http://arxiv.org/abs/1405.0312. arXiv:1405.0312.
 [2] B. Ionescu, H. Müller, R. Péteri, A. Ben Abacha, M. Sarrouti, D. Demner-Fushman, S. A.
     Hasan, S. Kozlovski, V. Liauchuk, Y. Dicente, V. Kovalev, O. Pelka, A. G. S. de Herrera,
     J. Jacutprakart, C. M. Friedrich, R. Berari, A. Tauteanu, D. Fichou, P. Brie, M. Dogariu, L. D.
     Ştefan, M. G. Constantin, J. Chamberlain, A. Campello, A. Clark, T. A. Oliver, H. Moustahfid,
     A. Popescu, J. Deshayes-Chossart, Overview of the ImageCLEF 2021: Multimedia retrieval
     in medical, nature, internet and social media applications, in: Experimental IR Meets
     Multilinguality, Multimodality, and Interaction, Proceedings of the 12th International
     Conference of the CLEF Association (CLEF 2021), LNCS Lecture Notes in Computer
     Science, Springer, Bucharest, Romania, 2021.
 [3] O. Pelka, A. Ben Abacha, A. García Seco de Herrera, J. Jacutprakart, C. M. Friedrich,
     H. Müller, Overview of the ImageCLEFmed 2021 concept & caption prediction task, in:
     Experimental IR Meets Multilinguality, Multimodality, and Interaction, Proceedings of the
     12th International Conference of the CLEF Association (CLEF 2021), LNCS Lecture Notes
     in Computer Science, Springer, Bucharest, Romania, 2021.
 [4] K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image
     Recognition, CoRR abs/1409.1556 (2015).
 [5] K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition, in: 2016
     IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
     doi:10.1109/CVPR.2016.90.
 [6] I. Sutskever, O. Vinyals, Q. V. Le, Sequence to Sequence Learning with Neural Networks,
     in: Proceedings of the 27th International Conference on Neural Information Processing
     Systems - Volume 2, NIPS’14, MIT Press, Cambridge, MA, USA, 2014, pp. 3104–3112.
 [7] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser,
     I. Polosukhin, Attention Is All You Need, CoRR abs/1706.03762 (2017). URL: http:
     //arxiv.org/abs/1706.03762. arXiv:1706.03762.
 [8] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT Pre-training of Deep Bidirectional
     Transformers for Language Understanding, CoRR abs/1810.04805 (2018). URL: http://arxiv.
     org/abs/1810.04805. arXiv:1810.04805.
 [9] O. Vinyals, A. Toshev, S. Bengio, D. Erhan, Show and tell: A neural image caption generator,
     in: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp.
     3156–3164. doi:10.1109/CVPR.2015.7298935.
[10] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, Y. Bengio,
     Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, CoRR
     abs/1502.03044 (2015). URL: http://arxiv.org/abs/1502.03044. arXiv:1502.03044.
[11] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, L. Zhang, Bottom-up and
     top-down attention for image captioning and VQA, CoRR abs/1707.07998 (2017). URL:
     http://arxiv.org/abs/1707.07998. arXiv:1707.07998.
[12] A. Ben Abacha, M. Sarrouti, D. Demner-Fushman, S. A. Hasan, H. Müller, Overview
     of the VQA-Med Task at ImageCLEF 2021: Visual Question Answering and Generation
     in the Medical Domain, in: CLEF 2021 Working Notes, CEUR Workshop Proceedings,
     CEUR-WS.org, Bucharest, Romania, 2021.
[13] S. Hochreiter, J. Schmidhuber, Long Short-Term Memory, Neural Computation 9 (1997)
     1735–1780.
[14] S. Ren, K. He, R. B. Girshick, J. Sun, Faster R-CNN: towards real-time object detection with
     region proposal networks, CoRR abs/1506.01497 (2015). URL: http://arxiv.org/abs/1506.
     01497. arXiv:1506.01497.
[15] J. W. Wei, K. Zou, EDA: easy data augmentation techniques for boosting performance on
     text classification tasks, CoRR abs/1901.11196 (2019). URL: http://arxiv.org/abs/1901.11196.
     arXiv:1901.11196.
[16] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin,
     N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Te-
     jani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, S. Chintala, Pytorch: An impera-
     tive style, high-performance deep learning library, CoRR abs/1912.01703 (2019). URL:
     http://arxiv.org/abs/1912.01703. arXiv:1912.01703.
[17] Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing, Bleu: a method
     for automatic evaluation of machine translation, in: Proceedings of the 40th Annual
     Meeting of the Association for Computational Linguistics, Association for Computational
     Linguistics, Philadelphia, Pennsylvania, USA, 2002, pp. 311–318. URL: https://www.aclweb.
org/anthology/P02-1040. doi:10.3115/1073083.1073135.