1. Introduction

Bucharest, Romania " tsuneda.riku.am@kde.cs.tut.ac.jp (R. Tsuneda); asakawa@kde.cs.tut.ac.jp (T. Asakawa); aono@tut.jp (M. Aono)

10.3115/1073083.1073135

Kdelab at ImageCLEF 2021: Medical Caption Prediction with Efective Data Pre-processing and Deep Learning

Riku Tsuneda

Tetsuya Asakawa

Masaki Aono

0 Department of Computer Science and Engineering, Toyohashi University of Technology , Aichi , Japan

2021

000 0 0002

ImageCLEF 2021 Caption Prediction Task is an example of a challenging research problem in the field of image captioning. The goal of this research is to automatically generate accurate captions describing a given medical image. We describe our approach to captioning medical images and illustrate the text and image pre-processing that is efective for our task dataset. In this paper, we have applied sentence-ending period removal as text pre-processing and histogram normalization of luminance as image pre-processing. Furthermore, we present the efectiveness of our text data augmentation approach. Submission of our kdelab team on the task test dataset achieved a BLEU evaluating of 0.362.

eol>Image Captioning Deep Learning Medical Images

1. Introduction

In recent years, multimodal processing of images and natural language has attracted much attention in the field of machine learning. Image Captioning is one of these representative tasks, which aims at proper captioning of input images. As these accuracies improve, it is expected that computers will not only be able to detect objects in images, but also to understand the relationships and behaviors between objects.

Image captioning is also efective in the medical field. For example, interpreting and summarizing possible disease symptoms from a large number of radiology images (e.g. X-ray images and CT images) is a time-consuming task that can only be understood by highly knowledgeable specialists. If computers could understand medical images and generate accurate captions, it would help solve the world’s growing shortage of medical doctors. However, there is still the bottleneck problem that few physicians are able to give accurate annotations.

In this paper, we describe our approach to general Image Captioning task in medical domain at Image Captioning such as Fig. 1(right).

The nature of medical images are quite diferent from general images such as MS-COCO [ 1 ] in many aspects.

In the following, we first describe related work on Image Captioning task and Medical Image Captioning in Section 2, followed by the description of the dataset provided for ImageCLEF2021 [ 2 ] Medical Image Captioning [ 3 ]dataset in Section 3. In Section 4, we describe details of the method we have applied, and then of our experiments we have conducted in Section 5. We finally conclude this paper in Section 6.

2. Related Work

In the field of image recognition, convolutional neural networks (CNN), including VGG [ 4 ],and ResNet [ 5 ], have been widely used. In the field of natural language processing for text understanding, encoder-decoder models (seq2seq) [ 6 ] have been the mainstream, but in recent years Transformers [ 7 ] such as BERT [ 8 ] have become common. The Image Captioning task is a fusion of image recognition and sentence generation, and lies in the middle of these two.

For example, Oriol Vinyals et al. proposed caption generation using an encoder-decoder model [ 9 ], and Kelvin Xu et al. proposed Show, Attend and Tell, which adds visual attention to the encoder-decoder model [ 10 ]. Recently, P. Anderson et al. presented a model using Bottom-Up Attention obtained by pre-training a Faster-R-CNN used for object detection [ 11 ].

In addition, the Caption Prediction Task is the first time of its kind to be held at an ImageCLEF conference. However, a similar task, the VQA-Med task [ 12 ], has been contested at ImageCLEF2018, 2019, and 2020. For the ImageCLEF 2021 Medical Caption Prediction task, organizers have provided us with a training set of 2,756 radiology images with the same number of captions, a validation set of 500 radiology images with the same number of captions, and a test set of 444 radiology images with the same number of captions. We are supposed to use these as our datasets. Most of the images in the dataset are non-colored, and they potentially include non-essential logos and texts. The task participants have to generate automatic caption based on radiology image data.

According to our analysis, the top word frequencies were dominated by prepositions and words such as right and left that indicate position. The word cloud of case-insensitive words and the top 14 ranking words in terms of word frequency are summarized in Figure 2 and Table 1, respectively.

4. Methodlogy

The overview of our Medical Image Captioning methodology is divided into three main parts, as shown in Figure 3.

The first is the image and text pre-processing. As preliminaries, we propose a method for pre-processing the images and text in the dataset. The second is the encoder part. In the encoder part, the features of the image are extracted. The third is the decoder part. In the decoder part, words are predicted recursively using LSTM [ 13 ] and attention mechanism.

We have adopted Show, Attend and Tell as the base model. This model is known to have high accuracy among Image Captioning models that do not use object detection such Faster R-CNN [ 14 ].

4.1. Input Data Pre-processing 4.1.1. Image Pre-processing

Image pre-processing includes image normalization.

The image processing consists of two steps. In the first step, we normalize images using histogram smoothing based on the luminance of the image. In the second step, we resize all images to a size of 256 × 256.

We have tried two ways to normalize the luminance distribution of an image. The first is histogram flattening. Histogram flattening is a method of smoothing the luminance distribution of the entire image. When flattened, the contrast of the image is enhanced and the image becomes clearer. The second is adaptive histogram flattening. This method performs the histogram flattening described in the first method on a small area of the image. In general, this technique can reduce the occurrence of tone jumps. A comparison of the raw image and the pre-processed image is shown in Figure 4.

4.1.2. Text Pre-processing

We preprocess the text by removing and lowercasing periods in the captions of the training data. In general, the MS-COCO captioning task is not case-sensitive, and it is well known that symbols such as periods had better be removed. If there are multiple captions for a single image, only the period in the last caption is should be removed. As a contribution to these, the period is recognized as one of the words in the sentence, since the period is present only in the sentence.

4.2. Caption Data Expanding using EDA

We tried EDA (Easy Data Augmentation) [ 15 ] as an extension of our text dataset. EDA is a text classification task in natural language processing, and is an efective method that works well when the dataset is small. In a typical captioning task using MS-COCO, five captions are provided for one image. However, in the ImageCLEF2021 dataset, only one caption per image is provided. We have tested the efectiveness of this approach using various data expansion methods in EDA.

4.3. Neural network model

As a base neural network model for caption generation, we have adopted ”Show, Attend and Tell” model [ 10 ]. This model is capable of highly accurate captioning without using object detection. The architecture of the models is almost the same, but our model difers in that we employ ResNet-101 [ 5 ] instead of VGG16 [ 4 ] as the CNN encoder .

5. Experiments and results 5.1. Setting up hyper-parameters and performing pre-processing with validation data

We experimented with hyper-parameter adjustment and image pre-processing using training and validation data. As noted in 4.1, all characters in the train caption data are lowercased.

We have setup the following hyper-parameters as follows; batch size as 32, optimization function as “Adam” with a decoder learning rate of 0.001 , and the number of epochs 200. For the implementation, we employ PyTorch1.7.1 [ 16 ] as our deep learning framework. For the evaluation of captioning , we utilize BLEU4 [ 17 ]. Table 2 shows the results. Here we compare in terms of BLEU for data pre-processing.

5.2. The results with test data

The test dataset consists of the test images distributed as described in 4.1. The test image consists of 444 medical images, without the correct answer captions. In contrast to the text pre-processing in 5.1, the captions used in the training have been all lowercased and the periods at the end of sentences were deleted.

Table 3 shows the BLEU results for the test data. In the experiments on the test data, the BLEU evaluation was the highest when Histogram Normalization was used. Example of our seemingly successful caption generation results are shown in Fig 5.

Table 4 shows the BLEU ratings for the EDA attempts. The pre-processing of the dataset uses the method that achieved the highest BLEU rating in Table 3. Using EDA’s synonym substitution and other methods, we compare the case of adding one caption, two captions, and four captions. In all cases where data expansion has been performed using EDA, the BLEU rating has dropped.

6. Conclusions

We have described our system with which we submitted to the ImageCLEF2021 Caption Prediction task. In our system, we have done our own data pre-processing, and have attempted to add data augmentation with EDA. In addition, two types of luminance smoothing and period removal were applied to image and text pre-processing. The results demonstrate that these processes have improved the caption prediction accuracy of the neural network model. EDA turns out to be inefective in this task. Finally, from organizer’s evaluation, we have achieved a BLEU score of 0.362 in the ImageCLEF2021Caption Prediction task, placing us 4th.

Acknowledgment

A part of this research was carried out with the support of Grant for Education and Research in Toyohashi University of Technology.

[1]

Lin ,

Maire ,

S. J.

Belongie ,

L. D.

Bourdev ,

R. B.

Girshick ,

Hays ,

Perona ,

Ramanan ,

Dollár ,

C. L.

Zitnick , Microsoft

COCO

: common objects in context , CoRR abs/1405 .0312 ( 2014 ). URL: http://arxiv.org/abs/1405.0312. arXiv: 1405 . 0312 .

[2]

Ionescu ,

Müller ,

Péteri ,

A. Ben

Abacha ,

Sarrouti ,

Demner-Fushman ,

S. A.

Hasan ,

Kozlovski ,

Liauchuk ,

Dicente ,

Kovalev ,

Pelka , A. G. S. de Herrera , J.

Jacutprakart , C. M.

Friedrich , R.

Berari , A.

Tauteanu , D.

Fichou , P.

Brie , M.

Dogariu , L. D.

Ştefan , M. G.

Constantin , J.

Chamberlain , A.

Campello , A.

Clark , T. A.

Oliver , H.

Moustahfid , A.

Popescu , J.

Deshayes-Chossart , Overview of the ImageCLEF 2021: Multimedia retrieval in medical, nature, internet and social media applications, in: Experimental IR Meets Multilinguality , Multimodality, and Interaction , Proceedings of the 12th International Conference of the CLEF Association (CLEF 2021 ), LNCS Lecture Notes in Computer Science , Springer, Bucharest, Romania, 2021 .

[3]

Pelka ,

A. Ben

Abacha ,

García Seco de Herrera ,

Jacutprakart ,

C. M.

Friedrich ,

Müller , Overview of the ImageCLEFmed 2021 concept & caption prediction task, in: Experimental IR Meets Multilinguality , Multimodality, and Interaction , Proceedings of the 12th International Conference of the CLEF Association (CLEF 2021 ), LNCS Lecture Notes in Computer Science , Springer, Bucharest, Romania, 2021 .

[4]

Simonyan ,

Zisserman , Very Deep Convolutional Networks for Large-Scale Image Recognition , CoRR abs/1409 .1556 ( 2015 ).

[5]

He ,

Zhang , S. Ren,

Sun , Deep Residual Learning for Image Recognition , in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2016 , pp. 770 - 778 . doi: 10 .1109/CVPR. 2016 . 90 .

[6]

Sutskever ,

Vinyals ,

Q. V.

Le , Sequence to Sequence Learning with Neural Networks , in: Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS'14 , MIT Press, Cambridge, MA, USA, 2014 , pp. 3104 - 3112 .

[7]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez ,

Kaiser , I. Polosukhin , Attention Is All You Need, CoRR abs/1706 .03762 ( 2017 ). URL: http: //arxiv.org/abs/1706.03762. arXiv: 1706 . 03762 .

[8]

Devlin ,

Chang ,

Lee , K. Toutanova, BERT Pre-training of Deep Bidirectional Transformers for Language Understanding , CoRR abs/ 1810 .04805 ( 2018 ). URL: http://arxiv. org/abs/ 1810 .04805. arXiv: 1810 .04805.

[9]

Vinyals ,

Toshev ,

Bengio ,

Erhan , Show and tell: A neural image caption generator , in: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2015 , pp. 3156 - 3164 . doi: 10 .1109/CVPR. 2015 . 7298935 .

[10]

Xu ,

Ba ,

Kiros ,

Cho ,

A. C.

Courville ,

Salakhutdinov ,

R. S.

Zemel ,

Bengio , Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , CoRR abs/1502 .03044 ( 2015 ). URL: http://arxiv.org/abs/1502.03044. arXiv: 1502 . 03044 .

[11]

Anderson ,

He ,

Buehler ,

Teney ,

Johnson , S. Gould, L. Zhang, Bottom-up and top-down attention for image captioning and VQA , CoRR abs/1707 .07998 ( 2017 ). URL: http://arxiv.org/abs/1707.07998. arXiv: 1707 . 07998 .

[12]

Ben Abacha ,

Sarrouti ,

Demner-Fushman ,

S. A.

Hasan ,

Müller , Overview of the VQA-Med Task at ImageCLEF 2021: Visual Question Answering and Generation in the Medical Domain , in: CLEF 2021 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org, Bucharest, Romania, 2021 .

[13]

Hochreiter ,

Schmidhuber ,

Long

Short-Term Memory , Neural Computation 9 ( 1997 ) 1735 - 1780 .

[14]

Ren ,

He ,

R. B.

Girshick ,

Sun , Faster R-CNN : towards real-time object detection with region proposal networks , CoRR abs/1506 .01497 ( 2015 ). URL: http://arxiv.org/abs/1506. 01497. arXiv: 1506 . 01497 .

[15]

J. W.

Wei ,

Zou , EDA: easy data augmentation techniques for boosting performance on text classification tasks , CoRR abs/ 1901 .11196 ( 2019 ). URL: http://arxiv.org/abs/ 1901 .11196. arXiv: 1901 .11196.

[16]

Paszke ,

Gross ,

Massa ,

Lerer ,

Bradbury , G. Chanan,

Killeen ,

Lin ,

Gimelshein ,

Antiga ,

Desmaison ,

Köpf ,

Yang ,

DeVito ,

Raison ,

Tejani ,

Chilamkurthy ,

Steiner ,

Fang ,

Bai ,

Chintala , Pytorch: An imperative style, high-performance deep learning library , CoRR abs/ 1912 .01703 ( 2019 ). URL: http://arxiv.org/abs/ 1912 .01703. arXiv: 1912 .01703.

[17] Papineni , Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing , Bleu: a method for automatic evaluation of machine translation, in: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics , Philadelphia, Pennsylvania, USA, 2002 , pp. 311 - 318 . URL: https://www.aclweb.