-

Neural Captioning for the ImageCLEF 2017 Medical Image Challenges

David Lyndon

Ashnil Kumar

Jinman Kim

0 0 School of Information Technologies, University of Sydney

Manual image annotation is a major bottleneck in the processing of medical images and the accuracy of these reports varies depending on the clinician's expertise. Automating some or all of the processes would have enormous impact in terms of e ciency, cost and accuracy. Previous approaches to automatically generating captions from images have relied on hand-crafted pipelines of feature extraction and techniques such as templating and nearest neighbour sentence retrieval to assemble likely sentences. Recent deep learning-based approaches to general image captioning use fully di erentiable models to learn how to generate captions directly from images. In this paper, we address the challenge of end-to-end medical image captioning by pairing an imageencoding convolutional neural network (CNN) with a language-generating recurrent neural network (RNN). Our method is an adaptation of the NICv2 model that has shown state-of-the-art results in general image captioning. Using only data provided in the training dataset, we were able to attain a BLEU score of 0.0982 on the ImageCLEF 2017 Caption Prediction Challenge and an average F1 score of 0.0958 on the Concept Detection Challenge.

Deep Learning Image Captioning LSTM CNN RNN

Generating a textual summary of the insights gleaned from a medical image is a routine, yet nonetheless time-consuming task requiring much human e ort on the part of highly trained clinicians. Prior e orts to automate this task relied on hand-crafted pipelines, employing manually designed feature extraction and techniques such as templating and sentence retrieval to assemble likely sentences [ 9, 13, 14 ]. Recent deep learning-based approaches to general image captioning, however, use fully di erentiable models to learn how to generate captions directly from images. In general the advantages of such fully learnable models is that any part of the model can adapt in a manner most useful for the problem at hand, whereas a hand-designed system is constrained by the assumptions made during feature extraction, concept detection, and sentence generation.

In this paper we describe the submission of the University of Sydney's Biomedical Engineering & Technology (BMET) group to the caption prediction and concept detection task of the ImageCLEF 2017 caption challenge [ 4, 7 ]. This submission employs a fully di erentiable model, pairing an image-encoding CNN with a language-generating RNN to generate captions for images from a range of modalities. 2

Background

Image captioning, whereby the contents of an image are automatically described in natural language, is challenging task in machine learning, requiring methods from both image and natural language processing. Many early approaches to this problem involved complex systems comprising of visual feature extractors and rule based methods for sentence generation. Li et al. [ 11 ] utilise image feature similarity measures to locate likely n-grams from a large corpus of image and text, then use a simple sentence template and local search to generate a caption. Yao et al. [22] extract image features such as SIFT and edges to match images to a concept database, then apply a graph-based ontology to these concepts to produce readable sentences. Ordonez et al. [ 12 ] use image features and a ranking-based approach to locate likely sentences in an extremely large database of images and text. Such methods require a great deal of hand-crafted optimisation and produce systems which are brittle and limited to specialised domains.

Recently, deep learning-based encoder-decoder frameworks for machine translation [16] have been adapted and applied to problem of image captioning. By replacing the language-encoding Long Short-Term Memory (LSTM) [ 6 ] RNN with an image-encoding CNN, the model is able to learn to generate captions directly from images. The entire model is completely di erentiable so errors are propagated to the di erent components proportional to their contribution to the error, allowing them to adapt appropriately. While there were several precursors that replaced various components of existing image to caption frameworks with trainable RNNs or CNNs, Vinyals et al. [19] proposed the rst end-to-end neural network based approach to captioning with their \Show and Tell" (also called Neural Image Captioning (NIC)) model. An updated method, NICv2 [20], won the Microsoft Common Objects in Contex (MSCOCO) challenge in 2015. Qualitative analysis has shown that neural captioning methods are preferred in comparison with conventional nearest-neighbour sentence lookup approaches [ 3 ].

There has been limited work in adapting such methods to the medical domain, despite the large volume of image and text data found in PACS. Schlegl et al. [ 13 ] present the rst such work that leveraged text reports to improve classi cation accuracy of CNN applied to Optical Coherence Tomography (OCT) images. Mahmood et al. [17] present a method that uses hand-coded topic extraction, hand-coded image features and a SVM-based correlation system. Shin et al. [ 14 ] document e orts to mine an extremely large database of images and text extracted from the PACS of the National Institutes of Health Clinical Center (approximately 216 thousand images) using latent Dirichlet allocation (LDA) to extract topics from the raw text and then correlate these topics to image features. Kisilev et al. [ 9 ] proposes an SVM-based approach to highlight regions of interest (ROIs) and generate template-based captions for the Digital Database for Screening Mammography (DDSM). This is extended using a multi-task loss CNN in a later work [ 8 ].

To the best of our knowledge, only one published work exists for applying neural image captioning to a medical dataset [15]. In this work the authors employ an architecture similar to Vinyals et al. [19] to generate an array of keywords for a radiological dataset. 3

Method

Unless otherwise speci ed, the same method was applied for both the caption prediction and concept detection tasks. The set of concepts assigned to an image in the concept detection task is considered to be a caption where each concept label is a word in the sentence. In both cases only the supplied training dataset was used to train the models. 3.1

Preprocessing

In order to simplify the task, each image in the training set was preprocessed in accordance with the task's evaluation preprocessing speci cations. This involved converting the caption to lower case, removing all punctuation (some captions contained multiple sentences, however, after this step each caption became a single sentence), removing stopwords using the NLTK [ 1 ] English stopword list and nally applying stemming using NLTK's Snowball stemmer. No preprocessing was applied to the `sentences' for the concept detection task. After this preprocessing the count of each unqiue word in the training corpus was taken. Words that appeared less than 4 times were discarded and this resulted in a dictionary of 25237 distinct words. For the RNN framework described below, two reserved words indicating the start and end of sentences are added to the dictionary and used to prepend and append each sentence.

The images are rst resized to 324x324px, and a 299x299px crop is then selected. During training this is a random crop, but during evaluation a central crop is used. We apply image augmentation during training to regularise the model [ 10 ]. This augmentation consists of distorting the image, rst by randomly ipping it horizontally then randomly adjusting the brightness, saturation, hue and contrast. The random cropping and distortion are performed each time an image is passed into the model and means that it is extremely rare that exactly the same image is seen twice.

A validation set was provided by the organisers of the task and it entirely reserved for validation. No part of it was used for training, and speci cally we did use it to build the dictionary of unique words. 3.2

Model

Our method extends Vinyals et. al's NICv2 model [20] 1. The NICv2 model consists of two di erent types of neural networks paired together to form an imageto-language, encoder-decoder pair. A CNN, speci cally the InceptionV3 [18] architecture, is used as an image encoder. InceptionV3 is one of the most accurate architecture for general image classi cation according to the ImageNet [ 2 ] benchmark, but is signi cantly more computationally e cient than alternatives such as Residual Networks [ 5 ]. We utilised a RNN based on LSTM units as the language decoder as per the original paper, however, we doubled the number of units from 512 to 1024 as this showed improved results in our experiments.

An image is rst preprocessed as described above and then fed to the input of the CNN. The logits of the CNN are passed into a single layer fully-connected neural network which functions as an image embedding layer. This image embedding then the becomes to initial state of the LSTM network. As per [20] the embedding was passed only at the initial state and is not used subsequently. At each state subsequent to the initial state, then LSTM's output is passed to a word embedding layer and then to a softmax layer. At each time step the output of the softmax is the probability of each word in the dictionary. For two of the caption prediction experiments (PRED2 & PRED4) we modi ed the baseline language model to use a 3-layer LSTM model with a single dropout layer on the output. Increasing the number of LSTM layers improves the ability of the language model to represent complex sentences and long term dependencies. Industrial neural machine translation models have been demonstrated to use decoder layers with up to 8 layers [21].

In all our models we used 1024 units for both the image and word embedding layers. The CNN is initialised using weights from a model trained on the ImageNet dataset, while the weights for the LSTM are initialised from a random normal distribution with values between -0.08 and 0.08 as per [16]. For the nal caption prediction experiment (PRED4) we attempted to domain transfer the updated CNN weights from the DET2 caption detection model. This was attempted to avoid corruption of CNNs during end-to-end training (discussed in Sect. 4). 3.3

Training

The loss optimised during training is the summed cross entropy of the output of the softmax compared to the one-hot encoding of the next word in the ground truth sentence. This loss was minimised with standard Stochastic Gradient Descent (SGD) using an initial learning rate of 1:0 and a decay procedure that reduced the learning rate by half every 8 epochs (there were 164541 examples in the training set and a batch size of 16, so each epoch contains 10284 minibatches). Gradients were clipped to 5:0 for all experiments. 1 As per the Tensor ow-Slim implementation https://github.com/tensor ow/models/tree/master/im2txt available

at: g m E

orthomorpha unk sp n holotyp b right

gonopod mesal later view respect cf distal part right

gonopod mesal later subor subcaud

view respect scale bar 02 mm

LSTM Language Generator CNN Vision Encoder

As suggested by Vinyals et al. [19] we use Beam Search to generate sentences at inference time. This avoids the non-trivial issue that greedily selecting the most probable word at each time-step may result in a sentence which is itself of low probability. Ideally we would search the entire space for the most probable sentence, however, this would have an exponential computational cost associated with it as a forward pass through the entire model must be made for each node of the search tree. Therefore some search procedure is required in order to nd the most probable sentence given limited computational resources. Our best results were achieved with a beam size of 3 and maximum caption length of 50. The sentence output of the concept detection task was converted to an ordered set of concept labels. Tables 3 & 4 detail the training, validation and test results of the various runs. Please note that due to the high cost of inference, the training scores are estimated based on a random sample of 10000 images from the training set.

We attempted a two-phase training procedure as suggested by Vinyals et al. [20] for DET2 and some unsubmitted experiments. In the rst phase we froze the CNN weights and trained only the LSTM and embedding layers. Then, once the language model had begun to converge we trained the entire model end-toend with a very small learning rate (1e 5). This is suggested by the Vinyals et al. as necessary as otherwise the CNN model will become corrupted and never recover. However, we found that despite training the LSTM for a very long time in the rst phase and using a very small learning rate in the second phase we would very quickly corrupt the CNN as evidenced by a sharp increase in dead ReLUs and a large decrease in BLEU score. We found that BLEU scores would eventually return to those achieved in the rst phase of training, however, the dead ReLUs did not revive. We believe that the underlying issue is that the degree of domain transfer required to go from general images to medical images is vastly greater than that required to go from one collection of general images to another (i.e. ImageNet to MSCOCO). Based on the small variance between our training and validation scores we do not believe that the models were over tting, however, the large variance between validation and test scores indicates that there was a large disparity between the training and validation data and the data in test set. Based on the non-over tting analysis we could potentially train a much larger vision model for a longer time and improve the overall performance.

Additionally the fact that we could not successfully train the vision model without corrupting the network was a major limiting factor in our experiments.

Future work will investigate the potential of larger language models and devising a training regime that allows true end-to-end training for medical images.

Actual Predicted Scores

Actual

orthomorpha latiterga sp n holotyp b right gonopod mesal later view respect cf distal part right gonopod mesal later subor subcaud view respect scale bar 02 mm orthomorpha unk sp n holotyp b right gonopod mesal later view respect cf distal part right gonopod mesal later subor subcaud view respect scale bar 02 mm BLEU: 0.9304 BLEU1: 0.9629 BLEU2: 0.92 BLEU3: 0.9231

BLEU4: 0.9167 preoper later radiograph right knee

Predicted later radiograph right knee BLEU: 0.7788 BLEU1: 1.0 Scores BLEU2: 1.0 BLEU3: 1.0 Acknowledgement

The authors are grateful to the NVIDIA Corporation for their donation of the Titan X GPU used in this research. 15. Shin, H.C., Roberts, K., Lu, L., Demner-Fushman, D., Yao, J., Summers, R.M.:

Learning to read chest x-rays: recurrent neural cascade model for automated im

age annotation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2497{2506 (2016) 16. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger,

K.Q. (eds.) Advances in Neural Information Processing Systems 27, pp. 3104{3112.

Curran Associates, Inc. (2014) 17. Syeda-Mahmood, T., Kumar, R., Compas, C.: Learning the correlation between images and disease labels using ambiguous learning. In: Medical Image Computing and Computer-Assisted Intervention { MICCAI 2015. pp. 185{193. Springer, Cham (Oct 2015) 18. Szegedy, C., Vanhoucke, V., Io e, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision (Dec 2015) 19. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: A neural image caption generator (Nov 2014) 20. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: Lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Trans. Pattern Anal. Mach.

Intell. (Jul 2016) 21. Wu, Y., Schuster, M., Chen, Z., Le, Q.V., Norouzi, M., Macherey, W., Krikun,

M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, ., Gouws, S., Kato, Y., Kudo, T., Kazawa, H., Stevens, K., Kurian, G., Patil, N., Wang, W., Young, C., Smith, J., Riesa, J., Rudnick, A., Vinyals, O., Corrado, G., Hughes, M., Dean, J.: Google's neural machine translation system: Bridging the gap between human and machine translation (Sep 2016)

22. Yao, B.Z., Yang, X., Lin, L., Lee, M.W., Zhu, S.C.: I2t: Image parsing to text description. Proc. IEEE 98(8), 1485{1508 (Aug 2010)

1. Bird , S.: NLTK: The natural language toolkit . In: Proceedings of the COLING/ACL on Interactive Presentation Sessions . pp. 69 { 72 . COLING-ACL ' 06 , Association for Computational Linguistics, Stroudsburg, PA, USA ( 2006 )

2. Deng , J. , Dong , W. , Socher , R. , Li , L.J. , Li , K. , Fei-Fei , L. : ImageNet: A large-scale hierarchical image database . In: 2009 IEEE Conference on Computer Vision and Pattern Recognition . pp. 248 { 255 (Jun 2009 )

3. Devlin , J. , Gupta , S. , Girshick , R., Mitchell, M. ,

Lawrence

Zitnick , C. : Exploring nearest neighbor approaches for image captioning (May 2015 )

4. Eickho , C. , Schwall , I. , Garc a Seco de Herrera, A. , Muller, H.: Overview of ImageCLEFcaption 2017 - image caption prediction and concept detection for biomedical images . In: CLEF 2017 Labs Working Notes. CEUR Workshop Proceedings , CEUR-WS.org <http://ceur-ws. org> , Dublin, Ireland (September 11 -14 2017 )

5. He , K. , Zhang , X. , Ren , S. , Sun , J.: Deep residual learning for image recognition ( Dec 2015 )

6. Hochreiter , S. , Schmidhuber , J.: Long short-term memory . Neural Comput . 9 ( 8 ), 1735 {1780 (Nov 1997 )

7. Ionescu , B. , Mller , H. , Villegas , M. , Arenas , H. , Boato , G. , Dang-Nguyen , D.T. , Dicente Cid , Y. , Eickho , C. , Garcia Seco de Herrera , A. , Gurrin , C. , Islam , B. , Kovalev , V. , Liauchuk , V. , Mothe , J. , Piras , L. , Riegler , M. , Schwall , I. : Overview of ImageCLEF 2017: Information extraction from images . In: Experimental IR Meets Multilinguality, Multimodality, and Interaction 8th International Conference of the CLEF Association, CLEF 2017. Lecture Notes in Computer Science , vol. 10456 . Springer, Dublin, Ireland ( 2017 )

8. Kisilev , P. , Sason , E. , Barkan , E. , Hashoul , S. : Medical image description using multi-task-loss CNN . In: Deep Learning and Data Labeling for Medical Applications , pp. 121 { 129 . Springer, Cham (Oct 2016 )

9. Kisilev , P. , Walach , E. , Hashoul , S. , Barkan , E. , Ophir , B. , Alpert , S. : Semantic description of medical image ndings: structured learning approach . In: Procedings of the British Machine Vision Conference 2015 . pp. 171 . 1 { 171 .11. British Machine Vision Association ( 2015 )

10. Kumar , A. , Kim , J. , Lyndon , D. , Fulham , M. , Feng , D.: An ensemble of Fine-Tuned convolutional neural networks for medical image classi cation . IEEE journal of biomedical and health informatics 21(1) , 31 {40 (Jan 2017 )

11. Li , S. , Kulkarni , G. , Berg , T.L. , Berg , A.C. , Choi , Y. : Composing simple image descriptions using web-scale n-grams . In: Proceedings of the Fifteenth Conference on Computational Natural Language Learning . pp. 220 { 228 . CoNLL '11, Association for Computational Linguistics, Stroudsburg, PA, USA ( 2011 )

12. Ordonez , V. , Kulkarni , G. , Berg , T.L. : Im2Text: Describing images using 1 million captioned photographs . In: Shawe-Taylor , J., Zemel , R.S. , Bartlett , P.L. , Pereira , F. , Weinberger , K.Q . (eds.) Advances in Neural Information Processing Systems 24 , pp. 1143 { 1151 . Curran Associates, Inc. ( 2011 )

13. Schlegl , T. , Waldstein , S.M. , Vogl , W.D. , Schmidt-Erfurth , U. , Langs , G.: Predicting semantic descriptions from medical images with convolutional neural networks . In: Information Processing in Medical Imaging . pp. 437 { 448 . Springer, Cham (Jun 2015 )

14. Shin , H.C. , Lu , L. , Kim , L. , Se , A. , Yao , J. , Summers , R.M.: Interleaved text/image deep mining on a very large-scale radiology database . In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . pp. 1090 { 1099 ( 2015 )