-

Full training versus ne tuning for radiology images concept detection task for the ImageCLEF 2019 challenge

Priyanshu Sinha

sinha@outlook.com 1

Saptarshi Purkayastha

Judy Gichoya

gichoya@ohsu.edu 2 0 Indiana University Purdue University , Indianapolis, IN 46202 USA 1 Mentor Graphics India Pvt. Ltd.

priyanshu

2 Oregon Health Science University , Portlnd, OR 97239 , USA

Concept detection from medical images remains a challenging task that limits implementation of clinical ML/AI pipelines because of the scarcity of the highly trained experts to annotate images. There is a need for automated processes that can extract concrete textual information from image data. ImageCLEF 2019 provided us a set of images with labels as UMLS concepts. We participated for the rst time for the concept detection task using transfer learning. Our approach involved an experiment of layerwise ne tuning (full training) versus ne tuning based on previous reported recommendations for training classi cation, detection and segmentation tasks for medical imaging. We ranked number 9 in this year's challenge, with an F1 result of 0.05 after three entries. We had a poor result from performing layerwise tuning (F1 score of 0.014) which is consistent with previous authors who have described the bene t of full training for transfer learning. However when looking at the results by a radiologist, the terms do not make clinical sense and we hypothesize that we can achieve better performance when using medical pretrained image models for example PathNet and utilizing a hierarchical training approach which is the basis of our future work on this dataset.

Transfer Learning Layer wise Fine Tuning Deep Learning in Radiology

Concept detection from medical images remains a challenging task that limits implementation of clinical ML/AI pipelines because of the scarcity of the highly trained experts to annotate images. ImageCLEF is an annual challenge now in its third year that seeks contributions that provide techniques to map visual information to condensed textual descriptions. The process of automatic extraction of high level concepts from low level features is di cult when the images have occlusion, background clutter, intra-class variation, pose and lighting changes[ 1 ].

Participants from past challenges in 2017 and 2018 noted a broad range of content and hence the 2019 [ 2 ] challenge was narrowed down in focus to only radiology images [ 3 ]. The focus on concept detection in the 2019 challenge is important because it is the rst step of automatic image captioning, while also providing metadata to support context based image retrieval.

This was our rst time participating in the ImageCLEF challenge. The challenge is a multi-label classi cation problem, where one radiology image can have multiple labels. Previous participants had good performance when using transfer learning, hence we focussed on optimizing the ResNet50 [ 5 ] network which had the best performance compared to VGG19 [ 4 ], Xception Net [ 6 ] and InceptionResNetV2 [ 7 ]. We ranked number 9 in this year's challenge, with an F1 result of 0.05 after three entries. We had a poor result from performing layerwise tuning (F1 score of 0.014) which is consistent with previous authors who have described the bene t of full training during transfer learning. However when looking at the results by a radiologist, the terms do not make clinical sense and we hypothesize that we can achieve better performance when using medical pretrained image models for example PathNet which is the basis of our future work on this dataset. We describe our approach in detail in the remaining sections of this paper. 2

DATASET

A total of 6,031,814 image - caption pairs were extracted from PubMed Open Access and after processing were reduced to 72,187 radiology images from various modalities. This dataset included archived images from February 2018 to February 2019 [ 3 ]. Table 1 shows a summary of the images in the training, test and validation set. We did not use additional radiology training data for the purpose of our submission to this challenge. Each label is a UMLS concept provided as a csv le. Table 2 shows a representative sample of the data showing images in the training (First row), validation (second row) and test set (third row).

No of Images 56629 14157 10000

STUDY EXPERIMENT

Data Analysis The ImageCLEF images were formatted to the Imagenet directory style where the directory name is the UMLS label. This was because our approach was mainly based on transfer learning and would make repeat experiments easy to perform. Summary statistics of the dataset found 5217 unique UMLS/label concepts. There was image imbalance with approximately 90% of the labels containing less than 100 images; and 30% labels containing a single image. Table 3 shows the top 10 concepts occurring in the highest frequency in the training set. Analysis of the top 25 labels (summarized in Figure 1) show that there is persistent data imbalance with one label containing more than 6500 images (C0441633 - \Scanning") and one label containing less than 2000 images (C0006104 \Brain"). We therefore discarded labels containing less than 1000 images and used class weight technique from sklearn for balancing our training data [ 8 ]. Each input image was resized into 224x224 pixels without cropping. We used a batch size of 32 with learning rate 0.0001. The batches were formed by randomly shu ing the dataset. Optimization was performed using Adam optimizer with default beta 1 (0.9) and beta 2 (0.999). Image augmentation during training was performed using the Keras ImageDataGenerator. Augmentations performed include rescaling, rotation, zooming, shearing and horizontal ipping. A total of 100 epochs were executed. We split the data to 85% training set and 15% validation set. The network was trained using the Keras framework with tensor ow as the backend, running on a NVIDIA Quadro P6000 GPU.

We treated this as multi label classi cation problem and limited our training to the top 25 labels. Our base model was ResNet50, from which we removed the fully connected top layers and added our own auxiliary convolutional layer along with dense layers. To prevent over tting, we used dropout between dense layers. After evaluating our performance with ne tuning the last layers and reviewing the literature on ne tuning versus full training [ 9 ], we embarked on layerwise ne tuning using Resnet50(run 2). In the second run we sequentially trained each layer while freezing others. For this approach we decreased the learning rate for higher layers and ne tuned it layer wise by unfreezing layers below a particular layer. 4

Evaluation and Analysis

Tajbakhsh et al [ 9 ] performed the most comprehensive experiment evaluating the approach of ne tuning a network versus training a network from scratch. In their review of classi cation, detection and segmentation tasks using multiple imaging modalities including radiology, colonoscopy and ultrasound, they demonstrate better performance with layerwise ne tuning. Our attempt to replicate their superior performance when approaching concept detection task on the ImageCLEF 2019 dataset led to lower performance when layerwise ne tuning (F1 score of 0.014) versus whole ne tuning the network as a whole (F1 score 0.05) summarized in table 4. Our poor comparative performance may be due to poor selection of hyperparameters for ne tuning the network.

Our approach included a clinical review of some of the sample output by a radiologist who is one of the authors of this paper, and we notice a large discrepancy in the utility of the generated concepts (Table 5). For example the rst row demonstrated a chest xray with a pneumoperitoneum, and our model does not generate terms closely related to the actual radiograph interpretation. We hypothesize that a stepwise approach to training where ontology hierarchies for example laterality and anatomy are maintained may generate a superior performance that is clinically meaningful.

CONCLUSION

Despite previous documentations of superior performance with layer wise ne tuning of medical image tasks, we had a poor performance with this approach for concept detection. There is an opportunity to improve on layer wise ne tuning for such tasks. We advance the challenge by reviewing clinical relevance of output, for which despite our performance at number 9 in the challenge we found that the clinical utility of the concepts detected was low and hypothesize that we can achieve better performance and improved clinical utility using a hierarchical approach to training.

Katsios and E. Kavallieratou, \ Concept Detection on Medical Images using Deep Residual Learning Network," CLEF , 2017 .

2. Ionescu , Bogdan and Muller, Henning and Peteri, Renaud and Dang-Nguyen, DucTien and Piras, Luca and Riegler, Michael and Tran, Minh-Triet and Lux, Mathias and Gurrin, Cathal and Cid, Yashin Dicente and Liauchuk, Vitali and Kovalev, Vassili and Ben Abacha, Asma and Hasan, Sadid A. and Datla, Vivek and Liu, Joey and Demner-Fushman, Dina and Pelka, Obioma and Friedrich, Christoph M. and Chamberlain , Jon and Clark, Adrian and de Herrera, Alba Garc a Seco and Garcia, Narciso and Kavallieratou, Ergina and del Blanco, Carlos Roberto and Rodr guez, Carlos Cuevas and Vasillopoulos, Nikos and Karampidis, Konstantinos, "Overview of ImageCLEF 2019 : Challenges, Datasets and Evaluation", Experimental IR Meets Multilinguality , Multimodality, and Interaction, in Proceedings of the Tenth International Conference of the CLEF Association (CLEF 2019 )

Pelka ,

Koitka , J. Ruckert,

Nensa , and

C. M.

Friedrich , \ Radiology objects in context (ROCO): A multimodal image dataset," in Intravascular Imaging and Computer Assisted Stenting and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis: 7th Joint International Workshop, CVII-STENT 2018 and Third International Workshop, LABELS 2018, Held in Conjunction with MICCAI 2018, Granada , Spain, September 16 , 2018 , Proceedings, vol. 11043 ,

Stoyanov ,

Taylor , S. Balocco,

Sznitman ,

Martel ,

Maier-Hein ,

Duong , G. Zahnd,

Demirci ,

Albarqouni ,

S.-L.

Lee ,

Moriconi ,

Cheplygina ,

Mateus ,

Trucco , E. Granger, and P. Jannin, Eds. Cham: Springer International Publishing, 2018 , pp. 180 { 189 .

He ,

Zhang , S. Ren, and

Sun , \ Deep residual learning for image recognition," in Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2016 , pp. 770 { 778 .

Simonyan and

Zisserman , \ Very Deep Convolutional Networks for Large-Scale Image Recognition," arXiv, Sep . 2014 .

Chollet , \Xception: Deep Learning with Depthwise Separable Convolutions," in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2017 , pp. 1800 { 1807 .

Szegedy , S. Io e, V. Vanhoucke, and

Alemi , \ Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning," arXiv, Feb . 2016 .

8. sklearn.utils.class weight. compute class weight | scikit-learn 0.21.2 documentation." [Online]. Available: https://scikitlearn.org/stable/modules/generated/sklearn.utils. class weight.compute class weight .html. [Accessed: 27 -May-2019].

Tajbakhsh ,

J. Y.

Shin ,

S. R.

Gurudu ,

R. T.

Hurst ,

C. B.

Kendall ,

M. B.

Gotway , and Jianming Liang, \ Convolutional neural networks for medical image analysis: full training or ne tuning?," IEEE Trans. Med . Imaging, vol. 35 , no. 5 , pp. 1299 { 1312 , Mar . 2016 .

10. Pelka , Obioma and Friedrich, Christoph

and Garc a Seco de Herrera, Alba and Muller, Henning, "Overview of the ImageCLEFmed 2019 Concept Prediction Task" , CLEF2019 Working Notes, CEUR Workshop Proceedings (CEUR- WS.org) , ISSN 1613-0073 , http://ceur-ws. org/ Vol- 2380 /, 2019

11. Bogdan Ionescu and Henning Muller and Renaud Peteri and Yashin Dicente Cid and Vitali Liauchuk and Vassili Kovalev and Dzmitri Klimuk and Aleh Tarasau and Asma Ben Abacha and Sadid A. Hasan and Vivek Datla and Joey Liu and Dina Demner-Fushman and Duc-Tien Dang-Nguyen and Luca Piras and Michael Riegler and Minh-Triet Tran and Mathias Lux and Cathal Gurrin and Obioma Pelka and Christoph M. Friedrich and Alba Garc a Seco de Herrera and Narciso Garcia and Ergina Kavallieratou and Carlos Roberto del Blanco and Carlos Cuevas Rodr guez and Nikos Vasillopoulos and Konstantinos Karampidis and Jon Chamberlain and Adrian Clark and Antonio Campello, "ImageCLEF 2019: Multimedia Retrieval in Medicine, Lifelogging, Security and Nature", Experimental IR Meets Multilinguality , Multimodality, and Interaction, in Proceedings of the Tenth International Conference of the CLEF Association (CLEF 2019 )