Finding and Classifying Tuberculosis Types for a Targeted Treatment: MedGIFT–UPB Participation in the ImageCLEF 2017 tuberculosis Task Liviu–Daniel Ştefan1 , Yashin Dicente Cid2,3 , Oscar Jimenez–del–Toro2,3 , Bogdan Ionescu1 , and Henning Müller2,3 1 University Politehnica of Bucharest, 061071 Romania 2 University of Applied Sciences Western Switzerland (HES–SO), Sierre, Switzerland 3 University of Geneva, Switzerland Abstract. This paper describes the participation of the MedGIFT/UPB group in the ImageCLEF 2017 tuberculosis task. This task includes two subtasks: (1) multi–drug resistance detection (MDR), with the goal of determining the probability of a tuberculosis patient having a resistant form of tuberculosis and (2) tuberculosis type detection (TBT), with the goal of classifying each tuberculosis patient into one of the following five types: infiltrative, focal, tuberculoma, miliary and fibro–cavernous. Two runs were submitted for the TBT subtask and one run for the MDR subtask. Both of them use visual features learned with a deep learning approach directly from slices of patient CT (Computed Tomography) scans. For the TBT subtask the submitted runs obtained the 3rd and 8th position out of 23 runs submitted for this task, with a top Kappa value of 0.2329. In the MDR subtask, the proposed approach obtained the 7th position according to the accuracy (0.5352) out of 20 partici- pant runs. Three main techniques were exploited during model training: pre–training the last layer of a neural network, small learning rates and data augmentation techniques. Data augmentation resulted in an effec- tive and efficient data transformation that enhanced small lesions in the full image space. Keywords: ImageCLEF, tuberculosis, medical image analysis, computed tomography, deep learning. 1 Introduction According to the World Health Organization (WHO), tuberculosis remains as one of the top 10 causes of death worldwide [1]. Particularly when the disease is multi–drug–resistant (MDR), current treatment can be toxic and patient out- come is often poor [2]. An important challenge is still to correctly detect these MDR cases, as it has been estimated that only 20% of MDR tuberculosis cases are actually detected and treated accordingly [3]. Therefore, a better detection and classification of tuberculosis types can result in a more targeted treatment for the patients [4]. ImageCLEF4 is an image retrieval and analysis evaluation campaign (part of CLEF, the Cross Language Evaluation Forum) where many algorithms can be compared on the same basis [5]. ImageCLEF started in 2003, with a medical task included since 2004 and held every year since then [6]. In ImageCLEF 2017, the topic of one of the two medical tasks was related to tuberculosis data analysis from chest CT (Computed Tomography) images [7]. Two independent subtasks were proposed to the participants: (1) Multi–drug resistance (MDR), including 230 3D CT scans from only HIV-negative patients, no relapses and classified according to their tuberculosis treatment: drug sensitive or multi–drug resistant, and (ii) Tuberculosis type (TBT), containing 500 3D CT scans of TB patients, classified into five disease types: infiltrative, focal, tuberculoma, miliary and fibro–cavernous. For the MDR subtask the performance evaluation was done using ROC–curves generated from the probabilities provided by the contestants. On the other hand, for the TBT subtask the evaluation of performance was done using the unweighted Cohens Kappa. [8] gives more details on the dataset, task setup and full participant results, also for the other tasks of ImageCLEF 2017. The MedGIFT–UBP group, submitted 3 runs in total within the ImageCLEF 2017 tuberculosis task, with participation in both subtasks. The runs were based on visual features extracted from the 2D axial slices of the patient CT scans, learned using a very deep Convolutional Neural Network (CNN). Two main research questions were addressed while training and testing algorithms with a heterogeneous and challenging dataset: 1) how to design an effective and efficient data transformation to enhance small lesions in the full image space?; 2) how to learn the ConvNet models given the limited training samples?. The remainder of this paper is organized as follows: In Section 2.1, the data transformation method used to enhance small lesions in the full image space is de- scribed. Then, in Section 2.2, we investigate the network from a model-selection and optimization perspective. Section 3 reports the experimental setup. In Sec- tion 4 we report the results of our runs with respect to the top scores. Finally, in Section 5 we present conclusions and discuss current and future directions. 2 Methods The proposed methods were built on top of the successful deep learning architec- ture described in [9] while tackling the problems mentioned above. In terms of volume structure modeling, a key observation is that consecutive slices are highly redundant. Therefore, a sparse structural sampling strategy was favorable in this case. To unleash the full potential of ConvNets with the provided data, several training practices were implemented to overcome the aforementioned difficulties resulting from a limited number of training samples. These practices include cross–modality pre–training and enhanced data augmentation. 4 http://www.imageclef.org/ We apply for both of the subtasks the provided mask of the lungs [10] and extract the slices using a sparse structural sampling strategy described later in the paper. Then, we use a very deep CNN to perform the feature extraction from each slice of the CT volume of the patients. The parameters for each subtask are described in Section 3. Finally, we take the average of scores and use a Softmax classification to achieve the training-classification of new instances. 2.1 Data Enhancement In this section, we first briefly describe our image representation method sampled using a sparse structural sampling strategy followed by several good practices for improving the size of the dataset for learning. Data transformation The range of gray scale values in a natural image is of 256, but when considering color images, the possible values are 2563 , i.e. more than 16M. The range of Hounsfield Units (HU) in a CT slice could be grater than 4000 containing negative values (from -1024 to more than 3000), so a direct usage of a pre-trained GoogleNet network may not work. However, if we see each HU as a color code, then it should be possible to use a pre-trained GoogleNet network. Following this idea, we transformed each HU-based slice into a 2D RGB (Red, Green, Blue) image. Since we were interested only in the lung region containing HU in the range [-1024,300], we thresholded the image to this range, assigning the values out of the range to the respective limits. Then, each HU was mapped to a uniform distribution of the HSV space, varying from red, passing by yellow, green, cyan, blue, magenta, and red again. This mapping set to red any value outside the specified range. Furthermore, a key observation is that consecutive slices are highly redundant and therefore we used a sparse structural sampling strategy to enhance the quality of the dataset. Examples of a few 2D RGB images and their respective slices are depicted in Figure 1. Multi–scale Cropping Augmentation To reduce the over-fitting problem due to the reduced training set we used a 10–crops data augmentation scheme to generate diverse training samples. We only crop 4 corners and 1 center of the images as well as their horizontal reflections (hence ten crops in all). We fix the input image size as 256 × 256 and randomly sample the cropping width and height from {256, 224, 192, 168}. After that, we resize the cropped regions to 224 × 224. At test time we average the predictions made by the network’s Softmax layer on the 10 crops. 2.2 Network Architecture GoogLeNet, the winner of the ILSVRC 2014 challenge, is a 22-layer deep con- volutional network with layers stacked upon each other with different sizes. In order to speed up the computational efficiency it uses 1×1 convolutional opera- tions for dimension reduction. Furthermore, it uses Average Pooling instead of Fully Connected layers at the top of the ConvNet, eliminating a large amount of parameters that do not seem to matter much. More details can be found in its original paper [9]. We choose this architecture due to its improved utilization of the computing resources that allows to increase the depth and width of the network while keep- ing the computational budget constant that can therefore be trained effectively within a reasonable amount of time. (a) (b) (c) (d) (e) (f) Fig. 1. The figure shows examples of the gray scale (a–c) and RGB modalities (d–f). 3 Experimental Setup In this section, we give a detailed description of our proposed methods. We first present the network input and the training details. After that, we describe our testing strategies for both of the subtasks. 3.1 ImageCLEFtuberculosis: Task Setup for the Tuberculosis Type Training Run 1: TBT T GNet: In this run, the network weights are learned using the mini-batch stochastic gradient descent with momentum set to 0.9. At each iteration, a mini-batch of 32 samples is constructed by sampling 32 training slices from the dataset using a batch accumulation of 2. We did not use any data augmentation techniques in this run. The learning rate is initially set to 10−5 , and then the rate is changed to 10−6 after 1/3 of the iterations, then to 10−7 after 2/3 of the iterations, and finally 10−8 after which the training is stopped. As the network in this run takes RGB images as input, we pre–trained it using the ImageNet [11] model as initialization. Run 2: TBT TEST RUN 2 GoogleNet 10crops at (different scales): In this run, the network weights are learned using the mini-batch stochastic gradient descent with momentum set to 0.9. At each iteration, a mini-batch of 32 samples is constructed by sampling 32 training slices from the dataset. In net training, a sub–image is randomly cropped from the selected slice using the techniques described in the previous section. The learning rate is initially set to 10−5 sched- uled with a polynomial decreasing policy at a power of 5. As the network in this run takes gray level images as input, we pretrained it using the ImageNet model as initialization. First, we discretize the slices into the interval from 0 to 255 by a linear transformation. This step makes the range to be the same with RGB images. Then, we modify the weights of first convolution layer of the ImageNet RGB model to handle the input of grayscale images. Specifically, we average the model filters of first layer across the channel. This initialization method works pretty well and reduce the effect of overfitting in experiments. Testing Run 1: TBT T GNet: At test time, given a volume, we extract all the slices (samples). The class scores for the whole volume are then obtained by averaging the scores across the slices. Finally, we use a softmax classification to achieve the training–classification of new instances. This yields an accuracy of 46% on the validation set. Run 2: TBT TEST RUN 2 GoogleNet 10crops at (different scales): At test time, given a volume, we extract all the slices. The class scores for the whole volumes are then obtained by averaging the scores across the slices and crops therein. Finally, we use a Softmax classification to achieve the training–classification of a new volume. This yields an accuracy of 41% on the validation set. In this run we simultaneously enriched the dataset both in quality and quan- tity, but this resulted in performance deterioration due to the noise introduced. We also tried the data enrichment approach using the RGB modality, but we did not notice any improvement. 3.2 ImageCLEFtuberculosis: Setup and Results for the Multi-drug Resistant Task Training MDR TST RUN 1 : The network weights are learned using the mini- batch stochastic gradient descent with momentum set to 0.9. At each iteration, a mini-batch of 32 samples is constructed by sampling 32 training slices from the dataset using a batch accumulation of 2. In net training, a 224×224 sub–image is randomly cropped from the selected frame. The learning rate is initially set to 10−5 scheduled with a polynomial decreasing policy at a power of 5. Testing MDR TST RUN 1 : In the test phase, given a volume, we extract all the slices. The class scores for the whole volume is then obtained by averaging the scores across the slices. 4 Results For the TBT subtask the submitted runs that obtained the 3rd and 8th position out of 23 runs submitted for this task, with a top Kappa value of 0.2329. In Table 1, the performance of the proposed and the top techniques for the TBT subtask can be seen. The Table reports scores that correspond to the test set, from the metric proposed by the organizers. Our best score is reported in run1, which uses the RGB modality without data augmentation techniques. On the other hand, the run2 reports scores obtained by using the grayscale modality with data augmentation techniques. In the latter result, we consider that the representation is affected by noise when increasing the number of samples. Table 1. Selected subset of results from ImageCLEFtuberculosis (2017) – TBT sub- task. Comparison of our submitted runs to top scores. Group name Run name Run type Kappa ACC SGEast TBT resnet full Not applicable 0.24 0.4 SGEast TBT LSTM 17 wcrop Not applicable 0.23 0.39 MedGIFT-UPB Run 1 Automatic 0.23 0.38 MedGIFT-UPB Run 2 Automatic 0.19 0.37 In the MDR subtask, the proposed approach obtained the 7th position ac- cording to the accuracy (0.5352) out of 28 participant runs. In Table 2, the performance of the proposed and the top techniques for the MDR subtask can be seen. The table reports scores that correspond to the test set using the metrics proposed by the organizers. Overall, our group obtained good positions both in the TBT and MDR tasks. 5 Conclusions In this work we propose two fully automatic tuberculosis classification methods and one automatic predictor for assessing the probability of TB patients having a multi-drug resistant TB. These approaches were evaluated in the TBT and Table 2. Selected subset of results of the ImageCLEF 2017 tuberculosis task – MDR subtask. Comparison of our submitted runs to the best results. Group Name Run name Run type AUC ACC MedGIFT MDR Top1 correct Automatic 0.58 0.51 MedGIFT MDR submitted topBest3 correct Automatic 0.57 0.46 MedGIFT MDR submitted topBest5 correct Automatic 0.56 0.48 MedGIFT-UPB Run 1 (baseline) Automatic 0.51 0.53 MDR subtasks of the ImageCLEF 2017 tuberculosis task. Due to the fact that the datasets are relatively small, we tested several good practices for training the ConvNets. Relying on the proposed training strategies, the approaches achieved an accuracy of 38.7% on the TB type dataset and 51% on the MDR dataset. A research direction is to introduce more aggressive data augmentation techniques designed to improve the network generalization capabilities aligned with new techniques to generate medical hypotheses. References 1. Organization, W.H., et al.: Global tuberculosis report 2016. (2016) 2. Ahuja, S.D., Ashkin, D., Avendano, M., Banerjee, R., Bauer, M., Bayona, J.N., Becerra, M.C., Benedetti, A., Burgos, M., Centis, R., Chan, E.D., Chiang, C.Y., Cox, H., D’Ambrosio, L., DeRiemer, K., Dung, N.H., Enarson, D., Falzon, D., Flanagan, K., Flood, J., Garcia-Garcia, M.L., Gandhi, N., Granich, R.M., Hollm- Delgado, M.G., Holtz, T.H., Iseman, M.D., Jarlsberg, L.G., Keshavjee, S., Kim, H.R., Koh, W.J., Lancaster, J., Lange, C., de Lange, W.C.M., Leimane, V., Leung, C.C., Li, J., Menzies, D., Migliori, G.B., Mishustin, S.P., Mitnick, C.D., Narita, M., O’Riordan, P., Pai, M., Palmero, D., Park, S.k., Pasvol, G., Pea, J., Prez-Guzmn, C., Quelapio, M.I.D., Ponce-de Leon, A., Riekstina, V., Robert, J., Royce, S., Schaaf, H.S., Seung, K.J., Shah, L., Shim, T.S., Shin, S.S., Shiraishi, Y., Sifuentes- Osornio, J., Sotgiu, G., Strand, M.J., Tabarsi, P., Tupasi, T.E., van Altena, R., Van der Walt, M., Van der Werf, T.S., Vargas, M.H., Viiklepp, P., Westenhouse, J., Yew, W.W., Yim, J.J.: Multidrug resistant pulmonary tuberculosis treatment regimens and patient outcomes: an individual patient data meta–analysis of 9,153 patients. PLoS med 9(8) (2012) e1001300 3. Rendon, A., Tiberi, S., Scardigli, A., DAmbrosio, L., Centis, R., Caminero, J.A., Migliori, G.B.: Classification of drugs to treat multidrug–resistant tuberculosis (mdr–tb): evidence and perspectives. Journal of Thoracic Disease 8(10) (2016) 2666 4. Horsburgh, C.R.J., Barry, C.E.I., Lange, C.: Treatment of tuberculosis. New England Journal of Medicine 373(22) (2015) 2149–2160 5. Villegas, M., Müller, H., Gilbert, A., Piras, L., Wang, J., Mikolajczyk, K., Garcı́a Seco de Herrera, A., Bromuri, S., Amin, M.A., Kazi Mohammed, M., Acar, B., Uskudarli, S., Marvasti, N.B., Aldana, J.F., Roldán Garcı́a, M.d.M.: General overview of ImageCLEF at the CLEF 2015 labs. In: Working Notes of CLEF 2015. Lecture Notes in Computer Science. Springer International Publishing (2015) 6. Kalpathy-Cramer, J., Garcı́a Seco de Herrera, A., Demner-Fushman, D., Antani, S., Bedrick, S., Müller, H.: Evaluating performance of biomedical image retrieval systems: Overview of the medical image retrieval task at ImageCLEF 2004–2014. Computerized Medical Imaging and Graphics 39(0) (2015) 55 – 61 7. Dicente Cid, Y., Kalinovsky, A., Liauchuk, V., Kovalev, V., , Müller, H.: Overview of ImageCLEFtuberculosis 2017 - predicting tuberculosis type and drug resistances. In: CLEF2017 Working Notes. CEUR Workshop Proceedings, Dublin, Ireland, CEUR-WS.org ¡http://ceur-ws.org¿ (September 11-14 2017) 8. Ionescu, B., Müller, H., Villegas, M., Arenas, H., Boato, G., Dang-Nguyen, D.T., Dicente Cid, Y., Eickhoff, C., Garcia Seco de Herrera, A., Gurrin, C., Islam, Bayzidul and, K.V., Liauchuk, V., Mothe, J., Piras, L., Riegler, M., Schwall, I.: Overview of ImageCLEF 2017: Information extraction from images. In: CLEF 2017 Proceedings. Lecture Notes in Computer Science, Dublin, Ireland, Springer (September 11-14 2017) 9. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.E., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. CoRR abs/1409.4842 (2014) 10. Dicente Cid, Y., Jiménez del Toro, O.A., Depeursinge, A., Müller, H.: Efficient and fully automatic segmentation of the lungs in ct volumes. In Goksel, O., Jiménez del Toro, O.A., Foncubierta-Rodrı́guez, A., Müller, H., eds.: Proceedings of the VIS- CERAL Anatomy Grand Challenge at the 2015 IEEE ISBI. CEUR Workshop Proceedings, CEUR-WS (May 2015) 31–35 11. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115(3) (2015) 211–252