ImageCLEF 2019: A 2D Convolutional Neural Network Approach for Severity Scoring of Lung Tuberculosis using CT Images 1 Kavitha S [0000-0003-3439-2383], 1Nandhinee PR, 1Harshana S, 1Jahnavi Srividya S and 1 Harrinei K 1 Department of CSE, SSN College of Engineering, Kalavakkam–603110, India kavithas@ssn.edu.in, {nandhinee16066,harshana17053,jahnavisrividya17061, harrinei17052}@cse.ssn.edu.in Abstract. Tuberculosis (TB) is an air-borne disease, which affects the lungs and often spreads through sputum. According to the report of World Health Organization 9 million people world-wide are affected with TB. Tuberculosis can be cured easily when diagnosed in its early stage and with accurate CT Analysis. As an effort to form a technical forum for ef- fective analysis and diagnosis, ImageCLEF released the Tuberculosis 2019 tasks, each dealing with one aspect of understanding and tackling the disease. We have taken up one sub-task that aims at assessing the severity of the tuberculosis disease as low or high. The task is imple- mented using a deep neural network approach using 2-D Convolutional Neural Network (CNN) with appropriate preprocessing. The CT volumes are segmented with the provided masks and further pre-processed with the aid of med2image, a python utility to obtain slices of CT scans, prior to training the model. The best run of the proposed CNN model resulted with an accuracy of 0.607 and an AUC of 0.626. The achieved result is placed 9th in the over- all leaderboard of the ImageCLEF 2019 Tuberculosis challenge for severity scoring . Keywords: Severity scoring; Lung tuberculosis; Pre-processing; Lung-mask; CNN; AUC; Accuracy. 1 Introduction Tuberculosis (TB) is an airborne disease that affects the lungs. Often spread through sputum, cough and infected droplets, it is quite widespread affecting about 9 million world-wide. The treatment depends upon the degree of infection, i.e the severity [1]. The severity evaluation has been executed by medical practitioners via a diverse set of devices including mycobacterial culture test, pleural fluid and cerebrospinal fluid analysis, lesion patterns obtained from radiological images of lungs besides individ- ualistic factors such as the patient’s age, prior treatment etc. Computed Tomography (CT) is widely used for analysis of the lesion patterns. Besides being prone to errors, Copyright (c) 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 September 2019, Lugano, Switzerland. a manual approach can prove to be costly, both in terms of capital and time. A com- puterized method, on the other hand upholds time efficiency and precision. In this paper, a Convolutional Neural Network (CNN) approach for severity scoring of lung tuberculosis based on CT scans is discussed with results. This work is a subtask of the tuberculosis tasks of ImageCLEF 2019 [2, 8]. This work establishes a standard scale against which evaluation of the CT in subject can be done for determining the severity. The sections span across following: Section 1 gives a brief introduction about the importance of this problem and the necessity to find the severity of tuberculosis. Section 2 gives a glimpse of the dataset and how it is spread across the two classes and Section 2.1 details about the data preprocessing procedures. Section 3 explains the proposed model using convolutional neural network with the parameters chosen for analysis. In Section 4, the results of various runs are discussed. Finally, Section 5 concludes this paperwork and looks into the futuristic aspects for further improvisa- tion of the proposed model. 2 Dataset In this edition of ImageCLEF 2019 TB tasks, the dataset contains 335 chest CT scans of TB patients along with a set of clinically relevant metadata, where data of 218 patients are used for training and 117 for testing. For all patients, 3D CT images are stored in the compressed NIFTI (Neuroimaging Informatics Technology Initia- tive) file format with a slice size of 512×512 pixels and the number of slices varies from 50 to 400 for each patient. This file format stores raw voxel intensities in Hounsfield Units (HU) as well the corresponding image metadata like image dimen- sions, voxel size in physical units, slice thickness, etc. The selected metadata in- cludes the following binary measures: disability, relapse, symptoms of TB, comor- bidity, bacillary, drug resistance, higher education, ex-prisoner, alcoholic, smoking and severity score ranges from 1 to 5 assigned by medical doctors. To treat this task as a binary classification problem, the severity scores are grouped as high severity with scores 1, 2 and 3, and low severity with scores 4 and 5. Moreover, for all pa- tients automatic extracted masks of the lungs are provided. In Table 1, the number of patients of each severity class in the training set and the number of patients in test set is given [2]. Table 1. Severity scoring dataset – Patient wise – Training and Test set Severity type Training Testing Low 118 High 100 117 Total patients 218 From the given dataset, sample images of type “high severity” class and “low se- verity” class is shown in Figure 1 and 2. Fig. 1. “High Severity” Patient ID 196 Slice 66 Left-CT Scan of Lung from dataset, Middle-Corresponding mask of the lung, Right- Masked image. Fig. 2. “Low Severity” Patient ID 181 Slice 65 Left-CT Scan of Lung from dataset, Middle-Corresponding mask of the lung, Right- Masked image. 2.1 Data Preprocessing The dataset for the TB tasks are given in compressed NIfTI (Neuroimaging Infor- matics Technology Initiative) format. Initially, the file is decompressed and the slic- es were extracted using med2image, a Python utility. For each Nifti image we obtain a certain number of slices ranging from 50 to 400 jpeg images. The lung masks pro- vided by the organizers are used, to avoid potential confusion resulting from identi- fication of similar structures resembling lungs in other parts of CT images. The next step involves masking the images. The given masks are converted to grayscale for- mat and each pixel is checked individually; if the pixel is not black it is converted to white. In this way, a final mask is created, with pixels of two values such as black (0) or white (255). Now the original scan of the lung is converted to grayscale and each pixel of it is multiplied with the corresponding pixel in the created final mask using bitwise and operation. Thus, the lungs are segmented from the original scans [4]. On the other hand, not all slices necessarily contain relevant information that can be useful to identify severity of TB. For the same reason, it is essential to filter slices to preserve only those that can be informative and contain relevant infor- mation. Upon visual inspection, slices ranging between 55 and 85 are used and other slices were eliminated from further processing. The slices being ordered, the 31 most informative usually fall at the center of the list. The workflow of the pre- processing stages is given in Figure 3. Fig. 3. General flow of the preprocessing stage 3 Methodology Convolutional Neural network takes an image as input, passes it through a series of convolutional layers, nonlinear activation layers, pooling (downsampling) and a fully connected layer to output the classification labels. It differs from normal neural network in two aspects: atleast one convolutional layer and filters. The model for TB severity scoring is created using 2-D convolutional neural network using software libraries Keras [5] with Tensorflow [6] for backend. The 3D images of the procured CT scans are sliced and converted to 2D images in the preprocessing stage. The network is designed with three 2D convolution layers, rectified linear unit (ReLU) activation function and each convolution layer followed the max pooling layer. The- se led to a complete layer structure which is connected to 1000 outputs with weight by the dense layer with ReLU activation. Finally, these activations run through a softmax layer, which output a tensor of size 2, for each category. Binary cross- entropy is used as loss function with Adam and RMSProp as optimizers. The model files are built for different runs by varying the hyper parameters of the base model. The corresponding CNN design structure is shown in Figure 4. In each layer the values are mentioned from the model summary of one run for more clarity. Fig. 4. Base design of the 2D CNN used for training 4 Experiments and Results The CNN model is trained by varying the hyperparameters such as number of filters, epochs and optimizers. optimizer The runs had a filter size of 64×64 64 with batch size 32 and loss type as binary cross entropy. The difference in the accuracy is brought by changing the epoch value and the optimizers optimizer such as Adam and RMSProp. RMSProp The dif- ferent runs of CNN model by varying the hyperparameters is given in Table2. Table Table 2. Different runs of the CNN model – Varying hyperparameters Hyperparameters Run 1 Run 2 Run 3 Run 4 No. of convolutional 3 3 3 3 layers No. of filters in each 16×32×64 16×32×64 64×32×32 64× ×32×16 layer Size of each filter 64×64 64×64 64×64 64× ×64 Pooling function max max max max Activation functions relu , softmax relu, softmax relu, softmax relu, softmax Batch size 32 32 32 32 Number of epochs 15 20 15 15 Loss type binary cross binary cross binary cross binary inary cross entropy entropy entropy entropy Optimizer RMSProp Adam RMSProp RMSProp Fig. 5. Visualization of 3 convolution layers layer and max pooling The intermediate visualization of convoluation layers and max pooling is shown in Figure 5, for Run1, Patient ID 181 and slice number 65. The result of submitted four runs are listed in Table 3,3 for training, validation and test dataset with necessary parameters. In testing, 31 slices per patient is considered as similar to training and validation, for all 117 patients. The probability of high severity for each patient is calculated from the average of “probability of high” of all 31 slices of the specific patient. For example, the class probability probability of patient ID 77 in testing, for slice number 60 is represented as [0.1.]. Here, “0.” is the probability of having low severity and “1.” is the probability of having high severity. We find the probability of having high severity for each of 31 slices of the patient and then com- puted the average of it. The average probability of high severity for patient ID 77 in each run is given in Table 4. The test dataset results are evaluated using two metrics namely accuracy and Area Under ROC Curve (AUC) and ranking is carried out among the participated teams. Table 3. Results of different runs – Training, Validation and Testing Run Training Validation Training Validation AUC Accuracy No. accuracy accuracy loss loss 1 0.8314 0.8011 0.3698 0.4223 0.5446 0.5299 2 0.8491 0.8390 0.3378 0.3496 0.6067 0.5726 3 0.8840 0.8434 0.2869 0.3132 0.6264 0.6068 4 0.8754 0.8103 0.2979 0.4284 0.6133 0.5385 Table 4. Test run of Patient Id: 77 with its probability score of high severity Test Run Probability score of high severity Run 1 0.67741935 Run 2 0.80645161 Run 3 0.83870968 Run 4 0.86362070 For better visualization, the same information is plotted and shown as graphs in Figures 6 and 7 using Tableau Tool. In Figure 6, the value of evaluation metrics for test set is given for all four runs. From the graph, it is clearly visible that run 3 has higher AUC and accuracy than remaining runs. In Figure 6, the accuracy for train- ing, validation and testing dataset are given for all four runs. From the graph, it is clearly visible that run 3 has higher value in all the cases. In addition, run 4 has higher training and validation accuracy, but the test accuracy is low than run 2, might have occurred due to the chosen filter size of each layer. Fig. 6. Performance analysis – Runs vs metrics Fig. 7. Comparison of accuracy between training, validation and testing In the ImageCLEF 2019 Tuberculosis-Severity scoring subtask, 4 runs are sub- mitted and the best run of our team is ranked 9th in overall among the teams partici- pated is given in Table 5 [3]. Table 5. Top 10 rankings of ImageCLEF 2019 Tuberculosis - Severity scoring task Team name AUC Accuracy No. of runs Rank submitted 1 UIIP_BioMed 0.788 0.718 2 2 SergeKo 0.775 0.718 2 3 KirillB 0.770 0.692 10 4 CompElecEngCU 0.763 0.658 2 5 agentili 0.721 0.684 9 6 yashindc(Organizer) 0.720 0.641 6 7 UniversityAlicante 0.701 0.701 10 8 MostaganemFSEI 0.651 0.615 10 9 Kavitha 0.626 0.607 4 10 Shopon 0.611 0.615 2 When the results of all runs are sorted by descending related to AUC for SVR subtask, we have obtained 29th, 31st, 35th and 43rd rank for the four runs submitted by our team [2, 7]. 5 Conclusion and Future Work In this paper, analysis of severity scoring (SVR) subtask for lung tuberculosis using 2D Convolutional Neural Network is implemented. The classification results ob- tained for the given set 3D CT Images are submitted for evaluation. In our approach, preprocessing of the dataset has been carried out to convert the images into 2D slic- es, and the images are split into training and validation set. The proposed model is built using CNN, trained and validated using tuning the hyperparameters for four different runs. From the runs submitted, the primary run is ranked 9th place among the team participations. CNN is a preferred approach, since it facilitates automatic detection of the low level and high level features, from large training dataset. However, a large dataset might prove disadvantageous in terms of memory during the training phase. This can be overcome by the use of a GPU and choosing optimal hyperparameters. In future, the proposed model can be improvised by considering all the slices of the CT images, to build the train model using GPU and transfer learning approach. References 1. WHO page, https://www.who.int/news-room/fact-sheets/detail/tuberculosis, last accessed May 2019 2. ImageCLEF 2019 page, https://www.imageclef.org/2019/medical/tuberculosis, last accessed May 2019 3. Crowd AI page, https://www.crowdai.org/challenges/imageclef-2019- tuberculosis-severity-scoring/ leaderboards, last accessed May 2019 4. Yashin Dicente Cid, Oscar A. Jiménez-del-Toro, Adrien Depeursinge, and Henning Müller, Efficient and fully automatic segmentation of the lungs in CT volumes. In: Goksel, O., et al. (eds.) Proceedings of the VISCERAL Challenge at ISBI. No. 1390 in CEUR Workshop Proceedings (Apr 2015) 5. Keras documentation, https://keras.io/, last accessed May 2019 6. Tensorflow documentation, https://www.tensorflow.org/, last accessed May 2019 7. Yashin Dicente Cid, Vitali Liauchuk, Dzmitri Klimuk, Aleh Tarasau, Vassili Kovalev, Henning Müller, Overview of ImageCLEFtuberculosis 2019 - Auto- matic CT-based Report Generation and Tuberculosis Severity Assessment, CLEF 2019 Working Notes. CEUR Workshop Proceedings (CEUR- WS.org), ISSN 1613-0073, http://ceur-ws.org/Vol-2380/. 8. Bogdan Ionescu, Henning Müller, Renaud Péteri, Yashin Dicente Cid, Vitali Liauchuk, Vassili Kovalev, Dzmitri Klimuk, Aleh Tarasau, Asma Ben Abacha, Sadid A. Hasan, Vivek Datla, Joey Liu, Dina Demner-Fushman, Duc-Tien Dang-Nguyen, Luca Piras, Michael Riegler, Minh-Triet Tran, Mathias Lux, Cathal Gurrin, Obioma Pelka, Christoph M. Friedrich, Alba García Seco de Her- rera, Narciso Garcia, Ergina Kavallieratou, Carlos Roberto del Blanco, Carlos Cuevas Rodríguez, Nikos Vasillopoulos, Konstantinos Karampidis, Jon Cham- berlain, Adrian Clark, Antonio Campello, ImageCLEF 2019: Multimedia Re- trieval in Medicine, Lifelogging, Security and Nature In: Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the 10th Interna- tional Conference of the CLEF Association (CLEF 2019), Lugano, Switzerland, LNCS Lecture Notes in Computer Science, Springer (September 9-12 2019) 9. Yashin Dicente Cid, Vitali Liauchuk, Vassili Kovalev, Henning Müller, Over- view of ImageCLEFtuberculosis 2018 - Detecting multi-drug resistance, classi- fying tuberculosis type, and assessing severity score, CLEF working notes, CEUR, 2018.