ImageCLEF 2018: Lesion-based TB-descriptor for CT Image Analysis Vitali Liauchuk1 , Aleh Tarasau2 , Eduard Snezhko1 , and Vassili Kovalev1 1 United Institute of Informatics Problems, Minsk, Belarus vitali.liauchuk@gmail.com 2 Scientific and Practical Center for Pulmonology and Tuberculosis, Minsk, Belarus Abstract. The paper presents image description and classification method which was used by United Institute of Informatics Problems (UIIP BioMed) group for accomplishing the three subtasks of ImageCLEFtuberculosis task. The image description method employed is based on automated detection of tuberculosis (TB) lesions of different types in 3D lung Com- puted Tomography (CT) scans. The lesion detection method is based on Coder-Decoder Convolutional Neural Network trained on a third-party dataset of 149 CT scans with lesions labeled by a qualified radiologist. It was shown that combination of lesion-based TB-descriptor and Ran- dom Forests classifier allows achieving the best performance in TB type classification and TB severity scoring subtasks. Keywords: tuberculosis, TB-descriptor, lesions, CT, image analysis 1 Introduction The tuberculosis task [3] of ImageCLEF 2018 Challenge [5] considers three sub- tasks all dealing with 3D CT images. The subtask #1 is dedicated to the problem of single image-based distinguishing between multi-drug resistant tuberculosis (MDR TB) cases and drug sensitive (DS) ones. The task remains very chal- lenging and so far has no solution with sufficient prediction accuracy. Recent analysis of published evidences reports presence of statistically significant links between drug resistance and multiple thick-walled caverns [12]. So far comput- erized methods demonstrate performance of image-based detection of MDR TB barely beyond the level of statistical significance [4, 8, 9]. Compared to 2017 data [2], datasets for MDR detection subtask were extended by means of adding several cases with extensively drug-resistant tuberculosis (XDR TB), which is a rare and more severe subtype of MDR TB. Thus, training data for the MDR detection subtask included 259 CT images: 134 drug sensitive and 125 drug re- sistant cases. Test set consisted of 236 CT images: 101 drug sensitive and 135 drug resistant cases. The subtask #2 of ImageCLEFtuberculosis task is aimed at automatic cate- gorization of CT images into one of five types of tuberculosis: Infiltrative, Focal, Tuberculoma, Miliary and Fibro-cavernous. Compared to 2017, the datasets were extended by adding new CT scans of the patients involved earlier, and also by introducing CT images of some new patients. However, in this study only the first CT scan of each patient was used. The newly represented subtask #3 was dedicated to assessment of severity of TB based on a single CT image of a patient. The severity score has meaning of a cumulative score of severity of TB case assigned by a medical doctor. Originally, the severity scores were assigned using natural numbers between 1 (”critical/very bad”) and 5 (”very good”). Additionally, for the case of binary classification the scores were converted to binary values where scores from 1 to 3 corresponded to ”high severity” and the remaining 4 and 5 corresponded to ”low severity”. In the process of scoring, the medical doctors considered many factors like patterns of lung lesions, results of microbiological tests, duration of treatment, patient’s age and some other. One of the goals of this subtask is to distinguish ”low severity” from ”high severity” based solely on the CT scan. 2 Detection of lung lesions in CT, TB-descriptor In this section, a method for automated detection of lung lesions in 3D CT images is described. The method is based on training the Deep Convolutional Neural Network (CNN) on a set of data derived from 3D CT images with manually labeled lesions of different types. The method utilizes slice-wise image segmen- tation technique previously described in [6]. This technique considers splitting the original 3D image into a number of smaller 2D regions, processing the re- gions one-by-one and collecting the CNN output into a 3D probability map (see Fig. 1). Finally, a quantitative TB-descriptor is built based on the lesion proba- bility maps. Fig. 1. General scheme of the slice-wise lesion segmentation method 2.1 Data preparation TB lesions were labeled manually on a total number of 198 3D CT scans. The labeling was performed in two stages. The first stage was performed by a qualified radiologist and was aimed at coarse localization of TB lesions of different type in lungs without the exact delineation. The second stage was aimed at correction of initial lesion labeling and making more precise segmentation of lesions (see Fig. 2). Both stages of labeling were performed using an auxiliary software tool designed by the authors (see Fig. 3). Fig. 2. Labeling stages, axial slices (top) and frontal projections (bottom): a) initial stage, rough labeling; b) second stage, more precise segmentation of lesions The developed software tool allows labeling of 10 different types of TB le- sions. Some types of lesions were well represented in the dataset whilst lesions of some other types (Plevritis, Atelectasis, Pneumathorax) were present only in few images in the dataset. List of lesion types and the corresponding frequencies of occurrence in dataset images are shown in Table 1. In the result of labeling process, 3D masks with the corresponding lesion indexes were obtained. Fig. 3. Screenshot of the developed software tool for lesion segmentation Table 1. Presence of lesions of different types in the dataset Index Type of lesion Number of images 1 Focus < 10 mm 140 2 Focus 10–30 mm 38 3 Infiltrate 26 4 Focus 0–30 mm (mix) 85 5 Focus + Infiltrate (mix) 30 6 Caverns 81 7 Fibrosis 56 8 Plevritis 13 9 Atelectasis 7 10 Pneumathorax 4 2.2 Segmentation of lung regions For extraction of lung regions for both lesion detection and ImageCLEFtuber- culosis subtasks, a domestic implementation of a conventional segmentation-by- registration approach [11] was employed instead of the one proposed by the organizers. In our case the method utilized 130 reference CT scans with man- ually segmented lungs. Projections along X, Y and Z axes are calculated for each reference CT scan. The three normalized projections are concatenated into a quantitative descriptor of a reference image. For a target CT scan, a similarity measure is calculated between the target image and the reference images based on the quantitative descriptors of all images. Top-5 most similar reference im- ages are selected. The selected images along with the corresponding lung masks are non-rigidly registered to the target image using ’elastix’ software tool [7], final segmentation mask is obtained by means of averaging. The implemented method demonstrates high robustness to the presence of large lesion in lungs (see Fig. 4). Fig. 4. Example slices of CT images with segmented lungs 2.3 Training the Convolutional Neural Network One of the possible ways to employ Deep Learning algorithms for 3D image is to operate at slice level by representing each 3D CT image as a set of 2D slices. One of the advantages of such approach is relatively low usage of computer memory since the large 3D is processed slice-by-slice. In the current study, 2D image regions of size 128×128 pixels were extracted from slices of original CT images with 64-pixels stride. Three neighboring slices were used to compose a single RGB image in order to use spatial information along Z-axis of original CT images. Finally, the image regions were up-sized using bicubic interpolation to 256×256 pixels. The up-sizing was performed to improve the detection of small lesions since the first convolutional layer of the network used which is AlexNet has 4-pixel stride, and some lesions present on the images have size of 2–3 pixels. From the total amount of 198 labeled 3D scans, 149 were used for training the algorithms and the rest 49 were used for validation. Lesion types with indexes 1–5 were merged together into one class ”Foci” as having similar nature and/or being mixture of classes. From the 149 training CT images, 268,278 2D image tiles were extracted. For each tile a corresponding label image was composed using manually labeled lesion data (see Fig. 5). Image regions which lay beyond the lung segmentation masks are marked with a special ”don’t care” label. Neural network omits these regions at both training and validation stages which allows to better focus the available computational facilities on the actual regions of interest. On the label images such regions are marked with gray color. For segmentation of lesions in 2D slice regions a Fully Convolutional Net- work Alexnet [10] was used. In order to increase convergence rate and overall accuracy, a publicly available ILSVRC2012-trained model was used to initialize the networks weights. The net was set to recognize multiple lesion types at the same time. Fig. 5. Examples of 2D slice regions (top row) and the corresponding label images (bottom row) Training was performed on a personal computer equipped with Intel i7-6700K CPU and dedicated GPU of Nvidia TITAN X type with 3072 CUDA Cores and 12 GB of GDDR5 onboard memory. NVIDIA DIGITS interface and Caffee framework were used. The network training parameters were set to the following values: Number of epochs=60, Activation function=ReLu, Batch size=64, Solver type=SGD Caffe solver. Learning Rate was set to 0.001 for the first 20 epochs, 0.0001 for the next 20 and 0.00001 for the last 20 ones. 2.4 Obtaining probability maps Once the training process is finished, the trained network model can be used for detection of lesions in an arbitrary 3D CT scan. In this case the CT image undergoes the same procedures as for the training images: – segmentation of lung regions; – extraction of 2D tiles; – processing the tiles with the trained CNN and obtaining probability maps for each lesion type considered; – collecting the obtained 2D probability maps into 3D probability maps for each lesion type separately; Additionally, probability maps can be smoothed to reduce the number of falsely detected lesions in images, or thresholded so that all probability values below minimum allowed value are zeroed. Fig. 6 demonstrates the detected le- sions on test CT scans. Lesion regions were obtained from the corresponding probability maps by means of thresholding with Pthres = 0.5. The resultant le- sion regions are marked with colors with correspondence to the colormap from Fig. 5. 2.5 Building TB-descriptor Once the probability maps are built, the TB-descriptor proposed with this study is built as follows. The lungs region on CT image is divided into 6 parts as it is Fig. 6. Detected lesions on test CT images: frontal projections (top row) and axial slices (bottom row) shown on Fig. 7. Height of the parts along Z axis is taken equal. For every type of lesion its presence in each of six parts is calculated as the sum of probabilities in the corresponding voxels divided by the number of lung voxels within the considered part. Since all the probabilities are ranged from 0 to 1, the lesion presence score for each part is also a number from 0 to 1. Finally, the presence scores obtained for each lesion type and each lung part are concatenated into a single TB-descriptor of size Nlesion types × Nparts . Thus, the proposed TB-descriptor indicates presence of lesions of certain types in different parts of lungs: upper left, middle right, etc. Portion of the affected lung volume is considered as well. Such TB-descriptor was used for recognition of drug resistance status, type and severity of tuberculosis in the ImageCLEF challenge subtasks. 3 Submissions and results For all the ImageCLEFtuberculosis subtasks the following prediction scheme was used: – segmentation of lung regions for each CT image; – detection of lesions; – calculation of TB-descriptors for each image; – prediction of the desired values using a valid classifier. Fig. 7. Lungs region partitioning Subtasks of ImageCLEFtuberculosis considered different types of predictions: multiple-class prediction where only the index of predicted class must be pro- vided, two-class prediction where probability of belonging to positive class must be provided as well, and regression where the corresponding method needs to predict value of a continuous variable as precise as possible. For all three sub- tasks, Random Forests classifier was used which is capable of handling all the above-mentioned tasks. Assessment of the algorithms performance was carried out on the Training data using k-fold cross-validation procedure with k = 5. 3.1 Subtask #1: MDR detection Following the above-mentioned prediction scheme, TB-descriptors were calcu- lated for all the available CT images. Random Forests classifier was trained on the set of TB-descriptors with concatenated meta-data values: patients’ age and gender. Based on a series of experiments, number of trees in the classifier was chosen to be 150 for this subtask. Accuracy assessment within 5-fold cross- validation demonstrated Area Under ROC-Curve (AUC) value of 0.6385. One run was submitted as the result of prediction of test data. A total number of 39 runs were submitted by 7 different participating groups for MDR detection subtask. Table 2 shows top-15 best participants’ results in terms of AUC value. Utilizing lesion-based TB-descriptor resulted in 0.5558 AUC and ranked 14-th place among the 39 runs. The best acheived result by VISTA@UEvora team with 0.6178 AUC value outperforms previous year’s re- sult with 0.5825 AUC. However, MDR detection performance still remains at a level close to random classification. Increase of prediction performance might be caused by adding a number of more severe cases with XDR TB into the dataset and also by utilizing information about patients’ age and gender. Table 2. Top-15 submitted runs with highest AUC values for MDR detection subtask. Group Name Run AUC Rank VISTA@UEvora 06-Mohan-SL-F3-Personal 0.6178 1 San Diego VA HCS/UCSD MDSTest1a 0.6114 2 VISTA@UEvora 08-Mohan-voteLdaSmoF7-Personal 0.6065 3 VISTA@UEvora 09-Sk-SL-F10-Personal 0.5921 4 VISTA@UEvora 10-Mix-voteLdaSl-F7-Personal 0.5824 5 HHU-DBS FlattenCNN DTree 0.5810 6 HHU-DBS FlattenCNN2 DTree 0.5810 7 HHU-DBS Conv68adam fl 0.5768 8 VISTA@UEvora 07-Sk-LDA-F7-Personal 0.5730 9 UniversityAlicante MDRBaseline0 0.5669 10 HHU-DBS Conv48sgd 0.5640 11 HHU-DBS Flatten 0.5637 12 HHU-DBS Flatten3 0.5575 13 UIIP BioMed TBdescs2 zparts3 thrprob50 rf150 0.5558 14 UniversityAlicante testSVM SMOTE 0.5509 15 3.2 Subtask #2: TBT classification For TB type classification subtask, a similar procedure was carried out with the difference that Random Forests classifier was trained for the case of multiple image classes. Number of trees for this subtask was chosen to be 150. Instead of using all the available data, only the first CT scan of each patient was used both for algorithms training and for final prediction of patient’s TB class. In total, 39 runs were submitted by 8 participating groups for TB type clas- sification subtask. The results were evaluated and ranked by accuracy and Co- hen’s Kappa coefficient [1] which is preferable in the case of unbalanced dataset. Among the submitted runs our method based on lesion detection demonstrated the best TB type recognition performance in terms of both Kappa coefficient (0.2312) and accuracy (0.4227) (see Table 3). Compared to 2017, overall TB type classification results are less accurate. Probably this is caused by the in- creased disbalance between TB types. Using more than one CT scan per patient might also confuse prediction methods and worsen the final results. 3.3 Subtask #3: Severity scoring In contrast to the two previous subtasks, the TB severity scoring subtask was evaluated in two principally different ways. One way of evaluation used the original severity scores from 1 to 5 as provided by the doctors and the task for participants was to predict those numerical scores Table 3. Top-15 submitted runs with highest Kappa values for TB type subtask. Group Name Run Kappa Rank UIIP BioMed TBdescs2 zparts3 thrprob50 rf150 0.2312 1 fau ml4cv m4 weighted 0.1736 2 MedGIFT AllFeats std euclidean TST 0.1706 3 MedGIFT Riesz AllCols euclidean TST 0.1674 4 VISTA@UEvora 02-Mohan-RF-F20I1500S20-317 0.1664 5 fau ml4cv m3 weighted 0.1655 6 VISTA@UEvora 05-Mohan-RF-F20I2000S20 0.1621 7 MedGIFT AllFeats AllCols correlation TST 0.1531 8 MedGIFT AllFeats mean euclidean TST 0.1517 9 MedGIFT Riesz std euclidean TST 0.1494 10 San Diego VA HCS/UCSD Submission64a 0.1474 11 San Diego VA HCS/UCSD TBTTask 2 128 0.1454 12 MedGIFT AllFeats AllCols correlation TST 0.1356 13 VISTA@UEvora 03-Mohan-RF-7FF20I1500S20-Age 0.1335 14 San Diego VA HCS/UCSD TBTLast 0.1251 15 as precise as possible. Here, Root Mean Square Error (RMSE) was computed between ground truth and predicted severity scores provided by participants. The goal was to achieve lowest possible RMSE value. The other way of evaluation considered binary classification problem. The original severity index was transformed into two class values: cases with scores from 1 to 3 were labeled as ”high severity” cases and the other cases with scores 4 and 5 corresponded to ”low severity” class. With this way of evaluation the par- ticipants were to provide probabilities of TB cases belonging to ”high severity” class. The results were ranked using AUC value. Top-10 runs for both evaluation methods are shown in Tables 4 and 5. Table 4. Top-10 submitted runs with lowest RMSE values for Severity scoring. Group Name Run RMSE Rank UIIP BioMed TBdescs2 zparts3 thrprob50 rf100 0.7840 1 MedGIFT HOG std euclidean TST 0.8513 2 VISTA@UEvora 07-Mohan-MLP-6FTT100 0.8883 3 MedGIFT AllFeats AllCols euclidean TST 0.8883 4 MedGIFT AllFeats AllCols correlation TST 0.8934 5 MedGIFT HOG mean euclidean TST 0.8985 6 MedGIFT HOG mean correlation TST 0.9237 7 MedGIFT HOG AllCols euclidean TST 0.9433 8 MedGIFT HOG AllCols correlation TST 0.9433 9 HHU-DBS RanFrst 0.9626 10 In total, 36 runs were submitted by 7 participants for this subtask. As it can be seen from the tables, lesion-based TB-descriptor appeared to be extremely Table 5. Top-10 submitted runs with highest ”low severity”/”high severity” prediction performance. Group Name Run AUC Rank MedGIFT AllFeats AllCols correlation TST 0.7708 1 MedGIFT HOG AllCols correlation TST 0.7608 2 MedGIFT HOG mean euclidean TST 0.7443 3 MedGIFT HOG AllCols euclidean TST 0.7268 4 MedGIFT HOG std euclidean TST 0.7162 5 UIIP BioMed TBdescs2 zparts3 thrprob50 rf100 0.7025 6 San Diego VA HCS/UCSD SVRSubmission 0.6984 7 HHU-DBS RanFRST depth 2 Ludmila new new 0.6862 8 HHU-DBS DTree Features Best All 0.6750 9 MedGIFT AllFeats AllCols euclidean TST 0.6733 10 useful for assessing TB severity with the best result in terms of regression (mini- mum RMSE among all runs) and 6-th best result in terms of ”low severity”/”high severity” classification. Number of trees for this experiments was set to 100. The highest binary classification performance with AUC value of 0.7708 was achieved by MedGIFT group. 4 Conclusions The results of this study allows to draw the following conclusions: – Combination of lesion-based TB-descriptor and Random Forests classifier allowed achieving the best performance in TB type classification and TB severity scoring subtasks. – Similar to 2017 results, image-based MDR TB detection performance re- mains low (AUC 0.6178, accuracy 55.93%) despite the addition of XDR TB cases into the dataset and utilizing information about patients’ age and gen- der. – Lesion-based TB-descriptor derived from lung CT scans conveys valuable information on patient’s state and is worth to consider in CT image analysis of TB patients. – Extending the training data for lesion detection is desirable for further im- provements of computerized TB diagnosis. In this paper, image description and analysis method based on automatic detection of TB lesions in lungs and composing TB-descriptor is presented. The method was employed by UIIP BioMed group in all three subtasks of Image- CLEFtuberculosis 2018 challenge. Acknowledgements This study was supported by the National Institute of Allergy and Infectious Diseases, National Institutes of Health, U.S. Department of Health and Human Services, USA through the CRDF project DAA3-17-63599-1 ”Year 6: Belarus TB Database and TB Portals”. References 1. Cohen, J.: A coefficient of agreement for nominal scales. Educational and Psycho- logical Measurement 20(1), 37–46 (1960) 2. Dicente Cid, Y., Kalinovsky, A., Liauchuk, V., Kovalev, V., , Müller, H.: Overview of ImageCLEFtuberculosis 2017 - predicting tuberculosis type and drug resis- tances. In: CLEF2017 Working Notes. CEUR Workshop Proceedings, CEUR- WS.org , Dublin, Ireland (September 11-14 2017) 3. Dicente Cid, Y., Liauchuk, V., Kovalev, V., , Müller, H.: Overview of ImageCLEF- tuberculosis 2018 - detecting multi-drug resistance, classifying tuberculosis type, and assessing severity score. In: CLEF2018 Working Notes. CEUR Workshop Pro- ceedings, CEUR-WS.org , Avignon, France (September 10- 14 2017) 4. Ionescu, B., Müller, H., Villegas, M., Arenas, H., Boato, G., Dang-Nguyen, D.T., Dicente Cid, Y., Eickhoff, C., Garcia Seco de Herrera, A., Gurrin, C., Islam, B., Kovalev, V., Liauchuk, V., Mothe, J., Piras, L., Riegler, M., Schwall, I.: Overview of ImageCLEF 2017: Information extraction from images. In: Experimental IR Meets Multilinguality, Multimodality, and Interaction 8th International Conference of the CLEF Association, CLEF 2017. Lecture Notes in Computer Science, vol. 10456. Springer, Dublin, Ireland (September 11-14 2017) 5. Ionescu, B., Müller, H., Villegas, M., de Herrera, A.G.S., Eickhoff, C., Andrea- rczyk, V., Cid, Y.D., Liauchuk, V., Kovalev, V., Hasan, S.A., Ling, Y., Farri, O., Liu, J., Lungren, M., Dang-Nguyen, D.T., Piras, L., Riegler, M., Zhou, L., Lux, M., Gurrin, C.: Overview of ImageCLEF 2018: Challenges, datasets and evaluation. In: Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceed- ings of the Ninth International Conference of the CLEF Association (CLEF 2018), LNCS Lecture Notes in Computer Science, Springer, Avignon, France (September 10-14 2018) 6. Kalinovsky, A., Liauchuk, V., Tarasau, A.: Lesion detection in CT images us- ing Deep Learning semantic segmentation technique. In: International Work- shop ”Photogrammetric and computer vision techniques for video surveillance, biometrics and biomedicine”. The International Archives of the Photogramme- try, Remote Sensing and Spatial Information Sciences, vol. XLII, pp. 13–17. Moscow, Russia (May 2017). https://doi.org/10.5194/isprs-archives-XLII-2-W4- 13-2017, http://www.int-arch-photogramm-remote-sens-spatial-inf-sci.net/XLII- 2-W4/13/2017/ 7. Klein, S., Staring, M., Murphy, K., Viergever, M.A., Pluim, J.P.: Elastix: a tool- box for intensity–based medical image registration. IEEE Transactions on medical imaging 29(1), 196–205 (2010) 8. Kovalev, V., Liauchuk, V., Kalinouski, A., Rosenthal, A., Gabrielian, A., Skrahina, A., Astrauko, A., Tarasau: Utilizing radiological images for predicting drug resis- tance of lung tuberculosis. In: Computer Assisted Radiology - 27th International Congress and Exhibition (CARS-2015). vol. 10, pp. 129–130. Springer, Barcelona (2015) 9. Kovalev, V., Liauchuk, V., Safonau, I., Astrauko, A., Skrahina, A., Tarasau, A.: Is there any correlation between the drug resistance and structural features of radiological images of lung tuberculosis patients? In: Computer Assisted Radiology - 27th International Congress and Exhibition (CARS-2013). vol. 8, pp. 18–20. Springer, Heidelberg (2013) 10. Shelhamer, E., Long, J., Darrell, T.: Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 39(4), 640–651 (April 2017). https://doi.org/10.1109/TPAMI.2016.2572683 11. Sluimer, I., Prokop, M., van Ginneken, B.: Toward automated segmentation of the pathological lung in ct. IEEE Transactions on Medical Imaging 24(8), 1025–1038 (Aug 2005). https://doi.org/10.1109/TMI.2005.851757 12. Wang, Y.X.J., Chung, M.J., Skrahin, A., Rosenthal, A., Gabrielian, A., Tar- takovsky, M.: Radiological signs associated with pulmonary multi-drug resistant tuberculosis: an analysis of published evidences. Quantitative Imaging in Medicine and Surgery 8(2), 161–173 (2018)