Classification of Tuberculosis Type on CT Scans of Lungs using a fusion of 2D and 3D Deep Convolutional Neural Networks Emad Aghajanzadeh1 , Behzad Shomali1 , Diba Aminshahidi1 and Navid Ghassemi1 1 Computer Engineering Department, Ferdowsi University of Mashhad, Mashhad, Iran. Abstract In this paper, we present a novel deep-learning-based method to deal with volumetric data like CT scans. The method ensembles a 2-dimensional convolutional neural network (2D-CNN) with a 3D-CNN followed by a recurrent neural network (RNN). We used this approach and its constituent to solve the task of categorizing tuberculosis type in the context of ImageCLEF 2021. Our best run ranked 4th based on the Kappa metric by reaching a value of 0.181 and 3rd based on the accuracy of 0.404. Also, it is worthy of mentioning that our obtained results were very similar to that of the third team with a Kappa of 0.190; and we had a big gap with the fifth team with a Kappa of 0.140. Keywords Deep Learning, Information Fusion, Tuberculosis, CT Scan, Diagnosis, Volumetric Data 1. Introduction Tuberculosis (TB) is an airborne disease that usually affects the lungs and causes severe coughing, chest pains, and fever. The disease is still one of the main health concerns worldwide, being second in causing high mortality rates [1]. Approximately 10.0 million people around the world caught TB in 2019 in line with WHO[World Health Organization. Global tuberculosis report 2020. Geneva, Switzerland: World Health Organization; 2020]. A CT Scan or Computerized Tomography Scan is a versatile medical imaging modality that uses computers and rotating X-ray machines to create cross-sectional images of the patient’s body. In other words, the number of detector rows in the z-axis is increased. This allows us to image the whole organ, which reduces image capturing time. It also has several advantages, including improving the quality of images, reducing radiation exposure, and illustrating the soft tissues, blood vessels, and bones of the patient’s body [2, 3]. Despite all the advantages, CT scan-based diagnosing approaches have some challenges in terms of the variety of images, their corresponding size, and the complexities there exist in the diagnosing process itself. Moreover, there exist some factors, namely, eye exhaustion and the great number of visitors, which lead to human mistakes [4]. These challenges motivated researchers to use Artificial Intelligence(AI) in order to create automated diagnosis systems for CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania " emad.aghajanzadeh@mail.um.ac.ir (E. Aghajanzadeh); behzad.shomali@mail.um.ac.ir (B. Shomali); d.aminshahidi@mail.um.ac.ir (D. Aminshahidi); navidghassemi@mail.um.ac.ir (N. Ghassemi) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) increasing the accuracy of medical diagnosis on CT-Scan [5, 6]. In recent years, Deep Learning (DL), a sub-field of AI, has shown encouraging results in medical diagnosis [7, 8]. In this paper, we presented a strategy based on deep learning approaches to detect the type of TB disease, in the context of the ImageCLEF tuberculosis task [9, 10]. ImageCLEF 2021 is an evaluation campaign that is being organized as part of the CLEF initiative labs. The campaign offers several research tasks that welcome participation from teams around the world. In 2021, there were three medical subtasks, one of which was Tuberculosis CT analysis that we participated in. The task is to classify the CT scans into five classes based on their TB type. Besides the dataset, the organizers also provided two versions of extracted masks per each lung [11, 12]. We analyzed the effectiveness of three main approaches based on Convolutional Neural Network (CNN) [13] for this task. The first one is to use a 2 Dimensional convolutional Neural Network (2D-CNN) to learn slice-level features and then obtain the final prediction label through several strategies, such as majority voting or the most certain prediction. The second approach is to utilize a 3 Dimensional convolutional Neural Network (3D-CNN) to capture the spatial features, which are not extracted by the 2D-CNN. Finally, the last approach was to combine the 2D-CNN and 3D-CNN models to reach a model that benefits from both the slice-level and the inter-slice-level features. The rest of the paper is organized as follows. Section 2 is devoted to describing the competition. Section 3 introduces the preprocessing steps that were used. Section 4 explains the proposed method, which obtained the results demonstrated in section 5. Finally, section 6 concludes the paper with future work directions. 2. ImageClef Tuberculosis: task, data, evaluation The tuberculosis task of ImageCLEF 2021 Challenge was categorizing each TB case based on its type into five categories: Infiltrative, Focal, Tuberculoma, Miliary, and Fibro-cavernous. Figure 1 illustrates one example for each TB type 1 . The dataset contains 1338 CT images stored in the NIfTI (Neuroimaging Informatics Technology Initiative) format with the resolution of 512 × 512 pixels and around 100 slices per scan. The file format stores raw voxel intensities in Hounsfield Units (HU). The training dataset consists of 917 CTs, each of which belongs to only one of the five classes. Hence, the task is a multi-class classification (see Table 1). Table 1 The number of training samples for each of the five TB types. Type # of samples Infiltrative 419 Focal 226 Tuberculoma 101 Miliary 101 Fibro-cavernous 70 total 917 1 https://www.imageclef.org/2021/medical/tuberculosis Figure 1: Examples of the five types of TB. The results are evaluated using unweighted Cohen’s Kappa [14] and accuracy metrics, but the primary ranking is done based on only the Kappa metric. 3. Preprocessing As the dataset files were provided in the NIfTI format with the extension .nii, we used the Nibabel package2 in Python3 to load the dataset. Following this, a threshold between -1000 and 400 is used to normalize the CT scans, HU values are scaled to be between 0 and 1. One of the major outcomes of this normalization is reducing the existing contrast among data (See Figure 2). The volumes are then rotated by 90 degrees so that their orientation is fixed. We did not use the masks provided by the task organizers. In most cases, the first and last slices do not contain beneficial features for a model to consider [15], therefore we only selected 50 middle slices and removed the rest of them; this number was chosen empirically by testing a few alternatives. It is worth mentioning, this also saves the power of computational resources; thus helping us to search through different models and settings more efficiently. Moreover, for the same reason, we resized each slice to 100*100, so finally, we had a set of 100*100*50 CT scans. Figure 3 shows the slices of a CT after the preprocessing phase. 4. Proposed Method In recent years, Deep Neural Networks (DNN) have shown great performance in various tasks, and medical diagnosis has not been an exception [16, 17]. More specifically, Convolutional Neural Networks (CNN), a well-known deep learning architecture inspired by the mechanism 2 https://github.com/nipy/nibabel 3 https://github.com/python Figure 2: Examples of slices with absolutely different contrast. Figure 3: Illustration of 50 selected slices of a CT scan, after preprocessing. of visual perception of creatures, have been used to solve many image processing tasks. It takes its name from Convolution, a mathematical operation, which performs mapping on input data and processes them into a new space. The main advantage of using CNN is that the kernel can automatically extract the important features from the input data such as detecting edges and distribution of colors in an image which other networks are unable to do, thus making these networks very robust in some processes like image classification. However, despite all aforementioned advantages, CNN models are data-hungry [18], making them less useful, when there are not enough data available like in medical tasks. Moreover, there are even more challenges to face to train these models properly; such as unbalanced [19]. In our attempt to use the CNN to categorize the CT images, we remedied the mentioned problems as follows: • Small Dataset: We used data duplication and also data augmentation techniques to increase our training data. For augmenting the training data, a degree between -5 to 5 was randomly chosen to apply rotation on each data. Later, we zoomed each data by a ratio of 1.25 and then resized it to its original size; by putting resizing at the last step, we ensured that image quality is kept during data augmentation. • Imbalanced Data: As shown in Table 1, the number of samples in different classes varies dramatically. To overcome this issue, two distinct approaches were used: the first one is to consider variable penalties for different classes, that is, the multiplier error in classes with fewer data becomes larger. The second approach is to remove the samples from the classes with more data. For this purpose, we removed plenty of samples from classes 0 and 1 from the dataset. By doing these steps, the training dataset became larger and more balanced (see Table 2). In the Table 2 The number of training samples after the preprocessing step. Type # of samples Infiltrative 376 Focal 262 Tuberculoma 342 Miliary 250 Fibro-cavernous 250 total 1480 proposed method, we mainly used two different approaches, 2D-CNN-based and 3D-CNN-based, where their training settings are as follows: • Optimizer: Adam optimizer with default parameters (alpha=0.001, beta1=0.9, beta2=0.999, epsilon=1e-7) [20] was used for all epochs of 2D models and the first 30 epochs of 3D models. For the rest epochs of 3D models, we decayed the learning rate by a 1/2 ratio. • Train/validation split: The dataset was split into train and validation partitions with a ratio of 0.2. • Batch size: Due to facing some problems with memory, we had to keep the batch size small; thus we set the batch sizes 32 and 8 for 2D and 3D models, respectively. • Loss function: A combination of cross-entropy and weighted Kappa [21] with multipliers of 0.7 and 0.3 was used for all epochs of 2D models and 30 initial epochs of 3D models. The contribution ratio of losses was then changed into 0.85 and 0.15 for the 10 last epochs of 3D models. The cross-entropy loss is defined as: 𝐶 ∑︁ 𝐿𝐶𝐸 = − 𝑡𝑖 log(𝑝𝑖 ) (1) 𝑖=1 Where C is the number of classes, 𝑡𝑖 is the ground truth, and 𝑝𝑖 is the probability for the 𝑖-th class. The formula of weighted kappa with the matrix of observed scores 𝑂, the matrix of expected scores based on chance agreement 𝐸, and the weight matrix 𝜔 is defined as follows: ∑︀ 𝑖,𝑗 𝜔𝑖,𝑗 𝑂𝑖,𝑗 𝜅 = 1 − ∑︀ ∀𝑖, 𝑗 ∈ {1, 2, ..., 𝐶} (2) 𝑖,𝑗 𝜔𝑖,𝑗 𝐸𝑖,𝑗 where 𝑂𝑖,𝑗 is the number of observations that are predicted to be in class 𝑖, but their true classes were 𝑗. 𝐸𝑖,𝑗 also denotes the outer product between the vectors of prediction and true value. Finally, 𝜔𝑖,𝑗 represents the weight penalization for every pair 𝑖, 𝑗. It is worthy of mentioning that all of our experiments were done using Google Colab [22]. 4.1. 2D In this method, we examined each slice of a single CT individually. To be more specific, in the training phase, we assigned the label of each CT to all of its corresponding slices and fed each slice to the 2D-CNN separately. In this case, we have a vector of 50, the number of slices, predicted labels as output for each CT. To obtain the final prediction for each CT in the testing phase, we have used two different approaches: • Pick the label which appeared the most (majority voting) • Pick the label whose corresponding probability was the highest, i.e., where the model is highly certain about that label. To configure hyper-parameters, we examined various settings such as using skip connection, changing the number of neurons of the last hidden layer, using different activation functions, and differing kernel size of convolution layers and selected our final network empirically. Finally, we got the best result from the model shown in Figure 4. Figure 4: Illustration of 2D model The learning curve of this model is illustrated in Figure 5. For the evaluation of the model, we used Accuracy, Kappa, and F1-score on validation data which are reported in Table 3 Figure 5: Learning curve of 2D-CNN model. Table 3 Evaluation results on 2D-CNN model. Criteria Score Accuracy 0.614 Kappa 0.509 F1-score 0.618 4.2. 2D + RNN In this method, we used the best-trained 2D-CNN that we found in the previous section, but in order to have a more accurate classifier, we implemented a simple Recurrent Neural Network (RNN). As illustrated in Figure 6, we used the extracted features of the 2D-CNN as input of the RNN. To be more clear, we tried two different approaches to fulfill this: • Feeding the output of the 2D CNN (vector of 50 labels) to the RNN (see 6a) • Feeding the features extracted by the last hidden layer of 2D CNN to the RNN (see 6b) We tried different settings for the RNN such as trying different architectures, including long short-term memory (LSTM) [23] and gated recurrent unit (GRU) [24], besides, different number of units. Despite all of the efforts, the accuracy obtained in this method was almost in the same range of 2D-CNN and neither was superior to another one. Therefore, we decided not to submit the result of this approach. 4.3. 3D Generally, 2D-CNN are unable to catch information that exists among slices, i.e., spatial infor- mation. This is because they take a single slice as input and the learning process is applied on each slice individually, thus some of the spatial information may be lost in the process. However, the input of 3D-CNN is a 3D matrix with dimensions of height, width, and depth and the kernel slides over these three dimensions. This property of 3D-CNNs enables them to capture the spatial information between slices. For this reason, we used a 3D model consists of 5 convolution layers as shown in Figure 7 and its corresponding learning curve is displayed in Figure 8. The evaluation result of the model is also listed in Table 4 (a) Feeding the output of the 2D CNN to the RNN (b) Feeding the extracted features by the 2D CNN to the RNN Figure 6: Two types of employing RNN model after the 2D CNN. Figure 7: Illustration of proposed 3D model Figure 8: Learning curve of 3D-CNN model. 4.4. 3D + Transfer learning In order to make use of pre-trained networks, we designed a model that transforms the input images into three-channel images using convolution layers and then followed by the pre-trained Table 4 Evaluation results on 3D-CNN model. Criteria Score Accuracy 0.646 Kappa 0.547 F1-score 0.656 models. In this experiment, we used ResNet [25], VGG16 [26] and EfficientNet [27], none of which obtained better result than what was obtained through the 3D model itself. 4.5. Fusion of 2D and 3D with RNN 3D CNN can capture the spatial information among CT slices, while 2D CNN can better extract 2D features in each slice. We assumed that ensembling these two models can result in a model with both advantages. Therefore, we first put the features of all slices together and then concatenate them with the features obtained by the 3D CNN model. This forms a feature vector for each CT image, which is then passed to the RNN model as Figure 9. The result was similar to that of the 3D model, which implies that the contribution of the 3D model in the learning process is more dominant than the other model. Figure 9: Illustration of fusion 2D and 3D model 4.6. Fusion of 2D and 3D to use the best of both After investigating the confusion matrices obtained from the 3D and 2D models, we noticed that the 3D model can better separate the last 3 classes, while can not properly categorize the first two ones. However, this pattern was completely reversed for the 2D model. Therefore, we decided to select the final prediction of classes 1 and 2 on test data manually from the prediction of both models. The result was similar to the 3D model which shows that the 2D model did not help the 3D model in the prediction of classes 1 and 2. 5. Comparative results Table 5 shows the results obtained on the test data of the competition. As it can be seen, the 2D-CNN model obtained 0.036 and 0.373 scores based on Kappa and accuracy score, which is the worst score on Kappa and second-best on accuracy metric among all our submissions. Then, the Kappa is increased by 0.02 when the RNN module is added on top of the 2D-CNN; meanwhile, the accuracy score is decreased by 0.031. Furthermore, the 3D-CNN model obtained 0.181 and 0.404 on Kappa and accuracy, respectively, which is our best score on both metrics. This result is then decreased to 0.136 and 0.371 by adding the 2D features and manually selecting the final prediction when 2 models had conflicts in the first two classes. Table 5 Results obtained on the test data. Method Kappa Acc 2D 0.036 0.373 2D + RNN 0.056 0.342 3D 0.181 0.404 3D + 2D + Manual 0.136 0.371 6. Conclusion and Future Works In this paper, we have described our proposed method for the tuberculosis task of ImageCLEF Tuberculosis 2021. We proposed three different approaches and analyzed their corresponding results. The results demonstrate that the 2D-CNN didn’t work well, while we believe that it can be significantly improved by applying a smarter voting mechanism for outputting the final label. These improvements can be such as applying Gaussian Distribution Normalization or defining a window with a fixed size, k, to move through the vector of 50 labels and pick the final label. By having a better 2D-CNN in hand, we can expect some enhancements while ensembling 2D and 3D-CNN models. Moreover, comparing the obtained results on validation data, Table 3 and Table 4, with the results on test data, Table 5, shows that there exists a considerable gap between them, which can be caused by the existence of different distributions between validation and test datasets. To resolve this issue, we suggest exchanging the order of duplicating and splitting the data. This guarantees that there is no overlap between training and validation data. We also plan to employ more complicated approaches for data augmentation; in this case, the models can learn and generalize more robustly. During our experiments, we also figured out that the main part of the incorrect predictions of models caused by predicting the first two classes interchangeably. By having this in mind, we can fix this problem by training a separate binary classification on those classes. References [1] N. Fogel, Tuberculosis: a disease without boundaries, Tuberculosis 95 (2015) 527–531. [2] M. Fu, S.-L. Yi, Y. Zeng, F. Ye, Y. Li, X. Dong, Y.-D. Ren, L. Luo, J.-S. Pan, Q. Zhang, Deep learning-based recognizing covid-19 and other common infectious diseases of the lung by chest ct scan images, medRxiv (2020). [3] P. T. Johnson, D. G. Heath, B. S. Kuszyk, E. K. Fishman, Ct angiography with volume rendering: advantages and applications in splanchnic vascular imaging., Radiology 200 (1996) 564–568. [4] S. Hu, Y. Gao, Z. Niu, Y. Jiang, L. Li, X. Xiao, M. Wang, E. F. Fang, W. Menpes-Smith, J. Xia, et al., Weakly supervised deep learning for covid-19 infection detection and classification from ct images, IEEE Access 8 (2020) 118869–118883. [5] D. L. Pham, C. Xu, J. L. Prince, Current methods in medical image segmentation, Annual review of biomedical engineering 2 (2000) 315–337. [6] A. El-Baz, G. M. Beache, G. Gimel’farb, K. Suzuki, K. Okada, A. Elnakib, A. Soliman, B. Ab- dollahi, Computer-aided diagnosis systems for lung cancer: challenges and methodologies, International journal of biomedical imaging 2013 (2013). [7] A. Bhandary, G. A. Prabhu, V. Rajinikanth, K. P. Thanaraj, S. C. Satapathy, D. E. Robbins, C. Shasky, Y.-D. Zhang, J. M. R. Tavares, N. S. M. Raja, Deep-learning framework to detect lung abnormality–a study with chest x-ray and lung ct scan images, Pattern Recognition Letters 129 (2020) 271–278. [8] G. van Tulder, M. de Bruijne, Combining generative and discriminative representation learning for lung ct analysis with convolutional restricted boltzmann machines, IEEE transactions on medical imaging 35 (2016) 1262–1272. [9] S. Kozlovski, V. Liauchuk, Y. Dicente Cid, V. Kovalev, H. Müller, Overview of ImageCLEFt- uberculosis 2021 - CT-based tuberculosis type classification, in: CLEF2021 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org , Bucharest, Romania, 2021. [10] B. Ionescu, H. Müller, R. Peteri, A. Ben Abacha, M. Sarrouti, D. Demner-Fushman, S. A. Hasan, S. Kozlovski, V. Liauchuk, Y. Dicente, V. Kovalev, O. Pelka, A. G. S. de Herrera, J. Jacutprakart, C. M. Friedrich, R. Berari, A. Tauteanu, D. Fichou, P. Brie, M. Dogariu, L. D. Ştefan, M. G. Constantin, J. Chamberlain, A. Campello, A. Clark, T. A. Oliver, H. Moustahfid, A. Popescu, J. Deshayes-Chossart, Overview of the ImageCLEF 2021: Multimedia retrieval in medical, nature, internet and social media applications, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction, Proceedings of the 12th International Conference of the CLEF Association (CLEF 2021), LNCS Lecture Notes in Computer Science, Springer, Bucharest, Romania, 2021. [11] V. Liauchuk, V. Kovalev, Imageclef 2017: Supervoxels and co-occurrence for tuberculosis ct image classification, in: CLEF2017 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org , Dublin, Ireland, 2017. [12] Y. Dicente Cid, O. A. Jiménez del Toro, A. Depeursinge, H. Müller, Efficient and fully automatic segmentation of the lungs in ct volumes, in: O. Goksel, O. A. Jiménez del Toro, A. Foncubierta-Rodríguez, H. Müller (Eds.), Proceedings of the VISCERAL Anatomy Grand Challenge at the 2015 IEEE ISBI, CEUR Workshop Proceedings, CEUR-WS.org , 2015, pp. 31–35. [13] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, Advances in neural information processing systems 25 (2012) 1097–1105. [14] J. Cohen, A coefficient of agreement for nominal scales, Educational and psychological measurement 20 (1960) 37–46. [15] M. Sohrabi, M. Parsi, S. H. Tabrizi, Statistical analysis for obtaining optimum number of ct scanners in patient dose surveys for determining national diagnostic reference levels, European radiology 29 (2019) 168–175. [16] M. Khodatars, A. Shoeibi, N. Ghassemi, M. Jafari, A. Khadem, D. Sadeghi, P. Moridian, S. Hussain, R. Alizadehsani, A. Zare, et al., Deep learning for neuroimaging-based diagnosis and rehabilitation of autism spectrum disorder: A review, arXiv preprint arXiv:2007.01285 (2020). [17] A. Shoeibi, M. Khodatars, R. Alizadehsani, N. Ghassemi, M. Jafari, P. Moridian, A. Khadem, D. Sadeghi, S. Hussain, A. Zare, et al., Automated detection and forecasting of covid-19 using deep learning techniques: A review, arXiv preprint arXiv:2007.10785 (2020). [18] G. Marcus, Deep learning: A critical appraisal, arXiv preprint arXiv:1801.00631 (2018). [19] Y. Sun, A. K. Wong, M. S. Kamel, Classification of imbalanced data: A review, International journal of pattern recognition and artificial intelligence 23 (2009) 687–719. [20] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014). [21] J. de La Torre, D. Puig, A. Valls, Weighted kappa loss function for multi-class classification of ordinal data in deep learning, Pattern Recognition Letters 105 (2018) 144–154. [22] E. Bisong, Google colaboratory, in: Building Machine Learning and Deep Learning Models on Google Cloud Platform, Springer, 2019, pp. 59–64. [23] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural computation 9 (1997) 1735–1780. [24] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, Y. Ben- gio, Learning phrase representations using rnn encoder-decoder for statistical machine translation, arXiv preprint arXiv:1406.1078 (2014). [25] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Pro- ceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. [26] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556 (2014). [27] M. Tan, Q. Le, Efficientnet: Rethinking model scaling for convolutional neural networks, in: International Conference on Machine Learning, PMLR, 2019, pp. 6105–6114.