UAIC2021: Lung Analysis for Tuberculosis Classification Alexandra Hanganu, Cristian Simionescu, Lucia-Georgiana Coca and Adrian Iftene "Alexandru Ioan Cuza" University, Faculty of Computer Science, Iasi, Romania Abstract This article presents a methodology for chest CT scan analysis that enables the automatic categorization of pulmonary tuberculosis cases into one of the following five types: Infiltrative, Focal, Tuberculoma, Miliary, Fibro-cavernous. We showcase several deep learning methods for classifying tuberculosis in CT scans from the ImageCLEF 2021 Tuberculosis - TBT classification challenge. Furthermore, it explores the use of pre-trained models, as well as training from scratch on volumetric data and 2D projections. Keywords Computed Tomography, Tuberculosis, Deep Learning, 2D Projections, k-Means 1. Introduction According to the NIH (National Institute of Allergy and Infectious Diseases), tuberculosis (TB) is the leading infectious cause of death worldwide. Despite the development of technology, in 2017, 10 million people became ill with TB, and 1.6 million people died of TB disease, including 230,000 children, according to the World Health Organization. Over the past 200 years, TB has claimed the lives of more than one billion people—more deaths than those caused by malaria, influenza, smallpox, HIV/AIDS, cholera, and plague combined. Although tuberculosis treatment exists and further research is ongoing, drug resistance poses a continued threat. ImageCLEF [1] is an evaluation campaign that is being organized as part of the CLEF initiative labs1. The ImageCLEFmedical 2021 challenge has a task [2] which consists of automatically categorizing each TB case into one of the following five types: (1) Infiltrative, (2) Focal, (3) Tuberculoma, (4) Miliary, (5) Fibro-cavernous. The organizers provided to participants two types of masks. The first version of segmentation provides more accurate masks [3], but it tends to miss large abnormal regions of lungs in the most severe TB cases. The second segmentation on the contrary provides more rough bounds of masks [4], but behaves more stable in terms of including lesion areas. In case the participants use the provided masks in their experiments. This paper was organized as follows: section 2 describes the related work, section 3 presents our methods, section 4 evaluates the methods proposed and, in the end, we draw conclusions. CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania " alexandra.hanganu@gmail.com (A. Hanganu) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 2. Related Work In previous editions, different approaches were proposed for the tuberculosis task. In 2017, there were two important tasks: MDR Detection which consisted of assigning each TB patient the probability of having a resistant form of tuberculosis based on the analysis of chest CT scan; Detection of TB Type task which automatically categorize each TB case into one of the following five types: Infiltrative, Focal, Tuberculoma, Miliary, Fibro-cavernous. In 2018, the valuable result was of team UIIP-BioMed that used a Deep CNN [5], folowed by MedGift that used an SVM with and RBF kernel [6]. In 2019, UIIP-BioMed [7] was the leader again, followed by CompEle-cEngCU [8]. UIIP- BioMed used a 2D CNN whilst CompEle-cEngCU used a 2D CNN based on AlexNet [9]. The solution developed by our group uses stage-wise boosting in low-resource environments [10]. In 2020, the majority of the participants used some variations of the projection-based approach and created 2D CNNs. As a result, only four groups tried 3D CNNs for a direct analysis of the CT volumetric data [11]. SenticLab.UAIC [12] was the winner of the task using 2D and 3D CNNs. The SDVA-UCSD was ranked on second place using a 3D CNN with a convolutional block attention module (CBMA) and a customized loss function [13]. Our team was ranked on 7th place, using SVMs and CNNs for lung-wise processing [14]. 3. Methods We propose an extension to the 2D projection methods which were successfully applied in previous years by replacing the arithmetic mean aggregation function with k-means. We made this choice since it can be considered that the average is a specific case for k-means when k=1. With this, we hope to capture more relevant characteristics from the data. This gives us the advantage of working with smaller dimensionality data. Transforming 3D CT scans into 2D images does have the downside of losing spacial proximity information along the third dimension and introduces the probability of losing important features. Due to this, we also experimented working with the 3D samples directly in order to have a reference point of how the two modalities (2D and 3D) performed. 3.1. Dataset Based on the metadata file presented by the organizers, we concluded that the data at hand is highly imbalanced. Since this could have negatively influenced our models, we have decided to use weighted loss, in order to counteract this phenomenon. Table 1 provides an insight for the distribution of each affection in the training set, as well as the weight used for the loss function. To calculate the weight, we have divided the lowest total number of patients affected by one of the types (in our case, this would be 70) by the number of patients affected for each type. Table 1 Patients Affected in the Training Set and Weighted Loss Used. Tuberculosis Type Patients Affected (out of 915) Weighted Loss (1) Infiltrative 418 0.1674 (2) Focal 226 0.3097 (3) Tuberculoma 101 0.6930 (4) Miliary 100 0.6930 (5) Fibro-cavernous 70 1 3.2. Data Pre-Processing Given that the masks provided were not able to return the best segmentation possible, a decision was made to combine the two of them and compute a new mask that would be closer to what we needed. However, this happened to be quite hard, as some of the masks did not have a proper lung segmentation, in accordance to the anatomical position given by the midsagittal line in the upper thoracic area. Furthermore, some issues were not related to the segmentations, but by the CT Scans, since some of them were not able to be opened or were considered faulty. Table 2 presents what issues we have encountered when processing the data and the proposed solution for said problems. The most difficult case to treat was the test CT Scan for patient TST_247, because we were not able to load it properly and it needed to be processed and labeled. Since it had a similar issue to the scan of patient TRN_360, we have assumed that both files have a similar origin and that the problem was either in an underlying anatomical condition or in the acquisition of a similar device fault. Thus, we manually assigned the same TBT class that TRN_360 was assigned in the training metadata. 3.2.1. Data Normalization When normalizing the images, we have opted for a similar approach to past high-ranking participants[7]. The erosion radius adopted was of 10 and the voxel intensity values were increased by 1024 Hounsfield Units (HU) with the threshold set to -1200, clipped to 600. 3.2.2. Segmentation Pre-Processing The two types of lung segmentations provided by the organizers had different issues. One of them has accurate edges, but it tends to overlook or miss entirely some of the large cavities or other lesions in the lungs that would be relevant for the model. However, the other performs better in covering the entire lung, but it is generally more inaccurate, especially when referring to the edges of the lung lobes. The solution proposed for this issue was to generate a mean of the two masks. Combining them and retrieving the entire information, helped ensure that no particularity of the affections is overlooked, while also highlighting the increased importance of the sections where the segmentation methods overlap and hence agree upon. To complete this task, we looked at the first type of mask, the fully automatic multistage one, taking into account only the non-zero values, added them to the second type of mask and then computed the mean Table 2 Dataset Issues and Solutions Data Type Patient ID Issue Solution Train CT Scan TRN_305 Corrupt file Left out Initial CT Scan consisted Train CT Scan TRN_360 of only some white pixels Left out on a black background and could not be used TRN_366 TRN_433 Recolored the mask image TRN_867 according to the initial Train Mask1 TRN_867 Wrong lung colour scheme and TRN_887 segmentation removed the trachea as TRN_891 it was also selected as TRN_894 part of the lungs TRN_897 TST_051 Recolored the mask image TST_093 according to the initial Test Mask1 TST_124 Wrong lung colour scheme and TST_269 segmentation removed the trachea as TST_362 it was also selected as part of the lungs Initial CT Scan consisted Assumed it was similar Test CT Scan TST_247 of only some white pixels to patient TRN_360 on a black background and assigned the and could not be used same label (Type 1) of the two, multiplying it by 255 in the end. Therefore, we computed a new image that had the opacity adjusted for the parts of the lungs where the two segmentations did not overlap. The resulting masks were applied to the raw CT scans by multiplying the two tensors. However, when completing this step, there was a threshold set so that it could help with eliminating the non-lung pixels present in the resulting image. 3.2.3. Data Augmentation Multiple types of data augmentation techniques were attempted to be used with a view to prevent an early over-fitting of the models. To this end, we have employed Randaugment [15], only disabling the "Invert" and "Solarize" augmentations and setting 𝑁 = 2 and 𝑀 = 14. Additionally, we used Random Erasing [16] which is a technique that randomly crops out a patch of size between 5% and 15% from the original image. Nonetheless, there were other approaches to augmenting the dataset that we have tried to apply, such as Mixup [17] and Cutmix [18], but these methods proved to produce images which were making it too difficult for the model to learn on. 1 the mask referred to is the fully automatic multistage one 3.3. Volumetric Classification When working with the 3D volumetric data, we have looked for a model that had been pre- trained, preferably on medical data. Thus, we came across MedicalNet [19], which provided some interesting models that had been trained previously. We have worked with these in order to obtain better results that could not be achieved when training the network from scratch. However, due to the large dimension of the samples and the hardware limitation imposed by using a single Nvidia RTX 2080Ti GPU, we had to resize all the images to 256x256x32. Taking these facts into consideration, we managed to fit into the GPU memory only the MedicalNet10 variant, which is essentially a pre-trained 3D-ResNet10 network. One technique adopted when training was gradient accumulation. Therefore, since we could only use a maximum batch size of 2, we accumulated 16 gradient backpropagation steps before updating the weights, effectively allowing us to simulate a batch size of 32. We loaded the 23-dataset pre-trained MedicalNet10 weights, replacing its segmentation head with a newly initialized classification head, where a linear layer outputs the probabilities for each of the 5 classes. As for the hyper parameter tuning step, we attempted multiple variations of optimizers and data augmentations, but unfortunately our attempts were consistently faced with an early overfitting of the model. This means that it would happen after the first 40 epochs of training, when very little to no data augmentations were applied. One other issue was that frequently we would encounter an unstable training process, caused by exploding gradients, which would then be followed by a model collapse, usually occurring in the first 5 epochs when stronger augmentations were used. Our best submitted run is a MedicalNet10 model trained for 44 epochs using the AdamP optimizer [20] with the default parameters, cross entropy loss and with no data augmentations. Attempts to train 3D-ResNet variants from scratch also suffered from the same issues but the overfitting would start sooner and as such the models generated poorer results. 3.4. Projection Classification 3.4.1. Projection Generation In previous years, the computation and usage of 2D projections has proven to be one of the most successful ways of pre-processing a 3D volume. This happened mostly due to its effective reduction in dimensionality without the loss of relevant information, so, this year, we attempted to build upon this approach. We had managed to adapt the projection method to use the k-means algorithm [21][22]. As described in [7] and [12], generating 2D projections from the X, Y and Z axis, by aggregating along the a given direction with the max(), average() and std() functions and stacking these results as channels of the same image, proved to conserve enough relevant information, while simultaneously drastically reducing the size. We propose the extension of the family of aggregation functions used for this purpose with k-means. We take the centroid values from k-means as the output which will construct multiple image channels. Consequently, when looking at the issue this way, the average() function from the original projection method becomes a particular case of k-means, that is when k=1. Figure 1: X, Y and Z Axis Projections Approximations. Using k-means as an aggregation function gives us more flexibility with the number of potentially relevant features that we extract as centroids from the 3D volume. Since the function outputs a number of centroids equal to the value of k (not just 1, as in the case of average()), each of these centroids is mapped to a different channel in the resulting projection. The centroids are initialized randomly, this factor did not seem to be a big influence on the results since the algorithm always and very rapidly finds the final centroids, in very few steps, as it is a very simple 1D k-means calculation. The mapping is done by sorting the centroid values, the smallest valued centroids will be mapped to the first channel, the second smallest to the second channel and so on. Our final projection algorithm still used max() and std() as aggregation functions and replaced average() with k-means() where k=5. The resulting images consist of 7 channel projections for each of the axis directions X, Y and Z. The usage of max() could be replaced by using higher values for k since the centroid on the higher end of the interval will probably be very close in value to the max value. An ablation study would be needed to identify an optimal value for k, as a starting point, we arbitrarily picked k=5. Since a stable training procedure was not obtained, we did not get to fine-tune this parameter. The Y projection had to be treated separately, since projecting from that direction would overlap the two lungs. In order to prevent losing relevant information at the lung level, we Figure 2: Stacked Projection Approximation first separated the lungs with the help of the lung-wise segmentation and then calculated the projection separately for each one. Consequently, we concatenated the two images such that the two lungs would be positioned next to each other. The resulting X, Y and Z projections can be observed in Fig.1. Due to having more than 3 channels, the figures here were obtained by resizing the number of channels to 3, for RGB. However, there exists less information, because some was lost in the process of resizing the channels for visualization purposes. After computing the X, Y and Z projections, each with 7 channels, we have decided to stack them along the channel dimension. Given that the anatomical positioning is the same, overlapping the lobes would not be considered an issue, especially since there would be 21 individual channels for the model to look at and learn the particularities of each tuberculosis type. An approximated look of the resulting images is presented in Fig.2. This was favorable especially because the algorithm would only use one image for the training and testing process, meaning that the data would be fed all at once. A fixed random seed was used for to generate the k-means projections. 3.4.2. Models The considerably smaller 2D image size allowed us to use larger models and batch sizes. To do so, we tried resizing the images, while preserving the number of channels. As for the CNN models, we experimented with PreResNet56 [23] and the Swin Transformer [24] large, base and tiny variants. For the Swin Transformer models, we also tried to use the weights trained on ImageNet- 21k [25] and to apply various transfer learning techniques in order to adapt it to the TBT classification task. One of the steps implemented to do this was to reinitialize the PatchEmbed layer and the final normalization, pooling and linear layers, so that they can match our 5 classes. The Swin Transformer model runs were not able to noticeably reduce the loss when trained from scratch or when loading the pre-trained weights, but we believe this might have been due to an implementation error on our part or from a faulty normalization technique. The two models that delivered the best results were both based on PreResNet, with a depth of 56. Most of the experiments with this type of model were based on hyper-parameter tuning, as the base was a standard PreResNet56 that had been trained from scratch on data for the task. For the hyper-parameters, we used the RAdam optimizer[26] with the default parameters, a cross entropy loss function and a batch size of 16 with 4 gradient accumulation steps, resulting in a simulated batch size of 64. Both models use lookahead[27]. Further information about the difference between the parameters of the two models is presented in Table.3. Table 3 Comparison of the Two PreResNet56 Models Parameter #135756 #135798 Epochs 55 28 Image Size 384x384 512x512 Learning Rate 0.0001 1.0e-05 Lookahead K 8 16 4. Evaluation When evaluating our experiments, the main reference was the kappa score, defined on the interval [-1,1]. This score is not differentiable, so we could not use it as a loss function directly for our models. Thence, the training process was also monitored by following the cross entropy loss function. As a means to track the progress of the training process, we randomly split the initial train dataset into train and validation splits, while also preserving the same class imbalanced for both splits. Hence, in order for the results to be consistent between different runs, all runs had the same data split. The experiments were performed using only flips and rotations as data augmentations, this caused the model to overfit very early on and would not recover with more training. This phenomenon convinced us to apply stronger augmentation techniques, such as Randaugment and Random Erasing, in order to populate the latent space. However, when these stronger augmentations were being used, the models suffered from training instability since our loss values would explode and hence the gradients would too, resulting in numeric overflow. Finally, once the loss exploded, the model collapsed and did not learn anything else. This happened very fast, usually in the first 5 epochs of training. The four submissions made by our team are listed in Table.4, detailing what model was used, whether or not the model had pre-trained weights, the local validation kappa score and the test kappa score. Since training with strong data augmentations would result in an unstable training, which stopped the model from learning in the first few epochs, no submissions were made with them. All of the runs we submitted suffered from relatively fast overfitting, this occurring in the first approximately 50 epochs. Table 4 Submission Runs - Official Evaluation Results ID Model Pre-trained Modality Val. Kappa Test Kappa 135708 MedicalNet10 Yes 3D 0.232 0.129 135756 ResNet56 No 2D 0.209 -0.003 135750 3D-ResNet20 No 3D 0.117 -0.019 135798 ResNet56 No 2D 0.134 0.016 5. Future Work Further work needs to be invested into adapting state-of-the-art data augmentations from the general computer vision domain to medical images. Since the size of the datasets usually encountered in the field is small, stable data augmentation strategies are required, as frequently acquiring additional data is not feasible. As a retrospective analysis of our methods, we need to carefully investigate the exact causes of training instability. We suspect that, because we did not normalize the data after the final step in the pre-processing/augmentation pipeline, the resulting images might have very different value distributions between different samples, which could even escape our intended value interval of [0.0, 1.0]. Another potential cause of our issues could come from the large size of the linear classification layer which could be the cause of the exploding gradients we noticed, we have to investigate if using a pooling layer before the classification one helps solve our issue. 6. Conclusion In this paper we present ways of detecting different types of tuberculosis in lungs. The methods that have proven to work best for us are focused on pre-trained models for classifying 3D volumetric data and models trained from scratch for 2D projections. The best result from our submissions can be improved, but in doing so, we would have to investigate different normalization techniques and types of data augmentation that would suit the task better. However, in order to work with volumetric data properly, we would also need to improve the current hardware with the aim of being able to apply larger models, as well as greater batch sizes, on medical datasets. 7. Acknowledgements This work was supported by project REVERT (taRgeted thErapy for adVanced colorEctal canceR paTients), Grant Agreement number: 848098, H2020-SC1-BHC-2018-2020/H2020-SC1-2019- Two-Stage-RTD. References [1] B. Ionescu, H. Müller, R. Peteri, A. Ben Abacha, M. Sarrouti, D. Demner-Fushman, S. A. Hasan, S. Kozlovski, V. Liauchuk, Y. Dicente, V. Kovalev, O. Pelka, A. G. S. de Herrera, J. Jacutprakart, C. M. Friedrich, R. Berari, A. Tauteanu, D. Fichou, P. Brie, M. Dogariu, L. D. Ştefan, M. G. Constantin, J. Chamberlain, A. Campello, A. Clark, T. A. Oliver, H. Moustahfid, A. Popescu, J. Deshayes-Chossart, Overview of the ImageCLEF 2021: Multimedia retrieval in medical, nature, internet and social media applications, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction, Proceedings of the 12th International Conference of the CLEF Association (CLEF 2021), LNCS Lecture Notes in Computer Science, Springer, Bucharest, Romania, 2021. [2] S. Kozlovski, V. Liauchuk, Y. Dicente Cid, V. Kovalev, H. Müller, Overview of ImageCLEFt- uberculosis 2021 - CT-based tuberculosis type classification, in: CLEF2021 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org , Bucharest, Romania, 2021. [3] Y. Dicente Cid, O. A. Jiménez del Toro, A. Depeursinge, H. Müller, Efficient and fully automatic segmentation of the lungs in ct volumes, in: O. Goksel, O. A. Jiménez del Toro, A. Foncubierta-Rodríguez, H. Müller (Eds.), Proceedings of the VISCERAL Anatomy Grand Challenge at the 2015 IEEE ISBI, CEUR Workshop Proceedings, CEUR-WS.org , 2015, pp. 31–35. [4] V. Liauchuk, V. Kovalev, ImageCLEF 2017: Supervoxels and co-occurrence for tuberculosis ct image classification, in: CLEF2017 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org , Dublin, Ireland, 2017. [5] V. Liauchuk, A. Tarasau, E. Snezhko, V. Kovalev, ImageCLEF 2018: Lesion-based tb- descriptor for ct image analysis, CLEF 2018 Working Notes (2018). [6] Y. Dicente Cid, H. Muller, Texture-based graph model of the lungs for drug resistance de- tection, tuberculosis type classification, and severity scoring: Participation in the imageclef 2018 tuberculosis task, CLEF 2018 Working Notes (2018). [7] V. Liauchuk, Imageclef 2019: Projection-based ct image analysis for tb severity scoring and ct report generation, CLEF2019 Working Notes 2380 (2019). [8] A. Mossa, A. Yibre, U. Evik, Multi-view cnn with mlp for diagnosing tuberculosis patients using ct scans and clinically relevant metadata, CLEF2019 Working Notes 2380 (2019). [9] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, Communications of the ACM 60 (6) (2017) 84–90. [10] A. Tabarcea, V. Rosca, A. Iftene, ImageCLEFmed Tuberculosis 2019: Predicting ct scans severity scores using stage-wise boosting in low-resource environments, CLEF2019 Work- ing Notes 2380 (2019). [11] S. Kozlovski, V. Liauchuk, Y. Dicente Cid, A. Tarasau, V. Kovalev, H. Muller, Overview of imagecleftuberculosis 2020 - automatic ct–based report generation, CLEF2020 Working Notes (2020). [12] R. Miron, C. Moisii, M. Breaban, Revealing lung affections from cts. a comparative analysis of various deep learning approaches for dealing with volumetric data, CLEF2020 Working Notes (2020). [13] X. Lu, E. Chang, Z. Liu, C. Hsu, J. Du, A. Gentili, ImageCLEF2020: Laterality-reduction three-dimensional cbam-resnet with balanced sampler for multi-binary classification of tuberculosis and ct auto reports, CLEF2020 Working Notes (2020). [14] L. Coca, A. Hanganu, C. Cusmuliuc, A. Iftene, UAIC2020: Lung analysis for tuberculosis detection, CLEF2020 Working Notes (2020). [15] E. D. Cubuk, B. Zoph, J. Shlens, Q. V. Le, Randaugment: Practical automated data aug- mentation with a reduced search space, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020, pp. 702–703. [16] Z. Zhong, L. Zheng, G. Kang, S. Li, Y. Yang, Random erasing data augmentation, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 2020, pp. 13001– 13008. [17] H. Zhang, M. Cisse, Y. N. Dauphin, D. Lopez-Paz, mixup: Beyond empirical risk minimiza- tion, arXiv preprint arXiv:1710.09412 (2017). [18] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, Y. Yoo, Cutmix: Regularization strategy to train strong classifiers with localizable features, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 6023–6032. [19] S. Chen, K. Ma, Y. Zheng, Med3D: Transfer learning for 3D Medical Image Analysis, arXiv preprint arXiv:1904.00625 (2019). [20] B. Heo, S. Chun, S. J. Oh, D. Han, S. Yun, G. Kim, Y. Uh, J.-W. Ha, Adamp: Slowing down the slowdown for momentum optimizers on scale-invariant weights, in: Proceedings of the International Conference on Learning Representations (ICLR), Online, 2021, pp. 3–7. [21] S. P. Lloyd, Least squares quantization in pcm, IEEE Trans. Inf. Theory 28 (1982) 129–136. [22] J. MacQueen, et al., Some methods for classification and analysis of multivariate obser- vations, in: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, Oakland, CA, USA, 1967, pp. 281–297. [23] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Pro- ceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. [24] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin transformer: Hierarchical vision transformer using shifted windows, arXiv preprint arXiv:2103.14030 (2021). [25] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale hierarchical image database, in: 2009 IEEE conference on computer vision and pattern recognition, Ieee, 2009, pp. 248–255. [26] L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, J. Han, On the variance of the adaptive learning rate and beyond, CoRR abs/1908.03265 (2019). URL: http://arxiv.org/abs/1908. 03265. arXiv:1908.03265. [27] M. R. Zhang, J. Lucas, G. E. Hinton, J. Ba, Lookahead optimizer: k steps forward, 1 step back, CoRR abs/1907.08610 (2019). URL: http://arxiv.org/abs/1907.08610. arXiv:1907.08610.