ImageCLEFmed Tuberculosis 2019: Predicting CT Scans Severity Scores using Stage-Wise Boosting in Low-Resource Environments Augustus Tabarcea, Valentin Rosca, and Adrian Iftene ”Alexandru Ioan Cuza” University, Iasi, Romania {augustus.tabarcea,valentin.rosca,adiftene}@info.uaic.ro Abstract. Tuberculosis (TB) still remains in our days a persistent threat and a leading cause of death worldwide. The different types of TB re- quire different treatments, usually with antibiotics, and therefore the detection of the TB type and the evaluation of the severity stage are very important. In the ImageCLEF 2019 Tuberculosis, our group sub- mitted a solution that addresses the problem of tuberculosis’ severity prediction in low-resource environments by attempting to minimize the information required from the CT scan using a regularized variant of the SAMME.R algorithm. Keywords: Tuberculosis · Boosting · ImageCLEF 1 Introduction Tuberculosis (TB) is a disease caused by bacteria (Mycobacterium tuberculosis) that most often affects the lungs. Even though it is curable and preventable, tuberculosis is the second biggest killer, globally (after HIV infection) [15]. In 2016, an epidemiological research estimated 10.4 million new cases and 1.7 mil- lion deaths. Tuberculosis is spread from person to person through the air. When people with active infection cough, sneeze or spit, they propel the TB germs into the air. A person needs to inhale only a few of these germs to become infected. About one-quarter of the world’s population has latent TB, which means people have been infected by TB bacteria but are not (yet) ill with the disease and cannot transmit it. People infected with TB bacteria have a 5-15% lifetime risk of falling ill with TB [17]. The 2019 edition of ImageCLEF (the Image Retrieval and Analysis Evalua- tion Campaign of the Cross Language-Evaluation Forum) [11] targets two tasks in the direction of serving the tuberculosis diagnosis [4]. The first task involves automated detection of tuberculosis severity, while the second one involves a computed tomography report based on CT scans and the patient’s metadata. Copyright c 2019 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 Septem- ber 2019, Lugano, Switzerland. The CT report task is a binary multi-class classification problem that addresses the presence of calcification, presence of caverns, presence of TB in the right lung, presence of TB in the left lung, pleurisy and lung capacity decrease. The Severity scoring task is a binary classification problem centered around labeling the TB severity as being ”HIGH” or ”LOW”. Each instance of training data is labelled into one of 5 classes, where 1, 2 and 3 represents ”HIGH” severity and labels 4 and 5 represents ”LOW” severity. The dataset provided by ImageCLEFmed Tuberculosis is composed of 218 patients CT scans for training and 117 for the test, each with a collection of metadata regarding the respective patient. The metadata is supplied, for better classification, in consideration to the elaborated process a doctor conducts to give a diagnostic. Each CT scan is made of slices (Section 1, that number varies from 50 to 400 and every slice has a dimension of 512 x 512. Fig. 1: Slices of lung example Considering the need for a rapid and reliable TB treatment regimens, that is also cost effective, we investigate for the best learning algorithm that qualifies our requirements and outputs the patient condition’s severity. Our solution addresses the problem of tuberculosis severity prediction in low- resource environments by attempting to minimize the information required from the CT scan. This approach is benefic for low-buget medical organizations and small practices as they may not have access to high-end computational resources needed for a computer aided diagnosis. 2 State of the art A number of different approaches were proposed for the task of severity predic- tion. More notably, in the second edition of ImageCLEFmed TB 2018 [5][12], the team UIIP BioMed [14] managed to obtain the highest kappa score, 0.2312, with an accuracy of 0.4227 and root mean squared error of 0.7840. Furthermore, the team MedGift [6] scored 0.7708 in terms of ROC AUC, the highest in that year. The methods used by the two teams consisted of a Deep Convolutional Neu- ral Networks model submitted by UIIP BioMed and an SVM with RBF kernel submitted by MedGift team. It is suitable to state that each of the two teams had completely different pre-processing procedures. Correlation of better outcomes given more clinical data and also the CT scans of patients were studied before in [13], as a result in 2019 more metadata about the patients were given. In the edition of 2019, the best Accuracy was given also by UIIP BioMed, with a score of 0.7350. Our experience in prediction algorithms comes from prediction of cryptocur- rency market [3] and in predicting of user activities using GPS Collections [1]. In the same time, we experienced with image processing in medical domain [7], in photo annotation task of CLEF2010 [10], and in combining semantic resources for image retrieval [9], [19]. 3 Architecture 3.1 Data pre-processing The CT scans were preprocessed trough methods whose foundation relies upon [21]. Depending on the procedure and device that was used to make the CT scans, the slices’ pixel spacing vary and, as a result, their number diverge for each patient. To simplify the process of learning we resampled the images slices number to their mean across patients, that implying approximately 130 vertical slices. Because we were limited by the available resources we decreased the size of each slice from 512 x 512 to 256 x 256. Each pixel is a signal on the Hounsfield Unit (HU) scale (Table 1), hence, we memorize each pixel as a float16 type to spare memory and aid in accelerated learning (Figure 2). Table 1: Hounsfield units scale Substance HU Air -1000 Lung -500 Fat -100 to -50 Water 0 CSF 15 Kidney 30 Blood +30 to +45 Muscle +10 to +40 Grey matter +37 to +45 White matter +20 to +30 Liver +40 to +60 Soft Tissue, Contrast +100 to +300 Bone +700 to +3000 Fig. 2: HU pixels value on a patient The HU signals for bones can reach up to 3000+ units, starting from around 400 units. Figure 3a shows merged slices of a patient for a more reliable overview of the patient bones section (HU signal values representing bones on axial view were plotted in Figure 3b). The pixels representing bones, whose value is greater than 400, were replaced by those representing air for a better focus on the tissue of the lungs, both af- fected and healthy, so any classifier could distinguish among those two categories. Afterward, we normalized the values between [0, 1]. (a) Axial slice overview of bones (b) Maximal HU pixel value in each structure axial point Fig. 3: Axial slice overview (left) and Maximal HU pixel value (right) To summarize the pre-processing of our data, these steps were applied: • resample the height of the lung voxel (number of slices per patient); • resample the width and height of the lung voxel (all slices of each patient); • converting the signals to float16 type (by doing so no information is lost); • replace pixels representing bones structure from the lung voxel; • normalize the pixel values. 3.2 Further propossed pre-processing methods Further pre-processing methods were conceptualized although they weren’t ex- ploited because of a lack of time and scarce computational resources. It is proper to acknowledge the future use of these methods for possible better results. One such method is the application of a mask over the lung voxel to focus on the tissue within the lung (Figure 4c) where the tuberculosis is located. This mask [21], computes the connected regions of the voxel (Figure 4a), by labeling all connected pixel with the same value. The same mask further computes the lung with no internal structure by computing for every slice the maximum connected pixels, those being the lung frontier if that slice contains lungs, or none otherwise (Figure 4b). (a) Lung voxel (b) Voxel of lung without (c) Lung internal internal structures structure Fig. 4: Masks of a lung 3.3 Building the prediction algorithm Instead of training a binary classifier for severity LOW and HIGH, we instead used the original severity score from 1 to 5 found in the patient’s metadata. For the final prediction, we added the probabilities estimated by our model for severity scores 1, 2 and 3 to obtain the probability of HIGH severity. As previously stated in the introduction, the objective of our paper is to explore methods that minimize the amount of resources used for the prediction. This implies finding a way to reduce the amount of information used during the prediction. Our idea is to use a model capable of selecting the most important features. We consider each pixel as a feature for our model. In this way, by selecting a subset of the features, we are actually restricting the information we need from the CT to make predictions. Moreover, we would like to supply an upper bound for the number of utilized pixels. A good choice for such a model is the Decision Tree algorithm described at [18]. A simple variant for limiting the number of nodes is by stopping the algorithm when the number of nodes reaches the threshold. Another method for building decision trees with a maximum number of nodes is provided by [8], where they introduce constraints into the building phase of the algorithm. By setting a size constraint on the decision tree, the best nodes will be selected within that constraint. Boosting algorithms use a linear combination of weak classifiers to make predictions. They learn the coefficients of each linear classifier in a stage-wise fashion by first training a weak learner and then selecting the weight as to minimize a loss function. If the weak learner utilizes only a limited number of different features, an upper bound of the boosting algorithm is the number of weak learners times the number of different features. This motivates our choice of using a boosting algorithm with decision tree as weak classifiers. To improve on this, consider the small number of instances (<400). By using a boosting algorithm, we try to prevent overfitting while also being able to limit the number of features our algorithm is allowed to use for the prediction. For the original severity scoring prediction, the ability to supply probabili- ties is required. A boosting algorithm that can estimate probabilities in a multi- labeling setting is described at [20], called SAMME.R. This particular algorithm is able to provide class probability estimates for the prediction. For the weak learners, we used decision stumps. The major problem that we faced while train- ing the model was overfitting. To mitigate this problem, we used four different approaches. The first technique we used is limiting the number of features a weak learner trains on. For each boosting iteration, we limited the training of the weak learner to 1,000-10,000 pixels sampled at random from an uniform distribution. By us- ing this method the training runtime was also decreased drastically, making possible to train the classifier in low-resource environments, while also reducing overfitting. Another method that has been used was regularization of decision stumps. This technique was already applied for the AdaBoost algorithm in [2]. We used the same principle and changed the formula to adapt this method for SAMME.R. Instead of minimizing the weighted training error w while searching for the best split, we minimize the following objective function: w + λ · (Pw (x < s) · Hw (x < s|Y ) + Pw (x >= s) · Hw (x >= s|Y ) (1) where Hw is the weighted conditional entropy, Pw is the weighted probability estimate, λ is the regularization factor and s is the split value. Our third method consists of splitting the training set further into an initial training set and an extended training set. During training, starting from the initial training set, some samples from the extended training set are added. The sample’s weight are then renormalized to accommodate the new samples and the algorithm continues as normal. This method adapted for SAMME.R from [16]. Our final approach was removing weak classifiers. We start with only the decision of the first weak classifier and increasingly add weak learners in the trained order. Then the configuration with the highest score on validation data is chosen. 4 Evaluation For the ImageCLEFmed Tuberculosis SVR prediction task we submitted a trained model based on the SAMME.R algorithm with 1,600 weak learners, where each weak learner is trained on 10,000 features (pixels and metadata) sampled ran- domly from an uniform distribution at each iteration. At the time, we didn’t yet conceptualize the methods described in the train- ing section and our algorithm suffered from overfitting. This approach obtained a ROC AUC (receiver operating characteristic area under the curve) score of 0.5692 and accuracy of 0.5556. In selecting the hyperparametes and the model we used 5-fold cross-validation and ROC AUC, accuracy as metrics. To select the best model, we looked at each fold and chose the most balanced one in terms of our metrics. These are our local results at the time of our submission: Table 2: 5-Fold cross-validation of the first run # Train ACC Train AUC Train ACC B Test ACC Test AUC Test ACC B 0 0.9537 0.9967 0.9653 0.3111 0.6511 0.5333 1 0.9306 0.9913 0.9479 0.4444 0.6284 0.5555 2 0.9827 0.9998 0.9827 0.4090 0.5960 0.6136 3 0.9657 0.9969 0.9657 0.4418 0.7489 0.6511 4 0.9717 0.9979 0.9717 0.4634 0.6238 0.6585 # - fold index used for testing Train ACC - training set accuracy original severity score Train AUC - training set ROC AUC binary severity Train ACC B - training set accuracy binary severity Test ACC - test set accuracy original severity score Test AUC - test set ROC AUC binary severity Test ACC B - test set accuracy binary severity After adding the regularization, validation boosting and pruning of weak learners our scores improved significantly: Table 3: 5-Fold cross-validation of the second run # Train ACC Train AUC Train AUC B Test ACC Test AUC Test ACC B 0 0.7572 0.9227 0.8208 0.5555 0.7529 0.6666 1 0.9364 0.9858 0.9421 0.4222 0.5988 0.5777 2 0.9195 0.9902 0.9310 0.5227 0.6714 0.6363 3 0.9142 0.9868 0.9428 0.4883 0.7207 0.6976 4 0.8079 0.9320 0.8474 0.5365 0.6095 0.6097 The regularization factor used was 1 and the number of training samples used at some iteration t was log t/log n, where n is the number of weak classifiers. It can clearly be noticed that the model tested on first fold did not overfit on the training data as badly as the other ones, having the lowest training accuracy and the highest Test ROC AUC score. We also trained the model for 1000 iterations. 5 Conclusion Recent work on tuberculosis severity prediction has been shown to be progress- ing, ImageCLEFmed Tuberculosis 2019 edition yielding better results both in terms of ROC AUC, as also in accuracy. Current findings may be considered a further validation of the fact that boosting, one of the modern machine learning approaches, still suffers from se- vere overfitting when applied on a very noisy dataset with a significantly higher number of features than training instances. However, by employing various data cleaning and novel regularization techniques, it was demonstrated that there are still improvements that can be made to combat this problem. With each improvement on the subject, the prediction of tuberculosis severity will be also made more accessible in low-resources environments to be used by low-buget medical organizations and small practices. 6 Future work Besides various pre-processing methods, described in the Architecture section, we can make use of the pixels selected by the model to 3D render a helpful voxel of the lung, with emphasis on the most important regions that contribute to the tuberculosis prediction. The latter visualization of a patient data linked with the low resource-demanding model makes it likely for improvement of the tuberculosis investigations. References 1. Andries, S., Iftene, A.: Predicting user activities using gps collections. pp. 53–60. Proceedings of the 14th Conference on Human Computer Interaction - RoCHI 2017, Craiova, Romania (09 2017) 2. Bereta, M.: Entropy-based regularization of adaboost. Computer Assisted Methods in Engineering and Science 24(2), 89–100 (2017) 3. Chelmus, R., Gifu, D., Iftene, A.: Prediction of cryptocurrency market. Proceedings of 19th International Conference on Computational Linguistics and Intelligent Text Processing (Cicling). LNCS, Hanoi, Vietnam (09 2018) 4. Dicente Cid, Y., Liauchuk, V., Klimuk, D., Tarasau, A., Kovalev, V., Müller, H.: Overview of ImageCLEFtuberculosis 2019 - automatic ct-based report generation and tuberculosis severity assessment. In: CLEF2019 Working Notes. CEUR Work- shop Proceedings, CEUR-WS.org, Lugano, Switzerland (September 9-12 2019) 5. Dicente Cid, Y., Liauchuk, V., Kovalev, V., , Müller, H.: Overview of ImageCLEF- tuberculosis 2018 - detecting multi-drug resistance, classifying tuberculosis type, and assessing severity score. In: CLEF2018 Working Notes. CEUR Workshop Pro- ceedings, CEUR-WS.org, Avignon, France (September 10-14 2018) 6. Dicente Cid, Y., Mller, H.: Texture-based graph model of the lungs for drug resis- tance detection, tuberculosis type classification, and severity scoring: Participation in the imageclef 2018 tuberculosis task (09 2018) 7. Gabor, S., Iftene, A.: Mir a system for medical image retrieval. pp. 9–19. Curative Power of Medical Data - Selected Papers of the First International Workshop MEDA 2017, ”Alexandru Ioan Cuza” University, Iasi, Constanta, Romania (09 2017) 8. Garofalakis, M., Hyun, D., Rastogi, R., Shim, K.: Building decision trees with constraints. Data Mining and Knowledge Discovery 7(2), 187–214 (Apr 2003) 9. Iftene, A., Siriteanu, A.: Using semantic resources in image retrieval. 20th Interna- tional Conference on Knowledge Based and Intelligent Information and Engineer- ing Systems, KES2016, vol. 96, pp. 436–445. Elsevier, York, United Kingdom (09 2016) 10. Iftene, A., Vamanu, L., Croitoru, C.: Uaic at imageclef 2009 photo annotation task. pp. 283–286. C. Peters et al. (Eds.): CLEF 2009, LNCS 6242, Part II (Multilin- gual Information Access Evaluation Vol. II Multimedia Experiments), Springer, Heidelberg (09 2010) 11. Ionescu, B., Müller, H., Péteri, R., Cid, Y.D., Liauchuk, V., Kovalev, V., Klimuk, D., Tarasau, A., Abacha, A.B., Hasan, S.A., Datla, V., Liu, J., Demner-Fushman, D., Dang-Nguyen, D.T., Piras, L., Riegler, M., Tran, M.T., Lux, M., Gurrin, C., Pelka, O., Friedrich, C.M., de Herrera, A.G.S., Garcia, N., Kavallieratou, E., del Blanco, C.R., Rodrı́guez, C.C., Vasillopoulos, N., Karampidis, K., Chamberlain, J., Clark, A., Campello, A.: ImageCLEF 2019: Multimedia retrieval in medicine, lifelogging, security and nature. In: Experimental IR Meets Multilinguality, Mul- timodality, and Interaction. Proceedings of the 10th International Conference of the CLEF Association (CLEF 2019), LNCS Lecture Notes in Computer Science, Springer, Lugano, Switzerland (September 9-12 2019) 12. Ionescu, B., Müller, H., Villegas, M., de Herrera, A.G.S., Eickhoff, C., Andrea- rczyk, V., Cid, Y.D., Liauchuk, V., Kovalev, V., Hasan, S.A., Ling, Y., Farri, O., Liu, J., Lungren, M., Dang-Nguyen, D.T., Piras, L., Riegler, M., Zhou, L., Lux, M., Gurrin, C.: Overview of ImageCLEF 2018: Challenges, datasets and evaluation. In: Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceed- ings of the Ninth International Conference of the CLEF Association (CLEF 2018), LNCS Lecture Notes in Computer Science, Springer, Avignon, France (September 10-14 2018) 13. Kovalev, V., Liauchuk, V., Skrahina, A., Astrauko, A., Rosenthal, A., Gabrielian, A.: Examining the utility of clinical, laboratory and radiological data for scoring severity of pulmonary tuberculosis (06 2018) 14. Liauchuk, V., Tarasau, A., Snezhko, E., Kovalev, V.: Imageclef 2018: Lesion-based tb-descriptor for ct image analysis (09 2018) 15. McIntosh, J.: All you need to know about tuberculosis (2019) 16. Meijer, D.W.J., Tax, D.M.J.: Regularizing adaboost with validation sets of increas- ing size. In: 2016 23rd International Conference on Pattern Recognition (ICPR). pp. 192–197 (Dec 2016) 17. Organization, W.H.: Bcg vaccines: Who position paper february 2018 vaccins bcg: Note de synthse de loms fvrier 2018. Weekly epidemiological record 93(08), 73–96 (Feb 2018) 18. Quinlan, J.R.: Induction of decision trees. Machine Learning 1(1), 81–106 (Mar 1986) 19. Serban, C., Siriteanu, A., Gheorghiu, C., Iftene, A., Alboaie, L., Breaban, M.: Combining image retrieval, metadata processing and naive bayes classification at plant identification 2013. Notebook Paper for the CLEF 2013 LABs Workshop - ImageCLEF - Plant Identification, Valencia, Spain (09 2013) 20. Zhu, J., Rosset, S., Zou, H., Hastie, T.: Multi-class adaboost. Statistics and its interface 2 (02 2006) 21. Zuidhof, G.: Full preprocessing tutorial (2019)