ImageCLEF2019: Tuberculosis - Severity Scoring and
CT Report with Neural Networks, Transfer Learning and
                    Ensembling

                            Amilcare Gentili1-2[0000-0002-5623-7512]
                  1 San Diego VA Health Care System, San Diego, CA USA
                       2 University of California, San Diego, CA, USA

                                  agentili@ucsd.edu


       Abstract. The diagnosis of tuberculosis is challenging. We present our approach
       for classifying whether a patient has high or low severity tuberculosis and for
       detecting which lung is involved, if there is decreased capacity, and if there are
       pleurisies, calcifications or cavities present. Our best results for the CT report
       task were obtained by converting volume images into an 8x4 montage of sagittal
       or coronal images and ensembling the results of separate networks trained sepa-
       rately on sagittal and coronal montage images. The best results for the severity
       scoring were obtained by ensembling the results from the CT report with the pro-
       vided metadata.

       Keywords: Deep Learning, Convolutional Neural Network, Tuberculosis, CT
       Scans.


1      Introduction

Tuberculosis is a common disease where fast diagnosis using CT images can often im-
prove treatment results. An accurate and automatic method for classifying tuberculosis
from CT images may be especially useful in regions of the world with few radiologists.
The ImageCLEF 2019[1] has 2 challenges [2]: 1) scoring severity of tuberculosis from
CT images and 2) creating a report that identifies if the left lung is affected, if the right
lung is affected, if calcifications, caverns, and/or pleurisy are present, and if lung ca-
pacity is decreased.


2      Methods

    2.1     Data.

The data set provided for both the CT report subtask and severity scoring subtask of the
ImageCLEF 2019 Tuberculosis task [2] use the same dataset containing 335 chest CT
scans of TB patients along with a set of clinically relevant metadata. 218 patients are
used for training and 117 for test. The provided metadata includes information about
  Copyright (c) 2019 for this paper by its authors. Use permitted under Creative Com-
mons License Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 September
2019, Lugano, Switzerland.
disability, relapse, symptoms of TB, comorbidity, bacillary, drug resistance, education
level, incarceration history, alcohol consumption, and smoking history. A set of lung
masks was also provided for all patients[3] .
   For the CT report task, the training set distribution of pathology was somewhat un-
balanced with lung involvement being very common, and calcifications and pleurisy
rare.

                            TB Pulmonary Manifestations

                      Caverns

                      Pleurisy

                  Calcification

 Lung Capacity Decreased

      Right Lung Affected

          Left Lung Affected

                                  0          50             100      150          200        250

                                             Present    Absent


           Fig. 1. Distribution of manifestations of tuberculosis in the training dataset

For the severity scoring task, the training set distribution of high and low severity was
balanced. See Figure 2


                                         Severity Score


  Score


           0              20            40             60           80           100         120

                                               LOW      HIGH


               Fig. 2. Distribution of high and low severity score in the training dataset
    2.2    Metadata Analysis


   Reviewing the metadata shows that some factors are a strong predictor of high se-
verity score. See Table 1


                  Table 1. Odd ratio of high severity for different factors
 Factor                          High            Low          Total        OR
 Comorbidity                     71              51           122          2.32
 Disability                      25              9            34           3.46
 Symptoms Of TB                  69              48           117          2.38
 Relapse                         50              26           76           2.87
 Drug Resistance                 88              51           139          5.45
 Bacillary                       99              86           185          3.60
 Higher Education                7               21           28           0.30
 Alcoholic                       30              19           49           1.89
 Ex-Prisoner                     19              8            27           2.78
 Smoking                         63              51           114          1.68
 Left Lung Affected              89              67           156          3.25
 Right Lung Affected             94              83           177          2.44
 Lung Capacity De-crease         43              21           64           2.88
 Calcification                   14              14           28           1.04
 Pleurisy                        14              2            16           8.20
 Caverns                         58              31           89           3.05
   Drug resistance, disability, and bacillary had the strongest influence on increasing
probability of high severity, and higher education had the strongest influence on in-
creasing probability of low severity.


    2.3    Preprocessing.

The images for the ImageCLEF tuberculosis task were provided as NIfTI 3D datasets.
We used two different approaches for preprocessing images. For the first run (SVT_5,
CTR_3) we used a method similar to what we employed for the ImageCLEF 2018 chal-
lenge [4]. We converted the images using med2image, a Python3 utility that converts
medical image formatted files to more visual friendly ones, such as png and jpg. After
reconstructing them in all three planes, we decided to use the coronal plane images,
since they had the most images containing areas of abnormal lung. Although we did
not visually verify the images of this data set, tuberculosis usually involves the upper
lobes with relatively unaffected lung bases. As a result, axial images through the lung
bases could possibly be normal even in patients with severe disease in the upper lobes.
As med2image did not take in consideration slice thickness, the reconstructed coronal
images were deformed and of different height. To correct this problem, all images were
resized to a 512 x 512 matrix. Image masks for the lungs were available[3], and were
used to select the 200 images with the largest area of lung in the image. For the first
run, all image equalization and data augmentation was done at the time of training using
the fastai library [5].

   For further runs (SVR_1, SVR_2, SVR_3, CTR_1, CTR_2 we used a different ap-
proach. We use nibabel library [6] to convert the NIfTI 3D datasets into numpy 3D
arrays, using the provided lung masks [3], we cropped the 3D arrays to the smallest
parallelogram that includes mostly the lungs. We equalized the array. We reshaped the
array to have 31-32 slices in either the sagittal or coronal plane with a 256x256 matrix.
Using montage, we combined the images into a single image. We did not correct for
difference in slice thickness. See Figure 1 and 2. Data augmentation was done at the
time of the training using the fastai library.


                 Fig. 3. Montage of equalized images in the coronal plane.
                  Fig. 4. Montage of equalized images in the sagittal plane.


    2.4     Neural Network Training

   For training the neural network, we used a workstation with an AMD Ryzen
Threadripper 1950X CPU with 16 CPU cores and 32 threads, a Nvidia Quadro P6000
GPU, 64 GB RAM, and a 1 TB solid state drive. We took advantage of the fastai library
to perform transfer learning of convolutional neural networks. We tried the following
architectures that were available in the fastai library: resnet18, resnet34, resnet50, res-
net101, resnet152, squeezenet1_0, squeezenet1_1, densenet121, densenet161, dense-
net169, densenet201, vgg16_bn, vgg19_bn, and alexnet. Resnet50, resnet101, dense-
net121, densenet161, and densenet169 gave the best results, so we decided to ensemble
them.
   For training the CNN, image sizes of 224x224, 299x299, and 384x384 were utilized.
The learning rate was determined after running the learning rate finder function and
plotting the learning rate vs. loss.
    2.5     Ensembling results and metadata analysis

Orange [7] was used to create a prediction based on metadata only (SVR_4), and to
combine metadata results with neural network results (SVR_1, SVR_2). See Figure 5.


Fig. 5. Example of Orange 3 workflow to compare different machine learning approaches.


3      Results

    3.1     CT Report Task

For the CTR_3 submission, for each patient we took the 200 images with the largest
lung surface, scored each of those images separately using all pre-trained CNNs avail-
able in the fastai library, and averaged those results. Both mean AUC and minimum
AUC were low, probably because only a few images of each patient have pathology,
and averaging results decreased the probability of positive results.
   For the CTR_1 and CTR_2 submissions we created a 4x8 montage of sagittal or
coronal images for each patient. W separately scored sagittal and coronal images with
6 neural networks. For the CTR_2 submission, we ensembled all results, and for the
CTR_1 submission, we ensembled the 3 best results.


                                 Table 2. CT Report Task
 Run Id       Run                                             Mean AUC        Min AUC
 CTR_1        CTR_Cor_32_montage.txt                          0.6631          0.5541
 CTR_2        CTR_ReportsubmissionEnsemble2.csv               0.6532          0.5904
 CTR_3        TB_ReportsubmissionLimited1.csv                 0.5811          0.4111
    3.2     Severity Scoring Task

                               Table 3. Severity Scoring Task
 Run Id      Run                                                       AUC          Accuracy
 SVR_1       SVR_From_Meta_Report1c.csv                                0.7214       0.6838
 SVR_2       SVR_Meta_Ensemble.txt                                     0.7123       0.6667
 SVR_3       SVR_LAstEnsembleOfEnsemblesReportCl.csv                   0.7038       0.6581
 SVR_4       SVRMetadataNN1_UTF8.txt                                   0.6956       0.6325
 SVR_5       SVT_Wisdom.txt                                            0.627        0.6581

   For the SVR_5 submission, we once again took the 200 images with the largest lung
surface of each patient. For each patient, we scored each of those 200 images separately
using all pretrained neural networks available in the fastai library and averaged those
results. Both AUC metrics were low, for similar reasons to the CT Report Task.


  Fig. 6. ROC curves of different models trained using only the metadata of the training set,
     based on 10-fold cross validation, calculated with Orange3 workflow from Figure 3

   For the SVR_4 submission, we trained different machine learning models available
in Orange3 (Constant, AdaBoost, Tree, CN2 rule inducer, Random Forest, SVM, kNN,
Logistic Regression, Neural Network, Naive Bayes) and based on validation results we
selected the top 4 to ensemble for the submission. See Figure 6.

   For SVR_3 we took the results of classifying 4x8 montages of sagittal or coronal
images as high or low severity, and ensembled them. For each 4x8 montage, we scored
each sagittal and coronal image separately by ensembling the results of 6 neural net-
works.

    For SVR_2 we ensembled SVR_3 with the metadata.

   For SVR_1 we used Orange3 to create a model from the metadata using (Comorbid-
ity, Disability, Symptoms of TB, Relapse, Drug Resistance, Bacillary, Higher Educa-
tion, Alcoholic, Ex-Prisoner, Smoking) and training data (Left Lung Affected, Right
Lung Affected, Lung Capacity Decrease, Calcification, Cavity, Pleurisy) and for the
prediction we used the test metadata and the results from CTR_1 (Left Lung Affected,
Right Lung Affected, Lung Capacity Decrease, Calcification, Cavity, Pleurisy). Alt-
hough we tried Constant, AdaBoost, Tree, CN2 rule inducer, Random Forest, SVM,
kNN, Logistic Regression, Neural Network, and Naive Bayes models, after evaluating
the validation results, we used only SVM, Logistic Regression, Neural Network and
Naive Bayes models to ensemble for the final submission.


4      Conclusion

In this paper, we presented the use of transfer learning to quickly train a CNN to classify
the severity of tuberculosis and different pathological manifestations of tuberculosis.


5      Perspectives for Future Work


   The training data set for the CT Report was imbalanced with only a few cases of
calcification or pleurisy, but we did not try to compensate for this imbalance. Trying to
compensate for this imbalance may improve results. We trained the neural network as
a multilabel task on the same set of equalized images. Using images with different win-
dows to enhance calcifications, training neural networks to detect just calcifications or
just cavities, and using windows set to visually enhance air within the lungs, may im-
prove results. Using Hounsfield units from the original images instead of values in the
png files may also be more accurate. As our best results for the Severity Task came
from combining the results of the CT Report Task with the metadata, improving results
of the CT Report should improve results for the Severity Task too.
References

1.    Ionescu, B., H. Müller, R. Péteri, Y.D. Cid, V. Liauchuk, V. Kovalev, D. Klimuk, A.
      Tarasau, A.B. Abacha, S.A. Hasan, V. Datla, J. Liu, D. Demner-Fushman, D.-T. Dang-
      Nguyen, L. Piras, M. Riegler, M.T. Tran, M. Lux, C. Gurrin, O. Pelka, C.M. Friedrich,
      A. Garcia, S. de Herrera N. Garcia, E. Kavallieratou, C.R. del Blanco, C.C. Rodríguez,
      N. Vasillopoulos, K. Karampidis, J. Chamberlain, A. Clark, and A. Campello.
      ImageCLEF 2019: Multimedia Retrieval in Medicine, Lifelogging, Security and Nature
      in Proceedings of the Tenth International Conference of the CLEF Association (CLEF
      2019). 2019. Lugano, Switzerland. (LNCS) Lecture Notes in Computer Science,
      Springer.
2.    Cid, Y.D., V. Liauchuk, D. Klimuk, A. Tarasau, V. Kovalev, and H. Muller, Overview
      of ImageCLEFtuberculosis 2019 - Automatic CT-based Report Generation and
      Tuberculosis Severity Assessment. CLEF 2019 Working Notes. CEUR Workshop
      Proceedings (CEUR- WS.org), 2019. ISSN 1613-0073, http://ceur-ws.org/Vol-2380/.
3.    Cid, Y.D., O.A. Jiménez-del-Toro, A. Depeursinge, and H. Müller, Efficient and fully
      automatic segmentation of the lungs in CT volumes. . In: Goksel, O., et al. (eds.)
      Proceedings of the VISCERAL Challenge at ISBI. No. 1390 in CEUR Workshop
      Proceedings . No. 1390 in CEUR Workshop Proceedings, 2015.
4.    Gentili, A., ImageCLEF2018: Transfer Learning for Deep Learning with CNN for
      Tuberculosis Classification. CEUR Workshop Proceedings, 2018.
5.    Howard, J. et al., fastai. GitHub, 2018.
6.    Brett, M., M. Hanke, C. Markiewicz, M.-A. Côté, P. McCarthy, and C. Cheng,
      nipy/nibabel: 2.3.3 Zenodo., 2019.
7.    Demsar J, C.T., et al., Orange: Data Mining Toolbox in Python. Journal of Machine
      Learning Research, 2013. 14: p. 2349−2353.