PUC Chile team at TBT Task: Diagnosis of
Tuberculosis Type using segmented CT scans
José Miguel Quintana1 , Daniel Florea1 , Ria Deane1 , Denis Parra1 , Pablo Pino1 ,
Pablo Messina1 and Hans Löbel1
1
    Department of Computer Science, School of Engineering, Pontificia Universidad Católica de Chile, Chile


                                         Abstract
                                         This article describes the participation and results of the PUC Chile team in the Turberculosis task in
                                         the context of ImageCLEFmedical challenge 2021. We were ranked 7th based on the kappa metric and
                                         4th in terms of accuracy. We describe three approaches we tried in order to address the task. Our best
                                         approach used 2D images visually encoded with a DenseNet neural network, which representations
                                         were concatenated to finally output the classification with a softmax layer. We describe in detail this
                                         and other two approaches, and we conclude by discussing some ideas for future work.


1 Introduction
ImageCLEF [1] is an initiative with the aim of advancing the field of image retrieval (IR) as
well as enhancing the evaluation in various fields of IR. The initiative takes the form of several
challenges, and it is specially aware of the changes in the IR field in recent years, which have
brought about tasks requiring the use of different types of data such as text, images and other
features moving towards multi-modality. ImageCLEF has been running annually since 2003, and
since the second version (2004) there are medical images involved in some tasks, such as medical
image retrieval. Since then, new tasks involving medical images have been integrated into
the ImageCLEFmedical challenge group of tasks [2], and that is how the task of Tuberculosis
type classification has been taking place since 2017. Although there has been changes in the
data used for the newest versions of the challenge, the goal of this task is the same: automatic
detection of tuberculosis (TB) types using Computer Tomography (CT) volumes as input data.
   In this document we describe the participation of our team from HAIVis group 1 within the
artificial intelligence laboratory 2 at PUC Chile (PUC Chile team) in the TB classification task
at MedicalImageCLEF 2021 [2]. Our team earned the 7th place in terms of kappa metric and the
fourth place in terms of accuracy in the challenge. Our best submission was a combination of
deep learning techniques for two 2D views of each input CT volume, followed by a traditional
multi-class classification via softmax layer.
   The rest of the paper is structured as follows: Section 2 describes our data analysis, and
in section 3 we provide details of our proposed approaches, including data augmentation. In

CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
" josemiguelquinta@uc.cl (J. M. Quintana); dparra@ing.puc.cl (D. Parra); halobel@ing.puc.cl (H. Löbel)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings         CEUR Workshop Proceedings (CEUR-WS.org)
                  http://ceur-ws.org
                  ISSN 1613-0073


                  1
                    http://haivis.ing.puc.cl/
                  2
                    http://ialab.ing.puc.cl/
Figure 1: CT-Scans of different type of Tuberculosis


section 4 we provide details of our results, and finally in section 5 we conclude our article.


2 ImageCLEFmed Tuberculosis: tasks, data, evaluation
The challenge for the 2021 ImageCLEFmed Tuberculosis (TB) task is to automatically categorize
a CT scan into one of five TB types. The generated prediction must indicate one label for each
image, specifying the type of TB it contains.
   The total dataset consists of 1,338 CT scans of TB patients, 917 assigned for training and 421
for testing. Each CT-image corresponds to only one TB category at a time. With respect to the
training data set from the 917 scans, 420 have Infiltrative TB, 226 Focal, 101 Tuberculoma, 100
Miliary and 70 Fibro-cavernous. Due to segmenting each CT scan into their corresponding left
and right lungs, we end up with double the amount of images.
   Therefore this task corresponds to a multi-class classification problem. To rank submissions,
each result is evaluated using the unweighted Cohen’s Kappa as a primary metric and accuracy
as a secondary metric.

2.1 Dataset Analysis
Before carrying out our different approaches to classify each CT scan into a type of tuberculosis,
we studied the training dataset. Figure 1 shows the prevalence of each class.
  Table 1 summarizes this information and presents the relative prevalence of each type of
Tuberculosis with respect to the total amount of images in the training dataset. It is important
to note the clear class imbalance present, where nearly half of all images have Infiltrative
Tuberculosis.
Figure 2: Number of images by class


Table 1
Number of images and percentage each type of Tuberculosis forms of the training dataset
                   TB Type       Number of images       Percentage of dataset (%)
               Infiltrative           420                         45.8
               Focal                  226                         24.6
               Tuberculoma            101                         11.0
               Military               100                         10.9
               Fibro-cavernous         70                         7.6
               Total                  917                        100.0


3 Approaches
3.1 The Human-Inspired 2D Approach
After consulting medical advice on how they would personally address the challenge, it was
theorized that a model could perform better if it trained and predicted with the same data as
reviewed by doctors when analyzing CT scans, this means a top view of the image from top to
bottom.
  The main reason for taking this approach was that it has proven to be helpful in different
computer vision tasks [3] and there was no found documentation about the subject. Further
investigation about the subject is aimed to be performed in future work.
3.1.1 Preprocessing
First, segmentation masks provided by ImageCLEF were applied to each 3D CT-scan, in order
to separate each lung of the patient into two separate inputs. This practice was validated by
medical professional, as diseases such as Tuberculosis tend to manifest on both lungs at a time.
Second, each 3D matrix was then split into 2D images viewed from the z-axis, this practice is
the one that gives the approach’s name, as radiologists only use top-down images to review CT
scans, the main reason for not using other views is that noise from other parts of the internal
structure of the lungs can mislead the professional on it’s diagnosis. This same thought process
was used in order to train the model. It was also commented by the medical professionals that
tuberculosis often concentrates on the upper part of the lungs, based around this observation
20% of images on the bottom of the scan were discarded.
   After this first pre-treatment, each image was normalized and later concatenated in order to
produce RGB images, task performed by the dataset, which also was in charge of performing
augmentations on each item when loaded and after which, transforming them into tensors.

3.1.2 Augmentations
Each image was cropped and later re-centered and randomly scaled both on the x and y axis.
After this, an angle was applied to the image, as well as shear and a random horizontal flip.
This last practice was performed in order to avoid biases between left and right lungs.
   It was considered to flip all images only to one side, this idea was later rejected due to the
fact that human professionals are skilled enough to detect the presence of tuberculosis in a lung
regardless of his orientation or which one of the two is the scan of.
   As for the parameters used, the cropping center for each axis was decided by selecting the
original center of the image and displacing it by a random amount of pixels between the values
of 0 and 32. As for the angle of the rotation and the shear, a random float between 0 and 6.0
and another one between 0 and 4.0 were used respectively. Finally, for the scaling, each axis
was multiplied by 2𝑥 , with 𝑥 being a random float between the values of 0 and 0.15.


Figure 3: Multiple augmentations on the same crop of a CT Scan
3.1.3 Model
The model used was lightly tweaked DenseNet121 pretrained on ImageNet, using an Adam
optimizer in order for training to be faster, this due to the high amount of images used during
the training process. As for the loss, a weighted cross entropy was used in order to decrease
class imbalance. Finally, learning rate was reduced on plateau of the validation set’s loss.
   The model trained on each of the images, receiving both an image and a label for the input
and failed to converge on a good solution for the test set even after many epochs of training. It
was suspected that this was due to the grand amount of images not containing any information
which were given as inputs. This was later proven right when inspecting the output predictions
for images belonging to a same CT scan; clusters of correct predictions where found in the
middle of many different predictions, later revealed by a medical experts to be the exact layers
to have the disease present in them.
   Due to a lack of time, no solution to the issue was available to be implemented and is left to be
further developed in future works, due to the uncertainty in the effectiveness of this approach.

3.2 A Simple 2D Approach
This approach was a modified version of one proposed in [4], which tried to represent 3D images
by squeezing the volumetric data to 2D projections. Unfortunately, it did not work as expected
due to an exploding gradient problem in its execution which, because of lack of time, was not
able to be resolved.
  We believe it is still useful to explain this approach and study the reasons it did not work.

3.2.1 Preprocessing
As mentioned in 3.1, this approach was a modified version of one proposed in [4]. In this case,
instead of calculating the mean, maximum and standard deviation in each dimension, we only
calculated the maximum for each axis and created a single 2D, three channel image from the
three matrices that appeared, which can be interpreted as an RGB image.
   First of all, we applied the segmentation mask provided by ImageCLEF to each 3D CT-scan,
dropped the first 10% and last 20% of the image to eliminate unnecessary information, and
then calculated the maximum values across each dimension. This produced three 2D matrices,
which were subsequently concatenated to produce one single RGB image. The reason we
only calculated the maximum was because we believe that, due to the nature of tuberculosis
being various small nodules present across the lungs, calculating the average and the standard
deviation across dimensions were not accurate measures to determine if there is tuberculosis in
the lungs. This is mainly because most of the lung is empty space, so the average and standard
deviations would be close to zero. On the other hand, the maximum would show if there were
higher values present in the lung, an indication of tuberculosis, which we hypothesize could
also help determine the type of tuberculosis.
3.2.2 Augmentations
The augmentations applied in this approach were resizing, to have three equally sized channels,
random horizontal flips, and normalization of the final image.

3.2.3 Model
The model used was one network, composed of a fine-tuned ResNet50 pretrained on ImageNet,
available from the Torchvision library in PyTorch3 , using Cross Entropy as the loss function,
due to the nature of the challenge being a multi-class classification problem, and Adam as an
optimization algorithm. To add regularization to the model, we implemented a drop rate of
0.3. Additionally, we set class weights in our loss function to resolve the imbalanced dataset
problem.
   As mentioned earlier, this approach presented exploding gradient problems. When these
started to appear, we implemented gradient clipping and went varying the learning rate. The
learning rate that performed best, before presenting exploding gradients, had a value of 0.0001.
Unfortunately, this was not enough to resolve the problem. We believe this is due to using
only the maximum that, although normalized, still caused all images to present high values and
prevented the network to learn correctly.
   In the following subsection, we present a different approach which did not present exploding
gradient issues but also implemented a modification of the volumetric squeezing approach
presented in [4].

3.3 The 2D Approach Scoring the Best
In this particular approach, the first step was to apply the segmentation masks provided by
ImageCLEF, particularly using the method proposed in [5], to separate the lungs of each image
and delete the non-important parts of the CT scan. After that, we divided the segmented image
in half, obtaining two 3D matrices, one for each lung.
   Each 3D CT scan can be reduced to a 2D representation, by computing different statistics
across each dimension of the image. We used the procedure suggested in [4], that consisted
in computing the mean, maximum and standard deviation over the 3D images, but applying
it only over the 1st and 2nd axis (lateral and superior). Using this, we obtained 2 images per
matrix, which can be interpreted as an RGB image. After that we applied the preprocessing steps
implemented in [4], which consisted in increasing the voxels intensity in the CT by 1024HU,
dividing the maximum values by 1500, and dividing the mean values and standard deviation
values by their maximum. Additionally, we resized the images to 256 x 256 pixels. In the end,
we end up with 2 RGB images for each lung, which represent the lateral and superior views of
them.


   3
       https://pytorch.org/vision/stable/models.html
Figure 4: Images after preprocessing


3.3.1 Augmentations
The images pass through multiple augmentations, each one of them with a varying probability
of being applied. In summary, the augmentations implemented were image rotation, with a
rotation range of 25 degrees, width and height shift, with a shift range of 15% of the image,
zoom, with a range of 20% of the image, and horizontal and vertical flipping.

3.3.2 Model
With the 2 RGB images for each lung obtained from 3.2, we further applied the augmentations
in 3.2.1 to obtain the images to feed the model. We trained one network for each axis, using a
fine-tuned DenseNet121 pre-trained on ImageNet, with average pooling in the output layer, and
added some layers at the end to reduce dimensions. Subsequently, we concatenated the output
of the two networks and added a softmax layer on top to get the final prediction. Using no
regularization, we obtained a Cohen Kappa value on training and validation of approximately
0.11, meanwhile with L2 regularization we got 0.236 with 0.511 accuracy on the training set,
and 0.186 with 0.467 accuracy on our validation set. On the test set we got 0.120 Cohen’s Kappa
with 0.401 of accuracy.
   In an effort to reduce the impact of the most represented classes in the dataset, we tried
weighting the loss of each label, according to it´s representation in the dataset. Regarding this,
we tried with multiple configurations of fine tuning of the network, achieving the best results
with the last twenty layers unfrozen. Using that configuration, we scored a Cohen’s Kappa of
0.206 on training set and 0.105 on validation set.
   We further tried using shared weights along the networks of the axes, in order to reduce the
quantity of parameters and over fitting. This didn’t improve the results, and got a best Cohen
Kappa score of 0.141 on training and 0.098 on the validation set.
Figure 5: Visual representation of the model


3.3.3 Evaluation of the Model
In order to evaluate the model, we preprocessed the data as explained in 3.3, resulting in 2
images per lung. Then, we got the predictions for each side of the lung and sum the softmax
results, keeping the highest value as the final prediction.


4 Results
In this section we present the results of our approaches at our own developing servers and also
at the crowdai.org platform.
   Table 2 shows the results obtained while training, including a fourth approach that was not
submitted.

Table 2
Train/validation results for each approach in our development servers
    Rank                Approach                  Train CK     Train Acc   Val. CK   Val. Acc
       1     Best performing 2D Approach             0.236       0.511      0.186     0.467
       2     2D Approach with shared weights         0.141       0.316      0.098     0.194
       3     Human inspired approach                -0.040       0.337        -          -
       4     Simple 2D Approach                      0.000       0.472        -          -

   Table 3 presents the evaluation metrics obtained for each approach when evaluated with the
test dataset.
   *Weighted based on class prevalence
Table 3
Final results for each approach at crowdai.org
              Rank              Approach             Kappa (K)    Accuracy (Acc)
                 1     Best performing 2D Approach      0.120         0.401
                 2     Human inspired approach         -0.040         0.337
                 3     Weighted random choice*         -0.048         0.245


5 Conclusions
In this article we have provided details of the participation of the PUC Chile team, for the
Tuberculosis type (TBT) classification task within the ImageCLEFmedical challenge 2021. In the
process of building our final submission, we tested several approaches. Our final submission was
based on a DenseNet architecture for visually encoding the input medical volumes represented
as 2D images, followed by a softmax classification layer. In future work, we plan to address
the task based on the process that actual radiologists follow when classifying CT scans for
Tuberculosis, which we described in section 3.1. Another idea we plan to further investigate is
using perceptual image similarity [6] to leverage approaches based on K-NN, which have had
interesting results in previous challenges. Finally, we plan at using methods which can directly
deal with volumes (3D images) rather than 2D images.


Acknowledgements
This work was partially funded by ANID - Millennium Science Initiative Program - Code
ICN17_002 and by ANID, FONDECYT grant 1191791.
References
[1] B. Ionescu, H. Müller, R. Péteri, A. Ben Abacha, M. Sarrouti, D. Demner-Fushman, S. A.
    Hasan, S. Kozlovski, V. Liauchuk, Y. Dicente, V. Kovalev, O. Pelka, A. G. S. de Herrera,
    J. Jacutprakart, C. M. Friedrich, R. Berari, A. Tauteanu, D. Fichou, P. Brie, M. Dogariu, L. D.
    Ştefan, M. G. Constantin, J. Chamberlain, A. Campello, A. Clark, T. A. Oliver, H. Moustahfid,
    A. Popescu, J. Deshayes-Chossart, Overview of the ImageCLEF 2021: Multimedia retrieval
    in medical, nature, internet and social media applications, in: Experimental IR Meets Multi-
    linguality, Multimodality, and Interaction, Proceedings of the 12th International Conference
    of the CLEF Association (CLEF 2021), LNCS Lecture Notes in Computer Science, Springer,
    Bucharest, Romania, 2021.
[2] S. Kozlovski, V. Liauchuk, Y. Dicente Cid, V. Kovalev, H. Müller, Overview of ImageCLEFtu-
    berculosis 2021 - CT-based tuberculosis type classification, in: CLEF2021 Working Notes,
    CEUR Workshop Proceedings, CEUR-WS.org, Bucharest, Romania, 2021.
[3] O. Mendez, S. Hadfield, N. Pugeault, R. Bowden, Sedar: Reading floorplans like a hu-
    man—using deep learning to enable human-inspired localisation, International Journal of
    Computer Vision 128 (2020) 1286–1310.
[4] R. Miron, C. Moisii, M. Breaban, Revealing lung affections from cts. a comparative analysis
    of various deep learning approaches for dealing with volumetric data, arXiv preprint
    arXiv:2009.04160 (2020).
[5] V. Liauchuk, V. Kovalev, Imageclef 2017: Supervoxels and co-occurrence for tuberculosis ct
    image classification, in: CLEF2017 Working Notes, CEUR Workshop Proceedings, CEUR-
    WS.org <http://ceur-ws.org>, Dublin, Ireland, 2017.
[6] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, O. Wang, The unreasonable effectiveness of
    deep features as a perceptual metric, 2018. arXiv:1801.03924.