=Paper= {{Paper |id=Vol-2696/paper_64 |storemode=property |title=Ensemble of Deep Learning Models for Automatic Tuberculosis Diagnosis Using Chest CT Scans: Contribution to the ImageCLEF-2020 Challenges |pdfUrl=https://ceur-ws.org/Vol-2696/paper_64.pdf |volume=Vol-2696 |authors=Abdela Ahmed Mossa,Halit Eriş,Ulus Çevik |dblpUrl=https://dblp.org/rec/conf/clef/MossaEC20 }} ==Ensemble of Deep Learning Models for Automatic Tuberculosis Diagnosis Using Chest CT Scans: Contribution to the ImageCLEF-2020 Challenges== https://ceur-ws.org/Vol-2696/paper_64.pdf
    Ensemble of Deep Learning Models for
 Automatic Tuberculosis Diagnosis Using Chest
CT Scans: Contribution to the ImageCLEF-2020
                 Challenges

 Abdela A. Mossa1[0000−0002−6168−5002] , Halit Eriş2[0000−0002−2384−5052] , and
                     Ulus Çevik2[0000−0002−0956−9725]
    1
     Department of Computer Engineering, Çukurova University, Adana, Turkey
2
    Department of Electrical-Electronics Engineering, Çukurova University, Adana,
            Turkey (amossa@student., heris@, ucevik@)cu.edu.tr




        Abstract. Tuberculosis (TB) is a bacterial infection that mainly affects
        the lungs. It is a potentially serious disease killing around 2 million people
        a year. Nevertheless, it can be cured if treated with the right antibiotics.
        However, manual diagnosing of TB can be difficult, and several tests are
        usually conducted by clinicians. Consequently, automated diagnosis of
        TB based on chest Computed Tomography (CT) images for rapid and
        accurate diagnosis are currently of great interest. Recently, deep learn-
        ing algorithms, and in particular convolutional neural network (CNN),
        due to the ability to learn low- and high-level discriminative features di-
        rectly from images in an end-to-end architecture, have been shown to be
        the state-of-the-art in automatic medical image analysis. In this work,
        we developed a deep learning model for automated TB diagnosis using
        an ensemble of different CNN architectures trained on 2D images sliced
        from volumetric chest CT scans. The CNN-based methods proposed in
        this study includes Multi-View and Triplanar CNN architectures using
        pre-trained AlexNet, VGG11, VGG19 and GoogLeNet feature extrac-
        tion layers as a backend. Using five-fold cross validation, the average
        AUC, Accuracy, Sensitivity and Specificity of the proposed ensemble
        method were 0.799, 77.1, 0.57 and 0.824, respectively, for multi-label bi-
        nary classification on the ImageCLEFtuberculosis 2020 training dataset
        of the lung-based automated CT report generation task, which is a well-
        benchmarked public dataset running every year since 2017. The result
        shows the strength of our model trained in a small dataset with highly
        unbalanced label distributions, leading to 4th place on the Leaderboard,
        with a mean AUC of 0.767 on the test dataset.

        Keywords: Automatic CT Report Generation · Deep Learning · Con-
        volutional Neural Network · Tuberculosis Diagnosis · 3D Medical Image
        Analysis.

 Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
 mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem-
 ber 2020, Thessaloniki, Greece.
1   Introduction

Tuberculosis (TB) is a highly contagious disease that typically attacks the lungs.
Every year, approximately ten million people become infected with TB, with
around one and half million deaths, thereby making the disease a global health
problem [1]. Even though many researches have been done to reduce the spread
of TB in the society, the report by the World Health Organization (WHO) in
2019 [2] indicates that TB still remains at the top ten causes of death worldwide
and epidemic in 202 countries and territories (see Table 1).


Table 1. Number of countries and territories that reported the TB incidents to WHO
in 2019.

                         Regions                Numbers
                         Africa                 46
                         European               45
                         Region of the Americas 43
                         Western Pacific        35
                         Eastern Mediterranean 22
                         South-East Asia        11
                         Global                 202



    Computed Tomography (CT) is one of the most commonly used non-invasive
medical imaging techniques in the diagnosis and management of patients with
TB [3]. A volumetric chest CT scan of people with suspected TB is obtained and
examined either for abnormalities suggestive of TB or for detection of any kind
of TB abnormality. It aids physicians to visualize lesions with specific manifesta-
tions in the altering lung tissues caused by tuberculosis [4]. However, CT comes
at the cost of generating thousands of images per patient, which makes it time-
consuming, subjective, and even impossible to achieve high performance level in
the absence of expert radiologists [5]. Hence, the development of computer-aided
diagnosis (CAD) techniques to assist physicians in tuberculosis detection and di-
agnosis have been attracted much attention from researchers at the intersection
of medicine and artificial intelligence [6–9].
    Deep learning (DL) [10] based CAD algorithms especially convolutional neu-
ral networks (CNNs) [11] that learn visual patterns directly from images with
minimal pre-processing and without the intermediate step of experts have re-
cently been effective in the medical imaging and other computer vision appli-
cations [12–14]. Along these lines, as part of CLEF (Conference and Labs of
the Evaluation Forum) - a series of campaigns that have been carried out in
the information retrieval domain since 2000, ImageCLEF 2020 has presented an
evaluation campaign that offers researchers around the world to participate in
the ImageCLEFtuberculosis task that runs for fourth consecutive year [15, 16].
    The task provided by ImageCLEFtuberculosis organizers varies from year to
year. Last year the tasks were Severity Score Prediction (SVR) and CT – based
automatic CT report generation (CTR) based on volumetric chest CT scans and
clinical information of patients [17]. However, this year’s challenge (ImageCLEF-
tuberculosis 2020) was a lung-based automatic CT report generation solely on
CT images [18]. In last year’s tuberculosis challenge, even though we partici-
pated for the first time, our Multi-View CNN based approach achieved rank 4th
with mean AUC of 0.707 [19]. Hence, since our last year approach produced
competitive result, we decided to improve and adapt it to the requirements of
this year challenge. Therefore, in this study, we developed a novel CAD based
system for automated TB diagnosis by using different Multi-View and Triplanar
CNN architectures with the ensemble method on chest CT images. We devel-
oped the CNN architectures using pre-trained AlexNet [20], GoogLeNet [21],
and VGG [22] feature extraction layers as a backend.
    This paper has the following structure: in section 2, we present the dataset,
image pre-processing, CNN architectures and ensemble methods used in this
work. Results and discussions are reported in section 3. Finally, section 4 points
future works, and concludes this paper.


2     Materials and Methods
2.1   Dataset and Image Pre-processing
The training and test datasets provided by the ImageCLEFtuberculosis 2020
task organizers consist of 403 studies of people with TB where the organizers
divided the dataset into 283 training and 120 test studies. Each study contains
patients’ volumetric chest CT scans stored in NIFTI file format, automatically
extracted masks of the lungs obtained with the algorithm discussed in [23], and
a lung-based six diagnosis labels, which are:
    (i) LeftLungAffected (LL) - binary label for presence of any TB lesions in
the left lung;
    (ii) RightLungAffected (RL) - binary label for presence of any TB lesions in
the right lung;
    (iii) CavernsLeft (CL) - binary label for presence of caverns in the left lung;
    (iv) CavernsRight (CR) - binary label for presence of caverns in the right
lung;
    (v) PleurisyLeft (PL) - binary label for presence of pleurisy in the left lung;
    (vi) PleurisyRight (PR) - binary label for presence of pleurisy in the right
lung.
    The provided training dataset by the task organizers is highly imbalanced in
which there are more positive cases than negative cases in LL and RL labels, and
few positive cases than negative cases in the other diagnosis labels. Moreover,
the PL label has the largest unbalanced distribution in the dataset where the
proportion of positive training cases being about 2.5%. Even the CVR label,
which has a relatively better balanced distribution than the other labels, has
only 27.9% of the training cases labelled positive. Fig.1 depicts the number of
positive and negative patients for each of the six diagnosis labels of the training
dataset. More details about the datasets can also be found in [17].
   The sizes of all the volumetric chest CT scans are 512 × 512 × k, where
image length and width are 512 and k indicates number of slices in the axial
plane varying from 47 to 264 and 101 to 258 for training and testing datasets,
respectively. We used the training dataset to develop a model that can generate
multi-class binary classification prediction results related to the three labeled
diagnosis conditions of each lung. In other words, our model simultaneously
predicts whether a certain condition is present (i.e. ’positive or the numerical
equivalent of 1’) or absent (i.e. ’negative or the numerical equivalent of 0’) for
each of the three diagnosis labels of each lung.




Fig. 1. Distribution of the positive and negative cases across the different diagnosis
labels. For LL and RL, the majority of cases are “Positive” compared to the minority
of “Negative” cases. However, for CVL, CVR, PL and PR labels the majority of cases
are “Negative”.



    As we planned to leverage 2D CNN models pre-trained on natural images
of a fixed image resolution, we reformatted each 3D chest CT scan to a group
of 2D stacked slices in the axial, coronal and sagittal views, respectively. Each
axial slice is then cropped to a fixed size of 256 × 256 pixels around the left
and right lung regions, respectively. Similarly, we cropped each sagittal and
coronal slices to a fixed size of 128 × 256 pixels around the left and right lung
regions, respectively. The rectangular bounding box locations around each lung
were selected through visual inspection of few mid-level slices using the provided
segmented masks. To avoid processing the background which does not contain
any lung tissue and process the scans under the memory constraints of the GPU,
only 30 axials, 60 coronal and 60 sagittal mid-level slices from each volumetric
chest CT exams were selected. In addition, to avoid the effect of image enlarging
on the models classification performance, two consecutive sagittal slices and
two consecutive coronal slices, respectively, were concatenated and reshaped to
256×256 pixel sizes. Then, we rescaled the intensity values of the slices to (0,255)
range, convert them to PNG format, and normalized to have zero mean and
unit variance. Then, all the sliced axial, sagittal and coronal PNG images were
stacked together, and saved in serialized form with pickle toolbox, respectively.
Therefore, our input shape turned to be (30, 3, 256, 256). The values can be
interpreted such that first value holds for the number of axial, coronal or sagittal
slices after pre-processing. The last two values for width and height of images
and 3 represents the number of color channels. The sketch map of image pre-
processing steps is shown in Fig.2




Fig. 2. Example of 3D chest CT scan of patient ID CTR-TRN-051 pre-processing
stages. From top to bottom: 2D sliced from the 3D scan and then cropped around the
left and right lung regions in the axial, coronal, and sagittal views, respectively.
2.2   Model Development

Convolutional neural network (CNNs), also known as deep learners are machine
learning methods designed to process image data via convolutional, pooling and
fully connected layers. Convolution and pooling layers occur in an alternative
fashion to extract high-level features, and fully connected layers are used to
perform classification. In this paper, we aimed to develop a DL model that
simultaneously predicts lung-based TB diagnosis labels by using different CNN
architectures with the ensemble method on chest CT images. We address it
as a multi-class binary classification problem. Moreover, we repeated training
the proposed architectures two times, one for each lung related diagnosis labels
report generation task.
    Considering the training dataset being very small and heavily imbalanced, we
proposed five CNN architectures (3 Multi-View CNN architectures: AlexNetM V ,
GoogLeNetM V and VGG19M V , and 2 Triplanar-CNN architectures: AlexNetT P
and VGG11T P ) using pre-trained AlexNet, GoogLeNet, VGG11, and VGG19
feature extraction layers as a backend. All of the five CNN architectures were
trained using Adam optimization with backpropagation algorithms as they are
successfully applied in many deep learning models. In addition, all the mod-
els were optimized using weighted binary cross-entropy loss function to account
for the unbalanced class sizes. The parameters tuning were experimentally deter-
mined individually for each proposed architecture. Moreover, when the validation
loss did not decrease for 20 epochs, we early-stopped the parameter optimization
and training process to avoid the overfitting problem. Then, the model with the
lowest average loss on the validation dataset were selected as our final model
candidate. All the models were developed by using a desktop computer with
NVIDIA GeForce RTX 2070 GPU and the widely used deep learning framework
Pytorch with backend libraries of Tensorflow [24].
    The individual classification performance of the five CNNs on the training
and testing datasets were compared. Then, in order to get a better and more
comprehensive generalized model [25], and motivated by the idea of “two or more
heads are better than one“, the probability predictions by the four CNNs that
performed better were fused using different strategies: average, majority voting
and stacking (Naı̈ve Bayes). The probability predictions by GoogLeNetM V was
relatively not good compared to the other architectures. Hence, we used the
other four CNN models as our base learners in the ensemble approach we used.
Details of each model architecture and results are discussed in the following
parts.


Model Architectures

Multi-View CNNs. The Multi-View CNN architectures proposed in this work
are an extension of our prior work for last year year’s TB challenge [19]. In the
paper, coronal and sagittal slices were concatenated before fed to the AlexNet
based multi-view CNN model, and axial slices were not used. However, in this
work’s proposed Multi-View CNN , in addition to AlexNet, we used pre-trained
VGG19 and GoogLeNet feature extraction layers as a backend. Moreover, in
addition to coronal and sagittal slices, axial slices is also used in this work to
train the proposed models.
    The basic concept of the proposed Multi-View CNN architecture is that
during the training process we provide the model a serious of 2D axial images
sliced from 3D CT scan as input and similar sagittal and coronal images as
data augmentation techniques, and generates a classification prediction results
for each lung related labels. As depicted in Fig.3, the overall Multi-View CNN
architecture consists of three core parts:
    (i) The feature extraction layers of pre-trained state-of-the-art CNN model
(i.e VGG19, AlexNet or GoogLeNet).
   (ii) Global average pooling and max pooling layers on top of the feature
extraction layers applied across the spatial dimensions to reduce feature maps,
and
   (iii) Dense layer. The dense layer was fed the resulted feature maps after
pooling operations. Then, the sigmoid function applied to the output of the
dense layer to obtain the final probability binary prediction score for each of the
three diagnosis labels of each lungs.




Fig. 3. Multi-View CNN architecture: VGG19M V . VGG19M V is an automatic
TB diagnosis Multi-View CNN architecture using VGG19 feature extraction layers as a
backend. The architecture takes as input a series of CT slices and outputs a multi-class
binary classification predictions of the CT scan. Global average and max-pooling oper-
ation were used to combine features from each slice obtained using the VGG19 feature
extraction layers. The resulted feature maps were then fed to a fully connected layer to
generate a probability score of each the three diagnosis labels. We trained VGG19M V
two times, one for each lung related report generations. Using similar architecture and
training, we developed AlexNetM V and GoogLeNetM V with AlexNet and GoogLeNet
feature extraction layers, respectively.
    Triplanar-CNN. The overview of the proposed Triplanar-CNN architecture
is depicted in Fig.4. A 2D images sliced from the volumetric chest CT scans in
the axial, coronal and sagittal planes were fed into the three parallel channels
of the Multi-View CNN architecture, respectively. Generated features from the
three channels were consolidated into a fixed size feature map to form a single
combined feature representation. Then, the classification is performed using a
fully connected layer and a sigmoid activation function on top of it. More details
on the Triplanar-CNN architecture is available in our prior work developed for
automated brain tumor grading [26].




Fig. 4. Triplanar-CNN: VGG19T P . A 3D chest CT scan is first decomposed into
2D axial, coronal and sagittal cross-sectional slices then passed to each of the three
column VGG19M V Multi-View CNN feature extraction layers, respectively. We trained
the VGG19T P two times, one for each lung related report generation. Using similar ar-
chitecture and training procedure, we developed AlexNetT P using AlexNetM V feature
extraction layers in each of the three columns.




3    Results and Discussion

As previously mentioned in Section 2.1, the ImageCLEFtuberculosis 2020 dataset
was provided with training and test set partitions. The training dataset is highly
imbalanced in each diagnostic labels. Thus, we used five-fold stratified cross-
validation upon the training dataset to reduce overfitting and avoid bias during
the overall system evaluation in the test dataset. That is, for each validation fold
in the training dataset, the remaining other folds were used to train the mod-
els. Indeed, this procedure ensures that every CT scan in the training dataset
gets to be in the validation set exactly once. The independent testing dataset
was not used during training and internal validation. In fact, diagnosis labels of
the patients in the test dataset were not visible to the challenge participants.
Participants of the challenge were required to submit the probability prediction
for each diagnostic labels and ranking was based on the average and minimum
AUC over the six diagnostic labels of both the left and right lungs. However,
to quantitatively evaluate the capability of the proposed deep learning based
approach on both the provided training and testing datasets, the performance
measures averaged over all the five folds of the training dataset are reported in
this paper, including the area under the receiver operating characteristic curve
(AUC), precision (PRE), specificity (SPE), and sensitivity (SEN) evaluation
metrics. Accuracy is not significant for evaluating the performance of the pro-
posed approach as the dataset for each diagnostic labels are highly unbalanced.
Performance of our proposed system on both the training and test dataset is
explained in the following subsections.


3.1   Performance of the Five Multi-View and Triplanar-CNN
      Models

Table 2 and 3 reports multi-class binary classification performance of the pro-
posed Multi-View CNN and Triplanar-CNN models, respectively, for both the
left and right lung related diagnosis labels, on the training dataset using the five-
fold cross validation. The results show that the AlexNetM V classifier achieved
better classification performance compared to the other classifiers in terms of
mean AUC. Moreover, the AlexNetM V methods outperformed its correspond-
ing Triplanar-CNN model, i.e. AlexNetT P , with a marginal increment of 1.6%
in terms of mean AUC. Meanwhile, the AlexNetM V classifier outperformed the
VGG19M V and the VGG11T P models with improvement rates of 3.2% and 2.6%
in terms of the mean AUC, respectively.
    GoogLeNetM V suffers with the overfitting problem and it performance (mean
AUC of 0.506) was relatively poor compared to the other models. This may be
due to the architecture is deeper than the AlexNet and VGG architectures, and
due to the scarcity of the available training dataset. In addition to axial slices,
Multi-View CNN classifiers were trained using coronal and sagittal slices as data
augmentation techniques. However, validation and testing were performed using
axial slices only. Triplanar-CNN models were trained and evaluated using axial,
coronal and sagittal slices without using any data augmentation techniques. Yet
due to the strong performance of the the proposed models, as reported in Table 2
and 3, for the multi-class binary classification problems across the multiple tasks,
we are confident that our models will perform better if we were to incorporate
extensive data augmentation techniques. In addition, though we used weighted
cross-entropy loss to account for the imbalanced class sizes, the performances of
the proposed models on some tasks are highly biased towards the majority class.
For instance, as shown in Fig.1, out of 283 patients of the training dataset, only
7 (2.5%) of them were PL positive, whereas the remaining 276 (97.5%) were PL
negative. Hence, performance of all the models in terms of SEN for the PL binary
classification is very poor, whereas the PRE is obviously very high. Similarly,
only 4.9% of the training dataset were PR positive, the remaining 95.1% were
PR negative. However, unlike that of the PL task, classification performance of
all the models in terms of SEN for PR was not highly affected. This shows that
the weighted loss computation we used during training the models for tackling
imbalanced class size problems worked well for some tasks. Hence weighted loss
computation along with some renowned resampling techniques might be further
investigated in order to balance of the classes distribution and avoid bias on
classification performance of deep learning models.


Table 2. Performance evaluation of the three Multi-View CNN models. Bold indicates
our best results averaged across the six labels.

                           Left Lung                Right Lung
       Models                                                         Avg.
                         LL CVL PL                RL CVR PR
                   AUC 0.744 0.775 0.82          0.736 0.717 0.88     0.779
                   SEN 0.609 0.663 0.278         0.644 0.594 0.806    0.599
       AlexNetMV
                   SPE 0.836 0.799 0.932         0.779 0.746 0.925    0.836
                   PRE 0.943 0.561 0.194         0.936 0.482 0.387    0.584
                   AUC 0.76 0.751 0.74           0.71 0.65 0.869      0.747
                   SEN 0.632 0.535 0.17          0.656 0.57 0.611     0.529
       VGG19MV
                   SPE 0.71 0.728 0.932          0.714 0.598 0.93     0.769
                   PRE 0.923 0.587 0.111         0.92 0.369 0.375     0.548
                   AUC 0.566 0.526 0.456         0.454 0.548 0.486    0.506
                   SEN 0.591 0.594 0.433         0.79 0.742 0.306     0.576
       GoogLeNetMV
                   SPE 0     0.371 0.383         0     0.46 0.563     0.296
                   PRE 0.782 0.265 0.031         0.824 0.35 0.063     0.386




Table 3. Performance evaluation of the two Triplanar-CNN models. Bold indicates
our best results averaged across the six labels.

                           Left Lung               Right Lung
        Models                                                       Avg.
                       LL CVL PL                 RL CVR PR
                  AUC 0.706 0.789 0.745         0.697 0.698 0.944    0.763
                  SEN 0.668 0.646 0             0.665 0.619 0.806    0.567
        AlexNetTP
                  SPE 0.553 0.892 0.833         0.698 0.737 0.931    0.774
                  PRE 0.85 0.677 0.083          0.914 0.475 0.446    0.574
                  AUC 0.745 0.79 0.69           0.681 0.656 0.957    0.753
                  SEN 0.617 0.729 0.0389        0.615 0.553 0.72     0.546
        VGG11TP
                  SPE 0.684 0.741 0.483         0.695 0.714 0.956    0.712
                  PRE 0.891 0.582 0.059         0.906 0.45 0.595     0.581




3.2   Performance of Ensemble Multi-View and Triplanar-CNN
      Models
With regard to the AUC, SEN, SPE and PRE, the classification results achieved
by each of the three ensemble methods used in our work are reported in Table 4.
The mean AUC, SEN, SPE, and PRE of the average fusion strategy were 0.799,
0.571, 0.824, and 0.576, respectively. The mean AUC, SEN, SPE, and PRE of
the voting fusion strategy were 0.777, 0.574, 0.821, and 0.574, respectively. The
mean AUC, SEN, SPE, and PRE of the Naı̈ve Bayes fusion strategy were 0.759,
0.573, 0.801, and 0.829, respectively. Average fusion strategy has the highest
mean AUC and SPE values, and Naı̈ve Bayes has the lowest in both evaluation
metrics. However, Naı̈ve Bayes has the highest mean PRE values than average
and voting fusion approaches with improvement rates of more than 25%. The
SEN and SPE of the three ensemble methods are almost the same with less than
0.5% and 2.5% difference, respectively.


Table 4. Performance of the proposed system for three different fusion strategies. Bold
indicates our best results averaged across the six labels.

                            Left Lung                Right Lung
         Models                                                        Avg.
                         LL CVL PL                 RL CVR PR
                    AUC 0.788 0.815 0.841         0.73 0.689 0.93      0.799
                    SEN 0.639 0.614 0.286         0.643 0.574 0.667    0.571
         Average
                    SPE 0.778 0.832 0.932         0.724 0.721 0.956    0.824
                    PRE 0.914 0.562 0.154         0.918 0.443 0.462    0.576
                    AUC 0.779 0.8   0.736         0.73 0.682 0.936     0.777
                    SEN 0.632 0.614 0.286         0.65 0.596 0.667     0.574
         Voting
                    SPE 0.75 0.84 0.938           0.724 0.721 0.95     0.821
                    PRE 0.903 0.574 0.167         0.919 0.452 0.429    0.574
                    AUC 0.78 0.805 0.68           0.722 0.681 0.886    0.759
                    SEN 0.677 0.614 0.143         0.778 0.447 0.778    0.573
         NaiveBayes
                    SPE 0.75 0.856 0.938          0.552 0.811 0.931    0.801
                    PRE 0.798 0.794 0.926         0.799 0.704 0.955    0.829




3.3   Results Comparison on the Training and Test Datasets

In this challenge, participants were required to come up with an approach that
generate an automatic lung-based report generation based on the volumetric
CT image. For this, the organizers provided training and test datasets. Labels of
the training dataset were given to the participants. However, test dataset labels
were not visible to the participants. Participants were allowed to submit up to
10 runs to the system arranged by the organizers. The organizers do evaluation
of the results, and ranking participants algorithms based on the results. The
results of our proposed approaches on the test dataset obtained from the orga-
nizers website is depicted in Table 5. From our proposed individual classifiers,
AlexNetMV performed best on the test dataset with average and min AUC of
0.757 and 0.713, respectively. From the proposed ensemble approaches, average
fusion strategy outperforms all the models with mean and min AUC of 0.767
and 0.733, respectively. When best runs of each participant are compared using
mean AUC on the test dataset, our result ranked 4th . Detailed results of each
participant algorithm on the test dataset using multiple performance metrics can
be obtained at [17]. In addition, as shown in Fig.5, performance of our proposed
DL models in both the training and test datasets is nearly the same, indicating
the robustness of our model. This also provides insight on how the proposed DL
system will be generalized to an unknown dataset at the real test time.


Table 5. Mean and minimum AUC of the each proposed models on the test dataset.

                    Models     Mean AUC Minimum AUC
                    AlexNetMV    0.757      0.713
                    AlexNetTP    0.755      0.707
                    VGG19MV      0.756      0.724
                    VGG11TP      0.731      0.722
                    GoogLeNet    0.427       0.36
                    Average      0.767      0.733
                    Voting       0.757      0.727
                    NaiveBayes   0.759      0.714




Fig. 5. Performance comparisons of the proposed models on the training and test
datasets.




4   Conclusion and Future Work

In conclusion, we propose a robust CAD system for automated tuberculosis
diagnosis using ensemble of different CNN architectures trained on volumetric
chest CT scans of less than 300 tuberculosis patients. The proposed CNN ar-
chitectures includes novel Multi-View and Triplanar-CNN architectures using
pre-trained feature extraction layers of state-of-the-art deep learning models as
a backend. Our experiment result that completes the top four in the challenge
demonstrates that the proposed deep learning model has the ability to generate
competitive performance on automated lung-based CT report generation solely
based on volumetric CT images of patients with tuberculosis.
    There are still some rooms for improvement within our proposed CAD sys-
tem to improve the performance. To crop the left and right lungs regions from
the chest CT images, we used a fixed bounding box location for all the images
through visual inspection of some random mid-level slices that could result in
missing some abnormal regions of the lungs, as different CT devices produce
images in different orientation. In the literature, transfer learning with differ-
ent data augmentation techniques have been used to improve the performance
of deep learning models in datasets with limited size. However, we only used
transfer learning to increase the performance of our deep learning models on
the available limited amount of training data. We did not use data augmen-
tation techniques. Moreover, though the provided datasets were highly imbal-
anced, various class imbalance techniques and ensemble learner with multiple
deep learning base classifiers were not investigated very well due to the limited
time constraints. Ultimately, we would like to address these issues in the future.


5   Acknowledgments
This work was supported by the research fund of Çukurova University Project
Number: 10683


References
 1. Bhalla, A.S., Goyal, A., Guleria, R., Gupta, A.K.: Chest tuberculosis: Radiologi-
    cal review and imaging recommendations. Indian J. Radiol. Imaging. 25, 213–225
    (2015). https://doi.org/10.4103/0971-3026.161431.
 2. World Health Organization (WHO) Global Tuberculosis Report 2019, https://
    www.who.int/tb/global-report-2019. Last accessed 23 Jun 2020.
 3. Bomanji, J.B., Gupta, N., Gulati, P., Das, C.J.: Imaging in tu-
    berculosis.    Cold   Spring    Harb.    Perspect.   Med.     5,   1–23   (2015).
    https://doi.org/10.1101/cshperspect.a017814.
 4. Yin, J., Lu, M., Gao, L., Guo, X.: A framework of predicting drug resistance
    of lung tuberculosis by utilizing radiological images. In: Proceedings - 2018
    10th International Conference on Advanced Computational Intelligence, ICACI
    2018. pp. 308–312. Institute of Electrical and Electronics Engineers Inc. (2018).
    https://doi.org/doi.org/10.1109/ICACI.2018.8377474.
 5. Van’t Hoog, A.H., Meme, H.K., Van Deutekom, H., Mithika, A.M., Olunga, C.,
    Onyino, F., Borgdorff, M.W.: High sensitivity of chest radiograph reading by clin-
    ical officers in a tuberculosis prevalence survey. Int. J. Tuberc. Lung Dis. 15,
    1308–1314 (2011). https://doi.org/10.5588/ijtld.11.0004.
 6. Swanly, V.E., Selvam, L., Kumar, P.M., Renjith, J.A., Arunachalam, M., Shunmu-
    ganathan, K.L.: Smart spotting of pulmonary TB cavities using CT images. Com-
    put. Math. Methods Med. 2013, (2013). https://doi.org/10.1155/2013/864854.
 7. Xu, Z., Bagci, U., Kubler, A., Luna, B., Jain, S., Bishai, W.R., Mollura, D.J.:
    Computer-aided detection and quantification of cavitary tuberculosis from CT
    scans. Med. Phys. 40, (2013). https://doi.org/10.1118/1.4824979.
 8. Harris, M., Qi, A., Jeagal, L., Torabi, N., Menzies, D., Korobitsyn, A., Pai,
    M., Nathavitharana, R.R., Ahmad Khan, F.: A systematic review of the di-
    agnostic accuracy of artificial intelligence-based computer programs to ana-
    lyze chest x-rays for pulmonary tuberculosis. PLoS One. 14, e0221339 (2019).
    https://doi.org/10.1371/journal.pone.0221339.
 9. Melendez, J., Sánchez, C.I., Philipsen, R.H.H.M., Maduskar, P., Dawson, R.,
    Theron, G., Dheda, K., Van Ginneken, B.: An automated tuberculosis screening
    strategy combining X-ray-based computer-aided detection and clinical information.
    Sci. Rep. 6, (2016). https://doi.org/10.1038/srep25265.
10. Lecun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature. 521, 436–444 (2015).
    https://doi.org/10.1038/nature14539.
11. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied
    to document recognition. In: Proceedings of the IEEE. pp. 2278–2323 (1998).
    https://doi.org/10.1109/5.726791.
12. Graziani M., Andrearczyk V., Marchand-Maillet S., Müller H.: Concept attribu-
    tion: Explaining CNN decisions to physicians. Comput. Biol. Med. 103865 (2020).
    https://doi.org/10.1016/j.compbiomed.2020.103865.
13. Eriş, H., Çevik, U.: Implementation of Target Tracking Methods on Im-
    ages Taken from Unmanned Aerial Vehicles. In: SAMI 2019 - IEEE 17th
    World Symposium on Applied Machine Intelligence and Informatics, Proceed-
    ings. pp. 311–316. Institute of Electrical and Electronics Engineers Inc. (2019).
    https://doi.org/10.1109/SAMI.2019.8782768.
14. Moon, W.K., Lee, Y.W., Ke, H.H., Lee, S.H., Huang, C.S., Chang, R.F.: Computer-
    aided diagnosis of breast ultrasound images using ensemble learning from convolu-
    tional neural networks. Comput. Methods Programs Biomed. 190, 105361 (2020).
    https://doi.org/10.1016/j.cmpb.2020.105361.
15. Ionescu, B., Müller, H., Péteri, R., Dang-Nguyen, D.T., Zhou, L., Piras, L., Riegler,
    M., Halvorsen, P., Tran, M.T., Lux, M., Gurrin, C., Chamberlain, J., Clark, A.,
    Campello, A., Seco de Herrera, A.G., Ben Abacha, A., Datla, V., Hasan, S.A., Liu,
    J., Demner-Fushman, D., Pelka, O., Friedrich, C.M., Dicente Cid, Y., Kozlovski,
    S., Liauchuk, V., Kovalev, V., Berari, R., Brie, P., Fichou, D., Dogariu, M., Ste-
    fan, L.D., Constantin, M.G.: ImageCLEF 2020: Multimedia retrieval in lifelogging,
    medical, nature, and internet applications. In: Lecture Notes in Computer Science
    (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in
    Bioinformatics). pp. 533–541. Springer (2020). https://doi.org/10.1007/978-3-030-
    45442-5-69.
16. Arampatzis, A., Kanoulas, E., Theodora, T., Vrochidis, S., Joho, H., Lioma, C.,
    Eickhoff, C., Névéol, A., Cappellato, L., Ferro, N. (eds.): Experimental IR Meets
    Multilinguality, Multimodality, and Interaction. In: Proceedings of the Eleventh In-
    ternational Conference of the CLEF Association (CLEF 2020). vol.12260. Springer,
    Thessaloniki, Greece (2020).
17. Dicente Cid, Y., Liauchuk, V., Klimuk, D., Tarasau, A., Kovalev, V., Müller, H.:
    Overview of ImageCLEFtuberculosis 2019 - Automatic CT-based Report Genera-
    tion and Tuberculosis Severity Assessment. In: CLEF2019 Working Notes (2019).
18. Kozlovski, S., Liauchuk, V., Dicente Cid, Y., Tarasau, A., Kovalev, V., Müller,
    H.: Overview of ImageCLEFtuberculosis 2020 - Automatic CT-based Report Gen-
    eration. In: CLEF2020 Working Notes. http://ceur-ws.org, Thessaloniki, Greece
    (2020).
19. Mossa, A.A., Yibre, A.M., Çevik, U.: Multi-view CNN with MLP for diagnosing
    tuberculosis patients using CT scans and clinically relevant metadata. In: CEUR
    Workshop Proceedings. CEUR-WS (2019).
20. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with
    deep convolutional neural networks. Commun. ACM. 60, 84–90 (2017).
    https://doi.org/10.1145/3065386.
21. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Van-
    houcke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the
    IEEE Computer Society Conference on Computer Vision and Pattern Recognition.
    pp. 1–9 (2015). https://doi.org/10.1109/CVPR.2015.7298594.
22. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
    image recognition. In: 3rd International Conference on Learning Representations,
    ICLR 2015 - Conference Track Proceedings. International Conference on Learning
    Representations, ICLR (2015).
23. Dicente Cid, Y., del Toro, O.A., Depeursinge, A., Müller, H.: Efficient and fully
    automatic segmentation of the lungs in CT volumes. In: Goksel, O., del Toro, O.A.,
    Foncubierta-Rodriguez, A., and Müller, H. (eds.) Proceedings of the (VISCERAL)
    Anatomy Grand Challenge at the 2015 (IEEE ISBI). pp. 31–35. CEUR-WS (2015).
24. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., Facebook, Z.D., Research,
    A.I., Lin, Z., Desmaison, A., Antiga, L., Srl, O., Lerer, A.: Automatic differentia-
    tion in PyTorch. In: Advances in Neural Information Processing Systems 32. pp.
    8024–8035 (2019).
25. Liu, Y., Yao, X.: Ensemble learning via negative correlation. Neural Networks. 12,
    1399–1404 (1999). https://doi.org/10.1016/S0893-6080(99)00073-8.
26. Mossa, A.A., Çevik U.: Triplanar-CNN for Automated Grading of Gliomas Using
    Preoperative Multi-modal MR Images. In: Proc. Of the International E-Conference
    onAdvances in Engineering,Technology and Management -ICETM 2020. pp. 21–27.
    SEEK Digital Library (2020). https://doi.org/10.15224/978-1-63248-188-7-05.