    Multi-View CNN with MLP for Diagnosing Tuberculosis
       Patients Using CT Scans and Clinically Relevant

                   Abdela A. Mossa1, Abdulkerim M. Yibre2 and Ulus Çevik3
     Department of Computer Engineering, Faulty of Engineering, Çukurova University, 01330
                 Sarıçam, Adana, Turkey (Email:abdela4u@gmail.com)
    Department of Computer Engineering, Faculty of Engineering, Konya Technical University,
        42250, Selçuklu, Konya, Turkey (Email:abdukerimm@selcuk.edu.tr)
  Department of Electrical and Electronics Engineering, Faculty of Engineering, Çukurova Uni-
           versity, 01330 Sarıçam, Adana, Turkey (Email:ucevik@cu.edu.tr)

            Abstract. We propose a hybrid approach of multi-view convolutional neural
            networks with Multi-Layer Perceptron to generate an automatic medical CT re-
            port and evaluation of the severity stage of Tuberculosis patients, trained and
            evaluated on 335 chest 3D CT images and available metadata provided by Im-
            ageCLEF2019 organizers for the participants of tuberculosis computation track.
            Transfer learning and data augmentation techniques were applied to avoid over
            fitting and enhance performance of the model. Our multi-view CNN approach
            comprises the decomposition of the 3D CT image into 2D axial, coronal and
            sagittal slices and converting them to PNG format as preliminary to training. At
            the first stage, coronal and sagittal slices were used to train the CNN classifier
            using pre-trained AlexNet. In the second stage, MLPs were trained using fea-
            tures extracted during stage one alongside with the provided metadata. Our re-
            sults ranked 6th and 4th ,with an AUC of 0.763 in predicting whether the severi-
            ty stage is High or Low, and mean AUC of 0.707 in detecting whether left and
            right lungs are affected or not , detecting the absence or presence of calcifica-
            tions, caverns, pleurisy and lung capacity decrease, respectively.

            Keywords: Tuberculosis Detection, Severity Score, Automatic CT Report,
            Convolutional Neural Network, Deep Learning, Multi-Layer Perceptron, Medi-
            cal Imaging Analysis.

1           Introduction

About 130 years after its discovery, Tuberculosis (TB) is one of the 10 leading causes
of death in the world. In 2017 alone, TB caused an estimated 1.3 million deaths and
around 10.0 million people developed TB disease. With early diagnosis and proper

treatment, each year it is possible to prevent millions of people with TB from death
   Advanced medical imaging technologies like Computed Tomography (CT), when
used by expertise radiologists with the help of Computer Aided Detection (CAD)
software, can detect subtle alterations in the lung tissue to correctly identify and diag-
nose the disease [2]. But despite many advances in both diagnosis and treatment, the
application of TB diagnosis remains one of highest causes of mortality from any in-
fectious cause in the world [3] which shows still challenges in detection and treatment
of TB are ahead.
   It has been reported in [4, 5] that, there is a relative lack of expert radiologists in
many TB burden countries, which may impair screening effectiveness and delays the
diagnosis results. In India, an average TB patient is diagnosed after a delay of nearly 2
months [6] and overall, a true negative rate as high as 30 % and a false positive rate of
up to 15 % has been reported in radiology [2]. Evidently, there is an unmet need to
fully automatic CAD for TB diagnoses which is efficient, facilitates earlier detection
of disease and save significant health care costs. Hence to tackle these problems, Im-
ageCLEF [7] has presented an evaluation campaign that welcomes researchers around
the world to participate in ImageCLEFmed Tuberculosis2019 [8] task for the third
consecutive year which comprises two subtasks: Severity Scoring (SVR) and CT
Report (CTR). The challenge is based on the 3D CT scans of patients with TB along
with the provided automatic report of the patients. The aim of SVR subtask is to as-
sess the severity of each TB patient and classify either to “LOW” (critical) or “HIGH”
(very good). The aim of the CTR subtask is to generate an automatic medical report
based on the status of Left and right lungs, presence or absence of calcifications, cav-
erns, pleurisy and capacity of lung.
   Deep learning approaches [9]– in particular, deep convolutional neural networks
(CNNs) have been shown to be successful on a large variety of computer vision and
image analysis tasks [10–13] and recently they have also been broadly applied, and
are in the infant stage to the medical imaging field. In this paper we provide an artifi-
cial intelligence-enhanced CAD technology, which is a fully-automatic hybrid model
of CNN and Multi-Layer Perceptron (MLP) for diagnosing people with TB. The in-
puts to the CNN architecture that we used are sagittal and coronal view point slices.
Henceforth, the CNN architecture we used in this paper is named as Multi-View
   This paper is organized as follows. In section 2, a brief overview of the Im-
ageCLEFmed Tuberculosis2019 subtasks, datasets, and preprocessing steps are de-
scribed in detail. In section 3, we discuss Multi-View CNN and MLP models. The
results obtained using our approaches in the two subtasks are shown in Section 4.
Finally, Section 5 concludes our participation in this challenge

2      Data-Preprocessing

The training and test datasets provided by the ImageCLEFmed Tuberculosis2019 task
organizers consist of 335, for training 218 and 117 for testing, chest CT volumetric
scans of people with TB which are stored in NIFTI file format. The sizes of all volu-
metric scans are 512 × 512 × s, where image length and width are 512 and s indicates
number of slices in the axial plane which varies from 50 to 400. In addition, for all
patients the provided datasets also includes automatic extracted masks of the lungs
and clinically relevant metadata. However, we do not use these provided segmenta-
tion masks in our work.
    Two binary classification subtasks of the TB task were proposed by the organizers:
i) Severity scoring, ii) CT report. The two sub tasks share the same datasets. For both
sub tasks, we split the provided training dataset into training and validation of 174 and
44 volumetric scans, respectively. The validation data was selected from the training
using stratified random sampling to avoid bias and ensure that proportional number of
positive and negative labels were present in each set. In addition, it allowed us for
tuning the hyperparameters and selects the best model for later use to evaluate the test
    We preprocess the NIFTI images for making it compatible for transfer learning us-
ing AlexNet [13] and it passes through different stages. First we reconstruct all the 3D
CT scan in all of the 3 planes: 2D Axial, sagittal and coronal slices were extracted
from each patient’s NIFTI volume data files, then rescaled the image intensity pixel
of each slice so that the actual minimum intensity value is mapped to 0 and the actual
maximum intensity value is mapped to 255, which is the standard range for PNG
images. To avoid processing the background which does not contain any chest tissue
and tackle the limited memory of GPUs constrain, some slices at the beginning and
end of sagittal and coronal views from each volume scan were discarded, resulting
400 slices (200 sagittal and 200 coronal) from each 3D CT scan. Next, depending on
the shape size of each extracted sagittal and coronal slices, cropping or padding was
used to rescale each slice to 224 × 112 pixels.
    We curate three separate datasets with all slices having the same shape size, 224 ×
224, by concatenating sagittal and coronal slices which are on the same position. The
first dataset was created by merging sagittal slices on the left and coronal slices on the
right and the second dataset was created by merging coronal slices on the left and
sagittal slices on the right, reducing the number of slices from 400 (224 × 112) to 200
(224 × 224) for a single 3D CT scan. We then convert each merged slices to Portable
Network Graphics (PNG) format and normalize to have zero mean and unit variance.
The third dataset contains combination of dataset one and two. All the three datasets
were used for training our models and only the first dataset for validating and testing
the trained models.
    All the preprocessing steps were done using the python programming language and
NiBabel package [14] . The example of reconstructed, scaled, and merged PNG imag-
es displayed using Python and Matplotlib Python package is shown in fig 1. In this
work, Axial slices not thoroughly investigated but didn’t improve the performance
when we tried hence they were not used in final results.
Fig. 1. Example of CT volumetric scan preprocessing stages. First row: sagittal slice; coronal
slice; second row: sagittal resized; coronal resized; third row: merged-sagittal on the left and
coronal on the right; merged-coronal on the left and sagittal on the right.

3      Model

Problem definition. For each 3D chest CT scans of TB patients, we assign 7 binary
labels: high severity (scores 1, 2 and 3)/low severity (scores 4 and 5), left lung
affected/not affected, right lung affected/not affected, presence/absence of
calcifications, presence/absence of caverns, presence/absence of pleurisy and lung
capacity decrease/not decrease. Our goal is to develop 7 similar automatic models to
predict the 7 labels for each patient.

Model architecture. Inspired by [15, 16], we developed a hybrid architecture using
CNN and MLP. The overall architecture consists of two core modules: (i) transfer
learning using pre-trained CNN AlexNet architecture with little modification that
maps 2D-slices of a patient to probability prediction as a deep feature extractor for
each of the 7 binary classification problems, and (ii) MLP- a standard machine
learning classifier that takes deep learning features obtained from the Multi-view
CNN model and available metadata as input to display the final TB diagnosis results.
See fig 2 for a schematic representation of the experimental set up of our hybrid
    CT                  Image        Training Set     Data Augmenta-            Trained ConvNet

  Volumes            Preprocessing                    tion                              Models

                                                                                Selected Model


                                                      Test Set         Valid. Set                Training Set
                                       Test Set
                                                      Probability      Probability               Probability
                                                      Predictions      Predictions               Predictions

                                                    Feature Fusion     Feature Fusion               Feature Fusion


                                                                                                     Trained MLP

                                                                                                   Selected Model

                                                                                                      Test Set

Fig. 2. A schematic representation of the proposed hybrid model for SVR score assessment
and CT report generation by using Multi-View CNN to conduct feature learning, and MLP
classifier for final prediction using learned features and available metadata. We use this exper-
imental set up for all of the 7 binary classification problems of the two subtasks explained in
the previous section.

Multi-View CNN. The trained Multi-view CNN architecture for extracting deep fea-
tures were 2D convolutional neural networks implemented using Python program-
ming language and PyTorch [17]: open source deep learning platform with medi-
um level abstraction between Tensorflow and Keras. The architecture takes
stacked 2D slices of 3D CT scan as input, with three channels corresponding to
RGB, and outputs a probability. As shown in fig 3 below, the overall network ar-
chitecture consists of three core parts: (i) the convolutional base of pre -trained
AlexNet which is a state of the art deep learning model trained on ImageNet dat a-
base which has 1.2 million high-resolution images belonging to 1000 categories,
(ii) global average pooling and max pooling layers on top of the convolutional
base applied across the spatial dimensions to reduce features obtained from the
convolutional base, and (iii) final dense layer with sigmoid activation function that
outputs a probability binary prediction for each subtasks problems.

                                                                            base of AlexNet

                                                                             Global Average
                                                                            and Max pooling

                                                                             Dense layer

Fig. 3. Multi-View CNN architecture. It takes a stacked s × 3 × 224 × 224 dimension prepro-
cessed PNG images of each 3D CT volume of a patient as input and outputs a classification
prediction for each binary classification problems where s is the number of merged slices of
sagittal and coronal views for a single 3D CT volume of a patient.

   We use back propagation algorithm for training and the binary cross- entropy loss
function along with Adam optimizer used for optimizing the model using a learning
rate of 10-5. Furthermore, during training we used data augmentation techniques to
increase the diversity of the data samples to avoid the behavior of overfitting due to
the small size of the training dataset. We apply common augmentation techniques
such as randomly rotate between 25 and -25 degree, and horizontal flipping to create
new images. We did not use any data augmentation techniques during test and valida-
tion time.
   We trained the network three times for each of the 7 binary classification problems,
one for each dataset created in the pre-processing stage, resulting three different deep
feature outputs for each patient scanned volume images. During training the MLP
classifier, the combination of the three deep features with the metadata achieved bet-
ter results than either of them alone with the metadata.

MLP-Classifier. Once we complete training the Multi-View CNN for the three da-
tasets, using MLP classifier we can get the medical diagnosis report of a TB patient,
i.e. SVR prediction score and CTR report, using the 3 deep features extracted at the
last layer of our CNN architecture and clinically relevant available patients’ infor-
mation metadata. We used Weka data mining tool to perform training and testing the

4      Results

In addition to the main tests executed using Multi-view CNN and MLP, extra experi-
ments were also tested using Multi-view CNN and various machine learning methods,
like Naïve Bayes, Random Forest and Random Tree but all of them were outper-
formed by MLP. In addition, we observed the performance of Multi-View CNN with
and without data augmentation and the model performed better when we used proper
number of additional augmented data but performed worse when we tried to use more
due to over fitting. We also tried to investigate the performance of the Multi-layer
Perceptron architecture by increasing and decreasing the number of hidden layers and
nodes in each hidden layers but we got better performance when we used the default
architecture of MLP by Weka data mining tool and assigning 0.001 learning rate, 0.2
momentum and 725 epochs.

              Table 1. SVR-Severity scoring results of the participant groups.
                Group name                    AUC       Accuracy      Rank
                UIIP_BioMed                   0.7877    0.7179        1
                UIIP                          0.7754    0.7179        2
                HHU                           0.7695    0.6923        3
                HHU                           0.7660    0.6838        4
                UIIP_BioMed                   0.7636    0.7350        5
                CompElecEngCU                 0.7629    0.6581        6
                San Diego VA HCS/UCSD         0.7214    0.6838        7
                San Diego VA HCS/UCSD         0.7214    0.6838        8
                MedGIFT                       0.71      0.641         9
                San Diego VA HCS/UCSD         0.7123    0.6667        10

   Top ten rankings taken from the results provided by the organizer of Im-
ageCLEFmed Tuberculosis2019 for both subtasks are shown in the table 1 and 2. A
total of 10 runs could be submitted in each ImageCLEF2019 TB subtasks but due to
the limited time constrain we have submitted only once (indicated in bold in Table 1
and 2) and our team ranked 6th and 4th in SVR and CTR subtasks, respectively. The
rank of the results is shown in terms of AUC and Accuracy

                  Table 2. CTR-CT report results of the participant groups.
             Group name                    Mean AUC        Min AUC       Rank
             UIIP_BioMed                   0.7968          0.6860        1
             UIIP_BioMed                   0.7953          0.6766        2
             UIIP_BioMed                   0.7812          0.6766        3
             CompElecEngCU                 0.7066          0.5739        4
             MedGIFT                       0.6795          0.5626        5
             San Diego VA HCS/UCSD         0.6631          0.5541        6
             HHU                           0.6591          0.5159        7
             HHU                           0.6560          0.5159        8
             San Diego VA HCS/UCSD         0.6532          0.5904        9
             UIIP                          0.6464          0.4099        10

5      Conclusion

In this paper, we investigate the use of a combination of pre-trained CNN and MLP
classifier to diagnose people with TB and our results show a promising result in Im-
ageCLEF 2019 TB evaluation track. Due to the limited time constraints and availa-
bility of limited computational power, we did not use extracted masks of the lungs,
axial slices, and some coronal and sagittal slices at the beginning and end even though
we expect that it will improve the result. Ultimately we would like to address these
issues in the future.


This work was supported by the research fund of the Çukurova University, Project
Number: 10683


