<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ImageCLEF 2017: ImageCLEF Tuberculosis Task { the SGEast Submission</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jiamei Sun</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Penny Chong</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yi Xiang Marcus Tan</string-name>
          <email>tang@mymail.sutd.edu.sg</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexander Binder</string-name>
          <email>binder@sutd.edu.sg</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ISTD Pillar, Singapore University of Technology and Design</institution>
          ,
          <addr-line>8 Somapah Road, 487372</addr-line>
          ,
          <country country="SG">Singapore</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>ST Electronics-SUTD Cyber Security Laboratory, Singapore University of Technology and Design</institution>
          ,
          <addr-line>8 Somapah Road, 487372</addr-line>
          ,
          <country country="SG">Singapore</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we describe our methodologies in an attempt to improve the diagnosis accuracy of drug resistant tuberculosis and also of identifying the type of tuberculosis present in the patient, as per the requirements of the ImageCLEF tuberculosis task of ImageCLEF 2017. Firstly, we employed the concept of Convolutional Neural Networks (CNN), which can be used to identify useful features in the Computerized Tomography (CT) scans that were provided in the competition and perform the classi cation based on them. Secondly, Recurrent Neural Networks (RNN) were used on top of CNNs by utilizing CNNs as a feature extractor and the RNNs as a classi er. In order for our model to produce acceptable results, proper preprocessing was performed prior to providing the input data to the models for training, such as image slicing and data augmentation. Our methods were able to reach rank 4 and 5 for the subtask involving drug-resistant tuberculosis and rank 1 and 2 for the other subtask of identifying the tuberculosis type, according to the evaluation performed by ImageCLEF.</p>
      </abstract>
      <kwd-group>
        <kwd>Deep Learning</kwd>
        <kwd>Convolutional Neural Network</kwd>
        <kwd>Residual Networks</kwd>
        <kwd>Recurrent Neural Networks</kwd>
        <kwd>Long-Short Term Memory</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Tuberculosis (TB) caused by Mycobacterium tuberculosis bacteria is a persistent
and deadly threat that endangers the lives of people, even with today's advanced
medical technology. Standard medications are known to be ine ective in
multidrug-resistant (MDR) tuberculosis. Early appropriate identi cation of the
presence of drug resistance (MDR) and accurate diagnosis of TB types can reduce
the potential detrimental e ects on patients. Determining whether a strain of
TB shows signs of MDR is cost- and equipment-intensive, which unfortunately
still cannot be a orded by all the needy patients. At the same time, mobile
internet and cloud technologies are becoming wide-spread even in economically
underdeveloped and remote regions on this planet. This nourishes hopes that the
great successes of deep learning can be employed to help in such circumstances.
Using image processing techniques on CT scan images could aid medical doctors
in providing hints for a more accurate diagnosis, which is the motivation behind
the ImageCLEF 2017 challenge, more speci cally focusing on TB. We refer the
reader to papers published by Dicente et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and Ionescu et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] for more
details of this competition task.
      </p>
      <p>We decided to participate in this task due to its importance and also due to its
challenging nature when compared to many standard datasets used, e.g. in deep
learning tutorials: As tuberculosis does not always a ect the whole lung volume,
one can expect that many of the areas in the lung do not contain discriminative
evidence. When seen on volume- or slice-level it implies that the signal to noise
ratio is very challenging. The signal to noise ratio makes this challenge unique.
This observation in combination with the relatively small sample size of the tasks
poses a problem even for transfer learning approaches.</p>
      <p>
        In this work, we have decided to tackle this problem from two perspectives.
Firstly, by viewing each patient's 3-dimensional CT scan and slicing them up into
2-dimensional images to be used as a training data for a CNN model. Secondly, by
viewing each 3-dimensional CT scan of patients as sequences of 2-dimensional
images and using those sequences to train a RNN model. For the CNN, we
adopted He et al. ResNet-50 model [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and performed transfer learning with
the TB images. For the RNN model, we used several stacked Long-Short Term
Memory (LSTM) layers, inspired by Goh et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], to train on image features
that were generated by the default ResNet-50 model.
      </p>
      <p>The main contributions of this work are:
1. Aiding medical doctors in the diagnosis of drug-resistant TB and TB type
identi cation through image processing techniques.
2. Introduce work towards inexpensive and quick methods for early detection
of the MDR status and TB types in patients.</p>
      <p>The remainder of the paper will be organized as such. Section 2 will introduce
two methods used to the task: transfer learning using CNN and sequence learning
using RNN. Experimental results and discussion will be in Section 3 and some
possible future works will be discussed in Section 4. Finally, we will conclude
our work for ImageCLEF tuberculosis in Section 5. Lastly, the authors would
like to point out that our models used for this ImageCLEF tuberculosis task are
uploaded to GitHub for sharing1 and we encourage readers to further improve
on it.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Methodology</title>
      <p>In this section, we will introduce two solutions to the TB task in detail. The
two sub-tasks of ImageCLEF tuberculosis task are considered as typical image
1 https://github.com/maizesix92/ImageCLEF2017 TB SGEast.git
classi cation task, where deep learning gives state-of-the-art results. Thus in
this TB task, we have adopted both CNN and LSTM approaches that are
wellknown for their performances in images and sequence classi cation tasks. We rst
transform the CT scans into image slices followed by preprocessing as described
in (Section 2.1). Then every preprocessed slice of a patient is used to ne tune a
CNN. Our early experiments showed that ResNet-50 outperformed GoogLenet
and Ca e-reference in terms of accuracy (Section 2.2). Our observations also
showed that not all the slices of CT scans contain signi cant information or
are relevant to tuberculosis. Hence, LSTM is implemented using the features of
ResNet-50 (Section 2.3) to overcome this problem. Figure 1 shows an overview
of our methodology. Both methods are employed on the two sub-tasks.
Fig. 1: Method description. Preprocessing transforms the CT scans into image
slices along the top-down dimension. Data augmentation increases training data.
Each slice inherits the label of the patient. For transfer learning, we tried both the
original training set and masked training set to ne tune ResNet-50 (Section 2.1).
ResNet-50 will output the scores (probability) of di erent classes. For LSTM
method, we extracted the feature of every image slice in original training set
from pre-trained ResNet-50 and formed a feature sequence for every patient.
The feature sequences are used to train LSTM, which can also output the scores
of MDR and di erent TB classes.
2.1</p>
      <sec id="sec-2-1">
        <title>Data Preparation</title>
        <p>For data preparation, we emulated what doctors usually do when interpreting
CT scans. The slices were extracted along the top-down dimension so that they
were of size 512*512. All the pixel values of one image slice were scaled linearly
to the range of [0,255] to form a greyscale image. Also, the slices inherit the label
of the CT scan.</p>
        <p>More often than not, the more training data we feed into the neural network,
the better the performance. In order to gather more training data, every image
slice was augmented a little by enhancing the contrast and brightness, blurring,
as well as rotating. Contrast and brightness were enhanced by Python
ImageEnhance module. When the parameter is set as 1, the output is the original image,
thus, we choose 1.5 as the enhancing parameter, in order not to introduce much
degradation. We also use 3*3 mode lter to blur the image, where the window
size is relatively small. Considering that the lungs positions may vary a little from
patient-to-patient, we had also left rotated image slices by 5 degrees to account
for the slight di erences between patients. Visually, the slices still looked the
same after the enhancements. However, for the neural network, the augmented
and original slices were considered to be di erent. By augmenting the training
data, we hoped to prevent over tting of the trained model. For simplicity, we
call this dataset as the original training set.</p>
        <p>
          We generated another training set called masked training set. Since only areas
within the lungs contain relevant information, we can use the mask les provided
by ImageCLEF to extract the lung area. The mask les are also 3D data with a
value of 0, 1 and 2. 0 represents the non-relevant area; 1 and 2 represent the left
and the right side of the lung respectively [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Hence, we can get a mask slice for
every image slice with the same method in Section 2.1. The objective of using
mask les is to only train the lung area in images. During CNN training, the
input images are always normalized by subtracting the mean value calculated on
the training set from each pixel[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] to get better backpropagation. Thus, we can
set the pixels with 0-mask as the mean of image slices in the original training
set, and retain the value of pixels in lung area to get the masked training set.
During training, we also used the mean of the original training set, so that the
non-relevant area would all be 0 after subtracting the mean, thereby highlighting
the lung area.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Transfer learning</title>
        <p>
          Transfer learning is proven to be useful in image and video processing tasks.
Besides the advantage of the model is fast and easy to train, this approach
always gives satis able results especially when training dataset is relatively small
[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. In transfer learning, we must rst choose a CNN model. According to our
experiments on di erent CNN models including GoogLenet, Ca e-reference, and
ResNet-50, we found ResNet-50 to be most stable and has the best performance.
Thus, ResNet-50 was used in the following experiment. Both original training
set and masked training set were used in transfer learning.
        </p>
        <p>Training stage After preparing the data, we obtained two training sets, the
original set and the masked set. Since both training sets are comprised of 2D
images, the image slices of all patients were shu ed and trained as individual
images. This would undoubtedly introduce noise in our training data. Hence,
in order to address this problem, we used maxpooling layer instead of average
pooling so that the pooling layer will extract the feature of relevant slices and
reduce the in uence of noisy training data. In addition, we also worked around
this problem during test stage, which will be further discussed in the next
paragraph. When ne tuning the CNN, layers after the last pooling layer are ne
tuned more as compared to the other layers in the neural network because the
early layers contain generic features such as edge or blobs that are relevant to
many image processing tasks. This approach can also avoid over tting due to
the small training dataset.</p>
        <p>Testing stage At test time, slices of one patient are fed into the neural network
one at a time and each slice will have one output. We average the outputs of slices
of one patient as the nal output. As mentioned, training on individual slices
will introduce noise to the training data, since some slices do not contain TB
nor contain the lungs at all. Nevertheless, our approach of averaging the scores
at test time can reduce the impact of noisy training data for a more reliable
output.
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Sequence Learning</title>
        <p>
          RNNs are especially useful in solving problems that contain some sort of
sequential or time-series data. This is because they are able to learn the temporal
patterns within them. Variants of RNNs have already been used many times
before, whether in the context of text generation [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] or in the area of anomaly
detection in Cyber-Physical Systems [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. In this subsection, we will provide a
brief introduction to RNNs and its variants, before moving on to discuss our
implementation in this image recognition context and the reason for doing so.
Vanilla RNN vs LSTM Vanilla RNNs are very simple cells with just one
activation function within it. These cells will be arranged in a sequence with a
length that is dependent on a user-de ned number. An illustration of a vanilla
RNN cell is shown in Figure 2.
Mathematically, it can be expressed as such:
ht = H(Whht 1 + Wxxt + bh)
yt = Wyht + by
(1)
where H is the activation function in the RNN cell, W are the weights, ht are
the hidden vectors at time step t, xt are the inputs at time step t, yt are the
outputs at time step t and b are the biases. Hence, it can be said that the
output of a RNN cell will become the input of the next RNN cell. The weights
in the network are updated at every training iteration using this concept called
backpropagation through time. At each iteration, the gradients are calculated
and the weights will be ne tuned based on these calculated gradients, so as to
minimize the di erence between the predicted and the actual result.
        </p>
        <p>
          Traditional RNNs are less widely used due to its inherent exploding and
vanishing gradients problem, which is found in gradient-based learning and
backpropagation, thereby a ecting the learning quality of the model negatively when
attempting to learn long temporal sequences. A paper published by Pascanu et
al. illustrates this point clearly [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
        </p>
        <p>LSTMs are variants of the vanilla RNNs and they have a more complex
cell architecture. As such, they take longer time for backpropagation through
time. They are also arranged in sequence, like the case for vanilla RNNs with a
sequence length that is user-de ned. Furthermore, there are di erent variations
of LSTM models, one of which being peephole LSTM. An example of a peephole
LSTM cell is shown in Figure 3.
Mathematically, it can be represented as such:
it = (Wxixt + Whiht 1 + Wcict 1 + bi)
ft = (Wxf xt + Whf ht 1 + Wcf ct 1 + bf )
ct = ftct 1 + it tanh(Wxcxt + Whcht 1 + bc)
ot = (Wxoxt + Whoht 1 + Wcoct + bo)
ht = ot tanh(ct)
(2)
where is a sigmoid activation function, tanh is an activation function, W are
the weights, ht and ht 1 are the hidden vectors, ct and ct 1 are the cell states
of a LSTM cell, b are the biases, ot are the output gates, it are the input gates
and ft are the forget gates. This variant is called the peephole LSTM because of
its peephole connections from the cell state ct 1 to ft and it, and from ct to ot.
LSTM solves the gradients issue that was mentioned earlier that was present in
the case of vanilla RNNs, as LSTMs have the ability to learn what to remember
and what to forget through the forget gate, ft. Hence, LSTM is more popular
and more widely used, and we chose that to be implemented in our context.
CNN-LSTM methodology As our raw data was given as a CT scan of
patients, we were able to extract out the image slices of the lungs. It is clear that
these extracted images can form a sequence of images from the start to the end
of the lungs. Also, as the TB-a ected areas would not appear in the entire lung,
some images will be seen as exempted from TB traces. Hence, we explored the
idea of using a sequence of images, that potentially show di erent parts of the
TB area, mixed together with images without TB to make the classi cation more
accurate and robust. Thus, we used the CNN to generate the image features and
then followed by LSTM for classi cation, which we will refer to as CNN-LSTM.
The CNN-LSTM architecture can be thought of as a two-step process. Firstly,
images will be passed into the CNN model to generate image feature vectors.
These feature vectors describe the images that were passed into the CNN in a
numerical form. After that, these feature vectors will be passed into the LSTM
model for classi cation, since it is not possible to pass an image to a recurrent
layer, to the best of our knowledge. This idea was illustrated in Figure 1.
3
3.1</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiment and submitted runs</title>
      <sec id="sec-3-1">
        <title>Experimental data and Libraries used</title>
        <p>
          Experimental data Our experimental data was provided by ImageCLEF[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
We were given 230 CT scans for MDR task and 500 for the TB-Type task. To
evaluate the performance of our models, we split 20% of the data as the local test
set and used the rest of the data, which we call local training set, to implement
transfer learning and sequence learning. On the local training set, we employed
5-fold cross validation to guarantee that our model was robust. As for the nal
submitted runs, we trained on the total training set.
        </p>
        <p>
          Adhering to our methodology as described earlier, CT scans were transformed
into image slices. Thus, there was a total of 27992 slices for MDR task and 68935
slices for the TB-Type task. After data augmentation, we obtained four times
more image slices to train and all the slices inherit the label of their corresponding
patient. However, tuberculosis only exists in some of the slices in CT scan. Slices
that do not contain tuberculosis but inherit the label of MDR or tuberculosis
patient will impact the accuracy of our model. In other words, the signal to noise
ratio of our training set is very low This is the main reason why the accuracy is
low. Experimental results will be shown in the following sections.
Libraries used For the training of the ResNet-50, we used the Ca e library [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]
by Jia et al. On the other hand, we used Keras [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] with Theano [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] as a backend
for the training of the LSTM neural network.
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Experiment of Transfer learning</title>
        <p>
          Transfer learning was implemented based on pre-trained ResNet-50 and we
trained on the original training set and masked training set as described in
Section 2.1. We also trained di erent models using AVE and MAX for pooling
layers. AVE pooling uses the average score as the output of sub-sampling, while
MAX pooling uses the maximal score[
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. All the slices were resized to 256*256 in
order to increase the batch size and accelerate training. In addition, the crop size
was set as 192 to increase the batch size further, and at the same time, we do not
lose much information of the lung. At test time, only the center crop was used.
Training parameters were modi ed based on the original training parameters of
ResNet-50, and we mainly lowered the learning rate to do transfer learning. We
employed stochastic gradient descent (SGD) as the optimizer. The base learning
rate was set to 0.0001 and decremented every 150000 iterations using the step
function. The weight decay was set to 0.0001, gamma to 0.1 and momentum to
0.9. To compare di erent models, we plotted the results of the local experiment
in Figure 4, where original and masked represent the two di erent training sets
while AVE and MAX represent the pooling policies chosen.
        </p>
        <p>By comparing the black and red lines in Figure 4a) and Figure 4b), we can
see that, on average, models trained on original training set obtained higher
accuracy and are also more stable as compared to models trained on the masked
training set. This suggests that the usage of masks may not be appropriate for
both tasks. The manual inspection also showed that some masks appeared to
disrupt the features of TB by introducing additional noises in the CT scans. By
comparing green and red lines, we also observed that the usage of maxpooling
as compared to average pooling results in a lower accuracy in MDR task but
had similar or higher accuracy in the TB-Type task. Generally, we believe
maxpooling performs better than average pooling in capturing features of relevant
slices with tuberculosis. However, our local testing on MDR task suggests the
opposite, which may be due to our relatively small test data in which the result
may not be a true indicator of our model performance.</p>
        <p>Among our submitted runs, MDR resnet partial and TBT resnet partial, were
generated by ResNet-50 model trained on the local training set with average
pooling. As mentioned, the local training set is the remaining data set after
the extraction of 20% of original slices for local testing. MDR resnet full and
TBT resnet full were generated by ResNet-50 model trained on the entire
training set which includes all the original and augmented slices with maxpooling.
For testing, we had used the o cial data set provided by ImageCLEF.</p>
        <p>The results of our nal submitted runs and their corresponding ranks are
shown in Table 1. We have included the performances of our models in terms of
area under curve (AUC) and accuracy (ACC) for MDR task, and Kappa coe
cient and ACC for TB-Type task. Comparing the results of submitted runs we
can see that, maxpooling models indeed performed better than average pooling
and this proves our earlier belief that maxpooling can, to some extent, capture
features of relevant image slices of a patient.
In order to use LSTM, some data preprocessing had to be done so that we were
able to pass our data into the LSTM model for training. Firstly, we used the
pre-trained ResNet-50 model as a feature extractor. We passed the image slices
into the pre-trained ResNet-50 model and extracted the output of the last fully
connected layer, which is a vector of size 2048. So each individual slice gets a
vector representation. Next, we grouped feature vectors that belong to the same
patient into the same sequence. As the number of image slices per patient can
vary from 50 to 400, we chose an arbitrary value of 150 feature vectors to form
one sequence per patient. If there were not enough slices to form the 150 feature
vectors, some slices will be repeated. However, if there were more than enough
slices, we will sub-sample the image slices to construct the sequence length of
150. After which, our data would be in 3-dimension in the format of (patient,
sequence, feature vector), where each patient has only one sequence. We then
further lowered the sequence length to 75 so that we had a more zoomed-in
representation by increasing the number of sequences per patient. This was so
that the areas with TB will become more signi cant. Lastly, we passed these
(a) Accuracy of MDR task
(b) Accuracy of TB-Type task</p>
        <p>Fig. 4: Local testing result of selected transfer learned ResNet-50
preprocessed data with their corresponding labels into the LSTM model for
training.</p>
        <p>The LSTM portion of the CNN-LSTM consists of 3 stacked LSTM layers.
By having stacked LSTM layers, complex temporal sequences could be learned.
Figure 5 shows how the general architecture of our neural network making use of
LSTM looks like, with the output dimensionality of the LSTM layers reducing
after each layer. In the model that we de ned, we also have some Dropout layers
that attempt to prevent the model from over- tting. The prediction as shown in
Figure 5 would either be the probability of detecting MDR-TB or 1 of the 5 TB
types.</p>
        <p>Figures 6 and 7 summarize the accuracy we obtained, when we performed
local testing on two di erent RNN variations, vanilla RNN and LSTM. The
di erence between the two architectures was just the type of RNN being used
and all the other factors remained constant. For the MDR task, as shown in
Figure 6, we can see that the accuracy for using vanilla RNN and LSTM were
very similar but LSTM still outperformed the vanilla RNN by a slight margin.
For the TB-Type task, as shown in Figure 7, the di erences between the vanilla
RNN and LSTM were much greater than that of the MDR task, with the LSTM
clearly outperforming the vanilla RNN. As such, we chose to use LSTM as our
RNN layers.</p>
        <p>For our submitted runs, we trained our model with the augmented and
original data that we had in preparation for the prediction of labels for the test
data given by ImageCLEF, that was provided at a later date. We also set aside
some labeled data for model veri cation purposes, so that we can ensure that
our prediction accuracy based on the labeled data is acceptable, before moving
on to the prediction of unlabelled data. This splitting of data is similar to that
described in Section 3.2, where 80% of the data was used for training purposes
and the remaining 20% being used for validation. Table 2 summarizes the results
that our team achieved based on the evaluation performed by the ImageCLEF
committee, with the evaluation method as described in 3.2.
As mentioned in Section 2.2, TB does not always a ect the whole area of the
lungs, particularly in early stages. Also, the image slices at the two extreme ends
often do not contain any part of the patient's lungs. Thus only a certain
percentage of the slices contain information relevant to the discrimination between
various TB types which made the classi cation problems in this challenge very
challenging due to the low signal to noise ratio. Although we tried maxpooling
and averaging slices' output of patients to address this problem, we also suggest
another method which is a potentially better method to decrease the noise.</p>
        <p>The proposed way is to manually extract, at least for a subset of patients,
the relevant slices of di erent types of TB, and label the image slices that are
not relevant as belonging to a sixth class. All patient data that would not have
been preprocessed in that way, would be fully assigned to one of the ve existing
classes. Using this approach, even by an incomplete annotation of the patients,
one could increase the signal to noise ratio. We did not do this at all for the
submitted runs, as we lacked the time and the only expertise in our group present
was regarding histopathological stains. The emphasis here is that even a partial
annotation of a subset of patients could improve prediction capability. In the
TB-Type task, we can train the data with six classes: 5 types of TB and
nondiscriminative/background slices.</p>
        <p>
          At the testing phase, image slices of a particular patient are fed into the
network, and the network will not only give the type of TB but also output
the slices that contain TB and those that don't contain TB as well. We can
also use methods such as layer-wise relevance propagation[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] and deep Taylor
decomposition[
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] to further analyse the predictions of trained models. Both
methods can output heatmaps that show pixels' relevance[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] to the prediction of
a model. And in our tuberculosis task, heatmaps can highlight the TB part in
an image slice. This may also help doctors as a second fall back check that may
help to identify overlooked areas.
5
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>Our work uses techniques of deep learning to complete the ImageCLEF 2017
TB task. By converting CT scans to image slices, we were able to implement
CNN transfer learning and LSTM sequence learning on MDR task and TB-Type
task. Both methods gave relatively good results in the competition. Hence, we
can conclude that transfer learning CNN can learn to discriminate di erent
TB types, and the feature sequence extracted from CNN can also give a good
representation of CT scans. However, due to the highly noisy nature of the
training data, our performance of using neural network for training was a ected
negatively. If we were able to label the slices more accurately, the two methods
in this paper would give better results.
6</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>This work was generously supported by the ST Electronics-SUTD Cyber
Security Laboratory. The authors express gratefulness for the ISTD-SUTD start-up
funding.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>S.</given-names>
            <surname>Bach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Binder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Montavon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Klauschen</surname>
          </string-name>
          , K. Muller, W. Samek, and
          <string-name>
            <given-names>O. D.</given-names>
            <surname>Suarez</surname>
          </string-name>
          .
          <article-title>On pixel-wise explanations for non-linear classi er decisions by layer-wise relevance propagation</article-title>
          .
          <source>PLoS ONE</source>
          ,
          <volume>10</volume>
          (
          <issue>7</issue>
          ):e0130140,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>F.</given-names>
            <surname>Chollet</surname>
          </string-name>
          et al. Keras. https://github.com/fchollet/keras,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dicente Cid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. A.</given-names>
            <surname>Jimenez del Toro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Depeursinge</surname>
          </string-name>
          , and H. Muller. E
          <article-title>- cient and fully automatic segmentation of the lungs in ct volumes</article-title>
          . In O.
          <string-name>
            <surname>Goksel</surname>
            ,
            <given-names>O. A.</given-names>
          </string-name>
          <string-name>
            <surname>Jimenez del Toro</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Foncubierta-Rodr</surname>
            <given-names>guez</given-names>
          </string-name>
          , and H. Muller, editors,
          <source>Proceedings of the VISCERAL Anatomy Grand Challenge at the 2015 IEEE ISBI, CEUR Workshop Proceedings</source>
          , pages
          <volume>31</volume>
          {
          <fpage>35</fpage>
          .
          <string-name>
            <surname>CEUR-WS</surname>
          </string-name>
          , May
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dicente Cid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kalinovsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Liauchuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kovalev</surname>
          </string-name>
          , , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Mu</surname>
          </string-name>
          <article-title>ller. Overview of ImageCLEFtuberculosis 2017 - predicting tuberculosis type and drug resistances</article-title>
          .
          <source>In CLEF 2017 Labs Working Notes, CEUR Workshop Proceedings</source>
          , Dublin, Ireland,
          <source>September</source>
          <volume>11</volume>
          -14
          <year>2017</year>
          .
          <article-title>CEUR-WS</article-title>
          .org &lt;http://ceur-ws.
          <source>org&gt;.</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>J.</given-names>
            <surname>Goh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Adepu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Z. S.</given-names>
            <surname>Lee</surname>
          </string-name>
          .
          <article-title>Anomaly detection in cyber physical systems using recurrent neural networks</article-title>
          .
          <source>In 2017 IEEE 18th International Symposium on High Assurance Systems Engineering (HASE)</source>
          , pages
          <fpage>140</fpage>
          {
          <fpage>145</fpage>
          ,
          <string-name>
            <surname>Jan</surname>
          </string-name>
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>H.</given-names>
            <surname>Greenspan</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. van Ginneken</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Summers</surname>
          </string-name>
          .
          <article-title>Guest editorial deep learning in medical imaging: Overview and future promise of an exciting new technique</article-title>
          .
          <source>IEEE Transactions on Medical Imaging</source>
          ,
          <volume>35</volume>
          (
          <issue>5</issue>
          ):
          <volume>1153</volume>
          {
          <fpage>1159</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren, and
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <article-title>Deep residual learning for image recognition</article-title>
          .
          <source>arXiv preprint arXiv:1512.03385</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          , H. Muller, M. Villegas,
          <string-name>
            <given-names>H.</given-names>
            <surname>Arenas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Boato</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.-T.</surname>
            Dang-Nguyen,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Dicente Cid</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Eickho</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Garcia Seco de Herrera</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Gurrin</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Islam</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Kovalev</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Liauchuk</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Piras</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Riegler</surname>
            ,
            <given-names>and I.</given-names>
          </string-name>
          <string-name>
            <surname>Schwall</surname>
          </string-name>
          . Overview of ImageCLEF 2017:
          <article-title>Information extraction from images</article-title>
          .
          <source>In Experimental IR Meets Multilinguality, Multimodality, and Interaction 8th International Conference of the CLEF Association, CLEF</source>
          <year>2017</year>
          , volume
          <volume>10456</volume>
          of Lecture Notes in Computer Science, Dublin, Ireland,
          <source>September</source>
          <volume>11</volume>
          -14
          <year>2017</year>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Shelhamer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Donahue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Karayev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Long</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Guadarrama</surname>
          </string-name>
          , and T. Darrell. Ca e:
          <article-title>Convolutional architecture for fast feature embedding</article-title>
          .
          <source>arXiv preprint arXiv:1408.5093</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>A.</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          .
          <article-title>Imagenet classi cation with deep convolutional neural networks</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <volume>1097</volume>
          {
          <fpage>1105</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. G. Montavon,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lapuschkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Binder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Samek</surname>
          </string-name>
          , and
          <string-name>
            <surname>K.-R. Mu</surname>
          </string-name>
          <article-title>ller. Explaining nonlinear classi cation decisions with deep taylor decomposition</article-title>
          .
          <source>Pattern Recognition</source>
          ,
          <volume>65</volume>
          :
          <fpage>211</fpage>
          {
          <fpage>222</fpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>R.</given-names>
            <surname>Pascanu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <article-title>On the di culty of training recurrent neural networks</article-title>
          .
          <source>ICML (3)</source>
          ,
          <volume>28</volume>
          :
          <fpage>1310</fpage>
          {
          <fpage>1318</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>D.</given-names>
            <surname>Scherer</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <article-title>Muller, and</article-title>
          <string-name>
            <given-names>S.</given-names>
            <surname>Behnke</surname>
          </string-name>
          .
          <article-title>Evaluation of pooling operations in convolutional architectures for object recognition</article-title>
          .
          <source>Arti cial Neural Networks{ICANN</source>
          <year>2010</year>
          , pages
          <fpage>92</fpage>
          {
          <fpage>101</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14. I. Sutskever,
          <string-name>
            <given-names>J.</given-names>
            <surname>Martens</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          .
          <article-title>Generating text with recurrent neural networks</article-title>
          .
          <source>In Proceedings of the 28th International Conference on Machine Learning (ICML-11)</source>
          , pages
          <fpage>1017</fpage>
          {
          <fpage>1024</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15. Theano Development Team.
          <article-title>Theano: A Python framework for fast computation of mathematical expressions</article-title>
          . arXiv e-prints,
          <source>abs/1605</source>
          .02688, May
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>