=Paper=
{{Paper
|id=Vol-2125/paper_175
|storemode=property
|title=Analysing TB Severity Levels with an Enhanced Deep re-sidual Learning – Depth-Resnet
|pdfUrl=https://ceur-ws.org/Vol-2125/paper_175.pdf
|volume=Vol-2125
|authors=Xiaohong Gao,Carl James-Reynolds,Ed Currie
|dblpUrl=https://dblp.org/rec/conf/clef/GaoJC18
}}
==Analysing TB Severity Levels with an Enhanced Deep re-sidual Learning – Depth-Resnet==
<pdf width="1500px">https://ceur-ws.org/Vol-2125/paper_175.pdf</pdf>
<pre>
 Analysing TB Severity Levels With An Enhanced Deep Residual Learning
                             – Depth-Resnet


                                   Xiaohong Gao, Carl James-Reynolds, Ed Currie

                      Department of Computer Science, Middlesex University, London, UK

                                   {x.gao, c.james-reynolds, e.currie} @mdx.ac.uk


       Abstract. This work responds to the Competition of Tuberculosis Task organised by imageCLEF 2018.
       While Task #3 appears to be challenging, the experience was very enjoyable. If time had been permitted, it
       was certain that more accurate results could have been achieved. The authors submitted 2 runs. Based on the
       given training datasets with severity levels of 1 to 5, an enhanced deep residual learning architecture, depth-
       ResNet, is developed and applied to train the datasets to classify 5 categories. The datasets are pre-processed
       with each volume being segmented into twenty- 128×128×depth blocks with ~64 pixel overlaps. While each
       block has been predicted with a severity level, assembling all constituent block scores together to give an
       overall label for the concerned volume tends to be more challenging. Since the probability of high severity is
       not provided from the training datasets, which bears little resemblance to the classification probability, the
       submission of probability for the first run was manually assigned as 0.9, 0.7, 0.5, 0.3, and 0.1 to severity
       levels of 1 to 5 respectively. After the deadline was extended, the model was re-trained with frame numbers
       increased from 1 to 8, which takes much longer to train. In addition, a new measure was introduced to
       calculate the overall probability of high severity based on the block scores. As a result, with regard to
       classification accuracy, the 2nd submitted run achieved place 14 over a total of 36 submissions, a significant
       improvement from position of 35 from the first run.

       Keywords: Deep residual learning, classification, severity of Tuberculosis


1. Deep Residual Learning – Depth-Resnet
Since a convolutional neural network (CNN) architecture can be constructed by stacking multiple layers of
convolution and subsampling in an alternating fashion, a CNN network can be enhanced into going deeper by
piling a large number of layers. However, the increased depth appears to have little contribution to the accuracy
of a trained model. This is due to the well-known vanishing gradient obstacle, i.e. as the gradient is back-
propagated to earlier layers, repeated multiplication may compose the gradient infinitely small. As a result, as
the network becomes deeper, its performance gets saturated or even starts degrading rapidly.

Recently, deep residual networks (ResNet) [1-3] introduce the notion of ‘identity shortcut connection’ that
bypasses one or more layers. A key advantage of residual units is that their skip connections allow direct signal
propagation from the first to the last layer of the network, especially during backpropagation. This is due to the
fact that gradients are propagated directly from the loss layer to any previous layer while skipping intermediate
weight layers that have potential to trigger vanishing or deterioration of the gradient signal.

In this work, an enhanced ResNet, i.e. depth-Resnet is applied for analysis of the level of severity of
tuberculosis from CT lung images, which is built on ResNet-50 model and illustrated in Figure 1.
Fig. 1. The depth-ResNet architecture applied in this paper, where ×N at each conv level refers to the block (e.g. conv5_x)
repeats N (e.g. 3) times consecutively.

In Figure 1 built on the inception concept, the depth (𝑧) convolution block operates on the dimensionality
reduced input, 𝑥𝑙,𝑧 with a bank of 3D filters, 𝑊𝑙,𝑧 . Biases 𝑏 ∈ ℝ𝐶 are also applied with initial values of 0 as
formulated in Eq. (1).

                    𝑥𝑙,𝑧 = 𝑊𝑙,𝑧 𝑥𝑙,1 + 𝑏                                                                              (1)

Hence the residual unit ℱ is expressed in Eq. (2).


                    ℱ = 𝑓 (𝑊𝑙,3 (𝑆𝑙 𝑓(𝑥𝑙,𝑧 ) + 𝑓(𝑊𝑙,2 𝑓(𝑊𝑙,1 𝑥𝑙,1 )))                                                 (2)


where 𝑆𝑙 is affine scaling along depth direction with a bias between 0 and 0.01. This scaling is adaptive to
facilitate generalisation performance and will be learnt during the training of the network. The convolution at
each convolution layer along depth (𝑧) direction (𝑥𝑙,𝑧 ) takes place between 3 neighbouring slices or feature maps,
i.e. front, current, and back, with a randomly chosen stride (between 1 and 7 in this study). This feature then is
added to the block with a scaling factor as a component of the residual unit. The pooling involves two stages.
The avg-pool occurs for 2D spatial global average pooling whereas max-pool is conducted along z direction
performing global max pooling upon those feature maps.

On the other hand, to integrate block scores into a volumetric label for each dataset, a support vector machine
(SVM) is applied. The system is implemented in Matlab with MatConvNet [4] toolbox by following standard
ConvNet training procedures [5, 6]. Upon training, 8 slices is chosen from each block with randomly selected
stride between 1 and 7 from 5 categories with a batch of 128 (=16 blocks). At testing time, each dataset
undertakes the same pre-processing procedure to generate 128128depth blocks. Then the trained depth-
Resnet model (Figure 1) takes each block as a whole, selects 8 slices at equal depth space and propagates these
slices through the trained model to produce a single prediction for this block with severity scores labeled
between 1 and 5.


2. Datasets
Data are collected from the competition organised by ImageCLEF2018 on Tuberculosis severity scoring task
(task#3) [7, 8, 9], with 170 for training and 109 for testing. The training data include chest CT scans of TB from
170 patients with the corresponding severity scores (1 to 5) and the severity levels designated as "high" and
"low", which contains 90 low severity (with scores 4 and 5) and 80 high severity (with scores 1, 2 and 3).


3. Image Pre-processing
The collected data are pre-processed to remove background and to segment into smaller blocks, which is
because that most abnormalities occur in small regions and spread over only a few slices. Figure 2 demonstrates
the process to remove background by the application of masks that are dilated in advance. As illustrated in
Figure 2 (b), some masks [10] over-remove lung information. Hence all the masks are dilated (Figure 2(d)) by a
diameter of 30 pixels found empirically to ensure the balance between over- and under- removing of
background (Figure 2(e)). Figure 2(f) depicts the final image of removing background, which has a size of
460 × 340 × 𝑧 (z is the depth and varies between 50 to 400).
                                                          (a)


                                   (b)                                          (c)


                                   (d)                                          (e)


                                                           (f)


Fig. 2. The process of removing background using dilated mask. (a) an original slice of image; (b) original mask and the
masked image (c) between (a) and (b); (d) dilated mask and (e) the result between (a) and (d). (f) final segmented masked
image after removing background. The arrow points to the diseased region to be concerned.

Then upon the segmented volume of 460 × 340 × 𝑧, 24 blocks of size of 128 × 128 × 𝑧 are created with
overlapping of ~64 pixels as illustrated in Figure 3.
Fig. 3. Segmentation of processed volume into 24 overlapping blocks with each one being 128128depth.

Since some corner blocks comprise large amount of background information, i.e. pixel value being 0, these
frames, in particular at front and back of a volume along 𝑧 direction, are removed when its background region
occupies more than one third of the total space. Hence the depth (𝑧) of each block varies between 11 and 250 for
all datasets after segmentation. As a result, many 3D volume datasets have less than 24 blocks after pre-
processing. Each block has been resized to 256256𝑧 from 128128𝑧 to save training time.


4. Results
Since each volume of 3D dataset contains around 24 blocks with individual severity scores, the final score for
this dataset has to be integrated. In principle, the five levels of severity can be treated as 2 classes labeled as
‘high’ (with scores 1, 2 and 3) and ‘low’ (with scores 4 and 5). Hence, three measures can then be formulated to
convey the inter-relationships between blocks scored 1 to 3, 4 to 5 and 1 to 5 are then calculated using Eqs. (3),
(4) and (5) respectively where scores of 1 to 5 are assigned initial probabilities of high severity linearly by 0.9,
0.7, 0.5, 0.3, and 0.1 respectively.

                        0.9×𝑛𝑢𝑚𝑏𝑙𝑜𝑐𝑘1 +0.7×𝑛𝑢𝑚𝑏𝑙𝑜𝑐𝑘2 +0.5×𝑛𝑢𝑚𝑏𝑙𝑜𝑐𝑘3
         𝑝𝑟𝑜𝑏_ℎ𝑖𝑔ℎ =          𝑛𝑢𝑚𝑏𝑙𝑜𝑐𝑘1 +𝑛𝑢𝑚𝑏𝑙𝑜𝑐𝑘2 +𝑛𝑢𝑚𝑏𝑙𝑜𝑐𝑘3                                                  (3)
                       0.3×𝑛𝑢𝑚𝑏𝑙𝑜𝑐𝑘4 +0.1×𝑛𝑢𝑚𝑏𝑙𝑜𝑐𝑘5
         𝑝𝑟𝑜𝑏_𝑙𝑜𝑤 =                                                                                            (4)
                         𝑛𝑢𝑚𝑏𝑙𝑜𝑐𝑘4 +𝑛𝑢𝑚𝑏𝑙𝑜𝑐𝑘5
                    0.9×𝑛𝑢𝑚𝑏𝑙𝑜𝑐𝑘1 +0.7×𝑛𝑢𝑚𝑏𝑙𝑜𝑐𝑘2 +0.5×𝑛𝑢𝑚𝑏𝑙𝑜𝑐𝑘3 +0.3×𝑛𝑢𝑚𝑏𝑙𝑜𝑐𝑘4 +0.1×𝑛𝑢𝑚𝑏𝑙𝑜𝑐𝑘5
         𝑝𝑟𝑜𝑏_𝑎𝑙𝑙 =          𝑛𝑢𝑚𝑏𝑙𝑜𝑐𝑘1 +𝑛𝑢𝑚𝑏𝑙𝑜𝑐𝑘2 +𝑛𝑢𝑚𝑏𝑙𝑜𝑐𝑘3 +𝑛𝑢𝑚                                              (5)
                                                                        +𝑛𝑢𝑚
                                                                      𝑏𝑙𝑜𝑐𝑘4    𝑏𝑙𝑜𝑐𝑘5


Hence the probability of a whole volume dataset can then be decided by these measures, which is in turn utilized
to score the overall severity of a volume. For example, in this study, if a dataset has 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦ℎ𝑖𝑔ℎ > 0.7 and
𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦𝑙𝑜𝑤 < 0.20 and 𝑁𝑢𝑚𝑏𝑙𝑜𝑐𝑘1 > 0 , then this dataset is classified as severity 1. In Table 1, two
approaches are applied. One is based on the overall probability (Level-1) as formulated in Eq. (5), which is
simple and straightforward. Level 1 approach has been applied to the imageCLEF Tuberculosis 2018
competition [8, 9], which ranks number 14 (out of 36 submissions) in terms of accuracy (AUC=0.6534) by the
authors of this paper when using different set of test data (n=109) with unknown severity levels. These results
are based on three runs using training datasets where 100 data are randomly selected from 170 training sets and
remaining 70 as test.
                    Table 1. The accuracy performance from both Level-1 and Level-2 calculations.

     Severity          1              2               3              4                5              Average
     Level-1           0.80 ± 0.00    0.60 ± 0.05     0.75 ± 0.00    0.88 ± 0.02      0.82 ± 0.12    75.88 ± 3.80%
     Level-2           0.86 ± 0.08    0.70 ± 0.01     0.77 ± 0.02    0.90 ± 0.00      0.84 ± 0.04    85.29 ± 3.00%


5. Conclusion
Prediction of probability of high severity level appears to be a challenging task since this information has to be
determined from the severity scores of 1 to 5. Due to the limited computer power (with only 1 GPU), each run
takes 4 days to train (100 datasets) and 2 days to test, the results from Level-2 approach were only obtained after
the deadline. However, the experience gained from this competition was very enjoyable with many lessons
learnt in relation to designing deep residual learning network.


References
    1.  He K, Zhang X, Ren S, Sun J, Identity Mappings in Deep Residual Networks, European Conference on Computer
        Vision (ECCV) (2016).
    2. He K, Zhang X, Ren S, Sun J, Deep Residual Learning for Image Recognition, IEEE Conference on Computer
        Vision and Pattern Recognition (CVPR) (2016).
    3. Feichtenhofer C, Pinz A, Wildes R, Temporal Residual Networks for Dynamic Scene Recognition, IEEE
        Conference on Computer Vision and Pattern Recognition (CVPR) (2017).
    4. MatConvNet: http://www.vlfeat.org/matconvnet/. Retrieved in May (2018).
    5. LeCun Y, Bengio Y, Hinton G, Deep Learning, Nature. 521: 436-444 (2015).
    6. Krizhevsky A, Sutskever I, Hinton G. Imagenet classification with deep convolutional neural networks, Advances
        in neural information processing systems 2012. NIPS 2012 (2012).
    7. Cappellato L., Ferro N., Nie J, Soulier L, Eds., CLEF 2018 Working Notes, Working Notes of CLEF 2018 –
        Conference and Labs of the Evaluation Forum, CEUR-WS, Eds. (2018).
    8. Dicente Cid Y., Liauchuk V., Kovalev V., Müller H., Overview of ImageCLEFtuberculosis 2018 - Detecting
        multi-drug resistance, classifying tuberculosis type, and assessing severity score, CLEF working notes, CEUR,
        2018., CEUR-WS.org, September 10-14, Avignon, France (2018).
    9. Ionescu B., Müller H., Villegas M., de Herrera A., Eickhoff C., Andrearczyk V., Cid Y.D., Liauchuk V., Kovalev
        V., Hasan S.A., Ling Y., Farri O., Liu J., Lungren M., Dang-Nguyen DT, Piras L., Riegler M., Zhou L., Lux M.,
        Gurrin C., Overview of ImageCLEF 2018: Challenges, Datasets and Evaluation, Experimental IR Meets
        Multilinguality, Multimodality, and Interaction, Proceedings of the Ninth International Conference of the CLEF
        Association (CLEF 2018), September 10-14, Avignon, France (2018).
    10. Cid YD, Jiménez-del-Toro OA, Depeursinge A, Müller H, Efficient and fully automatic segmentation of the lungs
        in CT volumes. In: Goksel, O., et al. (eds.) Proceedings of the VISCERAL Challenge at ISBI. No. 1390 in CEUR
        Workshop Proceedings (2015).

</pre>