SZTAKI @ ImageCLEFmed 2020 Tuberculosis
                    Task

        Bence Lestyan1 , András A. Benczúr1,2 , and Bálint Daróczy1,2
             1
                 Institute for Computer Science and Control (SZTAKI)
                    H-1111, Kende str. 13-17, Budapest, Hungary
                                2
                                  Széchenyi University
                        H-9026, Egyetem tér 1, Győr, Hungary
                   {lestyan,benczur,daroczyb}@ilab.sztaki.hu


      Abstract. In this paper we describe our submission to the ImageCLEFmed
      2020 Tuberculosis task and discuss additional results on the training set
      with various neural networks. After some centralization and normaliza-
      tion we independently categorized the 2D slices with convolutional neural
      networks (traditional and residual feed-forward networks) and we aggre-
      gated the individual predictions based on the positions of the lung and
      the slices. Our additional experiments with various aggregation methods
      indicate that individual slices do not necessary contain enough informa-
      tion about such complex structures.


Keywords: Computed tomography, Residual networks, Convolutional networks,
Tuberculosis


1    Introduction

The goal of the ImageCLEFmed 2020 Tubercolosis task3 [8, 6] is to detect
whether the different parts of the lung are affected by Mycobacterium tuber-
culosis. The categories are LeftLungAffected, RightLungAffected, CavernsLeft,
CavernsRight, PleurisyLeft, PleurisyRight. The data set contain 403 computed
tomography scans (CT scans). Out of the 403 CT scans 283 scans are used as
a training set with known labels for the participants and 120 CT scans as the
test for the competition. For our experiments we split the training set into two
subsets (163 as training and 120 as validation set) and evaluated our models on
the smaller set. Out of the two lung masks [2, 10] we used the first segmentation
method in the aggregation phase of the slice predictions.

  Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
  mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem-
  ber 2020, Thessaloniki, Greece.
3
  https://www.imageclef.org/2020/medical/tuberculosis
2     Models

First, we preprocessed individual CT slices. Based on the provided lung masks
we centralized and rescaled the slices to lower the resolution from 512x512 to
256x256. Additionally we standard normalized the intensity values, see Fig. 1. We
omitted to apply additional augmentation techniques [11] e.g. rotation, mirroring
or random crop as the position of the lung is crucial. We treated the scoring
procedure as a set of binary classification tasks therefore we trained separate
neural networks for each category.
    We chose feed-forward neural networks with a single output node to model
the categories per slice. Every inner layer included Rectangular Linear Units
(ReLU) as non-linear activation functions and we chose sigmoid for the output
unit. We built a traditional Convolutional Neural Network (CNN [9]) with two
convolutional layers with 64 5x5 sized filters and a Residual Network (ResNet
[5]) with three residual blocks, for details see Table 1 and Table 2 respectively.
The residual blocks contained a set of 3x3 sized convolution with 8,32,64 filters
per block followed by a second convolution with the same size and a final resid-
ual connection and a downsizing unit. Between the two convolutions we used
batch normalization and ReLU similarly to the original paper. Before the linear
discriminative layer we downsized the tensor with average pooling. Addition-
ally, for the CNN network we applied Dropout [11] in the second convolutional
layer. We evaluated the performance of the models with the log-likelihood of
the probability of the original label measured by the activation of the output
unit. As an optimization method we used Adam [7] thus we omitted additional
regularization in the loss function.
    We measured the performance of various models on the validation set, a
random subset of the training set. In the testing phase we used every training
scan with the best settings. We implemented the models in PyTorch framework
4
  and did all the experiments in python. Additionally, we used the provided lung
masks based on the first automatic segmentation method described in [2].


2.1    Aggregation

We combined the individual predictions of the slices to compute a single score
per CT scan. During our experiments we applied various methods to define a
single score:

 – Mean score (sc1 ): mean of the individual prediction scores of the CT scan.
 – Maximal score (sc2 ): maximal prediction score in a CT scan.
 – Minimal score (sc3 ): minimal prediction score in a CT scan.
 – Median score (sc4 ): median prediction score in a CT scan.
 – Middle score (sc5 ): prediction score of the center slice.
 – Majority vote (sc6 ): proportion of the positive predictions.
4
    https://pytorch.org
 – Mask score (sc7 ): weighted prediction scores based on the proportion of the
   actual lung in the slices. The proportion of the lung was the proportion
   of the lung segment given the mask files. The masks were extracted by a
   fully automatic lung segmentation method described in [2]. We used the
   corresponding masks per lung per task.


Fig. 1: Examples of modified CT slices. Note, standard normalization is a linear
transformation.


Table 1: Convolutional network layout. We denote 2D convolution, maximal
pooling [9] and Dropout [11] with C, M and DO respectively.
        Layer                  #nodes      #parameters         output
        C5x5 + M2x2              64           1.6k           64x126x126
        C5x5 + M2x2 + DO         64           102k            64x61x61
        Output layer              1           238k                1


3   Results
All of our submitted runs included mean score over the CNN results. Our main
submission (#68061) achieved mean AUC of 0.595. The remaining runs con-
tained single category scores with random scores for the rest of the categories
(we estimated the AUC as AU Cmean ∗ 6 − 0.5 ∗ 5)). The estimated per category
AUC of our submission can be seen in Table 3. We noticed that two of the
categories achieved an AUC under 0.5 thus if we negate the scores the AUC
values will flip to the upper half and the adjusted mean AUC will be 0.6548. Im-
portant to mention, that these adjustments only provide us information about
                       Table 2: Residual network layout.
        Layer                  #nodes      #parameters         output
        Input layer              16           0.2k           16x256x256
        Residual layer 1                        3k            8x256x256
        Residual layer 2                       14k           32x128x128
        Residual layer 3                       73k            64x64x64
        Average pooling 8x8                      0             64x8x8
        Output layer              1             4k                1


the distinguishing capability of the models (how the model differentiate nega-
tive and positive examples), in a realistic scenario the final decisions would be
still wrong as without any test data we would not know that we need to flip
the scores. During the challenge and afterwards we experimented over the small
training (163 CT scans) and validation (120 CT scans) sets with several models
and aggregation methods. Table 4 show the mean AUC results on the valida-
tion set and the detailed AUC scores can be seen for the left and right lung in
Table 5 and in Table 6 respectively. The method (mean score of CNN) in our
main submission achieved a mean AUC 0.595 on the validation set however the
best method (median score of CNN) performed significantly better with AUC
of 0.642. If we select the best model (ResNet or CNN) with the median score
per category the mean AUC will be similar to the median CNN with 0.659. In
comparison, if we select properly both the model and the aggregation method
for each category the mean AUC increases to 0.686, a significant gain on the
validation set in comparison to the submitted run.


                  Table 3: Estimated individual AUC values.
               category                         run        AUC
               Estimated LeftLungAffected      #68052       0.734
               Estimated CavernsLeft           #68055      0.452
               Estimated PleurisyLeft          #68050       0.728
               Estimated RightLungAffected     #68059        0.74
               Estimated CavernsRight          #68049        0.41
               Estimated PleurisyRight         #68058       0.674
               mean                            #68061       0.595
               mean adjusted                               0.6548


4   Conclusions and Future Work

In this paper we described our submission and some additional experiments
over the data set of the ImageCLEFmed 2020 Tuberculosis task. We trained
traditional feed-forward convolutional and residual neural networks over the in-
               Table 4: Mean AUC results on the validation set.
               model                 aggregation     mean AUC
               CNN (submitted)           sc1           0.595
               CNN                       sc2           0.594
               CNN                       sc3           0.614
               CNN                       sc4           0.642
               CNN                       sc5           0.626
               CNN                       sc6           0.558
               CNN                       sc7           0.591
               ResNet                    sc1           0.620
               ResNet                    sc2           0.614
               ResNet                    sc3            0.6
               ResNet                    sc4           0.584
               ResNet                    sc5           0.616
               ResNet                    sc6           0.577
               ResNet                    sc7           0.614
               Best model                sc4           0.659
               Best model & aggr.                      0.686


dividual slices of the CT scans and combined the predictions based on the impor-
tance of the slices according to their position and how well they represent both
of the lungs. We found that median score performed best on average although
in some categories the middle slice score or the mask score outperformed other
aggregation methods. Both ResNet and traditional CNN performed similarly in
our experiments on the validation set while the residual network needed signif-
icantly higher computational power. Our simplest run which was submitted to
the challenge had very low mean AUC score 0.595 meanwhile with additional
aggregations we improved the same method on the validation set to achieve a
mean AUC 0.684. We plan to replace 2D convolutions with 3D convolutions to
take advantage of the complex structure of CT scans. Additionally, we intend to
further expand our experiments with bi-directional Recurrent Neural Networks
(RNN [4]) to read through the CT scans from both ends and classify the se-
quence as a whole, utilize Markov Random Fields [1] over the prior predictions
and generate additional samples with slice transition refinement with inter-slice
reconstruction and with category-wise Generative Adversarial Networks [3] to
boost the training procedure. Based on the submissions of other participants
(SenticLab.UAIC mean AUC 0.924 or SDVA-UCSD mean AUC 0.875) we be-
lieve individual slice predictions may not be representative enough to describe
CT scans as a whole to detect Mycobacterium tuberculosis.


5   Acknowledgement
The publication was supported by the Hungarian Government project GINOP-
2.2.1-18-2018-00004: AI based lung cancer diagnosis by chest CT, 2018-1.2.1-
NKP-00008: Exploring the Mathematical Foundations of Artificial Intelligence,
Table 5: Per category AUC results on the validation set for the left lung. The
best results are highlighted in red.
      model         LeftLungAffected      CavernsLeft     PleurisyLeft
      CNN sc1             0.675              0.506           0.575
      CNN sc2             0.593              0.556           0.512
      CNN sc3             0.762              0.525           0.625
      CNN sc4             0.612               0.7            0.631
      CNN sc5             0.717              0.525           0.575
      CNN sc6             0.725              0.503            0.5
      CNN sc7             0.706              0.506           0.643
      ResNet sc1          0.687              0.681           0.618
      ResNet sc2          0.593               0.7            0.575
      ResNet sc3          0.706              0.581           0.637
      ResNet sc4           0.65              0.587           0.515
      ResNet sc5          0.668              0.681           0.562
      ResNet sc6          0.662              0.628            0.5
      ResNet sc7          0.743              0.637           0.612


by the Higher Education Institutional Excellence Program, and by the Momen-
tum Grant of the Hungarian Academy of Sciences. B.D. was supported by an
MTA Premium Postdoctoral Grant 2018.
Table 6: Per category AUC results on the validation set for the right lung. The
best results are highlighted in red.
    model         RightLungAffected       CavernsRight      PleurisyRight
    CNN sc1             0.712                 0.543             0.543
    CNN sc2               0.7                 0.625             0.581
    CNN sc3             0.693                  0.55             0.531
    CNN sc4             0.712                 0.525             0.675
    CNN sc5             0.712                 0.593             0.631
    CNN sc6               0.5                 0.575             0.546
    CNN sc7             0.543                 0.581             0.568
    ResNet sc1          0.562                  0.6              0.575
    ResNet sc2          0.537                 0.612             0.668
    ResNet sc3          0.587                 0.543             0.543
    ResNet sc4           0.55                 0.587             0.618
    ResNet sc5          0.575                 0.593             0.618
    ResNet sc6          0.562                 0.578             0.531
    ResNet sc7           0.5                  0.581             0.612
References
 1. Daróczy, B., Vaderna, P., Benczúr, A.: Machine learning based session drop predic-
    tion in lte networks and its son aspects. In: 2015 IEEE 81st Vehicular Technology
    Conference (VTC Spring). pp. 1–5. IEEE (2015)
 2. Dicente Cid, Y., Jiménez del Toro, O.A., Depeursinge, A., Müller, H.: Efficient and
    fully automatic segmentation of the lungs in ct volumes. In: Goksel, O., Jiménez del
    Toro, O.A., Foncubierta-Rodrı́guez, A., Müller, H. (eds.) Proceedings of the VIS-
    CERAL Anatomy Grand Challenge at the 2015 IEEE ISBI. pp. 31–35. CEUR
    Workshop Proceedings, CEUR-WS (May 2015)
 3. Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial
    examples. arXiv preprint arXiv:1412.6572 (2014)
 4. Greff, K., Srivastava, R.K., Koutnı́k, J., Steunebrink, B.R., Schmidhuber, J.: Lstm:
    A search space odyssey. IEEE transactions on neural networks and learning systems
    28(10), 2222–2232 (2016)
 5. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
    pp. 770–778 (2016)
 6. Ionescu, B., Müller, H., Péteri, R., Abacha, A.B., Datla, V., Hasan, S.A., Demner-
    Fushman, D., Kozlovski, S., Liauchuk, V., Cid, Y.D., Kovalev, V., Pelka, O.,
    Friedrich, C.M., de Herrera, A.G.S., Ninh, V.T., Le, T.K., Zhou, L., Piras, L.,
    Riegler, M., l Halvorsen, P., Tran, M.T., Lux, M., Gurrin, C., Dang-Nguyen, D.T.,
    Chamberlain, J., Clark, A., Campello, A., Fichou, D., Berari, R., Brie, P., Dogariu,
    M., Ştefan, L.D., Constantin, M.G.: Overview of the ImageCLEF 2020: Multime-
    dia Retrieval in Medical, Lifelogging, Nature, and Internet Applications. In: Ex-
    perimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings
    of the 11th International Conference of the CLEF Association (CLEF 2020), vol.
    12260. LNCS Lecture Notes in Computer Science, Springer, Thessaloniki, Greece
    (September 22-25 2020)
 7. Kingma, D., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint
    arXiv:1412.6980 (2014)
 8. Kozlovski, S., Liauchuk, V., Dicente Cid, Y., Tarasau, A., Kovalev, V., Müller, H.:
    Overview of ImageCLEFtuberculosis 2020 - automatic CT-based report generation.
    In: CLEF2020 Working Notes. CEUR Workshop Proceedings
 9. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to
    document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)
10. Liauchuk, V., Kovalev, V.: Imageclef 2017: Supervoxels and co-occurrence for tu-
    berculosis ct image classification. In: CLEF2017 Working Notes. CEUR Workshop
    Proceedings, CEUR-WS, Dublin, Ireland (September 11-14 2017)
11. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.:
    Dropout: a simple way to prevent neural networks from overfitting. The journal of
    machine learning research 15(1), 1929–1958 (2014)