MULTI-ORGAN SEGMENTATION USING SIMPLIFIED DENSE V-NET WITH POST
                               PROCESSING

                                  Ming Feng, Weiquan Huang, Yin Wang, Yuxia Xie

                                       Tongji University, Shanghai, China
                                {1810865, 1730784, yinw, yuxia xie}@tongji.edu.cn


                        ABSTRACT                                  the testing data. Our postprocessing method further reduces
                                                                  fragments in the prediction mask. The overall improvement
With the recent advances in the field of computer vision, Con-
                                                                  over the SM+CRF baseline model [5] is between 4 to 10
volutional Neural Networks (CNNs) are widely used in organ
                                                                  percent over different organs.
segmentation of computed tomography (CT) images. Based
on the Dense V-net model, this paper proposes a simplified
version with postprocessing methods to help reduce the frag-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        2. OUR MODEL
ments in organ segmentation results. Comparing with the
baseline method that uses a sharpmask model with condi-
tional random field (SM+CRF), our model improves the Dice         1283
                                                                   <latexit sha1_base64="LF98tkVakQ/apgRd253SkgsmOsI=">AAAB7HicbVBNTwIxEJ31E/EL9eilkZh4Ils8yJHEi0dMXCCBlXRLFxq67abtmpANv8GLB43x6g/y5r+xwB4UfMkkL+/NZGZelApurO9/exubW9s7u6W98v7B4dFx5eS0bVSmKQuoEkp3I2KY4JIFllvBuqlmJIkE60ST27nfeWLacCUf7DRlYUJGksecEuukANcbj9eDStWv+QugdYILUoUCrUHlqz9UNEuYtFQQY3rYT22YE205FWxW7meGpYROyIj1HJUkYSbMF8fO0KVThihW2pW0aKH+nshJYsw0iVxnQuzYrHpz8T+vl9m4EeZcppllki4XxZlAVqH552jINaNWTB0hVHN3K6Jjogm1Lp+yCwGvvrxO2vUa9mv4HlebjSKOEpzDBVwBhhtowh20IAAKHJ7hFd486b14797HsnXDK2bO4A+8zx+I+Y3L</latexit>
                                                                            sha1_base64="msbQH587scMYceJtEv3WOk5igQk=">AAAB7HicbZDLTgIxFIbP4A3xhrp000hMXJEpLmAniRuXaBwggZF0SgcaOp1J2zEhE57BjQuNceszuPIh3Pk2lstCwT9p8uX/z0nPOUEiuDau++3k1tY3Nrfy24Wd3b39g+LhUVPHqaLMo7GIVTsgmgkumWe4EaydKEaiQLBWMLqa5q0HpjSP5Z0ZJ8yPyEDykFNirOXhSu3+olcsuWV3JrQKeAGly8+PW7Bq9Ipf3X5M04hJQwXRuoPdxPgZUYZTwSaFbqpZQuiIDFjHoiQR0342G3aCzqzTR2Gs7JMGzdzfHRmJtB5Hga2MiBnq5Wxq/pd1UhPW/IzLJDVM0vlHYSqQidF0c9TnilEjxhYIVdzOiuiQKEKNvU/BHgEvr7wKzUoZu2V8g0v1GsyVhxM4hXPAUIU6XEMDPKDA4RGe4cWRzpPz6rzNS3POoucY/sh5/wFHho/S</latexit>


                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   Activation

ratio of Esophagus, Heart, Trachea, and Aorta by 10%, 4%,                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Volume

7%, and 6%, respectively.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Convolutional
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   downsampling

   Index Terms— Convolutional Neural Networks, CT Seg-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          2x
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   Convolution
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   Bilinear
mentation, Dense V-net                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             upsampling

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                4x                 Dense
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   feature stack


                   1. INTRODUCTION
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Fig. 1. Simplified Dense V-net model
Organ segmentation of CT images is of great importance
in medical diagnosis. The identification and localization of          The structure of our proposed model is shown in Fig. 1.
organs are the daily work of the radiologist. Since CT im-        Comparing with the original Dense V-net model, there are two
ages are complex and three-dimensional(3D), distinguishing        main differences. First, the input size is different. The input
organs manually is a difficult and tedious task. Therefore,       size of the original model is 1443 . The number of partial data
segmentation using deep learning methods automatically            slices in our data is less than 144, so we set the input size to
have received a great deal of attention in medical imaging re-    1283 . Second, the spatial prior block is discarded.
search. In the field of 3D medical image segmentation, there          The encoder block of the segmentation network generates
are two main methods. The first is to segment each slice inde-    three sets of feature maps of different sizes. The decoder
pendently, e.g., using the U-net model [1]. The other is to use   block upsamples the smaller feature maps so the output mask
the 3D convolution to aggregate inter-slice information and to    is of the same size as the input image. The output layer gen-
segment all slices of the CT image at once, e.g., V-net [2] is    erates the segmentation mask with the probability vector of
one of the 3D convolutional network models for this purpose.      different segmentation classes at each pixel.
Gibson et al. [3] integrated the two-dimensional segmenta-
tion model of Dense net [4] into V-net and proposed a Dense
V-net architecture for multiple organ segmentation. Overall,                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     3. IMPLEMENTATION
single slice segmentation methods cannot utilize inter-layer
dependencies for better results but are computationally more      This section discusses various optimization techniques to re-
efficient. All slice 3D segmentation can aggregate all layers     duce the Dice loss and to minimize the Hausdorff distance.
for better accuracy but is more expensive to compute.
    In this paper, we present our multi-organ segmentation so-    3.1. Data preprocessing
lution used in the SegTHOR challenge hosted at the ISBI’19
conference. Observing that the training data is relatively        Preprocessing is part of our fully automated organ segmenta-
small and easy to overfit deep convolutional neural nets, we      tion method. By analyzing the training data provided, we find
simplify the Dense V-net model to achieve better results with     the following issues.
    First, the dataset is small and is quite easy to overfit our
deep neural networks. Second, for a single CT slice, the pro-
portion of pixels of various organs is quite different. Fig. 2
shows the imbalance of different organs at different slices.
Last, considering the relative position of the machine and the
person while scanning, the CT images can be scaled and ro-
tated. Based on these observations, we apply the following
techniques.


                                                                         (a) Slice 119, label          (b) Slice 119, prediction


(a) The ratio of background         (b) The ratio of organs.
         to organs.

Fig. 2. Background and organ volume proportion in training
data.
                                                                         (c) Slice 120, label          (d) Slice 120, prediction

3.1.1. Patch sampling                                               Fig. 3. The 119th and 120th slices of patient 30 in the la-
                                                                    beled data and prediction result. We can see that the heart
We ensure that each class is sampled with the same probabil-        disappears at the 120th slice in the labeled data. The sudden
ity. According to the slice range of the test dataset, the sample   disappearing of an organ often leads to incorrect predictions.
block size is set to 1283 .

3.1.2. Data augmentation                                            Fig.4 shows the predict results with the removal of discon-
                                                                    nected blocks. The CT image is sliced along the depth direc-
During the training stage, we randomly rotate pictures (within      tion, for each layer, 52 average filtering is used, which seri-
-10◦ −10◦ ) and randomly scale pictures (-10 %−10 % range).         ously affects the segmentation results of small sample organs
We implement the data augmentation on the Niftynet frame-           like Esophagus and Trachea, and has little effect on multi-
work [6]. The data augmentation method used in the training         sample organs as Heart and Aorta. The enlargement of the
stage will not affect the structure of the Dense V-net.             organs for each class within each layer has little effect on the
                                                                    segmentation.
3.2. Postprocessing
By comparing the prediction result with the ground truth la-        3.3. DicePlusXEnt loss function
bel, we find the following issues.                                  The loss functions commonly used in segmentation are Cross-
    In the training data organs are all connected, but organs       Entropy loss and Dice loss. The Cross-Entropy loss exam-
are not connected in the predicted results. Some areas of the       ines each pixel separately, and compares the prediction results
CT image are not smooth, Fig.3. There are multiple organ            with one-hot encoded target vector. It does not consider the
inclusions in the same slice, which does not actually exist.        imbalance of different segmentation classes, and can lead to
In the prediction result, the organ is connected but there are      poor prediction results with the minority classes. Imbalanced
background noise inside.                                            classes are very common in medical image segmentation. The
    For the first question above, we experimented with the          Dice loss is essentially a measurement of the overlap between
following methods.                                                  the predicted mask and the ground truth mask, calculated as
    The CT image is sliced along three dimensions respec-           follows [7] :
tively, then count the number of connected blocks of each
                                                                                                                k k
                                                                                                        P
organ. For each dimension and each organ, the largest con-                                 2 X             i∈I ui vi
nected block is retained, and the other parts are considered                   ldice = −           P        k
                                                                                                                P       k
                                                                                                                                (1)
                                                                                          |K|         i∈I ui +     i∈I vi
                                                                                               k∈K
background noise and therefore removed. Experiments show
that our method achieves obvious increase; see Algorithm 1.            where K is the set of segmentation classes, I is the entire
Algorithm 1 Axis-based denoise method
Input: The result from model, Tm ;
Output: Remove noise block prediction result, Qm ;
 1: for all axisi of Tm do
 2:   for all slicej of the axisi do
 3:      for all categoryk of Tm do
 4:         Sets slice[−1] and slice[max + 1] to −1;
 5:         if The current slice contain the categoryk and the
            previous slice does not contain categoryk then
 6:            The current slice index is added to blockIn;
 7:         end if
 8:         if The current slice contain the categoryk and the
            next layer does not contain categoryk then
 9:            The current slice index is added to blockOut;
10:         end if
11:      end for
12:   end for
13:   The blockIn corresponds to the blockOut element one
      by one, each set of them represents a continuous block,
      the data difference represents the contiguous block
      length, the contiguous block of the maximum length
      is reserved, and the other continuous blocks in Qm are
      set as the background class;
14: end for
15: return Qm ;


image, and uki , vik are the predicted and ground truth value
of class k at pixel i, respectively. Dice loss is more suitable   1                               1
for sample’s extremely imbalance situation, but in our expe-
                                                                  Fig. 4. From top to bottom, main view and the left view of
rience, using the Dice loss alone will adversely affect back
                                                                  the true label, the predicted result, the 3D denoise. The small
propagation, making training extremely unstable.
                                                                  fragments are significantly reduced.
    We use DicePlusXEnt loss [8], which is the sum of the
Cross-Entropy loss and the Dice loss, as follows:

                     ltotal = ldice + lCE                  (2)
                                                                  Algorithm 2 Training model
    This loss function will improve the sample imbalance to a
                                                                  Input: The training data, X and label, Y ;The fusion model
certain extent and improve the stability of network training.
                                                                      numbers, N ;The learning rate list, L;
    Due to the imbalance of the samples, we set the weight of     Output: Segmentation result, R;
the Cross-Entropy loss in DicePlusXEnt: w(Background)=1,           1: for all ni in range(N ) do
w(Heart)=2, w(T rachea)=3, w(Aorta)=4, w(Esophagus)                2:   for all li ∈ L do
=5.                                                                3:      while loss does decrease in 500 iterations do
                                                                   4:         Forward and backward;
                    4. EXPERIMENTS                                 5:      end while
                                                                   6:   end for
Our experiment is conducted on the SegTHOR dataset [5].            7:   Save the model with the lowest validation set loss dur-
Niftynet is used in our model training, which is implemented            ing this iteration;
by Tensorflow. Based on the preprocessed data, the Dense           8: end for
V-net network is trained and then fine-tuned with different        9: Fusion saved models, get Rori ;
parameter configurations.                                         10: R ← Axis-based denoise(Rori );
    The activation function used in the network is Leaky          11: return R;
ReLU. The batch size is four. We use the Adam optimizer
with an initial learning rate of 0.01. If the loss value does
                                                  Table 1. Performance of different methods
                                                                              Dice                                        Hausdorff
                        Methods                          Esophagus    Heart          Trachea    Aorta     Esophagus    Heart     Trachea     Aorta
             Dense V-net (resize sampling)               0.588862    0.906035    0.772924      0.780659   1.531403    0.598427   1.783999   0.997311
           Dense V-net (balanced sampling)               0.746470    0.937633    0.875301      0.914082   1.153503    0.221647   1.726525   0.402991
   Dense V-net (balanced sampling and average filter)    0.490914    0.914966    0.589199      0.840300   3.246483    0.292705   2.417643   1.066558
 Dense V-net (balanced sampling and organ enlargement)   0.486919    0.913697    0.575745      0.841042   4.128935    0.817668   5.587061   1.581914
                  7 Dense V-net fusion                   0.763881    0.940254    0.883234      0.915550   0.771958    0.188203   0.597479   0.308775
          7 Dense V-net fusion (1D denoise)              0.763973    0.940255    0.885504      0.915673   0.766507    0.188183   0.330171   0.295968
          7 Dense V-net fusion (3D denoise)              0.765423    0.940225    0.885614      0.915954   0.661974    0.188183   0.325847   0.258024
 7 Dense V-net fusion (3D denoise and weighted loss)     0.773450    0.941403    0.892730      0.923325   0.640093    0.182138   0.307711   0.235788
  * This Dense V-net is simplified Dense V-net.


not decrease after 500 iterations, then the learning rate de-                 [3] Eli Gibson, Francesco Giganti, Yipeng Hu, Ester Bon-
creases by ten-fold, up to 0.0001. When the learning rate is                      mati, Steve Bandula, Kurinchi Gurusamy, Brian David-
0.0001 and after 500 iterations if the loss does not change,                      son, Stephen P Pereira, Matthew J Clarkson, and Dean C
the learning rate is reset to 0.1. This process is repeated seven                 Barratt, “Automatic multi-organ segmentation on abdom-
times, and the model with the lowest validation loss during                       inal ct with dense v-networks,” IEEE transactions on
the training process is selected for comparison. In addition,                     medical imaging, vol. 37, no. 8, pp. 1822–1834, 2018.
we pick the parameters of the minimum loss of the validation
                                                                              [4] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and
set in each training cycle, seven models in total, and fuse the
                                                                                  Kilian Q Weinberger, “Densely connected convolutional
results together for comparison [9]; see Algorithm 2. Table 1
                                                                                  networks,” in Proceedings of the IEEE conference on
shows the results with different settings.
                                                                                  computer vision and pattern recognition, 2017, pp. 4700–
    Overall, the fusion results are much better than the single-
                                                                                  4708.
model prediction. Denoise in postprocessing further improves
the accuracy. Heart and Aorta have much better segmentation                   [5] Roger Trullo, Caroline Petitjean, Su Ruan, Bernard
results than Esophagus and Trachea.                                               Dubray, D Nie, and D Shen, “Segmentation of organs at
                                                                                  risk in thoracic ct images using a sharpmask architecture
                                                                                  and conditional random fields,” in 2017 IEEE 14th Inter-
                        5. CONCLUSION                                             national Symposium on Biomedical Imaging (ISBI 2017).
                                                                                  IEEE, 2017, pp. 1003–1006.
Based on the analysis of the training data, we simplified
Dense V-net to perform multi-organ segmentation effectively.                  [6] Eli Gibson, Wenqi Li, Carole Sudre, Lucas Fidon,
We use a variety of optimization techniques such as multi-                        Dzhoshkun I Shakir, Guotai Wang, Zach Eaton-Rosen,
scale prediction, data augmentation, and data postprocessing                      Robert Gray, Tom Doel, Yipeng Hu, et al., “Niftynet: a
to improve the stability and performance of the model. Com-                       deep-learning platform for medical imaging,” Computer
paring to the baseline model of SM+CRF [5], the Dice rate                         methods and programs in biomedicine, vol. 158, pp. 113–
of organ segmentation is improved up to 10%. After our                            122, 2018.
optimization, there is still room for improvement for small
                                                                              [7] Carole H Sudre, Wenqi Li, Tom Vercauteren, Sebastien
organs, and delineation algorithms could help to refine organ
                                                                                  Ourselin, and M Jorge Cardoso, “Generalised dice over-
boundaries.
                                                                                  lap as a deep learning loss function for highly unbalanced
                                                                                  segmentations,” in Deep learning in medical image anal-
                        6. REFERENCES                                             ysis and multimodal learning for clinical decision sup-
                                                                                  port, pp. 240–248. Springer, 2017.
[1] Olaf Ronneberger, Philipp Fischer, and Thomas Brox,                       [8] Fabian Isensee, Jens Petersen, Andre Klein, David Zim-
    “U-net: Convolutional networks for biomedical image                           merer, Paul F Jaeger, Simon Kohl, Jakob Wasserthal,
    segmentation,” in International Conference on Medi-                           Gregor Koehler, Tobias Norajitra, Sebastian Wirkert,
    cal image computing and computer-assisted intervention.                       et al., “nnu-net: Self-adapting framework for u-net-
    Springer, 2015, pp. 234–241.                                                  based medical image segmentation,” arXiv preprint
                                                                                  arXiv:1809.10486, 2018.
[2] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ah-
    madi, “V-net: Fully convolutional neural networks for                     [9] Leslie N Smith, “Cyclical learning rates for training neu-
    volumetric medical image segmentation,” in 2016 Fourth                        ral networks,” in 2017 IEEE Winter Conference on Ap-
    International Conference on 3D Vision (3DV). IEEE,                            plications of Computer Vision (WACV). IEEE, 2017, pp.
    2016, pp. 565–571.                                                            464–472.