A 2D DILATED RESIDUAL U-NET FOR MULTI-ORGAN SEGMENTATION IN THORACIC
                                   CT

                                    Sulaiman Vesal, Nishant Ravikumar, Andreas Maier

          Pattern Recognition Lab, Friedrich-Alexander-University Erlangen-Nuremberg, Germany


                           ABSTRACT
Automatic segmentation of organs-at-risk (OAR) in com-
puted tomography (CT) is an essential part of planning ef-
fective treatment strategies to combat lung and esophageal
cancer. Accurate segmentation of organs surrounding tu-
mours helps account for the variation in position and mor-
phology inherent across patients, thereby facilitating adaptive
and computer-assisted radiotherapy. Although manual delin-
eation of OARs is still highly prevalent, it is prone to errors
due to complex variations in the shape and position of organs
across patients, and low soft tissue contrast between neigh-
bouring organs in CT images. Recently, deep convolutional
neural networks (CNNs) have gained tremendous traction and
achieved state-of-the-art results in medical image segmenta-
tion. In this paper, we propose a deep learning framework to
segment OARs in thoracic CT images, specifically for the:
heart, esophagus, trachea and aorta. Our approach employs
dilated convolutions and aggregated residual connections in       Fig. 1. Example of OARs in CT images with axial, sagittal
the bottleneck of a U-Net styled network, which incorporates      and coronal views and 3D surface mesh plot.
global context and dense information. Our method achieved
an overall Dice score of 91.57% on 20 unseen test samples
from the ISBI 2019 SegTHOR challenge.                             in delineating OARs more accurately, consistently, and effi-
                                                                  ciently. Several studies have addressed automatic segmenta-
   Index Terms— Thoracic Organs, Convolutional Neural
                                                                  tion of OARs in CT images, with efforts being more focused
Network, Dilated Convolutions, 2D Segmentation
                                                                  on pelvic, head and neck areas [1][2][3].
                                                                       In this paper, we propose a fully automatic 2D segmenta-
                     1. INTRODUCTION                              tion approach for the esophagus, heart, aorta, and trachea, in
                                                                  CT images of patients diagnosed with lung cancer. Accurate
Organs at risk (OAR) refer to structures surrounding tumours,     multi-organ segmentation requires incorporation of both local
at risk of damage during radiotherapy treatment [1]. Accurate     and global information. Consequently, we modified the orig-
segmentation of OARs is crucial for efficient planning of ra-     inal 2D U-Net [4], using dilated convolutions [5] in the low-
diation therapy, a fundamental part of treating different types   est layer of the encoder-branch, to extract features spanning a
of cancer. However, manual segmentation of OARs in com-           wider spatial range. Additionally, we added residual connec-
puted tomography (CT) images for structural analysis, is very     tions between convolution layers in the encoder branch of the
time-consuming, susceptible to manual errors, and is subject      network, to better incorporate multi-scale image information
to inter-rater differences[1][2]. Soft tissue structures in CT    and ensure a smoother flow of gradients in the backward pass.
images normally have very little contrast, particularly in the
case of the esophagus. Consequently, an automatic approach
to OAR segmentation is imperative for improved radiother-                                2. METHODS
apy treatment planning, delivery and overall patient progno-
sis. Such a framework would also assist radiation oncologists     Segmentation tasks generally benefit from incorporating local
                                                                  and global contextual information. In a conventional U-Net
   Thanks to EFI-Erlangen for funding.                            [4] however, the lowest level of the network has a relatively
small receptive field, which prevents the network from ex-           maps to be concatenated between the branches. A softmax
tracting features that capture non-local information. Hence,         activation function was used in the last layer to produce five
the network may lack the information necessary to recognize          probability maps to distinguish the background from the fore-
boundaries between adjacent organs, the fully connected na-          ground labels. Furthermore, to improve the flow of gradients
ture of specific organs, among other properties that require         in the backward pass of the network, the convolution layers in
greater global context to be included within the learning pro-       the encoder branch were replaced with residual convolution
cess. Dilated convolutions [5] provide a suitable solution to        layers. In each encoder-convolution block, the input to the
this problem. They introduce an additional parameter, i.e. the       first convolution layer is concatenated with the output of sec-
dilation rate, to convolution layers, which defines the spacing      ond convolution layer (red line in Fig. 3), and the subsequent
between weights in a kernel. This helps dilate the kernel such       2D max-pooling layer reduces volume dimensions by half.
that a 3×3 kernel with a dilation rate of 2 results in a receptive   The bottleneck between the branches employs four dilated
field size equal to that of a 7×7 kernel. Additionally, this is      convolutions, with dilation rates 1 − 4. The outputs of each
achieved without any increase in complexity, as the number           are summed up and provided as input to the decoder branch.
of parameters associated with the kernel remains the same.
                                                                     2.1. Dataset and Materials
                                                                     The ISBI SegTHOR challenge1 organizer provided the com-
                                                                     puted tomography (CT) images from the medical records of
                                                                     60 patients. The CT scans are 512 × 512 pixels in size, with
                                                                     an in-plane resolution varying between 0.90 mm and 1.37 mm
                                                                     per pixel. The number of slices varies from 150 to 284 with
                                                                     a z-resolution between 2mm and 3.7mm. The most common
                                                                     resolution is 0.98×0.98×2.5 mm3 . The SegTHOR dataset
                                                                     (60 patients) was randomly split into a training set: 40 pa-
                                                                     tients(7390 slices) and a testing set: 20 patients(3694 slices).
                                                                     The ground truth for OARs was delineated by an experienced
                                                                     radiation oncologist [2].

                                                                     2.2. Pre-Processing
                                                                     Due to low-contrast in most of CT volumes in the SegTHOR
                                                                     dataset, we enhanced the contrast slice-by-slice, using con-
                                                                     trast limited adaptive histogram equalization (CLAHE), and
                                                                     normalized each volume with respect to mean and standard
                                                                     deviation. In order to retain just the region of interest (ROI),
Fig. 2. Block diagram of the 2D U-Net+DR architecture for            i.e. the body part and its anatomical structures, as the input
thoracic OAR images segmentation. The left side shows the            to our network, each volume was center cropped to a size of
encoding part and the right side shows the decoding part. The        288×288 along the x and y axes, while the same number of
network has four dilated convolutions in the bottleneck and          slices along z were retained. We trained the model using the
residual connection in each encoder block respectively(shown         provided training samples via five-fold cross-validation (each
in red color arrow).                                                 fold comprising 32 subjects for training and 8 subjects for
                                                                     validation). Moreover, we applied off-line augmentation to
    We propose a 2D U-Net+DR (refer to Fig.3.) network
                                                                     increase the number of subjects within the training set, by
inspired by our previous studies [6][7]. It comprises four
                                                                     flipping the volumes horizontally and vertically.
downsampling and upsampling convolution blocks within the
encoder and decoder branches, respectively. In contrast to our
previous approaches, here we employ a 2D version (rather             2.3. Loss Function
than 3D) of the network with greater depth, because of the           In order to train our model, we formulated a modified version
limited number of training samples. For each block, we use           of soft-Dice loss [8] for multiclass segmentation. Here the
two convolutions with a kernel size of 3×3 pixels, with batch        Dice loss for each class is first computed individually and then
normalization, rectified linear units (ReLUs) as activation          averaged over the number of classes. Let’s suppose for the
functions, and a subsequent max pooling operation. Im-               segmentation of an N×N input image (CT slice with esopha-
age dimensions are preserved between the encoder-decoder             gus, heart, aorta, trachea and background as labels), the out-
branches following convolutions, by zero-padding the es-
timated feature maps. This enabled corresponding feature               1 https://competitions.codalab.org/competitions/21012
          P probabilities with classes of k = 0, 1, 2, 3, 4,
puts are five                                                               on an NVIDIA Titan X-Pascal GPU with 3840 CUDA cores
such that k ŷn,k = 1 for each pixel. Correspondingly, if                   and 12GB RAM. On the test dataset, we observed that our
yn,k is the one-hot encoded ground truth of that pixel, then                model predicted small structures in implausible locations.
the multiclass soft Dice loss is defined as follows:                        This was addressed by post-processing the segmentations, to
                                                                            retain only the largest connected component for each struc-
                               P                                            ture. As the segmentations predicted by our network were
                         1 X       y ŷ
        ζdc (y, ŷ) = 1 − (  P n nkPnk         )                      (1)   already of good quality, this only lead to marginal improve-
                         N    n ynk +   n ŷnk                              ments in the average Dice score, of approximately 0.002.
                                k
                                                                            However, it substantially reduced the average Hausdorff dis-
In Eq. (1) ŷnk denotes the output of the model, where n rep-
                                                                            tance, which is very sensitive to outliers.
resents the pixels and k denotes the classes. The ground truth
labels are denoted by ynk .
    Furthermore, in the second stage of the training (described             2.5. Evaluation Metrics
in detail in the next section), we used Tversky Loss (TL)[9],               Two standard evaluation metrics are used assess segmenta-
as the multiclass Dice loss does not incorporate a weighting                tion accuracy, namely, the Dice score coefficient (DSC) and
mechanism for classes with fewer pixels. The TL is defined                  Hausdorff distance (HD). The DSC metric, also known as F1-
as following:                                                               score, measures the similarity/overlap between manual and
                                      N
                                                                            automatic segmentation. DSC metric is the most widely used
                                      P
                                            ynk ŷnk                        metric to evaluate segmentation accuracy, and is defined as:
                                      k=1
T L(y, ŷ) = 1− N                      N                   N
                P                     P                    P
                       ynk ŷnk + α         ynk ŷnk + β         ynk ŷnk                              2T P           2|Gi ∩ Pi |
                 k=1                  k=1                  k=1                DSC(G, P ) =                          =                    (3)
                                                             (2)                                (F P + 2T P + F N )   |Gi | + |Pi |
    Also by adjusting the hyper-parameters α and β (refer to
                                                                               The HD is defined as the largest of the pairwise distances
Eq. 2) we can control the trade-off between false positives and
                                                                            from points in one set to their corresponding closest points in
false negatives. In our experiments, we set both α and β to
                                                                            another set. This is expressed as:
0.5. Training with this loss for additional epochs improved the
segmentation accuracy on the validation set as well as on the                                             (       (
                                                                                                                   p
                                                                                                                              ))
SegTHOR test set, compared to training with the multiclass                           HD(G, P ) = max max               2
                                                                                                                     g −p   2           (4)
                                                                                                     g∈G     p∈P
Dice loss alone.
                                                                            In Eq. (3) and (4), (G) and (P ) represent the ground truth
2.4. Model Training                                                         and predicted segmentations, respectively.
The adaptive moment estimation (ADAM) optimizer was
used to estimate network parameters throughout, and the 1st                             3. RESULTS AND DISCUSSIONS
and 2nd-moment estimates were set to 0.9 and 0.999 respec-
tively. The learning rate was initialized to 0.0001 with a                  The average DSC and HD measures achieved by 2D U-
decay factor of 0.2 during training. Validation accuracy was                Net+DR across five-fold cross-validation experiments are
evaluated after each epoch during training, until it stopped                summarized in Table 1. The DSC scores achieved by the
increasing. Subsequently, the best performing model was se-                 2D U-Net+DR without data augmentation for the validation
lected for evaluation on the test set. We first trained our model           and test sets were 93.61% and 88.69%, respectively. The
using five-fold cross-validation without any online data aug-               same network with online data augmentation significantly
mentation and using only multiclass Dice loss function. In                  improved the segmentation accuracy to 94.53% and 91.43%
the second stage, in order to improve the segmentation accu-                for the validation and test sets, respectively. Finally, on fine-
racy, we loaded the weights from the first stage and trained                tuning the trained network using the TL we achieved DSC
the model with random online data augmentation (zooming,                    scores of 94.59% and 91.57%, respectively. Compared to
rotation, shifting, shearing, and cropping) for 50 additional               [2], our method achieved DSC and HD scores of 85.67%
epochs. This lead to significant performance improvement on                 and 0.30mm for the esophagus, the most difficult OAR to
the SegTHOR test data. As the multiclass Dice loss does not                 segment. Table 2. illustrates the DSC and HD scores of
account for class imbalance, we further improved the second                 each individual organ for 2D U-Net+DR method with online
stage of the training process, by employing the TL in place of              augmentation and TL on test data set.
the former. Consequently, the highest accuracy achieved by                      The images presented in Fig.3 help visually assess the
our approach employed the TL along with online data aug-                    segmentation quality of the proposed method on three test
mentation. The network was implemented using Keras, an                      volumes. Here, the green color represents the heart, and the
open-source deep learning library for Python, and was trained               red, blue and yellow colors represent the esophagus, trachea,
 Methods            Train Data   Validation Data        Test Data        delineating OARs, relative to manual segmentations. The
                     DSC [%]        DSC [%]        DSC [%] HD [mm]
 2D U-Net + DR        0.9784          0.9361        0.8869     0.4461    method uses both local and global information, by expand-
 2D U-Net + DR
                      0.9741         0.9453         0.9143      0.2536
                                                                         ing the receptive-field in the lowest level of the network,
 (Augmented)                                                             using dilated convolutions. The two-stage training strategy
 2D U-Net + DR
                     0.9749          0.9459         0.9157      0.2500   employed here, together with the multi-class soft Dice loss
 (Augmented) + TL
                                                                         and Tversky loss, significantly improved the segmentation
Table 1. The DSC and HD scores for training, validation and              accuracy. Furthermore, we believe that including additional
test dataset.                                                            information, e.g. MR images, may be beneficial for some
                                                                         OARs with poorly-visible boundaries such as the esophagus.
and aorta respectively. We obtained the highest average DSC
value and HD for the heart and Aorta because of its high con-                                5. REFERENCES
trast, regular shape, and larger size compared to the other or-
gans. As expected, the esophagus had the lowest average DSC              [1] I. Bulat and L. Xing, “Segmentation of organs-at-risks
and HD values due to its irregularity and low contrast, mak-                 in head and neck ct images using convolutional neural
ing it difficult to identify within CT volumes. However, our                 networks,” Medical Physics, vol. 44, no. 2, pp. 547–557,
method achieved a DSC score of 85.8% for the esophagus on                    2017.
test data set, demonstrating better performance in compari-              [2] R. Trullo, C. Petitjean, S. Ruan, B. Dubray, D. Nie, and
son to the method proposed in [2] which used a shape mask                    D. Shen, “Segmentation of organs at risk in thoracic ct
network architecture and conditional random fields. These                    images using a sharpmask architecture and conditional
results highlight the effectiveness of the proposed approach                 random fields,” in 2017 IEEE 14th International Sym-
for segmenting OARs, which is essential for radiation ther-                  posium on Biomedical Imaging (ISBI 2017), April 2017,
apy planning.                                                                pp. 1003–1006.

     Metrics        Esophagus      Heart      Trachea        Aorta       [3] S. Kazemifar, A. Balagopal, D. Nguyen, S. McGuire,
     DSC [%]          0.858        0.941       0.926         0.938           R. Hannan, S. Jiang, and A. Owrangi, “Segmentation of
     HD [mm]          0.331        0.226       0.193         0.297           the prostate and organs at risk in male pelvic CT images
                                                                             using deep learning,” Biomedical Physics & Engineering
Table 2. The DSC and HD scores of each organ for 2D U-Net                    Express, vol. 4, no. 5, pp. 055003, jul 2018.
+ DR(Augmented) + TL method.
                                                                         [4] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Con-
                                                                             volutional networks for biomedical image segmentation,”
                                                                             in Medical Image Computing and Computer-Assisted In-
                                                                             tervention – MICCAI 2015, 2015, pp. 234–241.
                                                                         [5] F. Yu and V. Koltun, “Multi-scale context aggregation by
                                                                             dilated convolutions,” CoRR, vol. abs/1511.07122, 2015.
                                                                         [6] S. Vesal, N. Ravikumar, and A. Maier, “Dilated convo-
                                                                             lutions in neural networks for left atrial segmentation in
                                                                             3d gadolinium enhanced-mri,” in STACOM. Atrial Seg-
                                                                             mentation and LV Quantification Challenges, 2019, pp.
                                                                             319–328.
                                                                         [7] L. Folle, S. Vesal, N. Ravikumar, and A. Maier, “Dilated
                                                                             deeply supervised networks for hippocampus segmenta-
Fig. 3. 3D surface segmentation outputs of proposed model                    tion in mri,” in Bildverarbeitung für die Medizin 2019,
for three subjects from ISBI SegTHOR challenge test set.                     2019, pp. 68–73.
                                                                         [8] F. Milletari, N. Navab, and S. Ahmadi, “V-net: Fully con-
                                                                             volutional neural networks for volumetric medical image
                      4. CONCLUSIONS                                         segmentation,” in 2016 Fourth International Conference
                                                                             on 3D Vision (3DV), Oct 2016, pp. 565–571.
In this study, we presented a fully automated approach, called
2D U-Net+DR, for automatic segmentation of the OARs                      [9] S. Salehi, D. Erdogmus, and A. Gholipour, “Tversky loss
(esophagus, heart, aorta, and trachea) in CT volumes. Our                    function for image segmentation using 3d fully convolu-
approach provides accurate and reproducible segmentations,                   tional deep networks,” in Machine Learning in Medical
thereby aiding in improving consistency and robustness in                    Imaging, 2017, pp. 379–387.