=Paper=
{{Paper
|id=Vol-2349/paper-12
|storemode=property
|title=A 2D dilated residual U-Net for multi-organ segmentation in thoracic CT
|pdfUrl=https://ceur-ws.org/Vol-2349/SegTHOR2019_paper_13.pdf
|volume=Vol-2349
|dblpUrl=https://dblp.org/rec/conf/isbi/VesalRM19
}}
==A 2D dilated residual U-Net for multi-organ segmentation in thoracic CT==
A 2D DILATED RESIDUAL U-NET FOR MULTI-ORGAN SEGMENTATION IN THORACIC
CT
Sulaiman Vesal, Nishant Ravikumar, Andreas Maier
Pattern Recognition Lab, Friedrich-Alexander-University Erlangen-Nuremberg, Germany
ABSTRACT
Automatic segmentation of organs-at-risk (OAR) in com-
puted tomography (CT) is an essential part of planning ef-
fective treatment strategies to combat lung and esophageal
cancer. Accurate segmentation of organs surrounding tu-
mours helps account for the variation in position and mor-
phology inherent across patients, thereby facilitating adaptive
and computer-assisted radiotherapy. Although manual delin-
eation of OARs is still highly prevalent, it is prone to errors
due to complex variations in the shape and position of organs
across patients, and low soft tissue contrast between neigh-
bouring organs in CT images. Recently, deep convolutional
neural networks (CNNs) have gained tremendous traction and
achieved state-of-the-art results in medical image segmenta-
tion. In this paper, we propose a deep learning framework to
segment OARs in thoracic CT images, specifically for the:
heart, esophagus, trachea and aorta. Our approach employs
dilated convolutions and aggregated residual connections in Fig. 1. Example of OARs in CT images with axial, sagittal
the bottleneck of a U-Net styled network, which incorporates and coronal views and 3D surface mesh plot.
global context and dense information. Our method achieved
an overall Dice score of 91.57% on 20 unseen test samples
from the ISBI 2019 SegTHOR challenge. in delineating OARs more accurately, consistently, and effi-
ciently. Several studies have addressed automatic segmenta-
Index Terms— Thoracic Organs, Convolutional Neural
tion of OARs in CT images, with efforts being more focused
Network, Dilated Convolutions, 2D Segmentation
on pelvic, head and neck areas [1][2][3].
In this paper, we propose a fully automatic 2D segmenta-
1. INTRODUCTION tion approach for the esophagus, heart, aorta, and trachea, in
CT images of patients diagnosed with lung cancer. Accurate
Organs at risk (OAR) refer to structures surrounding tumours, multi-organ segmentation requires incorporation of both local
at risk of damage during radiotherapy treatment [1]. Accurate and global information. Consequently, we modified the orig-
segmentation of OARs is crucial for efficient planning of ra- inal 2D U-Net [4], using dilated convolutions [5] in the low-
diation therapy, a fundamental part of treating different types est layer of the encoder-branch, to extract features spanning a
of cancer. However, manual segmentation of OARs in com- wider spatial range. Additionally, we added residual connec-
puted tomography (CT) images for structural analysis, is very tions between convolution layers in the encoder branch of the
time-consuming, susceptible to manual errors, and is subject network, to better incorporate multi-scale image information
to inter-rater differences[1][2]. Soft tissue structures in CT and ensure a smoother flow of gradients in the backward pass.
images normally have very little contrast, particularly in the
case of the esophagus. Consequently, an automatic approach
to OAR segmentation is imperative for improved radiother- 2. METHODS
apy treatment planning, delivery and overall patient progno-
sis. Such a framework would also assist radiation oncologists Segmentation tasks generally benefit from incorporating local
and global contextual information. In a conventional U-Net
Thanks to EFI-Erlangen for funding. [4] however, the lowest level of the network has a relatively
small receptive field, which prevents the network from ex- maps to be concatenated between the branches. A softmax
tracting features that capture non-local information. Hence, activation function was used in the last layer to produce five
the network may lack the information necessary to recognize probability maps to distinguish the background from the fore-
boundaries between adjacent organs, the fully connected na- ground labels. Furthermore, to improve the flow of gradients
ture of specific organs, among other properties that require in the backward pass of the network, the convolution layers in
greater global context to be included within the learning pro- the encoder branch were replaced with residual convolution
cess. Dilated convolutions [5] provide a suitable solution to layers. In each encoder-convolution block, the input to the
this problem. They introduce an additional parameter, i.e. the first convolution layer is concatenated with the output of sec-
dilation rate, to convolution layers, which defines the spacing ond convolution layer (red line in Fig. 3), and the subsequent
between weights in a kernel. This helps dilate the kernel such 2D max-pooling layer reduces volume dimensions by half.
that a 3×3 kernel with a dilation rate of 2 results in a receptive The bottleneck between the branches employs four dilated
field size equal to that of a 7×7 kernel. Additionally, this is convolutions, with dilation rates 1 − 4. The outputs of each
achieved without any increase in complexity, as the number are summed up and provided as input to the decoder branch.
of parameters associated with the kernel remains the same.
2.1. Dataset and Materials
The ISBI SegTHOR challenge1 organizer provided the com-
puted tomography (CT) images from the medical records of
60 patients. The CT scans are 512 × 512 pixels in size, with
an in-plane resolution varying between 0.90 mm and 1.37 mm
per pixel. The number of slices varies from 150 to 284 with
a z-resolution between 2mm and 3.7mm. The most common
resolution is 0.98×0.98×2.5 mm3 . The SegTHOR dataset
(60 patients) was randomly split into a training set: 40 pa-
tients(7390 slices) and a testing set: 20 patients(3694 slices).
The ground truth for OARs was delineated by an experienced
radiation oncologist [2].
2.2. Pre-Processing
Due to low-contrast in most of CT volumes in the SegTHOR
dataset, we enhanced the contrast slice-by-slice, using con-
trast limited adaptive histogram equalization (CLAHE), and
normalized each volume with respect to mean and standard
deviation. In order to retain just the region of interest (ROI),
Fig. 2. Block diagram of the 2D U-Net+DR architecture for i.e. the body part and its anatomical structures, as the input
thoracic OAR images segmentation. The left side shows the to our network, each volume was center cropped to a size of
encoding part and the right side shows the decoding part. The 288×288 along the x and y axes, while the same number of
network has four dilated convolutions in the bottleneck and slices along z were retained. We trained the model using the
residual connection in each encoder block respectively(shown provided training samples via five-fold cross-validation (each
in red color arrow). fold comprising 32 subjects for training and 8 subjects for
validation). Moreover, we applied off-line augmentation to
We propose a 2D U-Net+DR (refer to Fig.3.) network
increase the number of subjects within the training set, by
inspired by our previous studies [6][7]. It comprises four
flipping the volumes horizontally and vertically.
downsampling and upsampling convolution blocks within the
encoder and decoder branches, respectively. In contrast to our
previous approaches, here we employ a 2D version (rather 2.3. Loss Function
than 3D) of the network with greater depth, because of the In order to train our model, we formulated a modified version
limited number of training samples. For each block, we use of soft-Dice loss [8] for multiclass segmentation. Here the
two convolutions with a kernel size of 3×3 pixels, with batch Dice loss for each class is first computed individually and then
normalization, rectified linear units (ReLUs) as activation averaged over the number of classes. Let’s suppose for the
functions, and a subsequent max pooling operation. Im- segmentation of an N×N input image (CT slice with esopha-
age dimensions are preserved between the encoder-decoder gus, heart, aorta, trachea and background as labels), the out-
branches following convolutions, by zero-padding the es-
timated feature maps. This enabled corresponding feature 1 https://competitions.codalab.org/competitions/21012
P probabilities with classes of k = 0, 1, 2, 3, 4,
puts are five on an NVIDIA Titan X-Pascal GPU with 3840 CUDA cores
such that k ŷn,k = 1 for each pixel. Correspondingly, if and 12GB RAM. On the test dataset, we observed that our
yn,k is the one-hot encoded ground truth of that pixel, then model predicted small structures in implausible locations.
the multiclass soft Dice loss is defined as follows: This was addressed by post-processing the segmentations, to
retain only the largest connected component for each struc-
P ture. As the segmentations predicted by our network were
1 X y ŷ
ζdc (y, ŷ) = 1 − ( P n nkPnk ) (1) already of good quality, this only lead to marginal improve-
N n ynk + n ŷnk ments in the average Dice score, of approximately 0.002.
k
However, it substantially reduced the average Hausdorff dis-
In Eq. (1) ŷnk denotes the output of the model, where n rep-
tance, which is very sensitive to outliers.
resents the pixels and k denotes the classes. The ground truth
labels are denoted by ynk .
Furthermore, in the second stage of the training (described 2.5. Evaluation Metrics
in detail in the next section), we used Tversky Loss (TL)[9], Two standard evaluation metrics are used assess segmenta-
as the multiclass Dice loss does not incorporate a weighting tion accuracy, namely, the Dice score coefficient (DSC) and
mechanism for classes with fewer pixels. The TL is defined Hausdorff distance (HD). The DSC metric, also known as F1-
as following: score, measures the similarity/overlap between manual and
N
automatic segmentation. DSC metric is the most widely used
P
ynk ŷnk metric to evaluate segmentation accuracy, and is defined as:
k=1
T L(y, ŷ) = 1− N N N
P P P
ynk ŷnk + α ynk ŷnk + β ynk ŷnk 2T P 2|Gi ∩ Pi |
k=1 k=1 k=1 DSC(G, P ) = = (3)
(2) (F P + 2T P + F N ) |Gi | + |Pi |
Also by adjusting the hyper-parameters α and β (refer to
The HD is defined as the largest of the pairwise distances
Eq. 2) we can control the trade-off between false positives and
from points in one set to their corresponding closest points in
false negatives. In our experiments, we set both α and β to
another set. This is expressed as:
0.5. Training with this loss for additional epochs improved the
segmentation accuracy on the validation set as well as on the ( (
p
))
SegTHOR test set, compared to training with the multiclass HD(G, P ) = max max 2
g −p 2 (4)
g∈G p∈P
Dice loss alone.
In Eq. (3) and (4), (G) and (P ) represent the ground truth
2.4. Model Training and predicted segmentations, respectively.
The adaptive moment estimation (ADAM) optimizer was
used to estimate network parameters throughout, and the 1st 3. RESULTS AND DISCUSSIONS
and 2nd-moment estimates were set to 0.9 and 0.999 respec-
tively. The learning rate was initialized to 0.0001 with a The average DSC and HD measures achieved by 2D U-
decay factor of 0.2 during training. Validation accuracy was Net+DR across five-fold cross-validation experiments are
evaluated after each epoch during training, until it stopped summarized in Table 1. The DSC scores achieved by the
increasing. Subsequently, the best performing model was se- 2D U-Net+DR without data augmentation for the validation
lected for evaluation on the test set. We first trained our model and test sets were 93.61% and 88.69%, respectively. The
using five-fold cross-validation without any online data aug- same network with online data augmentation significantly
mentation and using only multiclass Dice loss function. In improved the segmentation accuracy to 94.53% and 91.43%
the second stage, in order to improve the segmentation accu- for the validation and test sets, respectively. Finally, on fine-
racy, we loaded the weights from the first stage and trained tuning the trained network using the TL we achieved DSC
the model with random online data augmentation (zooming, scores of 94.59% and 91.57%, respectively. Compared to
rotation, shifting, shearing, and cropping) for 50 additional [2], our method achieved DSC and HD scores of 85.67%
epochs. This lead to significant performance improvement on and 0.30mm for the esophagus, the most difficult OAR to
the SegTHOR test data. As the multiclass Dice loss does not segment. Table 2. illustrates the DSC and HD scores of
account for class imbalance, we further improved the second each individual organ for 2D U-Net+DR method with online
stage of the training process, by employing the TL in place of augmentation and TL on test data set.
the former. Consequently, the highest accuracy achieved by The images presented in Fig.3 help visually assess the
our approach employed the TL along with online data aug- segmentation quality of the proposed method on three test
mentation. The network was implemented using Keras, an volumes. Here, the green color represents the heart, and the
open-source deep learning library for Python, and was trained red, blue and yellow colors represent the esophagus, trachea,
Methods Train Data Validation Data Test Data delineating OARs, relative to manual segmentations. The
DSC [%] DSC [%] DSC [%] HD [mm]
2D U-Net + DR 0.9784 0.9361 0.8869 0.4461 method uses both local and global information, by expand-
2D U-Net + DR
0.9741 0.9453 0.9143 0.2536
ing the receptive-field in the lowest level of the network,
(Augmented) using dilated convolutions. The two-stage training strategy
2D U-Net + DR
0.9749 0.9459 0.9157 0.2500 employed here, together with the multi-class soft Dice loss
(Augmented) + TL
and Tversky loss, significantly improved the segmentation
Table 1. The DSC and HD scores for training, validation and accuracy. Furthermore, we believe that including additional
test dataset. information, e.g. MR images, may be beneficial for some
OARs with poorly-visible boundaries such as the esophagus.
and aorta respectively. We obtained the highest average DSC
value and HD for the heart and Aorta because of its high con- 5. REFERENCES
trast, regular shape, and larger size compared to the other or-
gans. As expected, the esophagus had the lowest average DSC [1] I. Bulat and L. Xing, “Segmentation of organs-at-risks
and HD values due to its irregularity and low contrast, mak- in head and neck ct images using convolutional neural
ing it difficult to identify within CT volumes. However, our networks,” Medical Physics, vol. 44, no. 2, pp. 547–557,
method achieved a DSC score of 85.8% for the esophagus on 2017.
test data set, demonstrating better performance in compari- [2] R. Trullo, C. Petitjean, S. Ruan, B. Dubray, D. Nie, and
son to the method proposed in [2] which used a shape mask D. Shen, “Segmentation of organs at risk in thoracic ct
network architecture and conditional random fields. These images using a sharpmask architecture and conditional
results highlight the effectiveness of the proposed approach random fields,” in 2017 IEEE 14th International Sym-
for segmenting OARs, which is essential for radiation ther- posium on Biomedical Imaging (ISBI 2017), April 2017,
apy planning. pp. 1003–1006.
Metrics Esophagus Heart Trachea Aorta [3] S. Kazemifar, A. Balagopal, D. Nguyen, S. McGuire,
DSC [%] 0.858 0.941 0.926 0.938 R. Hannan, S. Jiang, and A. Owrangi, “Segmentation of
HD [mm] 0.331 0.226 0.193 0.297 the prostate and organs at risk in male pelvic CT images
using deep learning,” Biomedical Physics & Engineering
Table 2. The DSC and HD scores of each organ for 2D U-Net Express, vol. 4, no. 5, pp. 055003, jul 2018.
+ DR(Augmented) + TL method.
[4] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Con-
volutional networks for biomedical image segmentation,”
in Medical Image Computing and Computer-Assisted In-
tervention – MICCAI 2015, 2015, pp. 234–241.
[5] F. Yu and V. Koltun, “Multi-scale context aggregation by
dilated convolutions,” CoRR, vol. abs/1511.07122, 2015.
[6] S. Vesal, N. Ravikumar, and A. Maier, “Dilated convo-
lutions in neural networks for left atrial segmentation in
3d gadolinium enhanced-mri,” in STACOM. Atrial Seg-
mentation and LV Quantification Challenges, 2019, pp.
319–328.
[7] L. Folle, S. Vesal, N. Ravikumar, and A. Maier, “Dilated
deeply supervised networks for hippocampus segmenta-
Fig. 3. 3D surface segmentation outputs of proposed model tion in mri,” in Bildverarbeitung für die Medizin 2019,
for three subjects from ISBI SegTHOR challenge test set. 2019, pp. 68–73.
[8] F. Milletari, N. Navab, and S. Ahmadi, “V-net: Fully con-
volutional neural networks for volumetric medical image
4. CONCLUSIONS segmentation,” in 2016 Fourth International Conference
on 3D Vision (3DV), Oct 2016, pp. 565–571.
In this study, we presented a fully automated approach, called
2D U-Net+DR, for automatic segmentation of the OARs [9] S. Salehi, D. Erdogmus, and A. Gholipour, “Tversky loss
(esophagus, heart, aorta, and trachea) in CT volumes. Our function for image segmentation using 3d fully convolu-
approach provides accurate and reproducible segmentations, tional deep networks,” in Machine Learning in Medical
thereby aiding in improving consistency and robustness in Imaging, 2017, pp. 379–387.