A 2D DILATED RESIDUAL U-NET FOR MULTI-ORGAN SEGMENTATION IN THORACIC CT Sulaiman Vesal, Nishant Ravikumar, Andreas Maier Pattern Recognition Lab, Friedrich-Alexander-University Erlangen-Nuremberg, Germany ABSTRACT Automatic segmentation of organs-at-risk (OAR) in com- puted tomography (CT) is an essential part of planning ef- fective treatment strategies to combat lung and esophageal cancer. Accurate segmentation of organs surrounding tu- mours helps account for the variation in position and mor- phology inherent across patients, thereby facilitating adaptive and computer-assisted radiotherapy. Although manual delin- eation of OARs is still highly prevalent, it is prone to errors due to complex variations in the shape and position of organs across patients, and low soft tissue contrast between neigh- bouring organs in CT images. Recently, deep convolutional neural networks (CNNs) have gained tremendous traction and achieved state-of-the-art results in medical image segmenta- tion. In this paper, we propose a deep learning framework to segment OARs in thoracic CT images, specifically for the: heart, esophagus, trachea and aorta. Our approach employs dilated convolutions and aggregated residual connections in Fig. 1. Example of OARs in CT images with axial, sagittal the bottleneck of a U-Net styled network, which incorporates and coronal views and 3D surface mesh plot. global context and dense information. Our method achieved an overall Dice score of 91.57% on 20 unseen test samples from the ISBI 2019 SegTHOR challenge. in delineating OARs more accurately, consistently, and effi- ciently. Several studies have addressed automatic segmenta- Index Terms— Thoracic Organs, Convolutional Neural tion of OARs in CT images, with efforts being more focused Network, Dilated Convolutions, 2D Segmentation on pelvic, head and neck areas [1][2][3]. In this paper, we propose a fully automatic 2D segmenta- 1. INTRODUCTION tion approach for the esophagus, heart, aorta, and trachea, in CT images of patients diagnosed with lung cancer. Accurate Organs at risk (OAR) refer to structures surrounding tumours, multi-organ segmentation requires incorporation of both local at risk of damage during radiotherapy treatment [1]. Accurate and global information. Consequently, we modified the orig- segmentation of OARs is crucial for efficient planning of ra- inal 2D U-Net [4], using dilated convolutions [5] in the low- diation therapy, a fundamental part of treating different types est layer of the encoder-branch, to extract features spanning a of cancer. However, manual segmentation of OARs in com- wider spatial range. Additionally, we added residual connec- puted tomography (CT) images for structural analysis, is very tions between convolution layers in the encoder branch of the time-consuming, susceptible to manual errors, and is subject network, to better incorporate multi-scale image information to inter-rater differences[1][2]. Soft tissue structures in CT and ensure a smoother flow of gradients in the backward pass. images normally have very little contrast, particularly in the case of the esophagus. Consequently, an automatic approach to OAR segmentation is imperative for improved radiother- 2. METHODS apy treatment planning, delivery and overall patient progno- sis. Such a framework would also assist radiation oncologists Segmentation tasks generally benefit from incorporating local and global contextual information. In a conventional U-Net Thanks to EFI-Erlangen for funding. [4] however, the lowest level of the network has a relatively small receptive field, which prevents the network from ex- maps to be concatenated between the branches. A softmax tracting features that capture non-local information. Hence, activation function was used in the last layer to produce five the network may lack the information necessary to recognize probability maps to distinguish the background from the fore- boundaries between adjacent organs, the fully connected na- ground labels. Furthermore, to improve the flow of gradients ture of specific organs, among other properties that require in the backward pass of the network, the convolution layers in greater global context to be included within the learning pro- the encoder branch were replaced with residual convolution cess. Dilated convolutions [5] provide a suitable solution to layers. In each encoder-convolution block, the input to the this problem. They introduce an additional parameter, i.e. the first convolution layer is concatenated with the output of sec- dilation rate, to convolution layers, which defines the spacing ond convolution layer (red line in Fig. 3), and the subsequent between weights in a kernel. This helps dilate the kernel such 2D max-pooling layer reduces volume dimensions by half. that a 3×3 kernel with a dilation rate of 2 results in a receptive The bottleneck between the branches employs four dilated field size equal to that of a 7×7 kernel. Additionally, this is convolutions, with dilation rates 1 − 4. The outputs of each achieved without any increase in complexity, as the number are summed up and provided as input to the decoder branch. of parameters associated with the kernel remains the same. 2.1. Dataset and Materials The ISBI SegTHOR challenge1 organizer provided the com- puted tomography (CT) images from the medical records of 60 patients. The CT scans are 512 × 512 pixels in size, with an in-plane resolution varying between 0.90 mm and 1.37 mm per pixel. The number of slices varies from 150 to 284 with a z-resolution between 2mm and 3.7mm. The most common resolution is 0.98×0.98×2.5 mm3 . The SegTHOR dataset (60 patients) was randomly split into a training set: 40 pa- tients(7390 slices) and a testing set: 20 patients(3694 slices). The ground truth for OARs was delineated by an experienced radiation oncologist [2]. 2.2. Pre-Processing Due to low-contrast in most of CT volumes in the SegTHOR dataset, we enhanced the contrast slice-by-slice, using con- trast limited adaptive histogram equalization (CLAHE), and normalized each volume with respect to mean and standard deviation. In order to retain just the region of interest (ROI), Fig. 2. Block diagram of the 2D U-Net+DR architecture for i.e. the body part and its anatomical structures, as the input thoracic OAR images segmentation. The left side shows the to our network, each volume was center cropped to a size of encoding part and the right side shows the decoding part. The 288×288 along the x and y axes, while the same number of network has four dilated convolutions in the bottleneck and slices along z were retained. We trained the model using the residual connection in each encoder block respectively(shown provided training samples via five-fold cross-validation (each in red color arrow). fold comprising 32 subjects for training and 8 subjects for validation). Moreover, we applied off-line augmentation to We propose a 2D U-Net+DR (refer to Fig.3.) network increase the number of subjects within the training set, by inspired by our previous studies [6][7]. It comprises four flipping the volumes horizontally and vertically. downsampling and upsampling convolution blocks within the encoder and decoder branches, respectively. In contrast to our previous approaches, here we employ a 2D version (rather 2.3. Loss Function than 3D) of the network with greater depth, because of the In order to train our model, we formulated a modified version limited number of training samples. For each block, we use of soft-Dice loss [8] for multiclass segmentation. Here the two convolutions with a kernel size of 3×3 pixels, with batch Dice loss for each class is first computed individually and then normalization, rectified linear units (ReLUs) as activation averaged over the number of classes. Let’s suppose for the functions, and a subsequent max pooling operation. Im- segmentation of an N×N input image (CT slice with esopha- age dimensions are preserved between the encoder-decoder gus, heart, aorta, trachea and background as labels), the out- branches following convolutions, by zero-padding the es- timated feature maps. This enabled corresponding feature 1 https://competitions.codalab.org/competitions/21012 P probabilities with classes of k = 0, 1, 2, 3, 4, puts are five on an NVIDIA Titan X-Pascal GPU with 3840 CUDA cores such that k ŷn,k = 1 for each pixel. Correspondingly, if and 12GB RAM. On the test dataset, we observed that our yn,k is the one-hot encoded ground truth of that pixel, then model predicted small structures in implausible locations. the multiclass soft Dice loss is defined as follows: This was addressed by post-processing the segmentations, to retain only the largest connected component for each struc- P ture. As the segmentations predicted by our network were 1 X y ŷ ζdc (y, ŷ) = 1 − ( P n nkPnk ) (1) already of good quality, this only lead to marginal improve- N n ynk + n ŷnk ments in the average Dice score, of approximately 0.002. k However, it substantially reduced the average Hausdorff dis- In Eq. (1) ŷnk denotes the output of the model, where n rep- tance, which is very sensitive to outliers. resents the pixels and k denotes the classes. The ground truth labels are denoted by ynk . Furthermore, in the second stage of the training (described 2.5. Evaluation Metrics in detail in the next section), we used Tversky Loss (TL)[9], Two standard evaluation metrics are used assess segmenta- as the multiclass Dice loss does not incorporate a weighting tion accuracy, namely, the Dice score coefficient (DSC) and mechanism for classes with fewer pixels. The TL is defined Hausdorff distance (HD). The DSC metric, also known as F1- as following: score, measures the similarity/overlap between manual and N automatic segmentation. DSC metric is the most widely used P ynk ŷnk metric to evaluate segmentation accuracy, and is defined as: k=1 T L(y, ŷ) = 1− N N N P P P ynk ŷnk + α ynk ŷnk + β ynk ŷnk 2T P 2|Gi ∩ Pi | k=1 k=1 k=1 DSC(G, P ) = = (3) (2) (F P + 2T P + F N ) |Gi | + |Pi | Also by adjusting the hyper-parameters α and β (refer to The HD is defined as the largest of the pairwise distances Eq. 2) we can control the trade-off between false positives and from points in one set to their corresponding closest points in false negatives. In our experiments, we set both α and β to another set. This is expressed as: 0.5. Training with this loss for additional epochs improved the segmentation accuracy on the validation set as well as on the ( ( p )) SegTHOR test set, compared to training with the multiclass HD(G, P ) = max max 2 g −p 2 (4) g∈G p∈P Dice loss alone. In Eq. (3) and (4), (G) and (P ) represent the ground truth 2.4. Model Training and predicted segmentations, respectively. The adaptive moment estimation (ADAM) optimizer was used to estimate network parameters throughout, and the 1st 3. RESULTS AND DISCUSSIONS and 2nd-moment estimates were set to 0.9 and 0.999 respec- tively. The learning rate was initialized to 0.0001 with a The average DSC and HD measures achieved by 2D U- decay factor of 0.2 during training. Validation accuracy was Net+DR across five-fold cross-validation experiments are evaluated after each epoch during training, until it stopped summarized in Table 1. The DSC scores achieved by the increasing. Subsequently, the best performing model was se- 2D U-Net+DR without data augmentation for the validation lected for evaluation on the test set. We first trained our model and test sets were 93.61% and 88.69%, respectively. The using five-fold cross-validation without any online data aug- same network with online data augmentation significantly mentation and using only multiclass Dice loss function. In improved the segmentation accuracy to 94.53% and 91.43% the second stage, in order to improve the segmentation accu- for the validation and test sets, respectively. Finally, on fine- racy, we loaded the weights from the first stage and trained tuning the trained network using the TL we achieved DSC the model with random online data augmentation (zooming, scores of 94.59% and 91.57%, respectively. Compared to rotation, shifting, shearing, and cropping) for 50 additional [2], our method achieved DSC and HD scores of 85.67% epochs. This lead to significant performance improvement on and 0.30mm for the esophagus, the most difficult OAR to the SegTHOR test data. As the multiclass Dice loss does not segment. Table 2. illustrates the DSC and HD scores of account for class imbalance, we further improved the second each individual organ for 2D U-Net+DR method with online stage of the training process, by employing the TL in place of augmentation and TL on test data set. the former. Consequently, the highest accuracy achieved by The images presented in Fig.3 help visually assess the our approach employed the TL along with online data aug- segmentation quality of the proposed method on three test mentation. The network was implemented using Keras, an volumes. Here, the green color represents the heart, and the open-source deep learning library for Python, and was trained red, blue and yellow colors represent the esophagus, trachea, Methods Train Data Validation Data Test Data delineating OARs, relative to manual segmentations. The DSC [%] DSC [%] DSC [%] HD [mm] 2D U-Net + DR 0.9784 0.9361 0.8869 0.4461 method uses both local and global information, by expand- 2D U-Net + DR 0.9741 0.9453 0.9143 0.2536 ing the receptive-field in the lowest level of the network, (Augmented) using dilated convolutions. The two-stage training strategy 2D U-Net + DR 0.9749 0.9459 0.9157 0.2500 employed here, together with the multi-class soft Dice loss (Augmented) + TL and Tversky loss, significantly improved the segmentation Table 1. The DSC and HD scores for training, validation and accuracy. Furthermore, we believe that including additional test dataset. information, e.g. MR images, may be beneficial for some OARs with poorly-visible boundaries such as the esophagus. and aorta respectively. We obtained the highest average DSC value and HD for the heart and Aorta because of its high con- 5. REFERENCES trast, regular shape, and larger size compared to the other or- gans. As expected, the esophagus had the lowest average DSC [1] I. Bulat and L. Xing, “Segmentation of organs-at-risks and HD values due to its irregularity and low contrast, mak- in head and neck ct images using convolutional neural ing it difficult to identify within CT volumes. However, our networks,” Medical Physics, vol. 44, no. 2, pp. 547–557, method achieved a DSC score of 85.8% for the esophagus on 2017. test data set, demonstrating better performance in compari- [2] R. Trullo, C. Petitjean, S. Ruan, B. Dubray, D. Nie, and son to the method proposed in [2] which used a shape mask D. Shen, “Segmentation of organs at risk in thoracic ct network architecture and conditional random fields. These images using a sharpmask architecture and conditional results highlight the effectiveness of the proposed approach random fields,” in 2017 IEEE 14th International Sym- for segmenting OARs, which is essential for radiation ther- posium on Biomedical Imaging (ISBI 2017), April 2017, apy planning. pp. 1003–1006. Metrics Esophagus Heart Trachea Aorta [3] S. Kazemifar, A. Balagopal, D. Nguyen, S. McGuire, DSC [%] 0.858 0.941 0.926 0.938 R. Hannan, S. Jiang, and A. Owrangi, “Segmentation of HD [mm] 0.331 0.226 0.193 0.297 the prostate and organs at risk in male pelvic CT images using deep learning,” Biomedical Physics & Engineering Table 2. The DSC and HD scores of each organ for 2D U-Net Express, vol. 4, no. 5, pp. 055003, jul 2018. + DR(Augmented) + TL method. [4] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Con- volutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted In- tervention – MICCAI 2015, 2015, pp. 234–241. [5] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” CoRR, vol. abs/1511.07122, 2015. [6] S. Vesal, N. Ravikumar, and A. Maier, “Dilated convo- lutions in neural networks for left atrial segmentation in 3d gadolinium enhanced-mri,” in STACOM. Atrial Seg- mentation and LV Quantification Challenges, 2019, pp. 319–328. [7] L. Folle, S. Vesal, N. Ravikumar, and A. Maier, “Dilated deeply supervised networks for hippocampus segmenta- Fig. 3. 3D surface segmentation outputs of proposed model tion in mri,” in Bildverarbeitung für die Medizin 2019, for three subjects from ISBI SegTHOR challenge test set. 2019, pp. 68–73. [8] F. Milletari, N. Navab, and S. Ahmadi, “V-net: Fully con- volutional neural networks for volumetric medical image 4. CONCLUSIONS segmentation,” in 2016 Fourth International Conference on 3D Vision (3DV), Oct 2016, pp. 565–571. In this study, we presented a fully automated approach, called 2D U-Net+DR, for automatic segmentation of the OARs [9] S. Salehi, D. Erdogmus, and A. Gholipour, “Tversky loss (esophagus, heart, aorta, and trachea) in CT volumes. Our function for image segmentation using 3d fully convolu- approach provides accurate and reproducible segmentations, tional deep networks,” in Machine Learning in Medical thereby aiding in improving consistency and robustness in Imaging, 2017, pp. 379–387.