<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A 2D DILATED RESIDUAL U-NET FOR MULTI-ORGAN SEGMENTATION IN THORACIC CT Sulaiman Vesal, Nishant Ravikumar, Andreas Maier</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Pattern Recognition Lab, Friedrich-Alexander-University Erlangen-Nuremberg</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Automatic segmentation of organs-at-risk (OAR) in computed tomography (CT) is an essential part of planning effective treatment strategies to combat lung and esophageal cancer. Accurate segmentation of organs surrounding tumours helps account for the variation in position and morphology inherent across patients, thereby facilitating adaptive and computer-assisted radiotherapy. Although manual delineation of OARs is still highly prevalent, it is prone to errors due to complex variations in the shape and position of organs across patients, and low soft tissue contrast between neighbouring organs in CT images. Recently, deep convolutional neural networks (CNNs) have gained tremendous traction and achieved state-of-the-art results in medical image segmentation. In this paper, we propose a deep learning framework to segment OARs in thoracic CT images, specifically for the: heart, esophagus, trachea and aorta. Our approach employs dilated convolutions and aggregated residual connections in the bottleneck of a U-Net styled network, which incorporates global context and dense information. Our method achieved an overall Dice score of 91.57% on 20 unseen test samples from the ISBI 2019 SegTHOR challenge.</p>
      </abstract>
      <kwd-group>
        <kwd>Thoracic Organs</kwd>
        <kwd>Convolutional Neural Network</kwd>
        <kwd>Dilated Convolutions</kwd>
        <kwd>2D Segmentation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        Organs at risk (OAR) refer to structures surrounding tumours,
at risk of damage during radiotherapy treatment [1]. Accurate
segmentation of OARs is crucial for efficient planning of
radiation therapy, a fundamental part of treating different types
of cancer. However, manual segmentation of OARs in
computed tomography (CT) images for structural analysis, is very
time-consuming, susceptible to manual errors, and is subject
to inter-rater differences[1][
        <xref ref-type="bibr" rid="ref1">2</xref>
        ]. Soft tissue structures in CT
images normally have very little contrast, particularly in the
case of the esophagus. Consequently, an automatic approach
to OAR segmentation is imperative for improved
radiotherapy treatment planning, delivery and overall patient
prognosis. Such a framework would also assist radiation oncologists
Thanks to EFI-Erlangen for funding.
in delineating OARs more accurately, consistently, and
efficiently. Several studies have addressed automatic
segmentation of OARs in CT images, with efforts being more focused
on pelvic, head and neck areas [1][
        <xref ref-type="bibr" rid="ref1">2</xref>
        ][
        <xref ref-type="bibr" rid="ref2">3</xref>
        ].
      </p>
      <p>
        In this paper, we propose a fully automatic 2D
segmentation approach for the esophagus, heart, aorta, and trachea, in
CT images of patients diagnosed with lung cancer. Accurate
multi-organ segmentation requires incorporation of both local
and global information. Consequently, we modified the
original 2D U-Net [
        <xref ref-type="bibr" rid="ref3">4</xref>
        ], using dilated convolutions [
        <xref ref-type="bibr" rid="ref4">5</xref>
        ] in the
lowest layer of the encoder-branch, to extract features spanning a
wider spatial range. Additionally, we added residual
connections between convolution layers in the encoder branch of the
network, to better incorporate multi-scale image information
and ensure a smoother flow of gradients in the backward pass.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. METHODS</title>
      <p>
        Segmentation tasks generally benefit from incorporating local
and global contextual information. In a conventional U-Net
[
        <xref ref-type="bibr" rid="ref3">4</xref>
        ] however, the lowest level of the network has a relatively
small receptive field, which prevents the network from
extracting features that capture non-local information. Hence,
the network may lack the information necessary to recognize
boundaries between adjacent organs, the fully connected
nature of specific organs, among other properties that require
greater global context to be included within the learning
process. Dilated convolutions [
        <xref ref-type="bibr" rid="ref4">5</xref>
        ] provide a suitable solution to
this problem. They introduce an additional parameter, i.e. the
dilation rate, to convolution layers, which defines the spacing
between weights in a kernel. This helps dilate the kernel such
that a 3 3 kernel with a dilation rate of 2 results in a receptive
field size equal to that of a 7 7 kernel. Additionally, this is
achieved without any increase in complexity, as the number
of parameters associated with the kernel remains the same.
      </p>
      <p>
        We propose a 2D U-Net+DR (refer to Fig.3.) network
inspired by our previous studies [
        <xref ref-type="bibr" rid="ref5">6</xref>
        ][
        <xref ref-type="bibr" rid="ref6">7</xref>
        ]. It comprises four
downsampling and upsampling convolution blocks within the
encoder and decoder branches, respectively. In contrast to our
previous approaches, here we employ a 2D version (rather
than 3D) of the network with greater depth, because of the
limited number of training samples. For each block, we use
two convolutions with a kernel size of 3 3 pixels, with batch
normalization, rectified linear units (ReLUs) as activation
functions, and a subsequent max pooling operation.
Image dimensions are preserved between the encoder-decoder
branches following convolutions, by zero-padding the
estimated feature maps. This enabled corresponding feature
maps to be concatenated between the branches. A softmax
activation function was used in the last layer to produce five
probability maps to distinguish the background from the
foreground labels. Furthermore, to improve the flow of gradients
in the backward pass of the network, the convolution layers in
the encoder branch were replaced with residual convolution
layers. In each encoder-convolution block, the input to the
first convolution layer is concatenated with the output of
second convolution layer (red line in Fig. 3), and the subsequent
2D max-pooling layer reduces volume dimensions by half.
The bottleneck between the branches employs four dilated
convolutions, with dilation rates 1 4. The outputs of each
are summed up and provided as input to the decoder branch.
      </p>
    </sec>
    <sec id="sec-3">
      <title>2.1. Dataset and Materials</title>
      <p>
        The ISBI SegTHOR challenge1 organizer provided the
computed tomography (CT) images from the medical records of
60 patients. The CT scans are 512 512 pixels in size, with
an in-plane resolution varying between 0.90 mm and 1.37 mm
per pixel. The number of slices varies from 150 to 284 with
a z-resolution between 2mm and 3.7mm. The most common
resolution is 0.98 0.98 2.5 mm3. The SegTHOR dataset
(60 patients) was randomly split into a training set: 40
patients(7390 slices) and a testing set: 20 patients(3694 slices).
The ground truth for OARs was delineated by an experienced
radiation oncologist [
        <xref ref-type="bibr" rid="ref1">2</xref>
        ].
      </p>
    </sec>
    <sec id="sec-4">
      <title>2.2. Pre-Processing</title>
      <p>Due to low-contrast in most of CT volumes in the SegTHOR
dataset, we enhanced the contrast slice-by-slice, using
contrast limited adaptive histogram equalization (CLAHE), and
normalized each volume with respect to mean and standard
deviation. In order to retain just the region of interest (ROI),
i.e. the body part and its anatomical structures, as the input
to our network, each volume was center cropped to a size of
288 288 along the x and y axes, while the same number of
slices along z were retained. We trained the model using the
provided training samples via five-fold cross-validation (each
fold comprising 32 subjects for training and 8 subjects for
validation). Moreover, we applied off-line augmentation to
increase the number of subjects within the training set, by
flipping the volumes horizontally and vertically.</p>
    </sec>
    <sec id="sec-5">
      <title>2.3. Loss Function</title>
      <p>
        In order to train our model, we formulated a modified version
of soft-Dice loss [
        <xref ref-type="bibr" rid="ref7">8</xref>
        ] for multiclass segmentation. Here the
Dice loss for each class is first computed individually and then
averaged over the number of classes. Let’s suppose for the
segmentation of an N N input image (CT slice with
esophagus, heart, aorta, trachea and background as labels), the
out1https://competitions.codalab.org/competitions/21012
N
P ynky^nk
k=1
N
P ynky^nk +
k=1
      </p>
      <p>N
P ynky^nk
k=1
puts are five probabilities with classes of k = 0; 1; 2; 3; 4,
such that Pk y^n;k = 1 for each pixel. Correspondingly, if
yn;k is the one-hot encoded ground truth of that pixel, then
the multiclass soft Dice loss is defined as follows:
dc(y; y^) = 1
1 (X
N
k</p>
      <p>Pn ynky^nk
Pn ynk + Pn y^nk
)
(1)
In Eq. (1) y^nk denotes the output of the model, where n
represents the pixels and k denotes the classes. The ground truth
labels are denoted by ynk.</p>
      <p>
        Furthermore, in the second stage of the training (described
in detail in the next section), we used Tversky Loss (TL)[
        <xref ref-type="bibr" rid="ref8">9</xref>
        ],
as the multiclass Dice loss does not incorporate a weighting
mechanism for classes with fewer pixels. The TL is defined
as following:
(2)
      </p>
      <p>Also by adjusting the hyper-parameters and (refer to
Eq. 2) we can control the trade-off between false positives and
false negatives. In our experiments, we set both and to
0.5. Training with this loss for additional epochs improved the
segmentation accuracy on the validation set as well as on the
SegTHOR test set, compared to training with the multiclass
Dice loss alone.</p>
    </sec>
    <sec id="sec-6">
      <title>2.4. Model Training</title>
      <p>The adaptive moment estimation (ADAM) optimizer was
used to estimate network parameters throughout, and the 1st
and 2nd-moment estimates were set to 0.9 and 0.999
respectively. The learning rate was initialized to 0.0001 with a
decay factor of 0.2 during training. Validation accuracy was
evaluated after each epoch during training, until it stopped
increasing. Subsequently, the best performing model was
selected for evaluation on the test set. We first trained our model
using five-fold cross-validation without any online data
augmentation and using only multiclass Dice loss function. In
the second stage, in order to improve the segmentation
accuracy, we loaded the weights from the first stage and trained
the model with random online data augmentation (zooming,
rotation, shifting, shearing, and cropping) for 50 additional
epochs. This lead to significant performance improvement on
the SegTHOR test data. As the multiclass Dice loss does not
account for class imbalance, we further improved the second
stage of the training process, by employing the TL in place of
the former. Consequently, the highest accuracy achieved by
our approach employed the TL along with online data
augmentation. The network was implemented using Keras, an
open-source deep learning library for Python, and was trained
T L(y; y^) = 1</p>
      <p>N
P ynky^nk +
k=1
on an NVIDIA Titan X-Pascal GPU with 3840 CUDA cores
and 12GB RAM. On the test dataset, we observed that our
model predicted small structures in implausible locations.
This was addressed by post-processing the segmentations, to
retain only the largest connected component for each
structure. As the segmentations predicted by our network were
already of good quality, this only lead to marginal
improvements in the average Dice score, of approximately 0.002.
However, it substantially reduced the average Hausdorff
distance, which is very sensitive to outliers.</p>
    </sec>
    <sec id="sec-7">
      <title>2.5. Evaluation Metrics</title>
      <p>Two standard evaluation metrics are used assess
segmentation accuracy, namely, the Dice score coefficient (DSC) and
Hausdorff distance (HD). The DSC metric, also known as
F1score, measures the similarity/overlap between manual and
automatic segmentation. DSC metric is the most widely used
metric to evaluate segmentation accuracy, and is defined as:
DSC(G; P ) =</p>
      <p>2T P
(F P + 2T P + F N )
= 2jGi \ Pij
jGij + jPij</p>
      <p>The HD is defined as the largest of the pairwise distances
from points in one set to their corresponding closest points in
another set. This is expressed as:</p>
      <p>HD(G; P ) = max
g2G
max
p2P
(
(
pg2
p2
))
(3)
(4)
In Eq. (3) and (4), (G) and (P ) represent the ground truth
and predicted segmentations, respectively.</p>
    </sec>
    <sec id="sec-8">
      <title>3. RESULTS AND DISCUSSIONS</title>
      <p>
        The average DSC and HD measures achieved by 2D
UNet+DR across five-fold cross-validation experiments are
summarized in Table 1. The DSC scores achieved by the
2D U-Net+DR without data augmentation for the validation
and test sets were 93.61% and 88.69%, respectively. The
same network with online data augmentation significantly
improved the segmentation accuracy to 94.53% and 91.43%
for the validation and test sets, respectively. Finally, on
finetuning the trained network using the TL we achieved DSC
scores of 94.59% and 91.57%, respectively. Compared to
[
        <xref ref-type="bibr" rid="ref1">2</xref>
        ], our method achieved DSC and HD scores of 85.67%
and 0.30mm for the esophagus, the most difficult OAR to
segment. Table 2. illustrates the DSC and HD scores of
each individual organ for 2D U-Net+DR method with online
augmentation and TL on test data set.
      </p>
      <p>The images presented in Fig.3 help visually assess the
segmentation quality of the proposed method on three test
volumes. Here, the green color represents the heart, and the
red, blue and yellow colors represent the esophagus, trachea,
2D U-Net + DR
2D U-Net + DR
(Augmented)
2D U-Net + DR
(Augmented) + TL</p>
      <p>
        Train Data Validation Data
DSC [%] DSC [%]
0.9784 0.9361
0.9741
0.9749
and aorta respectively. We obtained the highest average DSC
value and HD for the heart and Aorta because of its high
contrast, regular shape, and larger size compared to the other
organs. As expected, the esophagus had the lowest average DSC
and HD values due to its irregularity and low contrast,
making it difficult to identify within CT volumes. However, our
method achieved a DSC score of 85.8% for the esophagus on
test data set, demonstrating better performance in
comparison to the method proposed in [
        <xref ref-type="bibr" rid="ref1">2</xref>
        ] which used a shape mask
network architecture and conditional random fields. These
results highlight the effectiveness of the proposed approach
for segmenting OARs, which is essential for radiation
therapy planning.
      </p>
      <p>Fig. 3. 3D surface segmentation outputs of proposed model
for three subjects from ISBI SegTHOR challenge test set.</p>
    </sec>
    <sec id="sec-9">
      <title>4. CONCLUSIONS</title>
      <p>In this study, we presented a fully automated approach, called
2D U-Net+DR, for automatic segmentation of the OARs
(esophagus, heart, aorta, and trachea) in CT volumes. Our
approach provides accurate and reproducible segmentations,
thereby aiding in improving consistency and robustness in
delineating OARs, relative to manual segmentations. The
method uses both local and global information, by
expanding the receptive-field in the lowest level of the network,
using dilated convolutions. The two-stage training strategy
employed here, together with the multi-class soft Dice loss
and Tversky loss, significantly improved the segmentation
accuracy. Furthermore, we believe that including additional
information, e.g. MR images, may be beneficial for some
OARs with poorly-visible boundaries such as the esophagus.</p>
    </sec>
    <sec id="sec-10">
      <title>5. REFERENCES</title>
      <p>[1] I. Bulat and L. Xing, “Segmentation of organs-at-risks
in head and neck ct images using convolutional neural
networks,” Medical Physics, vol. 44, no. 2, pp. 547–557,
2017.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Trullo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Petitjean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ruan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Dubray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nie</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Shen</surname>
          </string-name>
          , “
          <article-title>Segmentation of organs at risk in thoracic ct images using a sharpmask architecture and conditional random fields</article-title>
          ,” in
          <source>2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI</source>
          <year>2017</year>
          ),
          <year>April 2017</year>
          , pp.
          <fpage>1003</fpage>
          -
          <lpage>1006</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kazemifar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Balagopal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. McGuire</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hannan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Owrangi</surname>
          </string-name>
          , “
          <article-title>Segmentation of the prostate and organs at risk in male pelvic CT images using deep learning</article-title>
          ,
          <source>” Biomedical Physics &amp; Engineering Express</source>
          , vol.
          <volume>4</volume>
          , no.
          <issue>5</issue>
          , pp.
          <volume>055003</volume>
          , jul
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>O.</given-names>
            <surname>Ronneberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fischer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Brox</surname>
          </string-name>
          , “
          <article-title>U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing</article-title>
          and
          <string-name>
            <surname>Computer-Assisted</surname>
            <given-names>Intervention - MICCAI</given-names>
          </string-name>
          <year>2015</year>
          ,
          <year>2015</year>
          , pp.
          <fpage>234</fpage>
          -
          <lpage>241</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F.</given-names>
            <surname>Yu</surname>
          </string-name>
          and
          <string-name>
            <given-names>V.</given-names>
            <surname>Koltun</surname>
          </string-name>
          ,
          <article-title>“Multi-scale context aggregation by dilated convolutions,” CoRR</article-title>
          , vol.
          <source>abs/1511.07122</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Vesal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ravikumar</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Maier</surname>
          </string-name>
          , “
          <article-title>Dilated convolutions in neural networks for left atrial segmentation in 3d gadolinium enhanced-mri,” in STACOM</article-title>
          .
          <source>Atrial Segmentation and LV Quantification Challenges</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>319</fpage>
          -
          <lpage>328</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>L.</given-names>
            <surname>Folle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vesal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ravikumar</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Maier</surname>
          </string-name>
          , “
          <article-title>Dilated deeply supervised networks for hippocampus segmentation in mri</article-title>
          ,” in
          <source>Bildverarbeitung fu¨r die Medizin</source>
          <year>2019</year>
          ,
          <year>2019</year>
          , pp.
          <fpage>68</fpage>
          -
          <lpage>73</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>F.</given-names>
            <surname>Milletari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Navab</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Ahmadi</surname>
          </string-name>
          , “
          <article-title>V-net: Fully convolutional neural networks for volumetric medical image segmentation</article-title>
          ,
          <source>” in 2016 Fourth International Conference on 3D Vision (3DV)</source>
          ,
          <source>Oct</source>
          <year>2016</year>
          , pp.
          <fpage>565</fpage>
          -
          <lpage>571</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Salehi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Erdogmus</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Gholipour</surname>
          </string-name>
          , “
          <article-title>Tversky loss function for image segmentation using 3d fully convolutional deep networks</article-title>
          ,
          <source>” in Machine Learning in Medical Imaging</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>379</fpage>
          -
          <lpage>387</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>