<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>TWO-STAGE NETWORK FOR OAR SEGMENTATION Pan Chen, Chenghai Xu, Xiaoying Li, Yingying Ma, Fenglong Sun Digital China Health Technologies Co., Ltd.</article-title>
      </title-group>
      <abstract>
        <p>For cancer, radiotherapy is a standard treatment and the first step is to identify the target volumes to be targeted and the healthy organs at risk (OAR) to be protected. In this paper, we propose a deep learning framework for the segmentation of OAR in CT images of the thorax, specifically the heart, the esophagus, the trachea and the aorta. The key idea is using a two-stage strategy. First, we use a 3d U-shape convolution network to get the localization of four organs. Then using next 3d U-shape convolution network we obtain the precise segmentation results. Also, we try some tricks to improve the performance. Index Terms- OAR segmentation, deep learning, twostage</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>Delineating organs in risk is a very important routine work
in radiation therapy. However, current manual delineation is
time consuming and relies on the knowledge and experience
of a doctor. In this paper, inspired by [1], we introduce a
two-stage OAR segmentation framework based on DCNN,
consisting of 1) a localization model to detect the interest of
region containing the four organs: esophagus, heart, trachea,
aorta; 2) a segmentation model to focus on this interest of
region and obtain the segmentation result. Since the CT
image is three dimensional volume data and these organs are
intrinsically 3d objects, DCNN filters learning directly on
the overall 3d CT volume enables capturing the complete
spatial context of organs. But due to the computational
intensity and limit amount of GPU memory, the input image
can’t be very large. Our method has two advantages: on the
one hand, we don’t need high resolution image for
localization task; on the other hand, the segmentation model
can only concentrate on the important local region. Both
reduce the sizes of input images and make the algorithm
efficient.</p>
    </sec>
    <sec id="sec-2">
      <title>2. METHOD</title>
      <p>As mentioned above, we first determine the region where
OAR are located. Then we focus on this region to get fine
segmentation.</p>
    </sec>
    <sec id="sec-3">
      <title>2.1. Localization</title>
      <sec id="sec-3-1">
        <title>2.1.1. Data preprocessing</title>
        <p>With CT intensity values being not standardized,
normalization is critical to allow for data from different
processing types (with or without IV contrast). we
normalize each image independently by subtracting the
mean and dividing by the standard deviation.</p>
      </sec>
      <sec id="sec-3-2">
        <title>2.1.2. Network architecture</title>
        <p>Our network is inspired by the U-Net architecture [2] and
we refer to [3]. See Figure 1. Our architecture comprises of
a context pathway and a localization pathway. In the context
pathway, residual blocks that we call context modules are
used. Every context module consists of two 3x3x3
convolutional layers and a dropout layer (p=0.3) is in
between. Context modules are connected by stride 2 3x3x3
convolutions. As going deeper, the context modules give
more abstract representations of lower image resolution. In
the localization pathway, we have three localization
modules. Every localization module consists of a 3x3x3
convolution, which halves the number of channels,
followed by a 1x1x1 convolution. The upsampling module
consists of an upsampling (size 2, stride 2) and a 3x3x3
convolution which halves the number of channels. Through
upsampling and concatenating, a merged feature map is
send to the localization module. We employ deep
supervision in the localization pathway by integrating
segmentation layers (3x3x3 convolution) at different levels
of the network and combining them via element-wise
summation to form the final network output. Throughout the
network we use leaky ReLU non-linearities for all feature
maps. We furthermore replace the traditional batch
normalization before activation functions with instance
normalization in the case of small batch size.</p>
      </sec>
      <sec id="sec-3-3">
        <title>2.1.3. Training procedure</title>
        <p>Our network architecture is trained with 128x128x128
voxels obtained by zooming the original CT images and
batch size 1, due to the memory limitation of GPU. The
employed optimizer is Adam with an initial learning rate
lr=5x10-4 , the following learning rate schedule: after every
epoch, the learning rate is reduced to be 98.5% of the
original and a L2 weight decay of 10-5 .</p>
        <p>In order to deal with the class imbalance in the data, we
use a multiclass Dice loss function, i.e. the average of Dice
loss functions of the four classes. During training, flipping
at three axes on the fly is applied to prevent overfitting.
Note that we segment the image to obtain the localization.</p>
      </sec>
      <sec id="sec-3-4">
        <title>2.1.4. Post-processing</title>
        <p>At inference phase, the prediction is upsampled by a nearest
neighbor interpolation to match the shape of the original
input scan. After that, for each organ, we remove the
connect regions whose volumes are smaller than 5% of the
volume of the maximal connect region.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>2.2. Segmentation</title>
      <sec id="sec-4-1">
        <title>2.2.1. Training data</title>
        <p>In essence, we use the same network architecture and same
training procedure to obtain the segmentation model except
for using different training data. Noting that the training data
of the localization model is zoomed from the original CT
image, so its resolution is lower, which is enough for
localization, but not good for segmentation.</p>
        <p>For saving the resolution as much as possible, we clip
the original data by extending the bounding box which
exactly contains the groundtruth with 10 pixels in all
directions. If one side of this clipped cube is smaller than
128, we replace it by 128. That is, using a bounding box of
shape 128x128x128 to be the lower bound. As additional
data argument, bounding boxes of shape 200x200x150, and
256x256x200 are also used. Then resize this clipped cube to
128x128x128 as our input of segmentation module.</p>
      </sec>
      <sec id="sec-4-2">
        <title>2.2.2. Inference procedure</title>
        <p>At inference phase of our two-stage segmentation, we first
use the localization model to get a rough localization of four
organs, and basing on this information to clip the original
CT image and resize the clipped cube to 128x128x128.
Then send this resized cube to the segmentation model to
obtain the fine segmentation result, which becomes the final
result after post-processing and returning to the original
image. See Figure 2.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>3. RESULTS</title>
      <p>We trained and evaluated our network on the SegTHOR
2019 training dataset (40 CTs) and test dataset (20 CTs)[4].
No external data was used and the network was trained
from scratch. To improve performance further, we added a
multi-test step on the esophagus, and consider more detailed
post-processing on the segmentation of heart and esophagus.
It takes about 1 minute for our algorithm to obtain the
segmentation result from one raw CT image. Table 1 gives
the performance of our algorithm on the test dataset.</p>
    </sec>
    <sec id="sec-6">
      <title>4. REFERENCES</title>
      <p>[1] Roger Trullo, et al. "Fully automated esophagus segmentation
with a hierarchical deep learning approach." 2017 IEEE
International Conference on Signal and Image Processing
Applications (ICSIPA) IEEE, 2017.
[3] Isensee, Fabian , et al. "Brain Tumor Segmentation and
Radiomics Survival Prediction: Contribution to the BRATS 2017
Challenge." International MICCAI Brainlesion WorkshopSpringer,
Cham, 2017.
[4] Roger Trullo, Caroline Petitjean, Su Ruan, Bernard Dubray,
Dong Nie, and Dinggang Shen. Segmentation of organs at risk in
thoracic CT images using a sharpmask architecture and conditional
random fields. In IEEE International Symposium on Biomedical
Imaging (ISBI), pp. 1003-06, 2017.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>