<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Learn-
ing dense volumetric segmentation from sparse annota-
tion,” CoRR, vol. abs/</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>SEGMENTATION OF THORACIC ORGANS USING PIXEL SHUFFLE Dmitry Lachinov</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Intel</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nizhny Novgorod dmitry.lachinov@intel.com</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Image Processing</institution>
          ,
          <addr-line>Pixel Shuffle</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1606</year>
      </pub-date>
      <volume>06650</volume>
      <abstract>
        <p>This paper summarizes our contribution to the SegTHOR challenge. This competition addresses the problem of organs at risk segmentation in CT images. For lung cancer treatment, segmentation of nearby healthy organs is essential. The task of organs delineation is largely manual and can be potentially a source of mistakes. At this point, the segmentation of organs that are located close to the tumor is a routine and tedious procedure. With the intention to simplify this procedure we study approaches of automatic organs segmentation within CT images. The solution we came up with is based on deep learning and explores two concepts: attention mechanism and pixel shuffle as an upsampling operator. In this study, we describe our approach in details and evaluate it with test data provided by challenge organizers. Without any post-processing our method achieves notable performance with following intermediate results: 0.8303, 0.9381, 0.9088, 0.9353 for Esophagus, Heart, Trachea, and Aorta respectively (Dice scores are reported).</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>3D Computed tomography is a powerful tool for the human
body examination. Being a noninvasive diagnostic method
it has been firmly integrated into different therapy protocols.
However, despite its pros, this examination method has its
own drawbacks. Mainly, noise, low image contrast or even
absence of organs’ contours are the main challenges in CT
scan analysis. Besides this, the dimensional representation
of the input data imposes multiple restrictions on the way
the scan can be analyzed. All of the manual methods are
tedious and requires high level of concentration, at the same
time, any possible mistake during scan analysis process can
potentially become a serious problem in further therapy. At
this point, we are focusing to develop an automatic solution
for segmentation routine.</p>
      <p>With the recent advances in Machine Learning and Deep
Learning, in particular, the scientific community has
developed new techniques for vision tasks that are superior to the
classic computer vision methods. Talking about semantic
segmentation, Fully Convolutional Neural Networks [1] first
achieved decent performance on such type of tasks. All of
the modern neural network architectures explore the same
concept. For the natural image semantic segmentation
competition is high: FCNs [1], SegNet [2], DeepLab architectures
[3], PSPNet [4] and others show really high performance. In
medical image semantic segmentation domain UNet and its
variants [5, 6, 7, 8, 9, 10] show state of the art results. In this
study, we are trying to adapt the existing framework for the
purpose of segmentation of four organs: Esophagus, Heart,
Trachea, and Aorta.</p>
      <p>
        Denoting the problem of automatic organs at risk
segmentation SegTHOR challenge [
        <xref ref-type="bibr" rid="ref1">11</xref>
        ] provides competition in
classifying given 3D CT voxels into five different regions:
Background, Esophagus, Heart, Trachea, and Aorta.
Typically, this procedure is manual and requires a high amount of
time and can produce repeatable errors. In this paper, we
propose an automatic solution for the above-named problem and
evaluate it with the data provided by the challenge organizers.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. RELATED WORK</title>
      <p>In this section, we describe prior work that we are using in
this paper.</p>
      <p>
        We base our model on famous UNet architecture [5]
introduced by Olaf Ronneberger et al. for the purposes of
biomedical image segmentation, cells segmentation in particular. The
proposed network consists of encoding and decoding paths.
The skip connections employed between these paths enhance
localization capabilities and also help in solving the
vanishing gradient problem. The high number of channels in
contracting part of the network allows propagating information
further to higher resolution layers. Later, Attention
mechanisms incorporated into UNet architecture were studied in
works [
        <xref ref-type="bibr" rid="ref2">12</xref>
        ] and [
        <xref ref-type="bibr" rid="ref3">13</xref>
        ].
      </p>
      <p>
        The second concept we are using is the neural network with
residual connections, so-called ResNet [
        <xref ref-type="bibr" rid="ref4">14</xref>
        ]. The authors
propose a deep architecture that can be trained efficiently. In
order to propagate gradients closer to the starting layers of the
network, residual blocks are proposed. Later, different
residual blocks architectures [
        <xref ref-type="bibr" rid="ref5">15</xref>
        ] were studied.
      </p>
      <p>
        For the task of image super-resolution, Shi et al. at [
        <xref ref-type="bibr" rid="ref6">16</xref>
        ]
proposed to use pixel shuffle as an upsampling operator. The
idea of pixel shuffle is illustrated in figure 1. This operator
rearranges input channels to produce a feature map with higher
resolution. Worth to mention, this technique solves the
problem of checkerboard artifacts in the output image, Later, the
same concept was employed for semantic segmentation tasks
[
        <xref ref-type="bibr" rid="ref7 ref8">17, 18</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. OUR METHOD 3.1. Data</title>
      <p>The dataset is split into two parts by organizers: training and
testing. Training part has 40 CT images with voxel size
varying between 0.90 mm and 1.37 mm per pixel. Majority of
the images in the train set have 512x512 slice resolution. The
number of slices varies from 150 to 284. Ground truth
labels are provided for every image in the training dataset and
contain manual segmentation on five different classes:
Background, Esophagus, Heart, Trachea, and Aorta. Example of
the CT scan and corresponding labels are demonstrated in
figure 2. No preprocessing is applied to the data.</p>
      <p>The testing dataset has 20 images. Total 10 submissions are
available for participants to test their methods.</p>
    </sec>
    <sec id="sec-4">
      <title>3.2. Preprocessing</title>
      <p>Since both testing and training data has a different spatial
resolution, as the first step in preprocessing pipeline we
resample every image to the 2x2x2.5 mm3 resolution. As the next
step, we crop the body region from the image by applying
median filter that eliminates the examination table from the
picture. Remaining region is cropped from the original image
and passed further. Finally, standard deviation and mean of
the body voxels are calculated, and then all image voxels are
normalized according to these values.</p>
    </sec>
    <sec id="sec-5">
      <title>3.3. Method</title>
      <p>We employed fully convolutional neural network
architecture based on UNet, with skip connections between
contracting and expanding paths and exponentially growing number
of channels across consecutive spatial resolution levels. We
choose starting number of feature channels in the network to
be equal to 16.</p>
      <p>
        Our architecture consists of encoding part which is a residual
network [
        <xref ref-type="bibr" rid="ref4">14</xref>
        ] with the depth of 3 with 3, 4 and 6 full
preactivation residual blocks at each level respectively. In our
experiment, we noticed that deeper networks does not
improve results but increase computational workload and can
be a potential source of the overfitting due to a large number
of parameters. Instead of Batch Normalization [
        <xref ref-type="bibr" rid="ref9">19</xref>
        ] we are
using Group Normalization [
        <xref ref-type="bibr" rid="ref10">20</xref>
        ] with the number of groups
equals to 4. As an activation function, we use Leaky ReLU
with slope equals to 0.2.
      </p>
      <p>
        In the expanding part of the network, we employ two
consecutive convolutions followed by activation at each scale. As an
upsampling operator, we have adopted the pixel shuffle [
        <xref ref-type="bibr" rid="ref6">16</xref>
        ]
technique to handle three-dimensional input. The example is
illustrated in figure 3 where eight three dimensional feature
maps produce a single three-dimensional output feature map
with higher spatial resolution.
      </p>
      <p>
        In addition to this, we employ the attention mechanism
described in paper [
        <xref ref-type="bibr" rid="ref3">13</xref>
        ]. Due to the nature of annotation
protocol, we found attention mechanism to work especially well
for this benchmark.
We crop region with size 176x96x128 from the input image
and randomly mirror it along the first two axes. Then we
apply intensity shift augmentation.
      </p>
      <p>The loss function we are using is Dice loss function that can
be written as following:</p>
      <p>LDice(gt; pred) = 2</p>
      <p>
        P gt pred +
P (gt2 + pred2) +
where gt is ground truth one-hot encoded labels, and pred
are output logits. For optimization, we are using Adam with
initial learning rate set to 1e 3 and decaying with a rate
of 0.1 at 7th and 9th epoch. To evaluate the performance we
are using cross-validation scheme with four splits. To train
our network we are using three NVIDIA GTX 1080TIs with
PyTorch framework [
        <xref ref-type="bibr" rid="ref11">21</xref>
        ]. The network is trained with batch
size 6 for 10 epochs. Each epoch has 3200 iterations in it.
The whole training takes approximately one day.
      </p>
    </sec>
    <sec id="sec-6">
      <title>4. EVALUATION</title>
      <p>For evaluation, we are using the cross-validation scheme with
the number of splits equals to 4. Since no validation dataset
was provided and the number of training samples was
limited, we decided that it was the best option for tracking the
performance of our experiments. The accuracy of our model
on training dataset measured with cross-validation with the
number of splits equals to four is reported in table 1.</p>
    </sec>
    <sec id="sec-7">
      <title>5. INTERMEDIATE RESULTS AND CONCLUSION</title>
      <p>The scores reported by the testing systems are listed in table
1. Comparing cross validation and testing values we can
notice that Dice scores for CV are consistently lower compared
to the test results. This might indicate that training dataset is
more diverse and contain more difficult samples.</p>
      <p>In conclusion, proposed in this paper model achieves
notable performance with the following intermediate results:
0.8303, 0.9381, 0.9088, 0.9353 for Esophagus, Heart,
Trachea, and Aorta respectively (Dice scores are reported). This
is done with no post-processing included in the segmentation
pipeline.</p>
    </sec>
    <sec id="sec-8">
      <title>6. REFERENCES</title>
      <p>[1] J. Long, E. Shelhamer, and T. Darrell, “Fully
convolutional networks for semantic segmentation,” in
2015 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), June 2015, pp. 3431–3440.
[2] Vijay Badrinarayanan, Ankur Handa, and Roberto
Cipolla, “Segnet: A deep convolutional
encoderdecoder architecture for robust semantic pixel-wise
labelling,” arXiv preprint arXiv:1505.07293, 2015.
[3] Liang-Chieh Chen, Yukun Zhu, George Papandreou,
Florian Schroff, and Hartwig Adam, “Encoder-decoder
with atrous separable convolution for semantic image
segmentation,” in The European Conference on
Computer Vision (ECCV), September 2018.
[4] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid
scene parsing network,” in 2017 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), July
2017, pp. 6230–6239.
[5] Olaf Ronneberger, Philipp Fischer, and Thomas Brox,
“U-net: Convolutional networks for biomedical image
segmentation,” CoRR, vol. abs/1505.04597, 2015.
[10] Andriy Myronenko, “3d mri brain tumor
segmentation using autoencoder regularization,” in Brainlesion:
Glioma, Multiple Sclerosis, Stroke and Traumatic Brain
Injuries, Alessandro Crimi, Spyridon Bakas, Hugo</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>R.</given-names>
            <surname>Trullo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Petitjean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ruan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Dubray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nie</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Shen</surname>
          </string-name>
          , “
          <article-title>Segmentation of organs at risk in thoracic ct images using a sharpmask architecture and conditional random fields</article-title>
          ,” in
          <source>2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI</source>
          <year>2017</year>
          ),
          <year>April 2017</year>
          , pp.
          <fpage>1003</fpage>
          -
          <lpage>1006</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Ozan</surname>
            <given-names>Oktay</given-names>
          </string-name>
          , Jo Schlemper, Loic Le Folgoc,
          <string-name>
            <given-names>Matthew</given-names>
            <surname>Lee</surname>
          </string-name>
          , Mattias Heinrich, Kazunari Misawa, Kensaku Mori,
          <string-name>
            <surname>Steven</surname>
            <given-names>McDonagh</given-names>
          </string-name>
          , Nils Y Hammerla,
          <string-name>
            <given-names>Bernhard</given-names>
            <surname>Kainz</surname>
          </string-name>
          , et al.,
          <article-title>“Attention u-net: learning where to look for the pancreas</article-title>
          ,” arXiv preprint arXiv:
          <year>1804</year>
          .03999,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Ruirui</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Mingming</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>and JiaCheng Li, “Connection sensitive attention u-net for accurate retinal vessel segmentation</article-title>
          ,”
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Kaiming</surname>
            <given-names>He</given-names>
          </string-name>
          , Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “
          <article-title>Deep residual learning for image recognition,”</article-title>
          <source>in Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Kaiming</surname>
            <given-names>He</given-names>
          </string-name>
          , Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “
          <article-title>Identity mappings in deep residual networks</article-title>
          ,
          <source>” in European conference on computer vision</source>
          . Springer,
          <year>2016</year>
          , pp.
          <fpage>630</fpage>
          -
          <lpage>645</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Wenzhe</surname>
            <given-names>Shi</given-names>
          </string-name>
          , Jose Caballero, Ferenc Husza´r, Johannes Totz,
          <string-name>
            <surname>Andrew P Aitken</surname>
          </string-name>
          , Rob Bishop, Daniel Rueckert, and Zehan Wang, “
          <article-title>Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,”</article-title>
          <source>in Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>1874</fpage>
          -
          <lpage>1883</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Kaiqiang</surname>
            <given-names>Chen</given-names>
          </string-name>
          , Kun Fu, Menglong Yan,
          <string-name>
            <given-names>Xin</given-names>
            <surname>Gao</surname>
          </string-name>
          , Xian Sun, and Xin Wei, “
          <article-title>Semantic segmentation of aerial images with shuffling convolutional neural networks</article-title>
          ,
          <source>” IEEE Geoscience and Remote Sensing Letters</source>
          , vol.
          <volume>15</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>173</fpage>
          -
          <lpage>177</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Hongyang</surname>
            <given-names>Gao</given-names>
          </string-name>
          , Hao Yuan,
          <string-name>
            <given-names>Zhengyang</given-names>
            <surname>Wang</surname>
          </string-name>
          , and Shuiwang Ji, “
          <article-title>Pixel deconvolutional networks</article-title>
          ,
          <source>” arXiv preprint arXiv:1705.06820</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Sergey</given-names>
            <surname>Ioffe</surname>
          </string-name>
          and
          <string-name>
            <given-names>Christian</given-names>
            <surname>Szegedy</surname>
          </string-name>
          , “
          <article-title>Batch normalization: Accelerating deep network training by reducing internal covariate shift</article-title>
          ,
          <source>” arXiv preprint arXiv:1502.03167</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Yuxin</given-names>
            <surname>Wu</surname>
          </string-name>
          and Kaiming He, “Group normalization,”
          <source>in Proceedings of the European Conference on Computer Vision (ECCV)</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>3</fpage>
          -
          <lpage>19</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Adam</surname>
            <given-names>Paszke</given-names>
          </string-name>
          , Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang,
          <string-name>
            <surname>Zachary</surname>
            <given-names>DeVito</given-names>
          </string-name>
          , Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer, “Automatic differentiation in pytorch,” in NIPS-W,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>