<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MULTI-TASK LEARNING FOR THE SEGMENTATION OF THORACIC ORGANS AT RISK IN CT IMAGES Tao He, Jixiang Guo, Jianyong Wang, Xiuyuan Xu and Zhang Yi</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Automatic segmentation</institution>
          ,
          <addr-line>CT, U-Net</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Machine Intelligence Laboratory, Sichuan University</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>of Computer Science, Sichuan University</institution>
          ,
          <addr-line>Chengdu 610065</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The automatic segmentation of thoracic organs has clinical significance. In this paper, we develop the U-Net architecture and obtain a uniform U-like encoder-decoder segmentation architecture for the segmentation of thoracic organs. The encoder part of this architecture could directly involve the widely used popular networks (DenseNet or ResNet) by omitting their last linear connection layers. In our observation, we find out that individual organs could not appear independently in one CT slice. Therefore, we empirically propose to use the multi-task learning for the segmentation of thoracic organs. The major task focuses on the local pixel-wise segmentation and the auxiliary task focuses on the global slice classification. There are two merits of the multi-task learning. Firstly, the auxiliary task could improve the generalization performance by concurrently learning with the main task. Secondly, the predicted accuracy of the auxiliary task could achieve almost 98% on the validation set, so the predictions of the auxiliary task could be used to filter the false positive segmentation results. The proposed method was test on the Segmentation of THoracic Organs at Risk (SegTHOR) challenge (submitted name: MILab, till March 21, 2019, 8:44 a.m. UTC) and achieved the second place by the “All” rank and the second place by the “Esophagus” rank, respectively.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Index Terms—
Multi-task learning</p>
    </sec>
    <sec id="sec-2">
      <title>1. INTRODUCTION</title>
      <p>The contrast-enhanced Computed Tomography (CT) is the
widely used clinical tool for diagnosing plenty of thoracic
diseases. The drab and boring manual segmentation of thoracic
organs from CT images is very time-consuming. The
automatic segmentation from CT images will be helpful for
oncologists to diagnose the thoracic organs at risk in CT images.
In this paper, we focus on the automatic augmentation of
thoracic organs data, supported by the Segmentation of THoracic</p>
      <p>Organs at Risk (SegTHOR) [1] challenge. The segmentation
task is challenging for following reasons: (1) the shape and
position of each organ on CT slices vary greatly between
patients; (2) the contours in CT images have low contrast, and
can be absent. The challenge focuses on 4 organs as risk:
heart, aorta, trachea, esophagus.</p>
      <p>
        Recently, the developments of the automatic
segmentation based on deep learning have overthrown the traditional
feature extraction methods. The paragon of medical
segmentation models is U-Net [2]. U-Net has carefully designed
encoder and decoder parts with shortcut connections. The
most significant advantage of shortcut connections is to
combine low-level features with high-level features at
different layers. Recent years, many similar models termed as
encoder-decoder architectures were proposed, for example,
Seg-Net [3] and DeepLab series networks [
        <xref ref-type="bibr" rid="ref1">4, 5</xref>
        ].
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref2">6</xref>
        ], an H-DenseUNet was proposed for liver and tumor
segmentation, where intra-slice and inter-slice features were
extracted and jointly optimized through the hybrid feature
fusion layer. In [
        <xref ref-type="bibr" rid="ref3">7</xref>
        ], a 3D Deeply Supervised Network
(3DDSN) was proposed to address the liver segmentation
problem. The 3D-DSN involved additional supervision injected
into hidden layers to counteract the adverse effects of
gradient vanishing. This method achieved the state of the art on
the MICCAI-SLiver07 dataset. V-Net [
        <xref ref-type="bibr" rid="ref4">8</xref>
        ] is much like the 3D
version of U-Net, which was directly applied in volumetric
segmentation from MRI volumes depicting prostate.
      </p>
      <p>
        Honestly, the 3D-CNNs based models fully exploit the
space relative features but training a 3D-CNNs based model
is usually time-consuming and requires large
hyperparameters capacity. Therefore, many previous works employed
2DCNNs and trained them on 2.5D data, which consisted of a
stack of adjacent slices as input. Then the liver lesion
regions were predicted according to the center slice. In order to
achieve the accurate segmentation results of 2D-CNNs,
authors of [
        <xref ref-type="bibr" rid="ref5">9</xref>
        ] proposed a two-step segmentation framework. At
the first step, an FCN was trained to segment liver as ROI
input for the second FCN. The second FCN solely segmented
lesions from the predicted liver ROIs of step 1. The two-step
segmentation framework has been widely involved in many
segmentation works [
        <xref ref-type="bibr" rid="ref2 ref5 ref6">9, 6, 10</xref>
        ].
      </p>
      <p>In this paper, we propose a uniform U-like
encoderdecoder segmentation architecture. The previous U-Net and
156 157
its variants usually have symmetrical encoder and decoder
parts. In the uniform U-like architecture, the encoder part
could directly involves the widely used popular networks
(ResNet or DenseNet) by omitting their last linear connection
layers. The encoder has more no-linear mapping ability and
could adopts the transfer learning by initializing its
parameters with the popular networks trained on image classification.
The decoder part only works on enlarging the size of feature
maps and shrinking the channel of networks. The uniform
U-like architecture is trained under the multi-task learning
scheme. The major task focuses on the local pixel-wise
segmentation and the auxiliary task focuses on the global slice
classification. There are two merits of multi-task learning.
Firstly, the auxiliary task could improve the generalization
performance by concurrently learning with the main task.
Secondly, the predictions of the auxiliary task are used for
filtering the false positive segmentation results.</p>
    </sec>
    <sec id="sec-3">
      <title>2. METHOD</title>
      <p>In this section, we will introduce the multi-task learning
scheme and the uniform U-like encoder-decoder architecture.</p>
    </sec>
    <sec id="sec-4">
      <title>2.1. Multi-task Learning</title>
      <p>During the automatic segmentation of thoracic organs on the
SegTHOR challenge data, we found out that individual
organs could not appear independently in one slice. In Fig. (1), we
give the detailed macro view of Patient01’s CT slices. All
patients have similar macro appearance orders. In other words,
the organs appear dependently. If we could learn the macro
classification, we could use the classification results to filter
the false positive predictions of each organ. It will be much
more valuable since the organs appear dependently. We apply
the multi-task learning scheme to concurrently learn the
segmentation and classification tasks. The formulation of
learnConvolution
Copy
Global Avg Pool
Bilinear Upsample
Concatenate
where Ks = 5 and Kc = 4, which indicate the number of
segmentation and classification categories, respectively. D is
hk) log(1
qk)): (1)
…
N
N
N</p>
      <p>N</p>
      <p>Softmax
Y
N
Y
Y</p>
      <p>Y
N
Y</p>
      <p>N</p>
      <p>Sigmoid
the combined cost function. The major segmentation task is
trained with dice loss and the auxiliary classification task is
trained with multi-label logistic regression. In the dice loss
part, pikj and gikj are the kth output produced by a softmax
function and the kth one-hot target of pixel (i; j), respectively.
In the multi-label logistic regression part, qk and hk are the
kth output produced by the corresponding logistic function
and the kth target, respectively. is used to balance the loss.
In our experiments, we set = 0:5.</p>
    </sec>
    <sec id="sec-5">
      <title>2.2. Uniform U-like Encoder-Decoder Architecture</title>
      <p>In most segmentation tasks, manually labelling is
timeconsuming, therefore the train sets are always restrained.
Transfer learning is a very useful strategy to train a network
on a small data set. In order to apply the transfer learning
in the SegTHOR challenge, we abstract a uniform U-like
encoder-decoder architecture, where the encoder part could
directly involve the widely used ResNet or DenseNet by
omitting their last linear connection layers. The encoder part
could adopt the transfer learning by initializing the encoders
parameters with the corresponding networks trained on image
classification. The decoder part only works on enlarging the
size of feature maps and shrinking the channel of networks.
The U-like architecture is depicted in Fig. (2).</p>
    </sec>
    <sec id="sec-6">
      <title>3. EXPERIMENT</title>
      <p>There are 40 and 20 3D abdominal CT scans for training
and testing on the SegTHOR Challenge dataset,
respectively. We randomly split the given 40 training CT volumes
into 32 for training and 8 for validation. The 3D CT
scans were cut into slices along z-axis. Under the architecture
of the uniform U-like architecture, the encoder part is free
for setting. We implemented 6 widely used networks as the
encoder part including ResNet-101, ResNet-152,
DenseNet121, DenseNet-161, DenseNet-169 and DenseNet-201. The
decoder part of them only involved one convolutional layer to
shrink the number of channels.</p>
      <p>
        The training of networks stopped when the dice per case
of the validation set did not grow during 10 epochs. In
order to fully use the given data, we then reloaded the trained
model and retrained it on the full 40 slices for fixed 10
epochs. All networks were implemented by Pytorch [
        <xref ref-type="bibr" rid="ref7">11</xref>
        ] and
trained using the stochastic gradient descent with momentum
of 0:9. All networks were trained on the images with the
original resolution and in form of 2.5D data, which consists of
3 adjacent axial slices. The image intensity values of all
scans were truncated to the range of [ 128; 384] HU to omit
the irrelevant information. The initial learning rate was 0:01
and decayed by multiplying 0:9. For data augmentation, we
adopted random horizontal and vertical flipping and scaling
between 0:6 and 1 to alleviate the overfitting problem. The
networks were trained using four NVIDIA Titan Xp GPUs
and it took about 6 8 hours. After each testing, we used
a largest connected component labeling to refine the
segmentation results of each organ. The final submitted result is the
ensemble result of those 6 U-like networks. The experimental
results are listed on Table 1. We achieved the second place in
the “All” rank order and the second place in the “Esophagus”
rank order, respectively.
      </p>
    </sec>
    <sec id="sec-7">
      <title>4. CONCLUSION</title>
      <p>The uniform U-like architecture is abstracted from the
widely used U-Net. The encoder part of the uniform U-like
architecture is free for setting different network structures and
the transfer learning is easy to be applied in this design. In
our experimental observation, the transfer learning
accelerated the training of those networks and boosted the performance
of them. The multi-task learning is helpful on discovering
organs’ dependence. However, we did not analyze its
advantages because of the time limit of the challenge.</p>
      <p>We need to emphasize the fact that the connected
component labeling is very useful for the SegTHOR challenge
since all organs are indivisible and our method was based
on the 2D-CNNs. Since the given CT is not enough for the
SegTHOR data set compared with other segmentation tasks,
the trained networks were easy to overfit. Therefore, the
ensemble strategy is also very necessary for the SegTHOR
challenge.</p>
    </sec>
    <sec id="sec-8">
      <title>5. REFERENCES</title>
      <p>[1] Roger Trullo, C. Petitjean, Su Ruan, Bernard Dubray,
Dong Nie, and Dinggang Shen, “Segmentation of
organs at risk in thoracic CT images using a sharpmask
architecture and conditional random fields,” in IEEE
14th International Symposium on Biomedical Imaging
(ISBI), 2017, pp. 1003–1006.
[2] Olaf Ronneberger, Philipp Fischer, and Thomas Brox,
“U-net: Convolutional networks for biomedical image
segmentation,” in Proceedings of Medical Image
Computing and Computer-Assisted Intervention (MICCAI),
2015, pp. 234–241.
[3] Vijay Badrinarayanan, Alex Kendall, and Roberto
Cipolla, “Segnet: A deep convolutional
encoderdecoder architecture for image segmentation,” IEEE
Transactions on Pattern Analysis and Machine
Intelligence, vol. 39, no. 12, pp. 2481–2495, 2017.
[4] Liang-Chieh Chen, George Papandreou, Iasonas
Kokkinos, Kevin Murphy, and Alan L. Yuille, “Deeplab:
Semantic image segmentation with deep convolutional
nets, atrous convolution, and fully connected crfs,” IEEE
Transactions on Pattern Analysis and Machine
Intelligence, vol. 40, no. 4, pp. 834–848, 2018.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Liang-Chieh</surname>
            <given-names>Chen</given-names>
          </string-name>
          , Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam, “
          <article-title>Encoder-decoder with atrous separable convolution for semantic image segmentation</article-title>
          ,
          <source>” in European Conference on Computer Vision (ECCV)</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>833</fpage>
          -
          <lpage>851</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Xiaomeng</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Hao</given-names>
            <surname>Chen</surname>
          </string-name>
          , Xiaojuan Qi, Qi Dou, ChiWing Fu, and Pheng-Ann Heng, “
          <article-title>H-denseunet: Hybrid densely connected unet for liver and tumor segmentation from CT volumes</article-title>
          ,
          <source>” IEEE Transactions on Medical Imaging</source>
          , vol.
          <volume>37</volume>
          , no.
          <issue>12</issue>
          , pp.
          <fpage>2663</fpage>
          -
          <lpage>2674</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Qi</given-names>
            <surname>Dou</surname>
          </string-name>
          , Hao Chen, Yueming Jin, Lequan Yu,
          <string-name>
            <given-names>Jing</given-names>
            <surname>Qin</surname>
          </string-name>
          , and Pheng-Ann Heng, “
          <article-title>3d deeply supervised network for automatic liver segmentation from ct volumes,” in Medical Image Computing</article-title>
          and
          <string-name>
            <surname>Computer-Assisted Intervention</surname>
          </string-name>
          (MICCAI),
          <year>2016</year>
          , pp.
          <fpage>149</fpage>
          -
          <lpage>157</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Fausto</given-names>
            <surname>Milletari</surname>
          </string-name>
          , Nassir Navab, and
          <string-name>
            <surname>Seyed-Ahmad</surname>
            <given-names>Ahmadi</given-names>
          </string-name>
          , “
          <article-title>V-net: Fully convolutional neural networks for volumetric medical image segmentation,”</article-title>
          <source>in Proceedings of International Conference on 3D Vision</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>565</fpage>
          -
          <lpage>571</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Patrick</given-names>
            <surname>Ferdinand</surname>
          </string-name>
          <string-name>
            <given-names>Christ</given-names>
            , Mohamed Ezzeldin A.
            <surname>Elshaer</surname>
          </string-name>
          , Florian Ettlinger, Sunil Tatavarty, Marc Bickel, Patrick Bilic, Markus Rempfler, Marco Armbruster, Felix Hofmann,
          <string-name>
            <surname>Melvin D'Anastasi</surname>
          </string-name>
          ,
          <string-name>
            <surname>Wieland H. Sommer</surname>
          </string-name>
          ,
          <string-name>
            <surname>Seyed-Ahmad Ahmadi</surname>
          </string-name>
          , and
          <string-name>
            <surname>Bjoern H. Menze</surname>
          </string-name>
          , “
          <article-title>Automatic liver and lesion segmentation in ct using cascaded fully convolutional neural networks and 3d conditional random fields,” in Proceedings of Medical Image Computing</article-title>
          and
          <string-name>
            <surname>Computer-Assisted Intervention</surname>
          </string-name>
          (MICCAI),
          <year>2016</year>
          , pp.
          <fpage>415</fpage>
          -
          <lpage>423</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Yuyin</surname>
            <given-names>Zhou</given-names>
          </string-name>
          , Lingxi Xie,
          <string-name>
            <surname>Elliot K. Fishman</surname>
          </string-name>
          , and Alan L. Yuille, “
          <article-title>Deep supervision for pancreatic cyst segmentation in abdominal CT scans</article-title>
          ,”
          <source>in Proceedings of Medical Image Computing and Computer Assisted Intervention (MICCAI)</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>222</fpage>
          -
          <lpage>230</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Adam</surname>
            <given-names>Paszke</given-names>
          </string-name>
          , Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang,
          <string-name>
            <surname>Zachary</surname>
            <given-names>DeVito</given-names>
          </string-name>
          , Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer, “
          <article-title>Automatic differentiation in pytorch,”</article-title>
          <source>in the Workshop of Conference on Neural Information Processing Systems (NIPS Workshop).</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>