<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Vladimir Kondratenko1;2;3 Dmitry Denisenko1;2;3 Artem Pimkin1;2;3 Mikhail Belyaev1;2</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Dmitry Denisenko</string-name>
          <email>dmitry.denisenko@skoltech.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Artem Pimkin</string-name>
          <email>artem.pimkin@phystech.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mikhail Belyaev</string-name>
          <email>m.belyaev@skoltech.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Kharkevich Institute for Information Transmission Problems</institution>
          ,
          <addr-line>Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Moscow Institute of Physics and Technology</institution>
          ,
          <addr-line>Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Skolkovo Institute of Science and Technology</institution>
          ,
          <addr-line>Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Segthor19 is the competition timed to the conference IEEE ISBI 2019 that addresses the problem of organs at risk segmentation in Computed Tomography (CT) images. In this paper, we describe our best solution based on convolutional neural networks and challenges that we faced during the competition. Applying this approach we finished on the 24th place in the leaderboard.</p>
      </abstract>
      <kwd-group>
        <kwd>Convolutional Networks</kwd>
        <kwd>Organs segmentation</kwd>
        <kwd>Computed Tomography (CT) images</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>The problem of organs at risk segmentation rise in the field of
radiotherapy. Some approaches allow targeted irradiation of
the foci (usually tumour) and their ”burn out”. In this case,
the doctor solves the following problem: he needs to plan the
treatment in such a way that:
1. A tumor has a dose of at least a certain threshold.
2. For most of the healthy tissue is not more than a certain
(other) threshold.
3. For some critical structures (brain stem, optic nerves,
heart, etc.), the dose was no more than one more
threshold (as a rule, much less than the threshold from point
2).</p>
      <p>To carry out these calculations, they are handing or
semiautomatically circling both pockets and critical structures. We
can solve the problem with foci more or less well, but with
critical structures, the situation is more complicated.</p>
      <p>The modern automatic segmentation approaches often use
deep neural network structures like U-net[1] and classic
computer vision. Hence, these techniques are state-of-the-art; we
decided to use them to create our solution. We’ll discuss in
details the models’ structure and parameters selection in the
corresponding section of this paper.</p>
    </sec>
    <sec id="sec-2">
      <title>2. PROBLEM</title>
      <p>The goal of the SegTHOR challenge was to automatically
segment 4 OAR: heart, aorta, trachea, esophagus. As a
participant, we’ve been provided with a training set of 40 CT scans
with manual segmentation[2]. The organisers then measure
the quality on the test set of other 20 CT scans which was
published after 1.5 months from the start of the challenge
2.1. Data
We got the train and test data in the standard Nifti-1 file
format123. Nifti-1 include not only scan itself but an impressing
meta information.</p>
      <sec id="sec-2-1">
        <title>2.1.1. Features and limitations</title>
        <p>The typical limitations for biomedical imaging are the small
number of training samples and extremely high resolution of
the scans (for example matrix of size 512 512 250). And also
there is a list of natural problems connected with CT scans:</p>
        <sec id="sec-2-1-1">
          <title>1. People of different height.</title>
          <p>2. Different apparatus settings and hence different voxel
(3d pixel) sizes.
3. Different time of the equipment start when creating a</p>
          <p>CT scan.
4. Implants (such as cardiostimulators).
5. Features of the body of each person.</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>6. Errors in handmade segmentation.</title>
          <p>7. The difference in markup by different doctors.
1https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3948928/
2https://nifti.nimh.nih.gov/nifti-1/
3https://brainder.org/2012/09/23/the-nifti-file-format/</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>2.2. Quality metrics</title>
      <p>In competition there are two standard measures of quality:
jAjAj+^BjBjj .
1. The overlap Dice metric (DM), based on the pixel
labeling as the result of a segmentation algorithm,
defined as 2</p>
    </sec>
    <sec id="sec-4">
      <title>2. The Hausdorff distance (HD), defined as max(ha; hb),</title>
      <p>where ha is the maximum distance, for all automatic
contour points, to the closest manual contour point
and hb is the maximum distance, for all manual
contour points, to the closest automatic contour point. The
Hausdorff distance is computed in mm thanks to spatial
resolution.</p>
      <p>Furthermore, in our experiments we also decided to use
so-called surface dice score[3] which nicely describes the
amount of work required for a doctor to fix automatic
segmentation. Also, we found it useful for helping us to see if
our models perform poorly even without looking at the
output. To compute surface dice we used source code from the
DeepMind lab repository4.</p>
    </sec>
    <sec id="sec-5">
      <title>3. PROPOSED SOLUTION</title>
      <p>In all experiments, we used PyTorch library to create and train
models and Dpipe5 library to manage configurations and run
them on our GPU NVIDIA Tesla M40.</p>
    </sec>
    <sec id="sec-6">
      <title>3.1. Baseline</title>
      <sec id="sec-6-1">
        <title>3.1.1. Network architecture</title>
        <p>We started without any preprocessing using the 2d T-net
architecture[4]. We need to introduce the notation of the
network configuration which we’ll use several times later:
[ [ 3 2 , 3 2 ] , s h o r t c u t ( 3 2 , 3 2 ) , [ 6 4 , 32 , 3 2 ] ] ,
[ [ 3 2 , 64 , 6 4 ] , s h o r t c u t ( 6 4 , 6 4 ) , [ 1 2 8 , 64 , 3 2 ] ] ,
[ [ 6 4 , 128 , 1 2 8 ] , s h o r t c u t ( 1 2 8 , 1 2 8 ) , [ 2 5 6 , 128 , 6 4 ] ] ,
[ [ 1 2 8 , 256 , 1 2 8 ] ]
Numbers in brackets describes the number of filters in each
convolution. For example, for the last row it means that we
have 128 filters and feed them to convolution with 256 filters,
then again apply the convolution and get 128 feature maps.
Looking on the architecture scheme is quite useful and
convenient for understanding (Fig.1). Unless otherwise stated we
use (3; 3) convolutions for the 2d network case and (3; 3; 3)
for the 3d network case.</p>
      </sec>
      <sec id="sec-6-2">
        <title>3.1.2. Training</title>
        <p>During the training, we picked a random slice from the
images and fed them into the network with batch size 8. We
4https://github.com/deepmind/surface-distance
5https://github.com/neuro-ml/deep pipe
also used 5-fold cross-validation to find a sufficient number
of epochs to train. After that, we taught this model on the
train set of size 22. With this simple approach, we achieved
these mean dice scores on the test set gathered from the
training data: 0.60 for esophagus, 0.83 for heart, 0.78 for trachea
and 0.78 for aorta.</p>
      </sec>
      <sec id="sec-6-3">
        <title>3.1.3. Stacking</title>
        <p>We also performed some experiments to check if stacking
is helpful for the solution. In first experiments esophagus
was predicted much worse than other classes in terms of dice
score. So, we decided to add heart, aorta and trachea
segmentation in additional channels and feed them to the model.
We saw an increase of 0:1 in dice score for esophagus and
planned to use this technique for the next experiments with
3d networks.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>3.2. Preprocessing</title>
      <p>First of all using classic computer vision methods, we
removed the medical table and other frames, borders and air.
Then we scaled dataset to the same voxel size 0:97 0:97 2:5
and cut bounding box based on the lungs position (which was
received by classic cv methods) of size (512; 368) by x; y
from the center and +20 slices to top and 125 to bottom
by z axis to reduce required computational resources and use
bigger batch size. Then we got the outputs from the proposed
2d baseline After that, we performed individual preprocessing
for each organ.</p>
      <sec id="sec-7-1">
        <title>3.2.1. Esophagus</title>
        <p>We have created the bounding box based on trachea location
of size (128; 128) by x; y axes and +5 and 115 from the
beginning of trachea by z axis.
3.2.2. Trachea
Using the proposed 2d baseline, we took bounding box of size
(128; 128; 60) from the 1st slice from the top of the output of
the 2d baseline model.
3.2.3. Heart
3.2.4. Aorta
Using the proposed 2d baseline we took bounding box of size
(512; 368) (hence the heart is vast and require a lot of
information) by x; y and +10 and 50 by z from the 1st slice from
the top of the output of the 2d baseline model.</p>
        <p>Using the proposed 2d baseline, we made two bounding boxes
addressed to two different models. Both of them are of size
(128; 184) by x; y from the centre. The first one is +5 and</p>
        <p>75 by z and the second one is +70 and 10 by z from the
first slice from the top of the output of the 2d baseline model.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>3.3. Model selection and training</title>
      <p>Baseline results show that 2d networks perform not so bad,
but for some situations, there is no way they can accomplish
nicely. For example, in the top of a human body lungs
surround the heart, but in the bottom, there are no similar
reference points (see Fig.2). So the spatial information by z axis
is required to provide better segmentation. 2d networks can’t
address this information. Hence, after we performed
compression in the preprocessing stage, we are ready to use 3d
neural networks as the required amount of computational
resources is reduced.</p>
      <p>We picked the number of epochs and learning rate using
train-test-val split (27-8-5 respectively) and set the batch size
to 8.</p>
      <p>We found that different organs require different models,
mostly because of their geometric sizes. So, we need to entitle
these peculiarities6:
1. Esophagus: depth 3 T-net with three convolutions
(instead of 2) in up-sampling for the better contouring and
(7; 7; 3) convolutions in the bottom level. We divided
each scan by non-intersecting packages of size 40 by z
and fed them into the network.</p>
      <p>[ [ 1 6 , 1 6 ] , s h o r t c u t ( 1 6 , 1 6 ) , [ 3 2 , 16 , 8 , 4 ] ] ,
[ [ 1 6 , 3 2 ] , s h o r t c u t ( 3 2 , 3 2 ) , [ 6 4 , 32 , 32 , 1 6 ] ] ,
[ [ 3 2 , 64 , 128 , 128 , 64 , 3 2 ] ]
2. Trachea: depth 3 T-net with (7; 7; 3) convolutions in the
bottom level. We divided each scan by non-intersecting
packages of size 20 by z (Fig.3) and fed them into the
network.</p>
      <p>[ [ 1 6 , 1 6 ] , s h o r t c u t ( 1 6 , 1 6 ) , [ 3 2 , 16 , 8 ] ] ,
[ [ 1 6 , 3 2 ] , s h o r t c u t ( 3 2 , 3 2 ) , [ 6 4 , 32 , 1 6 ] ] ,</p>
      <p>[ [ 3 2 , 64 , 128 , 128 , 64 , 3 2 ] ]
3. Heart: depth 3 T-net with seven (7; 7; 3) convolutions
in the bottom level. We divided each scan by
nonintersecting packages of size 20 by z (Fig.3). The we
scaled data before feeding it to the network with the
scale factor 0.25.</p>
      <p>[ [ 1 6 , 1 6 ] , s h o r t c u t ( 1 6 , 1 6 ) , [ 3 2 , 16 , 1 6 ] ] ,
[ [ 1 6 , 32 , 3 2 ] , s h o r t c u t ( 3 2 , 3 2 ) , [ 6 4 , 32 , 1 6 ] ] ,</p>
      <p>[ [ 3 2 , 32 , 64 , 128 , 128 , 64 , 32 , 3 2 ] ]
4. Aorta: depth 3 T-net with three convolutions (instead of
2) in up-sampling for the better contouring and (7; 7; 3)
convolutions in the bottom level. We used such a model
structure for both top and bottom bounding boxes that
we described earlier in the preprocessing part. We
divided each scan by non-intersecting packages of size
40 by z. Then we scaled data before feeding it to the
network with the scale factor 0.5.</p>
      <p>[ [ 1 6 , 1 6 ] , s h o r t c u t ( 1 6 , 1 6 ) , [ 3 2 , 16 , 8 , 4 ] ] ,
[ [ 1 6 , 3 2 ] , s h o r t c u t ( 3 2 , 3 2 ) , [ 6 4 , 32 , 32 , 1 6 ] ] ,</p>
      <p>[ [ 3 2 , 64 , 128 , 128 , 64 , 3 2 ] ]</p>
    </sec>
    <sec id="sec-9">
      <title>3.4. Postprocessing</title>
      <p>6Note: each network structure below starts with [number of channels, 4,
16] initialization convolution blocks
After getting the predictions, we found that for some organs
the postprocessing is required: for heart we used the biggest
Fig. 2. Example of the lack of spatial information by z for the
2d network (Patient 28). Top left is the ground truth image
with the heart segmentation. Top right is the prediction of our
2d network. Bottom left is the image without segmentation,
and bottom right is the difference between ground truth image
and our prediction. These images demonstrate that the 2d
network finds incorrect reference points and start to draw the
heart in an inappropriate place.</p>
      <p>z slices
feed to model
connected component and convex hull, for aorta we took the
biggest connected component.</p>
    </sec>
    <sec id="sec-10">
      <title>4. FINAL RESULTS AND CONCLUSION</title>
      <p>Dice
Esophagus Heart
0.80 0.93</p>
      <p>Trachea Aorta
0.89 0.92</p>
      <p>Hausdorff
Esophagus Heart Trachea Aorta
0.62 0.30 0.81 0.27</p>
      <p>We proposed our solution for organs at risk
segmentation in Computed Tomography (CT) images. You can see
our results in the Table 17. Moreover, we described our
ap7For trachea we decided to put here the score from our penultimate
subproach to train complicated models on such big data
samples with constrained computational resources. We used the
non-intersecting packages division of each image. The
natural improvement is to use intersecting packages and to take
mean values of logits on the reconstruction step or to feed
these packages to another network to construct the final
output. Also, we planned to use stacking for 3d models. For
example, we tried to feed aorta and trachea prediction in
additional channels to predict esophagus (hence it worked nicely
for the 2d case), but for some reason, this approach performed
worse than one described above, so we didn’t use it in the final
submission.</p>
      <p>We also would like to mention the possibility to choose
the best segmentation for each patient within previous
submissions. The rules of the challenge did not prohibit this, but
we decided not to use it since this help to get the highest score
but absolutely inapplicable in clinical practice.</p>
      <p>The results have been obtained under the support of the
Russian Foundation for Basic Research grant 18-29-26030.</p>
    </sec>
    <sec id="sec-11">
      <title>5. REFERENCES</title>
      <p>[1] Olaf Ronneberger, Philipp Fischer, and Thomas Brox,
“U-net: Convolutional networks for biomedical image
segmentation,” in International Conference on
Medical image computing and computer-assisted intervention.</p>
      <p>Springer, 2015, pp. 234–241.
[3] Stanislav Nikolov, Sam Blackwell, Ruheena Mendes,
Jeffrey De Fauw, Clemens Meyer, C´ıan Hughes, Harry
Askham, Bernardino Romera-Paredes, Alan
Karthikesalingam, Carlton Chu, et al., “Deep learning
to achieve clinically applicable segmentation of head
and neck anatomy for radiotherapy,” arXiv preprint
arXiv:1809.04430, 2018.
[4] Artem Pimkin, Gleb Makarchuk, Vladimir Kondratenko,
Maxim Pisov, Egor Krivov, and Mikhail Belyaev,
“Ensembling neural networks for digital pathology images
classification and segmentation,” in International
Conference Image Analysis and Recognition. Springer, 2018,
pp. 877–886.
[5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
Sun, “Identity mappings in deep residual networks,”
2016.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>