<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Learning to segment from object sizes</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Denis Baručić</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jan Kybic</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Cybernetics, Faculty of Electrical Engineering, Czech Technical University in Prague</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Deep learning has proved particularly useful for semantic segmentation, a fundamental image analysis task. However, the standard deep learning methods need many training images with ground-truth pixel-wise annotations, which are usually laborious to obtain and, in some cases (e.g., medical images), require domain expertise. Therefore, instead of pixel-wise annotations, we focus on image annotations that are significantly easier to acquire but still informative, namely the size of foreground objects. We define the object size as the maximum Chebyshev distance between a foreground and the nearest background pixel. We propose an algorithm for training a deep segmentation network from a dataset of a few pixel-wise annotated images and many images with known object sizes. The algorithm minimizes a discrete (non-diferentiable) loss function defined over the object sizes by sampling the gradient and then using the standard back-propagation algorithm. Experiments show that the new approach improves the segmentation performance.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;semantic segmentation</kwd>
        <kwd>weakly-supervised learning</kwd>
        <kwd>deep learning</kwd>
        <kwd>distance transform</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>ing to each foreground pixel the shortest distance to the
background. Finally, the object size is defined as double
Semantic segmentation is the process of associating a the maximum of the computed distances.
class label to each pixel of an image. With the advent of Due to the thresholding, the cost function is not
diferdeep learning, deep networks have achieved incredible entiable and it is therefore not possible to use the standard
performance on many image processing tasks, including gradient descent for learning. We overcome this obstacle
semantic segmentation. Deep learning for semantic seg- by adding random noise to the output of our network.
mentation has many benefits; for example, it is flexible The predicted binary masks then become stochastic and
w.r.t. the model architecture and scales particularly well the gradient can be sampled. A detailed description of
[1, 2]. On the contrary, the standard deep learning de- our method is given later in Sec. 2 and 3.
mands many ground-truth (GT) pixel-wise annotations to
prevent overfitting. Since a human expert annotator must 1.2. Related work
usually provide the GT annotations, acquiring a
goodquality training dataset can be dificult. To combat this
issue, we focus on learning from GT image annotations
that are easier to produce but still informative enough,
namely the sizes of foreground objects. In practice, our
approach assumes a training dataset that consists of
relatively few pixel-wise annotated images and many images
with known object sizes. We present a work-in-progress
solution.</p>
      <p>Cano-Espinosa et al. [4] considered a similar learning
problem. They proposed a network architecture that
performs a biomarker (fat contents) regression and
image segmentation after being trained directly on images
annotated by biomarker values only. Similarly to ours,
their method derives the biomarker value from the
predicted segmentation deterministically. The diference is
that their biomarker, equivalent to the foreground area,
can be obtained by a simple summation. Furthermore,
1.1. Proposed approach the method assumes that the foreground objects can be
roughly segmented using thresholding. Pérez-Pelegrí et
Suppose a standard convolutional network for image seg- al. [5] took a similar approach. Although their method
mentation (e.g., a U-Net [3]). Given an input image, we does not involve thresholding to produce approximate
feed it to the network and collect the output prediction. segmentation, it was tailored explicitly for learning from
The prediction is then thresholded to obtain a binary images annotated by the foreground volume (as their
mask, which is processed by a distance transform, assign- images are 3D).</p>
      <p>Karam et al. [6] implemented a diferentiable distance
ITAT’22: Information technologies – Applications and Theory, Septem- transform via a combination of the convolution
operaber 23–27, 2022, Zuberec, Slovakia tions. The method is fast but exhibits numerical
instabili$ barucden@fel.cvut.cz (D. Baručić); kybic@fel.cvut.cz (J. Kybic) ties for bigger images. Resolving the numerical
instabili0000-0003-0428-3354 (D. Baručić); 0000-0002-9363-4947 ties, Pham et al. [7] later proposed a cascaded procedure
(J. Kybic)</p>
      <p>© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License with locally restricted convolutional distance transforms.
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) Nonetheless, both methods substitute the minimum
function with the log-sum-exp operation, which leads to
inaccurate results.</p>
      <p>The way our method deals with a non-diferentiable
cost function is borrowed from stochastic binary
networks [8]. In a stochastic binary network, one needs to
deal with zero gradient after each layer of the network.</p>
      <p>However, methods such as ARM [9] or PSA [10] are
unnecessarily complex. Instead, we employ a single sample
estimation, which has been discussed in [11].</p>
    </sec>
    <sec id="sec-2">
      <title>2. Model</title>
      <p />
      <p>˜</p>
      <p>
        The proposed model consists of (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) a segmentation
network,  , parametrized by  , and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) a deterministic al- 2.1.1. Implementation details
gorithm to derive the object size based on distance
transform, denoted as .
      </p>
      <p>
        Given an input image  = (1, . . . ,  ), the network
produces a pixel-wise segmentation
There is an eficient, two-pass algorithm that computes
the distance transform in Θ(  ) time. Furthermore, when
evaluating a batch of images, it is possible to compute
the distance transform on all images in parallel.
 =  (), (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) We have implemented a CPU version1 of this algorithm
that works with PyTorch tensors and is faster than, e.g.,
such that  ∈ R, 1 ≤  ≤  , where  is the number the SciPy implementation.
of pixels. The method does not make any assumptions
about the network’s technical details, except that it can be
trained using the standard back-propagation algorithm 3. Learning
and gradient descent. In our experiments, we always
employed a U-Net [3] with a residual network encoder
[12] and a mirroring decoder.
      </p>
      <p>To obtain a binary mask ^ ∈ {± 1} , the network
response  is thresholded,
Suppose a training dataset  =  ∪  consists of
fully- and weakly-annotated subsets  and . The
fully-annotated subset  contains pairs (, ), where
 is an input image and  the corresponding GT
pixelwise segmentation, while  comprises of pairs (, ),
where  is the size of the object present in the image .</p>
      <p>
        We focus on situations when | | ≪ | |.
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
      </p>
      <sec id="sec-2-1">
        <title>2.1. Object size</title>
        <p>We use a distance transform of the binary mask to define
the object size (see Fig. 1). Distance transform assigns to
each pixel the shortest distance to the background, i.e.,
 =</p>
        <p>min  (, ),  = 1, . . . , ,
,^ =− 1</p>
      </sec>
      <sec id="sec-2-2">
        <title>3.1. Supervised pre-training</title>
        <p>
          Our method starts by optimizing a pixel-wise loss w.r.t.
the network parameters  on the small subset  , as in
the standard supervised learning. For a particular
train(
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) ing pair (, ) ∈  and the corresponding prediction
 ∈ R , the loss function reads
where  (, ) is the Chebyshev ℓ∞ distance. After that,
we take double the maximum distance to define the object
size,

∑︁ ((1 − ) + log(1 + exp(− ))) ,
=1
(6)
ˆ = 2 max .
        </p>
        <p />
        <p>The composition of the distance transform and the
maximum aggregation determines the object size,
denoted as  : {± 1} → R,
(ˆ) = 2 max min  (, ).</p>
        <p>,^ =− 1
(4)
which is known as the binary cross-entropy with logits
loss. The optimization continues until convergence.</p>
        <p>Using proper data augmentation to extend the training
dataset, the network tends to recognize useful features
and produces decent predictions after this initial stage
(5) (see Sec. 4.2).
image</p>
        <sec id="sec-2-2-1">
          <title>Segmentation</title>
          <p>network</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>Distance transform Max</title>
          <p>( )
noise</p>
        </sec>
        <sec id="sec-2-2-3">
          <title>Size derivation  Loss</title>
          <p>(, ( ))</p>
          <p>We propose to follow an approach similar to those
used in binary neural networks [10] and subtract random
noise  from the real predictions  before thresholding.</p>
          <p>Consequently, the binary segmentation becomes a col- 3.2.3. Single sample estimator
lection  = (1, . . . ,  ) of  independent Bernoulli
variables,</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>3.2. Weakly-supervised training</title>
        <p>Consider a training pair (, ) ∈ . As described in
Sec. 2, one can obtain a prediction of the object size,
ˆ = (ˆ), from the thresholded network response ˆ. We
penalize the prediction error by the square loss
(, ˆ) = ( − ˆ)2.</p>
        <p>= sign( − ),
with</p>
        <p>Pr( = +1 | ;  ) = Pr( ≤ ) =  (),
where  is the cumulative distribution function (CDF)
of the noise  (see Fig. 2).</p>
        <p>Then, instead of minimizing the loss  (7), we minimize
the expected loss ℒ = E [(, ( ))],
ℒ =</p>
        <p>∑︁
∈{± 1}
Contrary to (7), the expected loss (10) is diferentiable,
assuming a smooth  .
3.2.1. Noise distribution
Following [10], we sample the noise  from the logistic
distribution with mean  = 0 and scale  = 1. Hence,
the CDF of  is a smooth, sigmoid function,
 () =</p>
        <p>1
1 + exp(− )</p>
        <p>.
3.2.2. Exact gradient
To compute the gradient ∇ ℒ, we need to evaluate the
derivative
for each pixel  = 1, . . . ,  . The gradient can be then
computed automatically by the back-propagation
algorithm. However, an exact computation of (12) leads to
∑︁</p>
        <p>Pr( =  | ;  )
∈{± 1} Pr( =  | ;  )
(, ()),</p>
        <p>(13)
which involves summing 2 terms and is thus tractable
only for very small images. Instead, we resort to a single
sample estimator.</p>
        <p>The single sample estimator is based on Lemma 1, which
is, in fact, a specific form of [10, Lemma B.1].</p>
        <p>Lemma 1. Let  = (1, . . . ,  ) be a collection of 
independent {± 1}-valued Bernoulli variables with
probabilities Pr( = +1) = . Let ℎ be a function ℎ : {± 1} →
R. Let  = (1, . . . ,  ) denote a random sample of 
and ↓ = (1, . . . , − 1, − , +1, . . . ,  ). Then</p>
        <p>(ℎ() − ℎ(↓))
is an unbiased estimate of  E∼  [ℎ()].</p>
        <p />
        <p>E∼  [ℎ()] = ∑︁ Pr() ℎ(),</p>
        <p>Pr()

¬
and write out the sum over ,
∑︁ ∑︁ Pr(¬)ℎ() = ∑︁ Pr(¬) ∑︁ ℎ()
¬</p>
        <p>(16)
where ¬ denotes vector  with the -th component
omitted. Notice that the inner sum simplifies and no
longer depends on ,
∑︁ Pr(¬)(ℎ(=+1) − ℎ(=− 1)),</p>
        <p>(17)
¬
where = is the vector  with the -th component set
to . Then, we multiply the inner subtraction by the
constant factor 1 =  + (1 − ) = ∑︀ Pr(),
∑︁ Pr(¬) ∑︁ Pr()(ℎ(=+1) − ℎ(=− 1)), (18)
¬ 
(14)
(15)
ultimately leading to the following expression for (15):
which can be written as
∑︁ Pr()(ℎ(=+1) − ℎ(=− 1)),

∑︁ Pr() [ℎ() − ℎ(↓)] .

(19)
(20)
Thus, (14) is a single sample unbiased estimate of (15).</p>
        <p>According to Lemma 1, an unbiased estimate of the
derivative (12) is</p>
        <p>(21)
where  is a random sample of Bernoulli variables with
probabilities (9) (see a few examples of sampled deriva- training, validation, and testing subsets containing 70%,
tives in Fig. 3). 10%, and 20% of the images.</p>
        <p>Given a GT segmentation  and a predicted
segmentation ˆ, we evaluate two metrics, the squared size
predic4. Experiments tion error  and the intersection-over-union  ,
The proposed method was implemented in the PyTorch (, ˆ) = ((), (ˆ)), (22)
fsLreiognmhtetdtnhieenxgSpeefgrrmaimmeenentwatstoiwrokne2rMeusopidenergflsoamPRyeTedosoNrcnehtaliismberrpavlreeyrm3.eeqTnuhtiaeptppioreend-  (, ˆ) = ∑∑︀︀==11 31 ++  ++ ˆˆ −+ ˆˆ . (23)
with Intel Xeon Silver 4214R (2.40GHz) and NVIDIA In the case of standard supervised method, vertical and
GeForce RTX 2080 Ti. horizontal flipping was randomly applied to augment the</p>
        <p>The data for our experiments was based on a dataset of training dataset. The proposed method did not apply any
3D MRI images of the hippocampus [13]. The dataset con- augmentation.
sists of 394 volumes provided with GT segmentation of
classes hippocampus head, hippocampus body, and
background. We decomposed the volumes into individual 2D 4.1. Number of derivative samples
slices of size 48 × 32 pixels and kept only those with A toy example (see Fig. 3) indicated that taking more
at least 1% foreground, obtaining a total of 6093 images. samples of the derivatives (21) might lead to better results
Next, we merged the hippocampus classes to get a binary than taking just one. This experiment investigates how
segmentation problem (see Fig. 4). Afterward, we derived the number of derivative samples  impacts learning
the object sizes from the GT pixel-wise annotations to speed and prediction quality.
use in training. Finally, we randomly split the data into We considered four diferent numbers of samples ,
2https://github.com/Lightning-AI/lightning  ∈ {1, 2, 4, 8}. For each , the other parameters (such
3https://github.com/qubvel/segmentation_models.pytorch as the batch size or the learning rate) were the same, and
the learning began with the same segmentation network
 that was pre-trained in the standard way on 85
pixelwise annotated images from the training subset. The
proposed method always ran until the squared error 
on the validation data stopped improving.</p>
        <p>To assess the learning speed, we measured the duration
of one learning epoch. For  = 1, an epoch took ≈ 10×
longer than the standard supervised learning. Generally,
the duration grew roughly exponentially with  (see
Fig. 5).</p>
        <p>Higher values of  did not lead to a lower  or a faster
convergence speed (see Fig. 6). In fact,  = 1 and  = 2
achieved the lowest , but not by a large margin. Given
the speed benefits, we use  = 1 always. Interestingly,
even though  kept decreasing over the course of
learning for all ,  improved only slightly and started
declining after ≈ 20 epochs. This observation suggests
that the squared error of the object size is not a suficient
objective for learning the segmentation.</p>
        <p>5. Discussion
4.2. Pre-training impact The method is promising but there is definitely potential
for improvement in both speed and prediction
perforThis experiment tests the essential question: given a seg- mance.
mentation model trained on a few pixel-level annotated The proposed method samples the derivatives
accordimages, can we improve its testing performance by fur- ing to (21) for each pixel . However, flipping the
predicther learning from size annotations? tion,  ↦→ − , changes the derived size only for some</p>
        <p>We trained diferent segmentation networks until con- ; particularly those within and on the border of the
prevergence on randomly selected training subsets of size . dicted object. Therefore, given a sample , (, ()) =
Then, we fine-tuned these networks on the whole train- (, (↓)) for many pixels , and the sampled
derivaing dataset using the proposed method. We measured tives (21) are sparse. The method might sample only
the test performance in terms of  . those derivatives that are potentially non-zero and set</p>
        <p>The proposed method led to a ≈ 5% increase of  the rest to zero directly, which would save much
compufor small  &lt; 100 (see Fig. 7), improving the segmen- tational time.
tation quality. For higher , the efect was negligible, We have seen in the experiments that lower size
prewhich complements the observation from the previous diction error does not strictly imply better segmentation.
experiment that improving the size estimate does not We need to closely investigate in what cases the size
prenecessarily improve the segmentation quality. diction loss is insuficient and adjust the objective. The
adjustment might involve adding an L1 regularization (as
in [4]) or drawing inspiration from unsupervised
methods (e.g., demand for the segmentation to respect edges Computing and Computer-Assisted Intervention,
in images, etc.). Springer, 2015, pp. 234–241.</p>
        <p>The proposed approach entails some principled lim- [4] C. Cano-Espinosa, et al., Biomarker localization
itations. For example, it allows only a single object in from deep learning regression networks, IEEE
an image. We also expect the method to be ill-suited for Transactions on Medical Imaging 39 (2020) 2121–
complex object shapes, but we have not performed any 2132.
experiments in that regard yet. [5] M. Pérez-Pelegrí, et al., Automatic left ventricle
volume calculation with explainability through a deep
learning weak-supervision methodology,
Com6. Conclusion puter Methods and Programs in Biomedicine 208
(2021) 106275.</p>
        <p>We proposed a weakly-supervised method for training [6] C. Karam, K. Sugimoto, K. Hirakawa, Fast
convolua segmentation network from a few pixel-wise annotated tional distance transform, IEEE Signal Processing
images and many images annotated by the object size. Letters 26 (2019) 853–857.</p>
        <p>The key ingredients is a method for evaluating the object [7] D. D. Pham, G. Dovletov, J. Pauli, A
diferensize from a probabilistic segmentation and a method for tiable convolutional distance transform layer for
optimizing a deep network using a non-diferentiable improved image segmentation, in: DAGM German
objective. Conference on Pattern Recognition, Springer, 2020,</p>
        <p>The achieved results seem promising. We believe the pp. 432–444.
improvements suggested in the discussion will improve [8] T. Raiko, M. Berglund, G. Alain, L. Dinh,
Techperformance, rendering the method valuable for training niques for learning binary stochastic feedforward
segmentation models for biomedical images. neural networks, in: 3rd International Conference
on Learning Representations, 2015.</p>
        <p>Acknowledgments [9] M. Yin, M. Zhou, ARM:
augment-REINFORCEmerge gradient for stochastic binary networks, in:
The authors acknowledge the support of the OP VVV 7th International Conference on Learning
Reprefunded project “CZ.02.1.01/0.0/0.0/16_019/0000765 Re- sentations, 2019.
search Center for Informatics”, the Czech Science Foun- [10] A. Shekhovtsov, V. Yanush, B. Flach, Path
sampledation project 20-08452S, and the Grant Agency of the analytic gradient estimators for stochastic binary
CTU in Prague, grant No. SGS20/170/OHK3/3T/13. networks, Advances in Neural Information
Processing Systems 33 (2020) 12884–12894.
[11] Y. Cong, M. Zhao, K. Bai, L. Carin, GO gradient for
References expectation-based objectives, in: 7th International
Conference on Learning Representations, 2019.
[12] K. He, X. Zhang, S. Ren, J. Sun, Deep residual
learning for image recognition, in: Proceedings of the
IEEE Conference on Computer Vision and Pattern</p>
        <p>Recognition, 2016, pp. 770–778.
[13] A. L. Simpson, et al., A large annotated
medical image dataset for the development and
evaluation of segmentation algorithms, 2019.
arXiv:arXiv:1902.09063.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          , et al.,
          <article-title>A review of deep-learning-based medical image segmentation methods</article-title>
          ,
          <source>Sustainability</source>
          <volume>13</volume>
          (
          <year>2021</year>
          )
          <fpage>1224</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Minaee</surname>
          </string-name>
          , et al.,
          <article-title>Image segmentation using deep learning: A survey</article-title>
          ,
          <source>IEEE transactions on pattern analysis and machine intelligence</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>O.</given-names>
            <surname>Ronneberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fischer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Brox</surname>
          </string-name>
          , U-net:
          <article-title>Convolutional networks for biomedical image segmentation</article-title>
          , in: International Conference on Medical Image
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>