Learning to segment from object sizes
Denis Baručić1 , Jan Kybic1
1
    Department of Cybernetics, Faculty of Electrical Engineering, Czech Technical University in Prague


                                       Abstract
                                       Deep learning has proved particularly useful for semantic segmentation, a fundamental image analysis task. However, the
                                       standard deep learning methods need many training images with ground-truth pixel-wise annotations, which are usually
                                       laborious to obtain and, in some cases (e.g., medical images), require domain expertise. Therefore, instead of pixel-wise
                                       annotations, we focus on image annotations that are significantly easier to acquire but still informative, namely the size of
                                       foreground objects. We define the object size as the maximum Chebyshev distance between a foreground and the nearest
                                       background pixel. We propose an algorithm for training a deep segmentation network from a dataset of a few pixel-wise
                                       annotated images and many images with known object sizes. The algorithm minimizes a discrete (non-differentiable) loss
                                       function defined over the object sizes by sampling the gradient and then using the standard back-propagation algorithm.
                                       Experiments show that the new approach improves the segmentation performance.

                                       Keywords
                                       semantic segmentation, weakly-supervised learning, deep learning, distance transform


1. Introduction                                                                                         ing to each foreground pixel the shortest distance to the
                                                                                                        background. Finally, the object size is defined as double
Semantic segmentation is the process of associating a the maximum of the computed distances.
class label to each pixel of an image. With the advent of                                                  Due to the thresholding, the cost function is not differ-
deep learning, deep networks have achieved incredible entiable and it is therefore not possible to use the standard
performance on many image processing tasks, including gradient descent for learning. We overcome this obstacle
semantic segmentation. Deep learning for semantic seg- by adding random noise to the output of our network.
mentation has many benefits; for example, it is flexible The predicted binary masks then become stochastic and
w.r.t. the model architecture and scales particularly well the gradient can be sampled. A detailed description of
[1, 2]. On the contrary, the standard deep learning de- our method is given later in Sec. 2 and 3.
mands many ground-truth (GT) pixel-wise annotations to
prevent overfitting. Since a human expert annotator must
usually provide the GT annotations, acquiring a good-
                                                                                                        1.2. Related work
quality training dataset can be difficult. To combat this Cano-Espinosa et al. [4] considered a similar learning
issue, we focus on learning from GT image annotations problem. They proposed a network architecture that
that are easier to produce but still informative enough, performs a biomarker (fat contents) regression and im-
namely the sizes of foreground objects. In practice, our age segmentation after being trained directly on images
approach assumes a training dataset that consists of rela- annotated by biomarker values only. Similarly to ours,
tively few pixel-wise annotated images and many images their method derives the biomarker value from the pre-
with known object sizes. We present a work-in-progress dicted segmentation deterministically. The difference is
solution.                                                                                               that their biomarker, equivalent to the foreground area,
                                                                                                        can be obtained by a simple summation. Furthermore,
1.1. Proposed approach                                                                                  the method assumes that the foreground objects can be
                                                                                                        roughly segmented using thresholding. Pérez-Pelegrí et
Suppose a standard convolutional network for image seg- al. [5] took a similar approach. Although their method
mentation (e.g., a U-Net [3]). Given an input image, we does not involve thresholding to produce approximate
feed it to the network and collect the output prediction. segmentation, it was tailored explicitly for learning from
The prediction is then thresholded to obtain a binary images annotated by the foreground volume (as their
mask, which is processed by a distance transform, assign- images are 3D).
                                                                                                           Karam et al. [6] implemented a differentiable distance
ITAT’22: Information technologies – Applications and Theory, Septem- transform via a combination of the convolution opera-
ber 23–27, 2022, Zuberec, Slovakia                                                                      tions. The method is fast but exhibits numerical instabili-
$ barucden@fel.cvut.cz (D. Baručić); kybic@fel.cvut.cz (J. Kybic)                                       ties for bigger images. Resolving the numerical instabili-
 0000-0003-0428-3354 (D. Baručić); 0000-0002-9363-4947
                                                                                                        ties, Pham et al. [7] later proposed a cascaded procedure
(J. Kybic)
           © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License with locally restricted convolutional distance transforms.
           Attribution 4.0 International (CC BY 4.0).
    CEUR

           CEUR Workshop Proceedings (CEUR-WS.org)
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                                                                                        Nonetheless, both methods substitute the minimum func-
tion with the log-sum-exp operation, which leads to inac-
curate results.
   The way our method deals with a non-differentiable
cost function is borrowed from stochastic binary net-                           𝑖    ˜
                                                                                     𝑠
works [8]. In a stochastic binary network, one needs to                        𝑑𝑖
deal with zero gradient after each layer of the network.
However, methods such as ARM [9] or PSA [10] are un-
necessarily complex. Instead, we employ a single sample Figure 1: Illustrative example of an object and its derived size.
                                                          The object is outlined by the thick boundary line. The point 𝑖
estimation, which has been discussed in [11].
                                                               denotes the foreground pixel whose shortest distance to the
                                                               background, 𝑑𝑖 , is the highest among the pixels. The derived
2. Model                                                       object size ^𝑠 = 2𝑑𝑖 .

The proposed model consists of (1) a segmentation net-
work, 𝑓𝜃 , parametrized by 𝜃, and (2) a deterministic al-      2.1.1. Implementation details
gorithm to derive the object size based on distance trans-
form, denoted as 𝑔.                                            There is an efficient, two-pass algorithm that computes
   Given an input image 𝑥 = (𝑥1 , . . . , 𝑥𝑉 ), the network    the distance transform in Θ(𝑉 ) time. Furthermore, when
produces a pixel-wise segmentation                             evaluating a batch of images, it is possible to compute
                                                               the distance transform on all images in parallel.
                       𝑎 = 𝑓𝜃 (𝑥),                       (1)     We have implemented a CPU version1 of this algorithm
                                                               that works with PyTorch tensors and is faster than, e.g.,
such that 𝑎𝑖 ∈ R, 1 ≤ 𝑖 ≤ 𝑉 , where 𝑉 is the number            the SciPy implementation.
of pixels. The method does not make any assumptions
about the network’s technical details, except that it can be
trained using the standard back-propagation algorithm          3. Learning
and gradient descent. In our experiments, we always
employed a U-Net [3] with a residual network encoder           Suppose a training dataset 𝒟 = 𝒟𝑓 ∪ 𝒟𝑤 consists of
[12] and a mirroring decoder.                                  fully- and weakly-annotated subsets 𝒟𝑓 and 𝒟𝑤 . The
   To obtain a binary mask 𝑦 ^ ∈ {±1}𝑉 , the network           fully-annotated subset 𝒟𝑓 contains pairs (𝑥, 𝑦), where
response 𝑎 is thresholded,                                     𝑥 is an input image and 𝑦 the corresponding GT pixel-
                                                               wise segmentation, while 𝒟𝑤 comprises of pairs (𝑥, 𝑠),
                      𝑦ˆ𝑖 = sign 𝑎𝑖 .                    (2)   where 𝑠 is the size of the object present in the image 𝑥.
                                                               We focus on situations when |𝒟𝑓 | ≪ |𝒟𝑤 |.
2.1. Object size
                                                               3.1. Supervised pre-training
We use a distance transform of the binary mask to define
the object size (see Fig. 1). Distance transform assigns to    Our method starts by optimizing a pixel-wise loss w.r.t.
each pixel the shortest distance to the background, i.e.,      the network parameters 𝜃 on the small subset 𝒟𝑓 , as in
                                                               the standard supervised learning. For a particular train-
          𝑑𝑖 = min 𝛿(𝑖, 𝑗),         𝑖 = 1, . . . , 𝑉,    (3)   ing pair (𝑥, 𝑦) ∈ 𝒟𝑓 and the corresponding prediction
                𝑗,𝑦
                  ^𝑗 =−1
                                                               𝑎 ∈ R𝑉 , the loss function reads
where 𝛿(𝑖, 𝑗) is the Chebyshev ℓ∞ distance. After that,                    𝑉
we take double the maximum distance to define the object
                                                                          ∑︁
                                                                                 (𝑎𝑖 (1 − 𝑦𝑖 ) + log(1 + exp(−𝑎𝑖 ))) ,   (6)
size,                                                                      𝑖=1
                     ˆ𝑠 = 2 max 𝑑𝑖 .                 (4)
                               𝑖                     which is known as the binary cross-entropy with logits
  The composition of the distance transform and the loss. The optimization continues until convergence.
maximum aggregation determines the object size, de-     Using proper data augmentation to extend the training
noted as 𝑔 : {±1}𝑉 → R,                              dataset, the network tends to recognize useful features
                                                     and produces decent predictions after this initial stage
               ˆ ) = 2 max min 𝛿(𝑖, 𝑗).
             𝑔(𝑦                                 (5) (see Sec. 4.2).
                           𝑖   𝑗,𝑦
                                 ^𝑗 =−1


                                                               1
                                                                   https://github.com/barucden/chdt
              Segmentation                         𝑌       Distance                        𝑔(𝑌 )
                                                                                 Max                Loss 𝑙        𝑙(𝑠, 𝑔(𝑌 ))
image 𝑥        network 𝑓𝜃                                 transform

                                   noise 𝑍                    Size derivation 𝑔                       𝑠

Figure 2: An overview of the proposed probabilistic model.


3.2. Weakly-supervised training                               for each pixel 𝑖 = 1, . . . , 𝑉 . The gradient can be then
                                                              computed automatically by the back-propagation algo-
 Consider a training pair (𝑥, 𝑠) ∈ 𝒟𝑤 . As described in
                                                              rithm. However, an exact computation of (12) leads to
 Sec. 2, one can obtain a prediction of the object size,
ˆ𝑠 = 𝑔(𝑦ˆ ), from the thresholded network response 𝑦
                                                   ˆ . We                ∑︁        Pr(𝑌 = 𝑦 | 𝑥; 𝜃)
 penalize the prediction error by the square loss                                                      𝑙(𝑠, 𝑔(𝑦))𝑦𝑖 ,   (13)
                                                                                 𝑉
                                                                                   Pr(𝑌 𝑖 = 𝑦𝑖 | 𝑥; 𝜃)
                                                                       𝑦∈{±1}

                    𝑙(𝑠, ˆ𝑠) = (𝑠 − ˆ𝑠)2 .              (7)
                                                              which involves summing 2𝑉 terms and is thus tractable
   We propose to follow an approach similar to those          only for very small images. Instead, we resort to a single
used in binary neural networks [10] and subtract random       sample estimator.
noise 𝑍 from the real predictions 𝑎𝑖 before thresholding.
Consequently, the binary segmentation becomes a col- 3.2.3. Single sample estimator
lection 𝑌 = (𝑌1 , . . . , 𝑌𝑉 ) of 𝑉 independent Bernoulli
                                                            The single sample estimator is based on Lemma 1, which
variables,
                                                            is, in fact, a specific form of [10, Lemma B.1].
                   𝑌𝑖 = sign(𝑎𝑖 − 𝑍),                   (8)
                                                            Lemma 1. Let 𝑌 = (𝑌1 , . . . , 𝑌𝑉 ) be a collection of 𝑉 in-
with
                                                            dependent {±1}-valued Bernoulli variables with probabili-
                                                                                                                    𝑉
    Pr(𝑌𝑖 = +1 | 𝑥; 𝜃) = Pr(𝑍 ≤ 𝑎𝑖 ) = 𝐹𝑍 (𝑎𝑖 ), (9) ties Pr(𝑌𝑖 = +1) = 𝑝𝑖 . Let ℎ be a function ℎ : {±1} →
                                                            R. Let 𝑦 = (𝑦1 , . . . , 𝑦𝑉 ) denote a random sample of 𝑌
where 𝐹𝑍 is the cumulative distribution function (CDF) and 𝑦↓𝑖 = (𝑦1 , . . . , 𝑦𝑖−1 , −𝑦𝑖 , 𝑦𝑖+1 , . . . , 𝑦𝑉 ). Then
of the noise 𝑍 (see Fig. 2).
   Then, instead of minimizing the loss 𝑙 (7), we minimize                        𝑦𝑖 (ℎ(𝑦) − ℎ(𝑦↓𝑖 ))                 (14)
the expected loss ℒ = E𝑌 [𝑙(𝑠, 𝑔(𝑌 ))],                                                    𝜕
                                                            is an unbiased estimate of 𝜕𝑝𝑖 E𝑦∼𝑌 [ℎ(𝑦)].
             ∑︁
     ℒ=             Pr(𝑌 = 𝑦 | 𝑥; 𝜃)𝑙(𝑠, 𝑔(𝑦)).        (10) Proof. We take the derivative of the expectation,
          𝑦∈{±1}𝑉
                                                                         𝜕                ∑︁ Pr(𝑦)
                                                                            E𝑦∼𝑌 [ℎ(𝑦)] =            ℎ(𝑦)𝑦𝑖 ,           (15)
Contrary to (7), the expected loss (10) is differentiable,              𝜕𝑝𝑖                𝑦
                                                                                             Pr(𝑦𝑖 )
assuming a smooth 𝐹𝑍 .
                                                              and write out the sum over 𝑦𝑖 ,
3.2.1. Noise distribution                                       ∑︁ ∑︁                     ∑︁           ∑︁
                                                                        Pr(𝑦¬𝑖 )ℎ(𝑦)𝑦𝑖 =      Pr(𝑦¬𝑖 )    ℎ(𝑦)𝑦𝑖
Following [10], we sample the noise 𝑍 from the logistic   𝑦¬𝑖 𝑦𝑖                    𝑦¬𝑖         𝑦𝑖
distribution with mean 𝜇 = 0 and scale 𝑠 = 1. Hence,                                                     (16)
the CDF of 𝑍 is a smooth, sigmoid function,             where 𝑦¬𝑖 denotes vector 𝑦 with the 𝑖-th component
                                                        omitted. Notice that the inner sum simplifies and no
                               1
               𝐹𝑍 (𝑎) =               .            (11) longer depends on 𝑦𝑖 ,
                        1 + exp(−𝑎)                              ∑︁
                                                                     Pr(𝑦¬𝑖 )(ℎ(𝑦𝑖=+1 ) − ℎ(𝑦𝑖=−1 )),    (17)
3.2.2. Exact gradient                                            𝑦¬𝑖

To compute the gradient ∇𝜃 ℒ, we need to evaluate the         where 𝑦𝑖=𝑧 is the vector 𝑦 with the 𝑖-th component set
derivative                                                    to 𝑧. Then, we multiply the inner subtraction  by the
                 𝜕 E𝑌 [𝑙(𝑠, 𝑔(𝑌 ))]                           constant factor 1 = 𝑝𝑖 + (1 − 𝑝𝑖 ) = 𝑦𝑖 Pr(𝑦𝑖 ),
                                                                                                   ∑︀
                                                 (12)
                      𝜕𝐹𝑍 (𝑎𝑖 )                                 ∑︁               ∑︁
                                                                      Pr(𝑦¬𝑖 )         Pr(𝑦𝑖 )(ℎ(𝑦𝑖=+1 ) − ℎ(𝑦𝑖=−1 )), (18)
                                                                𝑦¬𝑖               𝑦𝑖
0            1 -4                                               0


       𝐹𝑍           𝑛=1       𝑛=8         𝑛 = 64      𝑛 = 512
Figure 3: Examples of derivatives (12) computed according
to (21) for different number of samples 𝑛, given the output
of 𝐹𝑍 , for a small, 6 × 6 image. The red frame outlines the
object.
                                                                    Figure 4: Example of a hippocampus image [13] with the
                                                                    object outlined in red.

ultimately leading to the following expression for (15):
                                                                                   200
          ∑︁                                                                                    proposed
              Pr(𝑦)(ℎ(𝑦𝑖=+1 ) − ℎ(𝑦𝑖=−1 )),          (19)                                       standard
               𝑦                                                                   150


                                                                    Duration [s]
which can be written as
                                                                                   100
            ∑︁
               Pr(𝑦)𝑦𝑖 [ℎ(𝑦) − ℎ(𝑦↓𝑖 )] .                   (20)
                    𝑦                                                               50

Thus, (14) is a single sample unbiased estimate of (15).
                                                                                     0
                                                                                           1               2              4             8

  According to Lemma 1, an unbiased estimate of the                                               Number of derivative samples n
derivative (12) is                                                  Figure 5: Average epoch duration for the proposed method
                                                                    with different number of gradient samples. The duration of
   𝜕 E𝑌 [𝑙(𝑠, 𝑔(𝑌 ))]                                               the standard method is given as a reference.
                       ≈ 𝑦𝑖 [𝑙(𝑠, 𝑔(𝑦)) − 𝑙(𝑠, 𝑔(𝑦↓𝑖 ))] ,
        𝜕𝐹𝑍 (𝑎𝑖 )
                                                        (21)
where 𝑦 is a random sample of Bernoulli variables with
                                                             training, validation, and testing subsets containing 70%,
probabilities (9) (see a few examples of sampled deriva-
                                                             10%, and 20% of the images.
tives in Fig. 3).
                                                                Given a GT segmentation 𝑦 and a predicted segmenta-
                                                             tion 𝑦
                                                                  ˆ , we evaluate two metrics, the squared size predic-
4. Experiments                                               tion error 𝐸 and the intersection-over-union 𝐼𝑜𝑈 ,

The proposed method was implemented in the PyTorch                                         𝐸(𝑦, 𝑦ˆ ) = 𝑙(𝑔(𝑦), 𝑔(𝑦
                                                                                                                 ˆ )),                  (22)
Lightning framework2 using a ResNet implementation                                                     ∑︀𝑉
                                                                                                              1 + 𝑦𝑖 + 𝑦ˆ𝑖 + 𝑦𝑖 𝑦ˆ𝑖
from the Segmentation Models PyTorch library3 . The pre-                                         ˆ ) = ∑︀𝑖=1
                                                                                         𝐼𝑜𝑈 (𝑦, 𝑦        𝑉
                                                                                                                                    .   (23)
sented experiments were perfomed on a server equipped                                                     𝑖=1 3 + 𝑦𝑖 + 𝑦ˆ𝑖 − 𝑦𝑖 𝑦ˆ𝑖
with Intel Xeon Silver 4214R (2.40GHz) and NVIDIA                      In the case of standard supervised method, vertical and
GeForce RTX 2080 Ti.                                                horizontal flipping was randomly applied to augment the
   The data for our experiments was based on a dataset of           training dataset. The proposed method did not apply any
3D MRI images of the hippocampus [13]. The dataset con-             augmentation.
sists of 394 volumes provided with GT segmentation of
classes hippocampus head, hippocampus body, and back-
ground. We decomposed the volumes into individual 2D                4.1. Number of derivative samples
slices of size 48 × 32 pixels and kept only those with              A toy example (see Fig. 3) indicated that taking more
at least 1% foreground, obtaining a total of 6093 images.           samples of the derivatives (21) might lead to better results
Next, we merged the hippocampus classes to get a binary             than taking just one. This experiment investigates how
segmentation problem (see Fig. 4). Afterward, we derived            the number of derivative samples 𝑛 impacts learning
the object sizes from the GT pixel-wise annotations to              speed and prediction quality.
use in training. Finally, we randomly split the data into              We considered four different numbers of samples 𝑛,
2
    https://github.com/Lightning-AI/lightning                       𝑛 ∈ {1, 2, 4, 8}. For each 𝑛, the other parameters (such
3
    https://github.com/qubvel/segmentation_models.pytorch           as the batch size or the learning rate) were the same, and
                                                     n=1              0.820
    2.4
                                                     n=2
                                                     n=4
    2.2
                                                     n=8              0.815


                                                                IoU
    2.0
E


    1.8                                                               0.810                                                   n=1
                                                                                                                              n=2
    1.6                                                                                                                       n=4
                                                                      0.805                                                   n=8

          0        10         20         30        40                         0        10           20          30        40
                               epoch                                                                    epoch
Figure 6: Development of the squared size prediction error 𝐸 and the intersection-over-union 𝐼𝑜𝑈 on the validation images
over the course of learning for different numbers of derivative samples 𝑛.


                                                                      0.900
the learning began with the same segmentation network                              pre-training
𝑓𝜃 that was pre-trained in the standard way on 85 pixel-              0.875        proposed
wise annotated images from the training subset. The
                                                                      0.850
proposed method always ran until the squared error 𝐸
on the validation data stopped improving.                       IoU   0.825
   To assess the learning speed, we measured the duration             0.800
of one learning epoch. For 𝑛 = 1, an epoch took ≈ 10×
longer than the standard supervised learning. Generally,              0.775
the duration grew roughly exponentially with 𝑛 (see                   0.750
Fig. 5).
                                                                                  64    128       256     512   1024   2048    4096
   Higher values of 𝑛 did not lead to a lower 𝐸 or a faster
                                                                                                         m
convergence speed (see Fig. 6). In fact, 𝑛 = 1 and 𝑛 = 2
achieved the lowest 𝐸, but not by a large margin. Given         Figure 7: 𝐼𝑜𝑈 on the test data for different sizes 𝑚 of the pre-
the speed benefits, we use 𝑛 = 1 always. Interestingly,         training dataset. The plot shows results achieved by a network
even though 𝐸 kept decreasing over the course of learn-         after pre-training and after subsequent fine-tuning by the
                                                                proposed method.
ing for all 𝑛, 𝐼𝑜𝑈 improved only slightly and started
declining after ≈ 20 epochs. This observation suggests
that the squared error of the object size is not a sufficient
objective for learning the segmentation.                        5. Discussion
                                                                The method is promising but there is definitely potential
4.2. Pre-training impact                                        for improvement in both speed and prediction perfor-
This experiment tests the essential question: given a seg-      mance.
mentation model trained on a few pixel-level annotated             The proposed method samples the derivatives accord-
images, can we improve its testing performance by fur-          ing to (21) for each pixel 𝑖. However, flipping the predic-
ther learning from size annotations?                            tion, 𝑦𝑖 ↦→ −𝑦𝑖 , changes the derived size only for some
   We trained different segmentation networks until con-        𝑖; particularly those within and on the border of the pre-
vergence on randomly selected training subsets of size 𝑚.       dicted object. Therefore, given a sample 𝑦, 𝑙(𝑠, 𝑔(𝑦)) =
Then, we fine-tuned these networks on the whole train-          𝑙(𝑠, 𝑔(𝑦↓𝑖 )) for many pixels 𝑖, and the sampled deriva-
ing dataset using the proposed method. We measured              tives (21) are sparse. The method might sample only
the test performance in terms of 𝐼𝑜𝑈 .                          those derivatives that are potentially non-zero and set
   The proposed method led to a ≈ 5% increase of 𝐼𝑜𝑈            the rest to zero directly, which would save much compu-
for small 𝑚 < 100 (see Fig. 7), improving the segmen-           tational time.
tation quality. For higher 𝑚, the effect was negligible,           We have seen in the experiments that lower size pre-
which complements the observation from the previous             diction error does not strictly imply better segmentation.
experiment that improving the size estimate does not            We need to closely investigate in what cases the size pre-
necessarily improve the segmentation quality.                   diction loss is insufficient and adjust the objective. The
                                                                adjustment might involve adding an L1 regularization (as
                                                                in [4]) or drawing inspiration from unsupervised meth-
ods (e.g., demand for the segmentation to respect edges          Computing and Computer-Assisted Intervention,
in images, etc.).                                                Springer, 2015, pp. 234–241.
   The proposed approach entails some principled lim-        [4] C. Cano-Espinosa, et al., Biomarker localization
itations. For example, it allows only a single object in         from deep learning regression networks, IEEE
an image. We also expect the method to be ill-suited for         Transactions on Medical Imaging 39 (2020) 2121–
complex object shapes, but we have not performed any             2132.
experiments in that regard yet.                              [5] M. Pérez-Pelegrí, et al., Automatic left ventricle vol-
                                                                 ume calculation with explainability through a deep
                                                                 learning weak-supervision methodology, Com-
6. Conclusion                                                    puter Methods and Programs in Biomedicine 208
                                                                 (2021) 106275.
We proposed a weakly-supervised method for training
                                                             [6] C. Karam, K. Sugimoto, K. Hirakawa, Fast convolu-
a segmentation network from a few pixel-wise annotated
                                                                 tional distance transform, IEEE Signal Processing
images and many images annotated by the object size.
                                                                 Letters 26 (2019) 853–857.
The key ingredients is a method for evaluating the object
                                                             [7] D. D. Pham, G. Dovletov, J. Pauli, A differen-
size from a probabilistic segmentation and a method for
                                                                 tiable convolutional distance transform layer for
optimizing a deep network using a non-differentiable
                                                                 improved image segmentation, in: DAGM German
objective.
                                                                 Conference on Pattern Recognition, Springer, 2020,
   The achieved results seem promising. We believe the
                                                                 pp. 432–444.
improvements suggested in the discussion will improve
                                                             [8] T. Raiko, M. Berglund, G. Alain, L. Dinh, Tech-
performance, rendering the method valuable for training
                                                                 niques for learning binary stochastic feedforward
segmentation models for biomedical images.
                                                                 neural networks, in: 3rd International Conference
                                                                 on Learning Representations, 2015.
Acknowledgments                                              [9] M. Yin, M. Zhou, ARM: augment-REINFORCE-
                                                                 merge gradient for stochastic binary networks, in:
The authors acknowledge the support of the OP VVV                7th International Conference on Learning Repre-
funded project “CZ.02.1.01/0.0/0.0/16_019/0000765 Re-            sentations, 2019.
search Center for Informatics”, the Czech Science Foun-     [10] A. Shekhovtsov, V. Yanush, B. Flach, Path sample-
dation project 20-08452S, and the Grant Agency of the            analytic gradient estimators for stochastic binary
CTU in Prague, grant No. SGS20/170/OHK3/3T/13.                   networks, Advances in Neural Information Process-
                                                                 ing Systems 33 (2020) 12884–12894.
                                                            [11] Y. Cong, M. Zhao, K. Bai, L. Carin, GO gradient for
References                                                       expectation-based objectives, in: 7th International
                                                                 Conference on Learning Representations, 2019.
 [1] X. Liu, et al., A review of deep-learning-based med-
                                                            [12] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learn-
     ical image segmentation methods, Sustainability
                                                                 ing for image recognition, in: Proceedings of the
     13 (2021) 1224.
                                                                 IEEE Conference on Computer Vision and Pattern
 [2] S. Minaee, et al., Image segmentation using deep
                                                                 Recognition, 2016, pp. 770–778.
     learning: A survey, IEEE transactions on pattern
                                                            [13] A. L. Simpson, et al., A large annotated med-
     analysis and machine intelligence (2021).
                                                                 ical image dataset for the development and
 [3] O. Ronneberger, P. Fischer, T. Brox, U-net: Convolu-
                                                                 evaluation of segmentation algorithms, 2019.
     tional networks for biomedical image segmentation,
                                                                 arXiv:arXiv:1902.09063.
     in: International Conference on Medical Image