1. Introduction

Learning to segment from object sizes

Denis Baručić

Jan Kybic

0 0 Department of Cybernetics, Faculty of Electrical Engineering, Czech Technical University in Prague

Deep learning has proved particularly useful for semantic segmentation, a fundamental image analysis task. However, the standard deep learning methods need many training images with ground-truth pixel-wise annotations, which are usually laborious to obtain and, in some cases (e.g., medical images), require domain expertise. Therefore, instead of pixel-wise annotations, we focus on image annotations that are significantly easier to acquire but still informative, namely the size of foreground objects. We define the object size as the maximum Chebyshev distance between a foreground and the nearest background pixel. We propose an algorithm for training a deep segmentation network from a dataset of a few pixel-wise annotated images and many images with known object sizes. The algorithm minimizes a discrete (non-diferentiable) loss function defined over the object sizes by sampling the gradient and then using the standard back-propagation algorithm. Experiments show that the new approach improves the segmentation performance.

eol>semantic segmentation weakly-supervised learning deep learning distance transform

1. Introduction

ing to each foreground pixel the shortest distance to the background. Finally, the object size is defined as double Semantic segmentation is the process of associating a the maximum of the computed distances. class label to each pixel of an image. With the advent of Due to the thresholding, the cost function is not diferdeep learning, deep networks have achieved incredible entiable and it is therefore not possible to use the standard performance on many image processing tasks, including gradient descent for learning. We overcome this obstacle semantic segmentation. Deep learning for semantic seg- by adding random noise to the output of our network. mentation has many benefits; for example, it is flexible The predicted binary masks then become stochastic and w.r.t. the model architecture and scales particularly well the gradient can be sampled. A detailed description of [1, 2]. On the contrary, the standard deep learning de- our method is given later in Sec. 2 and 3. mands many ground-truth (GT) pixel-wise annotations to prevent overfitting. Since a human expert annotator must 1.2. Related work usually provide the GT annotations, acquiring a goodquality training dataset can be dificult. To combat this issue, we focus on learning from GT image annotations that are easier to produce but still informative enough, namely the sizes of foreground objects. In practice, our approach assumes a training dataset that consists of relatively few pixel-wise annotated images and many images with known object sizes. We present a work-in-progress solution.

Cano-Espinosa et al. [4] considered a similar learning problem. They proposed a network architecture that performs a biomarker (fat contents) regression and image segmentation after being trained directly on images annotated by biomarker values only. Similarly to ours, their method derives the biomarker value from the predicted segmentation deterministically. The diference is that their biomarker, equivalent to the foreground area, can be obtained by a simple summation. Furthermore, 1.1. Proposed approach the method assumes that the foreground objects can be roughly segmented using thresholding. Pérez-Pelegrí et Suppose a standard convolutional network for image seg- al. [5] took a similar approach. Although their method mentation (e.g., a U-Net [3]). Given an input image, we does not involve thresholding to produce approximate feed it to the network and collect the output prediction. segmentation, it was tailored explicitly for learning from The prediction is then thresholded to obtain a binary images annotated by the foreground volume (as their mask, which is processed by a distance transform, assign- images are 3D).

Karam et al. [6] implemented a diferentiable distance ITAT’22: Information technologies – Applications and Theory, Septem- transform via a combination of the convolution operaber 23–27, 2022, Zuberec, Slovakia tions. The method is fast but exhibits numerical instabili$ barucden@fel.cvut.cz (D. Baručić); kybic@fel.cvut.cz (J. Kybic) ties for bigger images. Resolving the numerical instabili0000-0003-0428-3354 (D. Baručić); 0000-0002-9363-4947 ties, Pham et al. [7] later proposed a cascaded procedure (J. Kybic)

© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License with locally restricted convolutional distance transforms. CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) Nonetheless, both methods substitute the minimum function with the log-sum-exp operation, which leads to inaccurate results.

The way our method deals with a non-diferentiable cost function is borrowed from stochastic binary networks [8]. In a stochastic binary network, one needs to deal with zero gradient after each layer of the network.

However, methods such as ARM [9] or PSA [10] are unnecessarily complex. Instead, we employ a single sample estimation, which has been discussed in [11].

2. Model

The proposed model consists of ( 1 ) a segmentation network, , parametrized by , and ( 2 ) a deterministic al- 2.1.1. Implementation details gorithm to derive the object size based on distance transform, denoted as .

Given an input image = (1, . . . , ), the network produces a pixel-wise segmentation There is an eficient, two-pass algorithm that computes the distance transform in Θ( ) time. Furthermore, when evaluating a batch of images, it is possible to compute the distance transform on all images in parallel. = (), ( 1 ) We have implemented a CPU version1 of this algorithm that works with PyTorch tensors and is faster than, e.g., such that ∈ R, 1 ≤ ≤ , where is the number the SciPy implementation. of pixels. The method does not make any assumptions about the network’s technical details, except that it can be trained using the standard back-propagation algorithm 3. Learning and gradient descent. In our experiments, we always employed a U-Net [3] with a residual network encoder [12] and a mirroring decoder.

To obtain a binary mask ^ ∈ {± 1} , the network response is thresholded, Suppose a training dataset = ∪ consists of fully- and weakly-annotated subsets and . The fully-annotated subset contains pairs (, ), where is an input image and the corresponding GT pixelwise segmentation, while comprises of pairs (, ), where is the size of the object present in the image .

We focus on situations when | | ≪ | |. ( 2 )

2.1. Object size

We use a distance transform of the binary mask to define the object size (see Fig. 1). Distance transform assigns to each pixel the shortest distance to the background, i.e., =

min (, ), = 1, . . . , , ,^ =− 1

3.1. Supervised pre-training

Our method starts by optimizing a pixel-wise loss w.r.t. the network parameters on the small subset , as in the standard supervised learning. For a particular train( 3 ) ing pair (, ) ∈ and the corresponding prediction ∈ R , the loss function reads where (, ) is the Chebyshev ℓ∞ distance. After that, we take double the maximum distance to define the object size, ∑︁ ((1 − ) + log(1 + exp(− ))) , =1 (6) ˆ = 2 max .

The composition of the distance transform and the maximum aggregation determines the object size, denoted as : {± 1} → R, (ˆ) = 2 max min (, ).

,^ =− 1 (4) which is known as the binary cross-entropy with logits loss. The optimization continues until convergence.

Using proper data augmentation to extend the training dataset, the network tends to recognize useful features and produces decent predictions after this initial stage (5) (see Sec. 4.2). image

Segmentation

network

Distance transform Max

( ) noise

Size derivation Loss

(, ( ))

We propose to follow an approach similar to those used in binary neural networks [10] and subtract random noise from the real predictions before thresholding.

Consequently, the binary segmentation becomes a col- 3.2.3. Single sample estimator lection = (1, . . . , ) of independent Bernoulli variables,

3.2. Weakly-supervised training

Consider a training pair (, ) ∈ . As described in Sec. 2, one can obtain a prediction of the object size, ˆ = (ˆ), from the thresholded network response ˆ. We penalize the prediction error by the square loss (, ˆ) = ( − ˆ)2.

= sign( − ), with

Pr( = +1 | ; ) = Pr( ≤ ) = (), where is the cumulative distribution function (CDF) of the noise (see Fig. 2).

Then, instead of minimizing the loss (7), we minimize the expected loss ℒ = E [(, ( ))], ℒ =

∑︁ ∈{± 1} Contrary to (7), the expected loss (10) is diferentiable, assuming a smooth . 3.2.1. Noise distribution Following [10], we sample the noise from the logistic distribution with mean = 0 and scale = 1. Hence, the CDF of is a smooth, sigmoid function, () =

1 1 + exp(− )

. 3.2.2. Exact gradient To compute the gradient ∇ ℒ, we need to evaluate the derivative for each pixel = 1, . . . , . The gradient can be then computed automatically by the back-propagation algorithm. However, an exact computation of (12) leads to ∑︁

Pr( = | ; ) ∈{± 1} Pr( = | ; ) (, ()),

(13) which involves summing 2 terms and is thus tractable only for very small images. Instead, we resort to a single sample estimator.

The single sample estimator is based on Lemma 1, which is, in fact, a specific form of [10, Lemma B.1].

Lemma 1. Let = (1, . . . , ) be a collection of independent {± 1}-valued Bernoulli variables with probabilities Pr( = +1) = . Let ℎ be a function ℎ : {± 1} → R. Let = (1, . . . , ) denote a random sample of and ↓ = (1, . . . , − 1, − , +1, . . . , ). Then

(ℎ() − ℎ(↓)) is an unbiased estimate of E∼ [ℎ()].

E∼ [ℎ()] = ∑︁ Pr() ℎ(),

Pr() ¬ and write out the sum over , ∑︁ ∑︁ Pr(¬)ℎ() = ∑︁ Pr(¬) ∑︁ ℎ() ¬

(16) where ¬ denotes vector with the -th component omitted. Notice that the inner sum simplifies and no longer depends on , ∑︁ Pr(¬)(ℎ(=+1) − ℎ(=− 1)),

(17) ¬ where = is the vector with the -th component set to . Then, we multiply the inner subtraction by the constant factor 1 = + (1 − ) = ∑︀ Pr(), ∑︁ Pr(¬) ∑︁ Pr()(ℎ(=+1) − ℎ(=− 1)), (18) ¬ (14) (15) ultimately leading to the following expression for (15): which can be written as ∑︁ Pr()(ℎ(=+1) − ℎ(=− 1)), ∑︁ Pr() [ℎ() − ℎ(↓)] . (19) (20) Thus, (14) is a single sample unbiased estimate of (15).

According to Lemma 1, an unbiased estimate of the derivative (12) is

(21) where is a random sample of Bernoulli variables with probabilities (9) (see a few examples of sampled deriva- training, validation, and testing subsets containing 70%, tives in Fig. 3). 10%, and 20% of the images.

Given a GT segmentation and a predicted segmentation ˆ, we evaluate two metrics, the squared size predic4. Experiments tion error and the intersection-over-union , The proposed method was implemented in the PyTorch (, ˆ) = ((), (ˆ)), (22) fsLreiognmhtetdtnhieenxgSpeefgrrmaimmeenentwatstoiwrokne2rMeusopidenergflsoamPRyeTedosoNrcnehtaliismberrpavlreeyrm3.eeqTnuhtiaeptppioreend- (, ˆ) = ∑∑︀︀==11 31 ++ ++ ˆˆ −+ ˆˆ . (23) with Intel Xeon Silver 4214R (2.40GHz) and NVIDIA In the case of standard supervised method, vertical and GeForce RTX 2080 Ti. horizontal flipping was randomly applied to augment the

The data for our experiments was based on a dataset of training dataset. The proposed method did not apply any 3D MRI images of the hippocampus [13]. The dataset con- augmentation. sists of 394 volumes provided with GT segmentation of classes hippocampus head, hippocampus body, and background. We decomposed the volumes into individual 2D 4.1. Number of derivative samples slices of size 48 × 32 pixels and kept only those with A toy example (see Fig. 3) indicated that taking more at least 1% foreground, obtaining a total of 6093 images. samples of the derivatives (21) might lead to better results Next, we merged the hippocampus classes to get a binary than taking just one. This experiment investigates how segmentation problem (see Fig. 4). Afterward, we derived the number of derivative samples impacts learning the object sizes from the GT pixel-wise annotations to speed and prediction quality. use in training. Finally, we randomly split the data into We considered four diferent numbers of samples , 2https://github.com/Lightning-AI/lightning ∈ {1, 2, 4, 8}. For each , the other parameters (such 3https://github.com/qubvel/segmentation_models.pytorch as the batch size or the learning rate) were the same, and the learning began with the same segmentation network that was pre-trained in the standard way on 85 pixelwise annotated images from the training subset. The proposed method always ran until the squared error on the validation data stopped improving.

To assess the learning speed, we measured the duration of one learning epoch. For = 1, an epoch took ≈ 10× longer than the standard supervised learning. Generally, the duration grew roughly exponentially with (see Fig. 5).

Higher values of did not lead to a lower or a faster convergence speed (see Fig. 6). In fact, = 1 and = 2 achieved the lowest , but not by a large margin. Given the speed benefits, we use = 1 always. Interestingly, even though kept decreasing over the course of learning for all , improved only slightly and started declining after ≈ 20 epochs. This observation suggests that the squared error of the object size is not a suficient objective for learning the segmentation.

5. Discussion 4.2. Pre-training impact The method is promising but there is definitely potential for improvement in both speed and prediction perforThis experiment tests the essential question: given a seg- mance. mentation model trained on a few pixel-level annotated The proposed method samples the derivatives accordimages, can we improve its testing performance by fur- ing to (21) for each pixel . However, flipping the predicther learning from size annotations? tion, ↦→ − , changes the derived size only for some

We trained diferent segmentation networks until con- ; particularly those within and on the border of the prevergence on randomly selected training subsets of size . dicted object. Therefore, given a sample , (, ()) = Then, we fine-tuned these networks on the whole train- (, (↓)) for many pixels , and the sampled derivaing dataset using the proposed method. We measured tives (21) are sparse. The method might sample only the test performance in terms of . those derivatives that are potentially non-zero and set

The proposed method led to a ≈ 5% increase of the rest to zero directly, which would save much compufor small < 100 (see Fig. 7), improving the segmen- tational time. tation quality. For higher , the efect was negligible, We have seen in the experiments that lower size prewhich complements the observation from the previous diction error does not strictly imply better segmentation. experiment that improving the size estimate does not We need to closely investigate in what cases the size prenecessarily improve the segmentation quality. diction loss is insuficient and adjust the objective. The adjustment might involve adding an L1 regularization (as in [4]) or drawing inspiration from unsupervised methods (e.g., demand for the segmentation to respect edges Computing and Computer-Assisted Intervention, in images, etc.). Springer, 2015, pp. 234–241.

The proposed approach entails some principled lim- [4] C. Cano-Espinosa, et al., Biomarker localization itations. For example, it allows only a single object in from deep learning regression networks, IEEE an image. We also expect the method to be ill-suited for Transactions on Medical Imaging 39 (2020) 2121– complex object shapes, but we have not performed any 2132. experiments in that regard yet. [5] M. Pérez-Pelegrí, et al., Automatic left ventricle volume calculation with explainability through a deep learning weak-supervision methodology, Com6. Conclusion puter Methods and Programs in Biomedicine 208 (2021) 106275.

We proposed a weakly-supervised method for training [6] C. Karam, K. Sugimoto, K. Hirakawa, Fast convolua segmentation network from a few pixel-wise annotated tional distance transform, IEEE Signal Processing images and many images annotated by the object size. Letters 26 (2019) 853–857.

The key ingredients is a method for evaluating the object [7] D. D. Pham, G. Dovletov, J. Pauli, A diferensize from a probabilistic segmentation and a method for tiable convolutional distance transform layer for optimizing a deep network using a non-diferentiable improved image segmentation, in: DAGM German objective. Conference on Pattern Recognition, Springer, 2020,

The achieved results seem promising. We believe the pp. 432–444. improvements suggested in the discussion will improve [8] T. Raiko, M. Berglund, G. Alain, L. Dinh, Techperformance, rendering the method valuable for training niques for learning binary stochastic feedforward segmentation models for biomedical images. neural networks, in: 3rd International Conference on Learning Representations, 2015.

Acknowledgments [9] M. Yin, M. Zhou, ARM: augment-REINFORCEmerge gradient for stochastic binary networks, in: The authors acknowledge the support of the OP VVV 7th International Conference on Learning Reprefunded project “CZ.02.1.01/0.0/0.0/16_019/0000765 Re- sentations, 2019. search Center for Informatics”, the Czech Science Foun- [10] A. Shekhovtsov, V. Yanush, B. Flach, Path sampledation project 20-08452S, and the Grant Agency of the analytic gradient estimators for stochastic binary CTU in Prague, grant No. SGS20/170/OHK3/3T/13. networks, Advances in Neural Information Processing Systems 33 (2020) 12884–12894. [11] Y. Cong, M. Zhao, K. Bai, L. Carin, GO gradient for References expectation-based objectives, in: 7th International Conference on Learning Representations, 2019. [12] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern

Recognition, 2016, pp. 770–778. [13] A. L. Simpson, et al., A large annotated medical image dataset for the development and evaluation of segmentation algorithms, 2019. arXiv:arXiv:1902.09063.

[1]

Liu , et al., A review of deep-learning-based medical image segmentation methods , Sustainability 13 ( 2021 ) 1224 .

[2]

Minaee , et al., Image segmentation using deep learning: A survey , IEEE transactions on pattern analysis and machine intelligence ( 2021 ).

[3]

Ronneberger ,

Fischer ,

Brox , U-net: Convolutional networks for biomedical image segmentation , in: International Conference on Medical Image