Learning to segment from object sizes Denis Baručić1 , Jan Kybic1 1 Department of Cybernetics, Faculty of Electrical Engineering, Czech Technical University in Prague Abstract Deep learning has proved particularly useful for semantic segmentation, a fundamental image analysis task. However, the standard deep learning methods need many training images with ground-truth pixel-wise annotations, which are usually laborious to obtain and, in some cases (e.g., medical images), require domain expertise. Therefore, instead of pixel-wise annotations, we focus on image annotations that are significantly easier to acquire but still informative, namely the size of foreground objects. We define the object size as the maximum Chebyshev distance between a foreground and the nearest background pixel. We propose an algorithm for training a deep segmentation network from a dataset of a few pixel-wise annotated images and many images with known object sizes. The algorithm minimizes a discrete (non-differentiable) loss function defined over the object sizes by sampling the gradient and then using the standard back-propagation algorithm. Experiments show that the new approach improves the segmentation performance. Keywords semantic segmentation, weakly-supervised learning, deep learning, distance transform 1. Introduction ing to each foreground pixel the shortest distance to the background. Finally, the object size is defined as double Semantic segmentation is the process of associating a the maximum of the computed distances. class label to each pixel of an image. With the advent of Due to the thresholding, the cost function is not differ- deep learning, deep networks have achieved incredible entiable and it is therefore not possible to use the standard performance on many image processing tasks, including gradient descent for learning. We overcome this obstacle semantic segmentation. Deep learning for semantic seg- by adding random noise to the output of our network. mentation has many benefits; for example, it is flexible The predicted binary masks then become stochastic and w.r.t. the model architecture and scales particularly well the gradient can be sampled. A detailed description of [1, 2]. On the contrary, the standard deep learning de- our method is given later in Sec. 2 and 3. mands many ground-truth (GT) pixel-wise annotations to prevent overfitting. Since a human expert annotator must usually provide the GT annotations, acquiring a good- 1.2. Related work quality training dataset can be difficult. To combat this Cano-Espinosa et al. [4] considered a similar learning issue, we focus on learning from GT image annotations problem. They proposed a network architecture that that are easier to produce but still informative enough, performs a biomarker (fat contents) regression and im- namely the sizes of foreground objects. In practice, our age segmentation after being trained directly on images approach assumes a training dataset that consists of rela- annotated by biomarker values only. Similarly to ours, tively few pixel-wise annotated images and many images their method derives the biomarker value from the pre- with known object sizes. We present a work-in-progress dicted segmentation deterministically. The difference is solution. that their biomarker, equivalent to the foreground area, can be obtained by a simple summation. Furthermore, 1.1. Proposed approach the method assumes that the foreground objects can be roughly segmented using thresholding. Pérez-Pelegrí et Suppose a standard convolutional network for image seg- al. [5] took a similar approach. Although their method mentation (e.g., a U-Net [3]). Given an input image, we does not involve thresholding to produce approximate feed it to the network and collect the output prediction. segmentation, it was tailored explicitly for learning from The prediction is then thresholded to obtain a binary images annotated by the foreground volume (as their mask, which is processed by a distance transform, assign- images are 3D). Karam et al. [6] implemented a differentiable distance ITAT’22: Information technologies – Applications and Theory, Septem- transform via a combination of the convolution opera- ber 23–27, 2022, Zuberec, Slovakia tions. The method is fast but exhibits numerical instabili- $ barucden@fel.cvut.cz (D. Baručić); kybic@fel.cvut.cz (J. Kybic) ties for bigger images. Resolving the numerical instabili-  0000-0003-0428-3354 (D. Baručić); 0000-0002-9363-4947 ties, Pham et al. [7] later proposed a cascaded procedure (J. Kybic) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License with locally restricted convolutional distance transforms. Attribution 4.0 International (CC BY 4.0). CEUR CEUR Workshop Proceedings (CEUR-WS.org) Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 Nonetheless, both methods substitute the minimum func- tion with the log-sum-exp operation, which leads to inac- curate results. The way our method deals with a non-differentiable cost function is borrowed from stochastic binary net- 𝑖 ˜ 𝑠 works [8]. In a stochastic binary network, one needs to 𝑑𝑖 deal with zero gradient after each layer of the network. However, methods such as ARM [9] or PSA [10] are un- necessarily complex. Instead, we employ a single sample Figure 1: Illustrative example of an object and its derived size. The object is outlined by the thick boundary line. The point 𝑖 estimation, which has been discussed in [11]. denotes the foreground pixel whose shortest distance to the background, 𝑑𝑖 , is the highest among the pixels. The derived 2. Model object size ^𝑠 = 2𝑑𝑖 . The proposed model consists of (1) a segmentation net- work, 𝑓𝜃 , parametrized by 𝜃, and (2) a deterministic al- 2.1.1. Implementation details gorithm to derive the object size based on distance trans- form, denoted as 𝑔. There is an efficient, two-pass algorithm that computes Given an input image 𝑥 = (𝑥1 , . . . , 𝑥𝑉 ), the network the distance transform in Θ(𝑉 ) time. Furthermore, when produces a pixel-wise segmentation evaluating a batch of images, it is possible to compute the distance transform on all images in parallel. 𝑎 = 𝑓𝜃 (𝑥), (1) We have implemented a CPU version1 of this algorithm that works with PyTorch tensors and is faster than, e.g., such that 𝑎𝑖 ∈ R, 1 ≤ 𝑖 ≤ 𝑉 , where 𝑉 is the number the SciPy implementation. of pixels. The method does not make any assumptions about the network’s technical details, except that it can be trained using the standard back-propagation algorithm 3. Learning and gradient descent. In our experiments, we always employed a U-Net [3] with a residual network encoder Suppose a training dataset 𝒟 = 𝒟𝑓 ∪ 𝒟𝑤 consists of [12] and a mirroring decoder. fully- and weakly-annotated subsets 𝒟𝑓 and 𝒟𝑤 . The To obtain a binary mask 𝑦 ^ ∈ {±1}𝑉 , the network fully-annotated subset 𝒟𝑓 contains pairs (𝑥, 𝑦), where response 𝑎 is thresholded, 𝑥 is an input image and 𝑦 the corresponding GT pixel- wise segmentation, while 𝒟𝑤 comprises of pairs (𝑥, 𝑠), 𝑦ˆ𝑖 = sign 𝑎𝑖 . (2) where 𝑠 is the size of the object present in the image 𝑥. We focus on situations when |𝒟𝑓 | ≪ |𝒟𝑤 |. 2.1. Object size 3.1. Supervised pre-training We use a distance transform of the binary mask to define the object size (see Fig. 1). Distance transform assigns to Our method starts by optimizing a pixel-wise loss w.r.t. each pixel the shortest distance to the background, i.e., the network parameters 𝜃 on the small subset 𝒟𝑓 , as in the standard supervised learning. For a particular train- 𝑑𝑖 = min 𝛿(𝑖, 𝑗), 𝑖 = 1, . . . , 𝑉, (3) ing pair (𝑥, 𝑦) ∈ 𝒟𝑓 and the corresponding prediction 𝑗,𝑦 ^𝑗 =−1 𝑎 ∈ R𝑉 , the loss function reads where 𝛿(𝑖, 𝑗) is the Chebyshev ℓ∞ distance. After that, 𝑉 we take double the maximum distance to define the object ∑︁ (𝑎𝑖 (1 − 𝑦𝑖 ) + log(1 + exp(−𝑎𝑖 ))) , (6) size, 𝑖=1 ˆ𝑠 = 2 max 𝑑𝑖 . (4) 𝑖 which is known as the binary cross-entropy with logits The composition of the distance transform and the loss. The optimization continues until convergence. maximum aggregation determines the object size, de- Using proper data augmentation to extend the training noted as 𝑔 : {±1}𝑉 → R, dataset, the network tends to recognize useful features and produces decent predictions after this initial stage ˆ ) = 2 max min 𝛿(𝑖, 𝑗). 𝑔(𝑦 (5) (see Sec. 4.2). 𝑖 𝑗,𝑦 ^𝑗 =−1 1 https://github.com/barucden/chdt Segmentation 𝑌 Distance 𝑔(𝑌 ) Max Loss 𝑙 𝑙(𝑠, 𝑔(𝑌 )) image 𝑥 network 𝑓𝜃 transform noise 𝑍 Size derivation 𝑔 𝑠 Figure 2: An overview of the proposed probabilistic model. 3.2. Weakly-supervised training for each pixel 𝑖 = 1, . . . , 𝑉 . The gradient can be then computed automatically by the back-propagation algo- Consider a training pair (𝑥, 𝑠) ∈ 𝒟𝑤 . As described in rithm. However, an exact computation of (12) leads to Sec. 2, one can obtain a prediction of the object size, ˆ𝑠 = 𝑔(𝑦ˆ ), from the thresholded network response 𝑦 ˆ . We ∑︁ Pr(𝑌 = 𝑦 | 𝑥; 𝜃) penalize the prediction error by the square loss 𝑙(𝑠, 𝑔(𝑦))𝑦𝑖 , (13) 𝑉 Pr(𝑌 𝑖 = 𝑦𝑖 | 𝑥; 𝜃) 𝑦∈{±1} 𝑙(𝑠, ˆ𝑠) = (𝑠 − ˆ𝑠)2 . (7) which involves summing 2𝑉 terms and is thus tractable We propose to follow an approach similar to those only for very small images. Instead, we resort to a single used in binary neural networks [10] and subtract random sample estimator. noise 𝑍 from the real predictions 𝑎𝑖 before thresholding. Consequently, the binary segmentation becomes a col- 3.2.3. Single sample estimator lection 𝑌 = (𝑌1 , . . . , 𝑌𝑉 ) of 𝑉 independent Bernoulli The single sample estimator is based on Lemma 1, which variables, is, in fact, a specific form of [10, Lemma B.1]. 𝑌𝑖 = sign(𝑎𝑖 − 𝑍), (8) Lemma 1. Let 𝑌 = (𝑌1 , . . . , 𝑌𝑉 ) be a collection of 𝑉 in- with dependent {±1}-valued Bernoulli variables with probabili- 𝑉 Pr(𝑌𝑖 = +1 | 𝑥; 𝜃) = Pr(𝑍 ≤ 𝑎𝑖 ) = 𝐹𝑍 (𝑎𝑖 ), (9) ties Pr(𝑌𝑖 = +1) = 𝑝𝑖 . Let ℎ be a function ℎ : {±1} → R. Let 𝑦 = (𝑦1 , . . . , 𝑦𝑉 ) denote a random sample of 𝑌 where 𝐹𝑍 is the cumulative distribution function (CDF) and 𝑦↓𝑖 = (𝑦1 , . . . , 𝑦𝑖−1 , −𝑦𝑖 , 𝑦𝑖+1 , . . . , 𝑦𝑉 ). Then of the noise 𝑍 (see Fig. 2). Then, instead of minimizing the loss 𝑙 (7), we minimize 𝑦𝑖 (ℎ(𝑦) − ℎ(𝑦↓𝑖 )) (14) the expected loss ℒ = E𝑌 [𝑙(𝑠, 𝑔(𝑌 ))], 𝜕 is an unbiased estimate of 𝜕𝑝𝑖 E𝑦∼𝑌 [ℎ(𝑦)]. ∑︁ ℒ= Pr(𝑌 = 𝑦 | 𝑥; 𝜃)𝑙(𝑠, 𝑔(𝑦)). (10) Proof. We take the derivative of the expectation, 𝑦∈{±1}𝑉 𝜕 ∑︁ Pr(𝑦) E𝑦∼𝑌 [ℎ(𝑦)] = ℎ(𝑦)𝑦𝑖 , (15) Contrary to (7), the expected loss (10) is differentiable, 𝜕𝑝𝑖 𝑦 Pr(𝑦𝑖 ) assuming a smooth 𝐹𝑍 . and write out the sum over 𝑦𝑖 , 3.2.1. Noise distribution ∑︁ ∑︁ ∑︁ ∑︁ Pr(𝑦¬𝑖 )ℎ(𝑦)𝑦𝑖 = Pr(𝑦¬𝑖 ) ℎ(𝑦)𝑦𝑖 Following [10], we sample the noise 𝑍 from the logistic 𝑦¬𝑖 𝑦𝑖 𝑦¬𝑖 𝑦𝑖 distribution with mean 𝜇 = 0 and scale 𝑠 = 1. Hence, (16) the CDF of 𝑍 is a smooth, sigmoid function, where 𝑦¬𝑖 denotes vector 𝑦 with the 𝑖-th component omitted. Notice that the inner sum simplifies and no 1 𝐹𝑍 (𝑎) = . (11) longer depends on 𝑦𝑖 , 1 + exp(−𝑎) ∑︁ Pr(𝑦¬𝑖 )(ℎ(𝑦𝑖=+1 ) − ℎ(𝑦𝑖=−1 )), (17) 3.2.2. Exact gradient 𝑦¬𝑖 To compute the gradient ∇𝜃 ℒ, we need to evaluate the where 𝑦𝑖=𝑧 is the vector 𝑦 with the 𝑖-th component set derivative to 𝑧. Then, we multiply the inner subtraction by the 𝜕 E𝑌 [𝑙(𝑠, 𝑔(𝑌 ))] constant factor 1 = 𝑝𝑖 + (1 − 𝑝𝑖 ) = 𝑦𝑖 Pr(𝑦𝑖 ), ∑︀ (12) 𝜕𝐹𝑍 (𝑎𝑖 ) ∑︁ ∑︁ Pr(𝑦¬𝑖 ) Pr(𝑦𝑖 )(ℎ(𝑦𝑖=+1 ) − ℎ(𝑦𝑖=−1 )), (18) 𝑦¬𝑖 𝑦𝑖 0 1 -4 0 𝐹𝑍 𝑛=1 𝑛=8 𝑛 = 64 𝑛 = 512 Figure 3: Examples of derivatives (12) computed according to (21) for different number of samples 𝑛, given the output of 𝐹𝑍 , for a small, 6 × 6 image. The red frame outlines the object. Figure 4: Example of a hippocampus image [13] with the object outlined in red. ultimately leading to the following expression for (15): 200 ∑︁ proposed Pr(𝑦)(ℎ(𝑦𝑖=+1 ) − ℎ(𝑦𝑖=−1 )), (19) standard 𝑦 150 Duration [s] which can be written as 100 ∑︁ Pr(𝑦)𝑦𝑖 [ℎ(𝑦) − ℎ(𝑦↓𝑖 )] . (20) 𝑦 50 Thus, (14) is a single sample unbiased estimate of (15). 0 1 2 4 8 According to Lemma 1, an unbiased estimate of the Number of derivative samples n derivative (12) is Figure 5: Average epoch duration for the proposed method with different number of gradient samples. The duration of 𝜕 E𝑌 [𝑙(𝑠, 𝑔(𝑌 ))] the standard method is given as a reference. ≈ 𝑦𝑖 [𝑙(𝑠, 𝑔(𝑦)) − 𝑙(𝑠, 𝑔(𝑦↓𝑖 ))] , 𝜕𝐹𝑍 (𝑎𝑖 ) (21) where 𝑦 is a random sample of Bernoulli variables with training, validation, and testing subsets containing 70%, probabilities (9) (see a few examples of sampled deriva- 10%, and 20% of the images. tives in Fig. 3). Given a GT segmentation 𝑦 and a predicted segmenta- tion 𝑦 ˆ , we evaluate two metrics, the squared size predic- 4. Experiments tion error 𝐸 and the intersection-over-union 𝐼𝑜𝑈 , The proposed method was implemented in the PyTorch 𝐸(𝑦, 𝑦ˆ ) = 𝑙(𝑔(𝑦), 𝑔(𝑦 ˆ )), (22) Lightning framework2 using a ResNet implementation ∑︀𝑉 1 + 𝑦𝑖 + 𝑦ˆ𝑖 + 𝑦𝑖 𝑦ˆ𝑖 from the Segmentation Models PyTorch library3 . The pre- ˆ ) = ∑︀𝑖=1 𝐼𝑜𝑈 (𝑦, 𝑦 𝑉 . (23) sented experiments were perfomed on a server equipped 𝑖=1 3 + 𝑦𝑖 + 𝑦ˆ𝑖 − 𝑦𝑖 𝑦ˆ𝑖 with Intel Xeon Silver 4214R (2.40GHz) and NVIDIA In the case of standard supervised method, vertical and GeForce RTX 2080 Ti. horizontal flipping was randomly applied to augment the The data for our experiments was based on a dataset of training dataset. The proposed method did not apply any 3D MRI images of the hippocampus [13]. The dataset con- augmentation. sists of 394 volumes provided with GT segmentation of classes hippocampus head, hippocampus body, and back- ground. We decomposed the volumes into individual 2D 4.1. Number of derivative samples slices of size 48 × 32 pixels and kept only those with A toy example (see Fig. 3) indicated that taking more at least 1% foreground, obtaining a total of 6093 images. samples of the derivatives (21) might lead to better results Next, we merged the hippocampus classes to get a binary than taking just one. This experiment investigates how segmentation problem (see Fig. 4). Afterward, we derived the number of derivative samples 𝑛 impacts learning the object sizes from the GT pixel-wise annotations to speed and prediction quality. use in training. Finally, we randomly split the data into We considered four different numbers of samples 𝑛, 2 https://github.com/Lightning-AI/lightning 𝑛 ∈ {1, 2, 4, 8}. For each 𝑛, the other parameters (such 3 https://github.com/qubvel/segmentation_models.pytorch as the batch size or the learning rate) were the same, and n=1 0.820 2.4 n=2 n=4 2.2 n=8 0.815 IoU 2.0 E 1.8 0.810 n=1 n=2 1.6 n=4 0.805 n=8 0 10 20 30 40 0 10 20 30 40 epoch epoch Figure 6: Development of the squared size prediction error 𝐸 and the intersection-over-union 𝐼𝑜𝑈 on the validation images over the course of learning for different numbers of derivative samples 𝑛. 0.900 the learning began with the same segmentation network pre-training 𝑓𝜃 that was pre-trained in the standard way on 85 pixel- 0.875 proposed wise annotated images from the training subset. The 0.850 proposed method always ran until the squared error 𝐸 on the validation data stopped improving. IoU 0.825 To assess the learning speed, we measured the duration 0.800 of one learning epoch. For 𝑛 = 1, an epoch took ≈ 10× longer than the standard supervised learning. Generally, 0.775 the duration grew roughly exponentially with 𝑛 (see 0.750 Fig. 5). 64 128 256 512 1024 2048 4096 Higher values of 𝑛 did not lead to a lower 𝐸 or a faster m convergence speed (see Fig. 6). In fact, 𝑛 = 1 and 𝑛 = 2 achieved the lowest 𝐸, but not by a large margin. Given Figure 7: 𝐼𝑜𝑈 on the test data for different sizes 𝑚 of the pre- the speed benefits, we use 𝑛 = 1 always. Interestingly, training dataset. The plot shows results achieved by a network even though 𝐸 kept decreasing over the course of learn- after pre-training and after subsequent fine-tuning by the proposed method. ing for all 𝑛, 𝐼𝑜𝑈 improved only slightly and started declining after ≈ 20 epochs. This observation suggests that the squared error of the object size is not a sufficient objective for learning the segmentation. 5. Discussion The method is promising but there is definitely potential 4.2. Pre-training impact for improvement in both speed and prediction perfor- This experiment tests the essential question: given a seg- mance. mentation model trained on a few pixel-level annotated The proposed method samples the derivatives accord- images, can we improve its testing performance by fur- ing to (21) for each pixel 𝑖. However, flipping the predic- ther learning from size annotations? tion, 𝑦𝑖 ↦→ −𝑦𝑖 , changes the derived size only for some We trained different segmentation networks until con- 𝑖; particularly those within and on the border of the pre- vergence on randomly selected training subsets of size 𝑚. dicted object. Therefore, given a sample 𝑦, 𝑙(𝑠, 𝑔(𝑦)) = Then, we fine-tuned these networks on the whole train- 𝑙(𝑠, 𝑔(𝑦↓𝑖 )) for many pixels 𝑖, and the sampled deriva- ing dataset using the proposed method. We measured tives (21) are sparse. The method might sample only the test performance in terms of 𝐼𝑜𝑈 . those derivatives that are potentially non-zero and set The proposed method led to a ≈ 5% increase of 𝐼𝑜𝑈 the rest to zero directly, which would save much compu- for small 𝑚 < 100 (see Fig. 7), improving the segmen- tational time. tation quality. For higher 𝑚, the effect was negligible, We have seen in the experiments that lower size pre- which complements the observation from the previous diction error does not strictly imply better segmentation. experiment that improving the size estimate does not We need to closely investigate in what cases the size pre- necessarily improve the segmentation quality. diction loss is insufficient and adjust the objective. The adjustment might involve adding an L1 regularization (as in [4]) or drawing inspiration from unsupervised meth- ods (e.g., demand for the segmentation to respect edges Computing and Computer-Assisted Intervention, in images, etc.). Springer, 2015, pp. 234–241. The proposed approach entails some principled lim- [4] C. Cano-Espinosa, et al., Biomarker localization itations. For example, it allows only a single object in from deep learning regression networks, IEEE an image. We also expect the method to be ill-suited for Transactions on Medical Imaging 39 (2020) 2121– complex object shapes, but we have not performed any 2132. experiments in that regard yet. [5] M. Pérez-Pelegrí, et al., Automatic left ventricle vol- ume calculation with explainability through a deep learning weak-supervision methodology, Com- 6. Conclusion puter Methods and Programs in Biomedicine 208 (2021) 106275. We proposed a weakly-supervised method for training [6] C. Karam, K. Sugimoto, K. Hirakawa, Fast convolu- a segmentation network from a few pixel-wise annotated tional distance transform, IEEE Signal Processing images and many images annotated by the object size. Letters 26 (2019) 853–857. The key ingredients is a method for evaluating the object [7] D. D. Pham, G. Dovletov, J. Pauli, A differen- size from a probabilistic segmentation and a method for tiable convolutional distance transform layer for optimizing a deep network using a non-differentiable improved image segmentation, in: DAGM German objective. Conference on Pattern Recognition, Springer, 2020, The achieved results seem promising. We believe the pp. 432–444. improvements suggested in the discussion will improve [8] T. Raiko, M. Berglund, G. Alain, L. Dinh, Tech- performance, rendering the method valuable for training niques for learning binary stochastic feedforward segmentation models for biomedical images. neural networks, in: 3rd International Conference on Learning Representations, 2015. Acknowledgments [9] M. Yin, M. Zhou, ARM: augment-REINFORCE- merge gradient for stochastic binary networks, in: The authors acknowledge the support of the OP VVV 7th International Conference on Learning Repre- funded project “CZ.02.1.01/0.0/0.0/16_019/0000765 Re- sentations, 2019. search Center for Informatics”, the Czech Science Foun- [10] A. Shekhovtsov, V. Yanush, B. Flach, Path sample- dation project 20-08452S, and the Grant Agency of the analytic gradient estimators for stochastic binary CTU in Prague, grant No. SGS20/170/OHK3/3T/13. networks, Advances in Neural Information Process- ing Systems 33 (2020) 12884–12894. [11] Y. Cong, M. Zhao, K. Bai, L. Carin, GO gradient for References expectation-based objectives, in: 7th International Conference on Learning Representations, 2019. [1] X. Liu, et al., A review of deep-learning-based med- [12] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learn- ical image segmentation methods, Sustainability ing for image recognition, in: Proceedings of the 13 (2021) 1224. IEEE Conference on Computer Vision and Pattern [2] S. Minaee, et al., Image segmentation using deep Recognition, 2016, pp. 770–778. learning: A survey, IEEE transactions on pattern [13] A. L. Simpson, et al., A large annotated med- analysis and machine intelligence (2021). ical image dataset for the development and [3] O. Ronneberger, P. Fischer, T. Brox, U-net: Convolu- evaluation of segmentation algorithms, 2019. tional networks for biomedical image segmentation, arXiv:arXiv:1902.09063. in: International Conference on Medical Image