Bladder Semantic Segmentation* Vadim Chernyshev1[0000-0002-8713-9026], Alexander Gromov1,3[0000-0001-9818-3770], Anton Konushin 1,2 [0000−0002−6152−0021], and Anna Mesheryakova 3[0000-0002-2409-0018] 1 Lomonosov Moscow State University, Moscow, Russia {vadim.chernyshev, alexander.gromov, anton.konushin}@graphics.cs.msu.ru 2 2 NRU Higher School of Economics, Moscow, Russia 3 Third Opinion Platform LLC, Moscow, Russia {alexander.gromov, ceo}@3opinion.ai Abstract. Obtaining information about the shape and volume of the blad- der plays a significant role in determining the pathologies of this or- gan. To collect the relevant data, the first thing to do is to separate the bladder from the background on the ultrasound image. The article is de- voted to automation this process using an algorithm based on the Unet architecture with a pretrained imagenet encoder (encoder – ResNet50). The article gives a comparative analysis of some well-known methods in literature that improve the accuracy of the proposed algorithm. The qual- ity of the basic architecture has been improved by more than 4 percent on the PR AUC metric (from 84.49% to 89.62%) in the series of exper- iments with the help of automatic annotation of previously unmarked data. In addition, there are two important results showing practical ef- fectiveness of using the data from another medical task (which raised the accuracy to 88.50%) and using time dependent sequence of frames inside the video (raised the quality to 88.19%). Keywords: Semantic segmentation · Pseudo Labeling · Bladder ultra- sound · 3D convolution · Time dependency. 1 Introduction The bladder is an organ that performs a very important function in a human body. Its walls are elastic and can stretch or contract depending on certain factors. As a result, parameters such as the shape and volume of the organ itself change. The analysis of these parameters plays a key role in determining the pathologies of the bladder. Performing the analysis, most clinics use a transabdominal ultrasound image, which shows the entire organ and the surrounding anatomy. To collect information about the Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). * Publication is supported by RFBR grant № 19-07-00844. 2 V. Chernyshev, A. Gromov, A. Konushin, A. Mesheryakova volume and the shape of the bladder, the image of the organ itself must be separated from the background. Physicians with appropriate qualifications are the only people who can do this job. However, this task can be solved using automatic semantic seg- mentation methods that would allow significant reduction of the physicians’ workload and could help them to devote more time to treating patients. At the same time, using an automatic system may reduce the risk of human error caused by fatigue and monot- onous work. The most promising approach for solving the problems of semantic segmentation of medical images is deep convolutional neural networks. It should be noted that the best results are currently obtained using algorithms based on the Unet [1] architecture, which was originally developed for this purpose. The main limitation of such models (especially in the medical field) is that they re- quire very large amounts of reliable training data. These data are accurate and tightly annotated images, which creation re- quires substantial human labor and experience. The purpose of this work was to review and compare methods that can potentially improve the accuracy of the classical Unet network. We collected 400 videos of ultra- sound of the bladder, each lasting 10 seconds (and 10 fps). Every 5 frames from these videos are taken and marked by specialists (physicians). The studied methods are pri- marily aimed at smart use of the provided data: ─ There is a connection between the marked frames, they are linked in time. Therefore, it makes sense to try to use this dependency. There are series of experiments related to volumetric convolutions. ─ Unannotated frames also have a feature – they are located between the marked frames of the video. Thus, they are very similar to annotated images, since unanno- tated frames are only a few fractions of a second away from them. This means that even a overfitted network would be able to mark them very well. As a result, we get a lot of new maximally realistic annotated frames (because they are fragments of a real ultrasound video). It should be mentioned that the error of the marking is close to an error that might have occurred if the physician had worked manually. ─ What should we do if there are no redundant data? It is proved experimentally that encoder pretraining on ImageNet has a positive effect on the final accuracy and speed of network convergence. In the current work, the possibility of pretraining architec- ture on a dataset of a similar medical problem was considered. We also studied the effect of applying these data directly during network training. In addition to implementation of the main ideas, experiments on the choice of augmen- tations, input resolution, optimizer, and approaches to changing the pace of learning were conducted. Bladder Semantic Segmentation 3 2 Preparation for experiments 2.1 Prepared data The reference collection used in this work was provided by Third Opinion Platform [9]. It contains 5270 annotated ultrasound images of the bladder taken from 400 videos. The example of bladder image and picture of its mask which were taken from a total sample is shown in Fig. 1. The annotation contains not only bladder images and its masks but also the information about the videos from which that images were taken. It also includes information about the position of the frame inside this video and information about the presence or absence of the bladder. Moreover, we have ultrasound videos themselves, containing about 18000 unannotated frames. For further experiments, all videos and cor- responding frames were randomly allocated either to the training sample or to the test sample. It is important to emphasize that all frames from the same video necessarily belong to only one of the samples, otherwise the purity of the experiment would be disturbed owning to false quality improvement of the algorithm work which occurs as a result of testing on images that are similar to the training ones. Fig. 1. On the left is the ultrasound image, on the right is the bladder mask. 2.2 Problem statement The main purpose is to obtain the mask of the bladder. In practice, our algorithm should mark the bladder directly while the physician is working. Since the physician makes a whole ultrasound video, we have an opportunity to submit to the input of the algorithm not only the frame itself that needs to be marked (Fig. 2a), but also the whole series of pictures taken from the video. So, we can predict masks for both cases: masks for all frames (Fig. 2b) or mask only for the central one (Fig. 2c). However, the usage of more than one frame as an in- put reduces the scope of the algorithm. For example, in the case when it would be necessary to annotate only a single image, our algorithm (which 4 V. Chernyshev, A. Gromov, A. Konushin, A. Mesheryakova must use many frames as input) will not be able to manage this task. Then we consider all methods mentioned above, since the main aim of the work is to obtain an increase in accuracy. All experiments were conducted on one video card GeForce GTX 1080 Ti. Fig. 2. a) The algorithm accepts one ultrasound image as input and gets a mask of this image as output. b) The algorithm accepts a series of images taken from the entire video at a fixed frequency as input and receives their masks as output. c) The algorithm accepts a frame and several surrounding images taken from the video as input and gets a mask for only the central image as output 2.3 Metrics In the beginning, we considered the selected metrics which are necessary for evaluating the algorithm performance. One of the most well-known metrics used for evaluating the solution of the binary semantic segmentation problem is IoU. This metric is based on calculating truly classified positive pixels (TP), false- positive pixels (FP) and false- negative pixels (FN) (1): 𝑇𝑃 𝐼𝑜𝑈 = . (1) 𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁 Alternatively, F-score, which is calculated as the harmonic mean between recall and precision, calculated by pixel, can be used (2): 𝑇𝑃 𝑅𝑒𝑐𝑎𝑙𝑙 = , (2) 𝑇𝑃 + 𝐹𝑁 𝑇𝑃 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = , (3) 𝑇𝑃 + 𝐹𝑃 2 ∙ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∙ 𝑅𝑒𝑐𝑎𝑙𝑙 𝐹𝑠𝑐𝑜𝑟𝑒 = . (4) 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙 Bladder Semantic Segmentation 5 However, all these metrics significantly depend on the threshold at which a decision is made about whether or not a pixel belongs to the bladder after the mask exits the neural network. To avoid binding to this parameter when choosing the best solution, we rely on generalization over all thresholds: on the size of the area under the Precision-Recall graph (PR AuC - Fig. 3). Fig. 3. Precision-Recall curve. 3 Related work 3.1 Basic architecture During neural network training we usually change the network architecture itself or the data feed strategy, but we can also vary many other parameters, such as the optimizer, the learning rate, input resolution, and some others. It takes a tremendous amount of time to perform an enumeration of all possible combi- nations of these variables in each experiment. And the result of that work does not give a significant effect. So, to avoid these problems, the basic character- istics were selected experimentally. Such charac- teristics will be used in further experiments. Thus, a Unet-like architecture with a clas- sifier in the form of a pretrained ResNet50 [8], a Novograd[2] optimizer (sometimes AdamW[3] was used instead), a cosine learning rate[4] and also with an input resolution of 128 * 128 or 256 * 256 pixels was chosen as the baseline. 3.2 Using Pseudo labels One of the main problems arising with the training of neural networks is the difficulty of obtaining a training sample that covers the entire range of possible situations that may occur in the future when using the network in practice. We can significantly expand our sample by adding almost 18 thousand frames that can be parsed from the video. However, they are not marked, and their manual annotation is extremely difficult (it 6 V. Chernyshev, A. Gromov, A. Konushin, A. Mesheryakova takes a lot of time and physicians must be involved). At first glance, the annotation using our own network looks questionable – the accuracy of our best basic architecture reaches 85.68% for PR AUC (84.37% F-score). It is logical to assume that using data annotated with an accuracy of 84% will not significantly increase the accuracy above this number. However, it should be noted that the data we want to annotate will be taken only from training videos (since using data from test videos can undeservedly improve the result on the test sample). These new frames are very similar to the surrounding training frames. Indeed, each new frame is separated from the annotated one by no more than 0.1 seconds (see Fig. 4), and the shape of the bladder smoothly changes over time (just like time dependence of the position of any other physical body). Fig. 4. Six consecutive frames from a single ultrasound video. Two frames are anno- tated (located on the edge), the central ones are to be annotated. As a result, our network should annotate new frames with an accuracy close to the ac- curacy of its work on the training sample, which in turn is equal to 95.04%. In total, when we annotate new data with our network, we know that the accuracy of their an- notation will lie in the range of [84.37, 95.04] percent. Moreover, it most likely tends to the right border of the value. This means that their use in further training should help to raise the accuracy of the algorithm to a value that lies within the presented range. By marking about 18000 images using a better network, we applied them while learning the same basic architecture. We added them to each batch of data in fixed portions. The result is shown in Table 1. The table shows that the accuracy of work increases with any percentage of new data, up to very large values of 50 percent. This does not support the theory that the data was annotated with an accuracy close to 95%, but in any case, it confirms the effec- tiveness of usage of pseudo labels [10]. Bladder Semantic Segmentation 7 Table 1. Using Pseudo labels. This table presents a comparison of the accuracy of trained neural networks depending on what percentage of pictures in each batch was occupied by received pseudo labels. Pseudo PR AuC Best F-score F-score IoU label (%) 0 0.8568 0.8513 0.8419 0.7364 10 0.8789 0.8665 0.8481 0.7432 20 0.8845 0.8710 0.8504 0.7466 25 0.8871 0.8732 0.8468 0.7431 30 0.8824 0.8693 0.8458 0.7398 35 0.8871 0.8729 0.8539 0.7513 40 0.8851 0.8717 0.8513 0.7482 45 0.8837 0.8713 0.8468 0.7410 50 0.8798 0.8680 0.8471 0.7418 3.3 Using data from another medical task There are two main reasons for using data from another medical task. The first is the positive experience of usage classifiers, pretrained on the ImageNet task, in the Unet network. We would like to observe an increase in this effect when our algorithm is retrained on the most approximate medical task. The second is a potential solution to the problem of overfitting. During training, the accuracy of our network in the training sample reaches 95 percent and continues to grow, while the accuracy in the control sample begins to decrease over time. The reason for this is memorization of training sample data, that is, overfitting. We suggest that stirring each batch during training with data from a similar medical task may slightly weaken this effect. Behind this is the following heuristics: the network receives every batch which contains the data that dif- fer in meaning from an ultrasound of the bladder. If we assume that the network is overfitted so much that it begins to poorly annotate slightly different frames from the test sample of the bladder ultrasound, then it should be even worse to annotate the pic- tures of another task. This means that it will receive a fine in the form of a large loss function. Otherwise, a similar medical task (for example, abdominal ultrasound snaps) can help the network identify new useful patterns. These patterns can increase the final accuracy and they will help avoid a heavy load such as marking something very new. For example, training the neural network in two completely different tasks - marking the bladder in ultrasound images and brain tumor in MRI is unlikely to be effective (since the neural network should simultaneously know signs that are practically unre- lated to each other) An open data set from the task of finding the circumference of the fetal head on an ultrasound image of the abdominal cavity was taken for further experiments [5][6]. This set contains 1000 training frames on which the circumference of the fetal head is marked. We approximated new data closer to our task by manually re-marking entire head area on new data (Fig. 5). 8 V. Chernyshev, A. Gromov, A. Konushin, A. Mesheryakova Fig. 5. a) an ultrasound image of the abdominal cavity taken from the new dataset b) corresponding masks c) Re-marked masks (these modification makes new data similar to ours). It was decided to choice this set, because it is both visually as close as possible to our data, and as similar as possible in the medical sense (abdominal area, ultrasound). The results of a series of experiments are shown in Table 2. It should be noted that the best results were obtained by combining two approaches: pretraining on new data and their further use during training. Table 2. New dataset. This table presents a comparison of algorithms obtained through various uses of the new dataset. The last line also shows an algorithm trained using pseudo labels. Experiment PR AuC Best F-score F-score IoU - 0.8568 0.8513 0.8419 0.7363 pretraining on 0.8613 0.8611 0.8357 0.7275 new data diluting by 0.8811 0.8695 0.8453 0.7398 new data pretraining + 0.8850 0.8702 0.8474 0.7475 diluting diluting by 0.8871 0.8729 0.8539 0.7513 pseudo labels 3.4 Using a time dependency In medicine it is important not only to mark a two-dimensional image but also to con- struct a volumetric segmentation map. For example, BraTS Brain Tumor Segmentation. One of the best solutions of such tasks is methods based on the 3D Unet architecture [7]. The main difference between 3D Unet architecture and the classical Unet is the replacement of two-dimensional convolutions with volumetric ones (Fig. 6). These con- volutions allow to determine spatial dependencies in all three directions - width, height and depth. However, the last of these spatial directions can be replaced by a temporal one. So, we have videos in which all annotated frames are separated from each other by certain time intervals. We tried to use a tensor consisting of all annotated video frames arranged in a row as an input to the 3D Unet network. And we got a tensor of the same dimension containing masks of all submitted frames as an output (Fig. 2b). The results of the ex- periment are presented in Table 3, experiment 1. It should be noted that the lack of quality improvement could be due to the following factors: Bladder Semantic Segmentation 9 ─ reducing the number of input data units by 20 times. Now there is only 1 tensor containing all 20 images used as an input earlier; ─ lack of pretraining and classic Unet architecture, while in the 2D experiments pre- trained ResNet50 were used as encoder. We slightly changed the training strategy to avoid these negative aspects: we decided to submit not all annotated frames from a single video, but a single image and a certain number of frames going in front of this image in the video and the same number of frames going after it. So, now we need to get the mask only for the main frame (see Fig. 2c) and because of that surrounding area might be unannotated. And now the number of input data units is the same as before. The results are presented in Table 3, experiment 2. As you can see, the final quality has become better. Table 3. 3D convolutions. This table compares three experiments based on the idea of three-dimensional convolution (multiple frames are fed to the networks). "Step" means the distance within the video between the taken adjacent frames. experiment step PR AuC Best F- F-score IoU score 1 - 0.7170 0.7152 0.7072 0.5546 2 1 0.8239 0.8239 0.7847 0.6765 2 0.8394 0.8450 0.7774 0.6735 4 0.8501 0.8546 0.7917 0.6835 8 0.8311 0.8217 0.7817 0.6817 3 4 0.8819 0.8632 0.8397 0.7320 However, the second problem was not solved - we still had the lack of pre- training. To deal with it, we had to abandon 3D Unet and use 2D Unet with the ResNet50 encoder. So, to determine time dependence we added 3D base, which consists of a certain num- ber of 3D convolutions and processes the original tensor, making it two-dimensional. Then the processed tensor is fed to 2D Unet. The results are presented in Table 3, ex- periment 3. This modification gave an even greater increase in quality and made it pos- sible to prove that application of time dependence could be useful in similar tasks. 4 Analysis of the obtained algorithms We have reviewed some methods that can improve the accuracy of the semantic segmen- tation of the bladder. Now we would like to show the most advantageous combinations of these methods and show the visual difference in their work. The best accuracy was achieved by using pretraining on another medical task in con- junction with the use of pseudo labels during the training (“Best our 2D network”, table 4, line 1). Another experiment also deserves attention. It uses a series of 3D convolutions, the output of which was eventually fed into a 2D Unet (“Best our 3D network”, table 4, line 2). This approach gives lower accuracy on the test sample (Fig. 6a), however, it has some advantages. So, for example, the network using one frame as an input (“Best our 10 V. Chernyshev, A. Gromov, A. Konushin, A. Mesheryakova 2D network”) often mislabels some very complex frames that have several shaded areas. And the 2D Unet with 3D base network (“Best our 3D network”), which also analyzes adjacent frames, marks them more correctly (Fig. 6b). Fig. 6. Red means false-positive pixels, blue - false-negative pixels, green - true- positive, black - true-negative. a) 2D Unet generally performs slightly better results than 3D net- work. b) The complex samples for 2D Unet, but acceptable for 3D Unet Table 4. Best our 2D network: encoder - resnet50, optimizer - NovoGrad, augmentation - random rotation (10 degrees), cosine learning rate, pretraining on a similar task, pseudo data 35%. Best our 3D network: encoder - resnet50, optimizer - NovoGrad, augmenta- tion - random rotation (10 degrees), cosine learning rate. Experiment PR AuC Best F-score F-score IoU Best our 2D network 0.8962 0.8747 0.8654 0.7651 Best our 3D network 0.8819 0.8632 0.8397 0.7320 experiment PR AuC Best F-score F-score IoU 5 Conclusion To sum up, in our work dedicated to bladder semantic segmentation we carried out a comparative analysis of well-known methods in literature that improve the accuracy of classical Unet network. Pseudo labels for unlabeled frames of the video were generated using a baseline trained on annotated frames from the same video. It was found that their further use during training of the same model provides a significant increase in quality of work by more than 4 percent. Another important conclusion is not only the potential usefulness of pretraining on data from a similar medical task, but also improving the quality of the bladder segmentation by adding this data directly to training. Bladder Semantic Segmentation 11 References 1. Olaf Ronneberger and Philipp Fischer and Thomas Brox: U-Net: Convolutional Networks for Biomedical Image Segmentation, 2015. 2. Boris Ginsburg and Patrice Castonguay and Oleksii Hrinchuk and Oleksii Kuchaiev and Ryan Leary and Vitaly Lavrukhin and Jason Li and Huyen Nguyen and Yang Zhang and Jonathan M. Cohen: Training Deep Networks with Stochastic Gradient Normalized by Layerwise Adaptive Second Moments, 2020. 3. Ilya Loshchilov and Frank Hutter: Decoupled weight decay regularization, 2019. 4. Leslie N. Smith : Cyclical Learning Rates for Training Neural Networks, 2017. 5. Thomas L. A. van den Heuvel and Dagmar de Bruijn and Chris L. de Korte and Bram van Ginneken.: Automated measurement of fetal head circumference using 2D ultrasound images, 2018. 6. Dong-Hyun Lee: Pseudo-Label : The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks, 2013. 7. Thomas L. A. van den Heuvel and Dagmar de Bruijn and Chris L. de Korte and Bram van Ginneken.: Automated measurement of fetal head circumference using 2D ultrasound images [Data set], 2018. 8. Ozgun C¸ i¸cek and Ahmed Abdulkadir and Soeren S. Lienkamp and Thomas Brox and Olaf Ronneberger: 3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation, 2016. 9. Pavel Yakubovskiy: Segmentation Models Pytorch, 2020. 10. ”Third Opinion Platform” Limited Liability Company. URL: https://thirdopin- ion.ai/