Synthetic Data for Unsupervised Polyp
                                               Segmentation ?

                           Enric Moreu1,2[0000−0003−1336−6477] , Kevin McGuinness1,2[0000−0003−1336−6477] ,
                                           and Noel E. O’Connor1,2[0000−0002−4033−9135]
                                              1
                                                  Insight SFI Centre for Data Analytics, Ireland
                                                        2
                                                          Dublin City University, Ireland


                                   Abstract. Deep learning has shown excellent performance in analysing
                                   medical images. However, datasets are difficult to obtain due privacy
                                   issues, standardization problems, and lack of annotations. We address
                                   these problems by producing realistic synthetic images using a combina-
                                   tion of 3D technologies and generative adversarial networks. We use zero
                                   annotations from medical professionals in our pipeline. Our fully unsuper-
                                   vised method achieves promising results on five real polyp segmentation
                                   datasets. As a part of this study we release Synth-Colon, an entirely
                                   synthetic dataset that includes 20 000 realistic colon images and addi-
                                   tional details about depth and 3D geometry: https://enric1994.github.io/
                                   synth-colon

                                   Keywords: Computer Vision · Synthetic Data · Polyp Segmentation ·
                                   Unsupervised Learning


                          1      Introduction

                          Colorectal cancer is one of the most commonly diagnosed cancer types. It can
                          be treated with an early intervention, which consists of detecting and removing
                          polyps in the colon. The accuracy of the procedure strongly depends on the
                          medical professionals experience and hand-eye coordination during the procedure,
                          which can last up to 60 minutes. Computer vision can provide real-time support
                          for doctors to ensure a reliable examination by double-checking all the tissues
                          during the colonoscopy.
                              The data obtained during a colonoscopy is accompanied by a set of issues
                          that prevent creating datasets for computer vision applications. First, there are
                          privacy issues because it is considered personal data that can not be used without
                          the consent of the patients. Second, there are a wide range of cameras and lights
                          used to perform colonoscopies. Every device has its own focal length, aperture,
                           ?
                               This project has received funding from the European Union’s Horizon 2020 research
                               and innovation programme under the Marie Sklodowska-Curie grant agreement
                               No 765140. This publication has emanated from research supported by Science
                               Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289 P2, co-funded by
                               the European Regional Development Fund.


Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)
2         Enric Moreu et al.


Fig. 1. Synth-Colon dataset samples include: synthetic image, annotation, realistic
image, depth map, and 3D mesh (from left to right).


and resolution. There are no large datasets with standardized parameters. Fi-
nally, polyp segmentation datasets are expensive because they depend on the
annotations of qualified professionals with limited available time.
    We propose an unsupervised method to detect polyps that does not require
annotations by combining 3D rendering and a CycleGAN [23]. First, we produce
artificial colons and polyps based on a set of parameters. Annotations of the
location of the polyps are automatically generated by the 3D engine. Second,
the synthetic images are used alongside real images to train a CycleGAN. The
CycleGAN is used to make the synthetic images appear more realistic. Finally,
we train a HarDNeT-based model [3], a state-of-the-art polyp segmentation
architecture, with the realistic synthetic data and our self-generated synthetic
labels.
    The contributions of this paper are as follows:

    – To the best of our knowledge, we are the first to train a polyp segmentation
      model with zero annotations from the real world.
    – We propose a pipeline that preserves the self-generated annotations when
      shifting the domain from synthetic to real.
    – We release Synth-Colon (see Figure 1), the largest synthetic dataset for polyp
      segmentation including additional data such as depth and 3D mesh.

   The remainder of the paper is structured as follows: Section 2 reviews relevant
work, Section 3 explains our method, Section 4 presents the Synth-Colon dataset,
Section 5 describes our experiments, and Section 6 concludes the paper.”


2      Related work

Here we briefly review some relevant works related to polyp segmentation and
synthetic data.


2.1     Polyp segmentation

Early polyp segmentation was based in the texture and shape of the polyps.
For example, Hwang et al. [8] used ellipse fitting techniques based on shape.
However, some corectal polyps can be small (5mm) and are not detected by these
                       Synthetic Data for Unsupervised Polyp Segmentation         3


Fig. 2. Real samples from CVC-ColonDB with the corresponding annotation made by
a medical professionals indicating the location of cancerous polyps.


techniques. In addition, the texture is easily confused with other tissues in the
colon as can be seen in Figure 2.
    With the rise of convolutional neural networks (CNNs) [10] the problem of the
texture and shape of the polyps was solved and the accuracy was substantially
increased. Several authors have applied deep convolutional networks to the polyp
segmentation problem. Brandao et al. [2] proposed to use a fully convolutional
neural network based on the VGG [16] architecture to identify and segment polyps.
Unfortunately, the small datasets available and the large number of parameters
make these large networks prone to overfitting. Zhou et al. [22] used an encoder-
decoder network with dense skip pathways between layers that prevented the
vanishing gradient problem of VGG networks. They also significantly reduced the
number of parameters, reducing the amount of overfitting. More recently, Chao
et al. [3] reduced the number of shortcut connections in the network to speed-up
inference time, a critical issue when performing real-time colonoscopies in high-
resolution. They focused on reducing the memory traffic to access intermediate
features, reducing the latency. Finally, Huang et al. [7] improved the performance
and inference time by combining HarDNet [3] with a cascaded partial decoder [21]
that discards larger resolution features of shallower layers to reduce latency.


2.2   Synthetic data

The limitation of using large neural networks is that they often require large
amounts of annotated data. This problem is particularly acute in medical imag-
ing due to problems in privacy, standardization, and the lack of professional
annotators. Table 1 shows the limited size and resolution of the datasets used to
train and evaluate existing polyp segmentation models. The lack of large datasets
for polyp segmentation can be addressed by generating synthetic data.
    Thambawita et al. [18] used a generative adversarial network (GAN) to
produce new colonoscopy images and annotations. They added a fourth channel
to SinGAN [14] to generate annotations that are consistent with the colon image.
They then used style transfer to improve the realism of the textures. Their
results are excellent considering the small quantity of real images and professional
annotations that are used. Gao et al. [6] used a CycleGAN to translate colonoscopy
images to polyp masks. In their work, the generator learns how to segment polyps
by trying to fool a discriminator.
4       Enric Moreu et al.

          Table 1. Real polyp segmentation datasets size and resolution.

                   Dataset                  #Images Resolution
                   CVC-T [19]             912         574 x 500
                   CVC-ClinicDB [1]       612         384 x 288
                   CVC-ColonDB [17]       380         574 x 500
                   ETIS-LaribPolypDB [15] 196         1225 x 966
                   Kvasir [9]             1000        Variable


    Synthetic images combined with generative networks have also been widely
used in the depth prediction task [11,12]. This task helps doctors to verify that all
the surfaces in the colon have been analyzed. Synthetic data is essential for this
task because of the difficulties to obtain depth information in a real colonoscopy.
    Unlike previous works, our method is entirely unsupervised and does not
require any human annotations. We automatically generate the annotations by
defining the structure of the colon and polyps and transferring the location of
the polyps to a 2D mask. The key difference between our approach and other
state-of-the-art is that we combine 3D rendering and generative networks. First,
the 3D engine defines the structure of the image and generates the annotations.
Second, the adversarial network makes the images realistic.
    Similar unsupervised methods have also been successfully applied in other
domains like crowd counting. For example, Wang et al. [20] render crowd images
from a video game and then use a CycleGAN to increase the realism.


3     Method

Our approach is composed of three steps: first, we procedurally generate colon
images and annotations using a 3D engine; second, we feed a CycleGAN with
images from real colonoscopies and our synthetic images; finally, we use the
realistic images created by CycleGAN to train an image segmentation model.


3.1   3D colon generation

The 3D colon and polyps are procedurally generated using Blender, a 3D engine
that can be automated via scripting.
    Our 3D colons structure is a cone composed by 2454 faces. Vertices are
randomly displaced following a normal distribution in order to simulate the
tissues in the colon. Additionally, the colon structure is modified by displacing
7 segments as in Figure 3. For the textures we used a base color [0.80, 0.13,
0.18] (RGB). For each sample we shift the color to other tones of brown, orange
and pink. One single polyp is used on every image, which is placed inside the
colon. It can be either in the colon’s walls or in the middle. Polyps are distorted
spheres with 16384 faces. Samples with polyps occupying less than 20,000 pixels
are removed.
                        Synthetic Data for Unsupervised Polyp Segmentation          5


Fig. 3. The structure of the colon is composed by 7 segments to simulate the curvature
of the intestinal tract.


Fig. 4. Synthetic colons with corresponding annotations rendered using a 3D engine.


    Lighting is composed by a white ambient light, two white dynamic lights
that project glare into the walls, and three negative lights that project black
light at the end of the colon. We found that having a dark area at the end helps
CycleGAN to understand the structure of the colon. The 3D scene must be
similar to real colon images because otherwise, the CycleGAN will not translate
properly the images to the real-world domain. Figure 4 illustrates the images
and ground truth generated by the 3D engine.

3.2   CycleGAN
A standard CycleGAN composed by two generators and two discriminators is
trained using real images from colonoscopies and synthetic images generated
using the 3D engine as depicted in Figure 6. We train a CycleGAN for 200 epochs
and then we infer real images in the “Generator Synth to Real” model, producing
realistic colon images.
    Figure 5 displays synthetic images before and after the CycleGAN domain
adaptation. Note that the position of the polyps is not altered. Hence, the ground
truth information generated by the 3D engine is preserved.

3.3   Polyp segmentation
After creating a synthetic dataset that has been adapted to the real colon textures,
we train an image segmentation model. We used the HarDNeT-MSEG [7] model
architecture because of its real-time performance and high accuracy. We use the
same hyperparameter configuration as in the original paper.
6       Enric Moreu et al.


Fig. 5. Synthetic images (first row) and realistic images generated by our CycleGAN
(second row).


Fig. 6. Our CycleGAN architecture. We train two generator models that try to fool
two discriminator models by changing the domain of the images.
                      Synthetic Data for Unsupervised Polyp Segmentation       7

4     Synth-Colon

We publicly release Synth-Colon, a synthetic dataset for polyp segmentation. It
is the first dataset generated using zero annotations from medical professionals.
The dataset is composed of 20 000 images with a resolution of 500×500. Synth-
Colon additionally includes realistic colon images generated with our CycleGAN
and the Kvasir training set images. Synth-Colon can also be used for the colon
depth estimation task [12] because we provide depth and 3D information for each
image. Figure 1 shows some examples from the dataset. In summary, Synth-Colon
includes:

 – Synthetic images of the colon and one polyp.
 – Masks indicating the location of the polyp.
 – Realistic images of the colon and polyps. Generated using CycleGAN and
   the Kvasir dataset.
 – Depth images of the colon and polyp.
 – 3D meshes of the colon and polyp in OBJ format.


5     Experiments

5.1   Metrics

We use two common metrics for evaluation. The mean Dice score, given by:

                                          2 × tp
                          mDice =                      ,                     (1)
                                    2 × tp + f p + f n

and the mean intersection over union (IoU):

                                           tp
                            mIoU =                  ,                        (2)
                                     tp + f p + f n

where in both forumlae, tp is the number of true positives, f p the number of
false positives, and f n the number of false negatives.


5.2   Evaluation on real polyp segmentation datasets

We evaluate our approach on five real polyp segmentation datasets. Table 2 shows
the results obtained when training HarDNeT-MSEG [7] using our synthetic data.
Note that our method is not using any annotations. Results are satisfactory
considering the fact that labels have been generated automatically. We found that
training the CycleGAN with only the images from the target dataset performs
better than training the CycleGAN with all the datasets combined, indicating a
domain gap among the real-world datasets.
8       Enric Moreu et al.

Table 2. Evaluation of our synthetic approach on real-world datasets. The metrics used
are mean Dice similarity index (mDice) and mean Intersection over Union (mIoU).

                       CVC-T     ColonDB    ClinicDB     ETIS      Kvasir
                     mDice mIoU mDice mIoU mDice mIoU mDice mIoU mDice mIoU
U-Net [13]            0.710   0.627    0.512   0.444   0.823   0.755   0.398   0.335   0.818   0.746
SFA [5]               0.467   0.329    0.469   0.347   0.700   0.607   0.297   0.217   0.723   0.611
PraNet [4]            0.871   0.797    0.709   0.640   0.899   0.849   0.628   0.567   0.898   0.840
HarDNet-MSEG [7]      0.887   0.821    0.731   0.660   0.932   0.882   0.677   0.613   0.912   0.857
Synth-Colon (ours)    0.703   0.635    0.521   0.452   0.551   0.475   0.257   0.214   0.759   0.527


5.3   Study with limited real data

In this section we evaluate how our approach based on synthetic imagery and
domain adaptation compares with the fully supervised state-of-the-art HarDNeT-
MSEG network when there are fewer training examples available. We train the
CycleGAN used in the proposed approach, without ground truth segmentation
labels, on progressively larger sets of imagery, and compare this with the super-
vised method trained on the same amount of labelled imagery. Table 3 shows the
results of the experiment, which demonstrates that synthetic data is extremely
useful for domains where annotations are very scarce. While our CycleGAN can
produce realistic images with a small sample of only five real images, supervised
methods require many images and annotations to achieve good performance.
Table 3 shows that our unsupervised approach is useful when there are less than
50 real images and annotations. Note that zero images here means there is no
domain adaptation via the CycleGAN.


Table 3. Evaluation of the proposed approach on the Kvasir dataset when few real
images are available. The performance is measured using the mean Dice metric.

                                      Synth-Colon (ours)       HarDNeT-MSEG [7]
          0 images                    0.356                    -
          5 images                    0.642                    0.361
          10 images                   0.681                    0.512
          25 images                   0.721                    0.718
          50 images                   0.735                    0.781
          900 (all) images            0.759                    0.912


6     Conclusions

We successfully trained a polyp segmentation model without annotations from
doctors. We used 3D rendering to generate the structure of the colon and
generative adversarial networks to make the images realistic, and demonstrated
                         Synthetic Data for Unsupervised Polyp Segmentation                9

that it can perform quite reasonably in several datasets, even outperforming some
fully supervised methods in some cases. We hope this study can help aligning
synthetic data and medical imaging in future. As future work, we will explore
how to include our synthetic annotations in the CycleGAN.

References
 1. Bernal, J., Sánchez, F.J., Fernández-Esparrach, G., Gil, D., Rodrı́guez, C., Vilariño,
    F.: Wm-dova maps for accurate polyp highlighting in colonoscopy: Validation vs.
    saliency maps from physicians. Computerized Medical Imaging and Graphics 43,
    99–111 (2015)
 2. Brandao, P., Mazomenos, E., Ciuti, G., Caliò, R., Bianchi, F., Menciassi, A., Dario,
    P., Koulaouzidis, A., Arezzo, A., Stoyanov, D.: Fully convolutional neural networks
    for polyp segmentation in colonoscopy. In: Medical Imaging 2017: Computer-Aided
    Diagnosis. vol. 10134, p. 101340F. International Society for Optics and Photonics
    (2017)
 3. Chao, P., Kao, C.Y., Ruan, Y.S., Huang, C.H., Lin, Y.L.: Hardnet: A low memory
    traffic network. In: Proceedings of the IEEE/CVF International Conference on
    Computer Vision. pp. 3552–3561 (2019)
 4. Fan, D.P., Ji, G.P., Zhou, T., Chen, G., Fu, H., Shen, J., Shao, L.: Pranet: Parallel
    reverse attention network for polyp segmentation. In: International Conference
    on Medical Image Computing and Computer-Assisted Intervention. pp. 263–273.
    Springer (2020)
 5. Fang, Y., Chen, C., Yuan, Y., Tong, K.y.: Selective feature aggregation network with
    area-boundary constraints for polyp segmentation. In: International Conference
    on Medical Image Computing and Computer-Assisted Intervention. pp. 302–310.
    Springer (2019)
 6. Gao, H., Ogawara, K.: Adaptive data generation and bidirectional mapping for
    polyp images. In: 2020 IEEE Applied Imagery Pattern Recognition Workshop
    (AIPR). pp. 1–6. IEEE (2020)
 7. Huang, C.H., Wu, H.Y., Lin, Y.L.: Hardnet-mseg: A simple encoder-decoder polyp
    segmentation neural network that achieves over 0.9 mean dice and 86 fps (2021)
 8. Hwang, S., Oh, J., Tavanapong, W., Wong, J., De Groen, P.C.: Polyp detection
    in colonoscopy video using elliptical shape feature. In: 2007 IEEE International
    Conference on Image Processing. vol. 2, pp. II–465. IEEE (2007)
 9. Jha, D., Smedsrud, P.H., Riegler, M.A., Halvorsen, P., de Lange, T., Johansen, D.,
    Johansen, H.D.: Kvasir-seg: A segmented polyp dataset. In: International Conference
    on Multimedia Modeling. pp. 451–462. Springer (2020)
10. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. nature 521(7553), 436–444
    (2015)
11. Mahmood, F., Chen, R., Durr, N.J.: Unsupervised reverse domain adaptation for
    synthetic medical images via adversarial training. IEEE transactions on medical
    imaging 37(12), 2572–2581 (2018)
12. Rau, A., Edwards, P.E., Ahmad, O.F., Riordan, P., Janatka, M., Lovat, L.B.,
    Stoyanov, D.: Implicit domain adaptation with conditional generative adversarial
    networks for depth prediction in endoscopy. International journal of computer
    assisted radiology and surgery 14(7), 1167–1176 (2019)
13. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical
    image segmentation. In: International Conference on Medical image computing and
    computer-assisted intervention. pp. 234–241. Springer (2015)
10      Enric Moreu et al.

14. Rott Shaham, T., Dekel, T., Michaeli, T.: Singan: Learning a generative model
    from a single natural image. In: Computer Vision (ICCV), IEEE International
    Conference on (2019)
15. Silva, J., Histace, A., Romain, O., Dray, X., Granado, B.: Toward embedded detec-
    tion of polyps in wce images for early diagnosis of colorectal cancer. International
    journal of computer assisted radiology and surgery 9(2), 283–293 (2014)
16. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
    image recognition. arXiv preprint arXiv:1409.1556 (2014)
17. Tajbakhsh, N., Gurudu, S.R., Liang, J.: Automated polyp detection in colonoscopy
    videos using shape and context information. IEEE transactions on medical imaging
    35(2), 630–644 (2015)
18. Thambawita, V., Salehi, P., Sheshkal, S.A., Hicks, S.A., Hammer, H.L., Parasa,
    S., de Lange, T., Halvorsen, P., Riegler, M.A.: Singan-seg: Synthetic training data
    generation for medical image segmentation. arXiv preprint arXiv:2107.00471 (2021)
19. Vázquez, D., Bernal, J., Sánchez, F.J., Fernández-Esparrach, G., López, A.M.,
    Romero, A., Drozdzal, M., Courville, A.: A benchmark for endoluminal scene
    segmentation of colonoscopy images. Journal of healthcare engineering 2017 (2017)
20. Wang, Q., Gao, J., Lin, W., Yuan, Y.: Learning from synthetic data for crowd
    counting in the wild. In: Proceedings of the IEEE/CVF Conference on Computer
    Vision and Pattern Recognition. pp. 8198–8207 (2019)
21. Wu, Z., Su, L., Huang, Q.: Cascaded partial decoder for fast and accurate salient
    object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision
    and Pattern Recognition. pp. 3907–3916 (2019)
22. Zhou, Z., Siddiquee, M.M.R., Tajbakhsh, N., Liang, J.: Unet++: A nested u-net
    architecture for medical image segmentation. In: Deep learning in medical image
    analysis and multimodal learning for clinical decision support, pp. 3–11. Springer
    (2018)
23. Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation
    using cycle-consistent adversarial networks. In: Computer Vision (ICCV), 2017
    IEEE International Conference on (2017)