Overview of ImageCLEFmedical 2024 – Medical Visual
                         Question Answering for Gastrointestinal Tract
                         Steven Hicks1,* , Andrea Storås1,2 , Pål Halvorsen1,2 , Michael Riegler1 and Vajira Thambawita1
                         1
                             SimulaMet - Simula Metropolitan Center for Digital Engineering, Oslo, Norway
                         2
                             OsloMet - Oslo Metropolitan University, Oslo, Norway


                                        Abstract
                                        This paper provides details on the second edition of the Medical Visual Question Answering for the Gastrointestinal
                                        Tract (MedVQA-GI) challenge, which took place during ImageCLEF 2024. This year, we changed the task from
                                        visual question answering to the application of text-to-image models for the creation of synthetic medical images.
                                        There were two sub-tasks in this challenge. The first sub-task involved using prompts to generate realistic looking
                                        images from the gastrointestinal tract. The second sub-task focused on the technical aspects involved in the
                                        implementation of these models, and optimizing the prompts to generate realistic-looking images using a low
                                        number of tokens. Despite considerable interest in the task, the rate of submissions remained low, suggesting
                                        that participants may have encountered barriers or found the task too complex to complete.

                                        Keywords
                                        Machine learning, medical ai, endoscopy


                         1. Introduction
                         The second edition of the Medical Visual Question Answering for the Gastrointestinal Tract (MedVQA-
                         GI) challenge at ImageCLEF [1] introduces a new goal that focuses on the use of generative models of
                         text-to-image in medical diagnosis. This combines natural language processing and image generation
                         to potentially improve diagnostic processes in healthcare by providing more comprehensive datasets
                         that can be used for training machine learning models. In contrast to last year’s focus on a Visual
                         Question Answering (VQA) task that required retrieving images or masks from user questions, this
                         year’s overall goal was to use generative models to create synthetic medical images from textual inputs.
                         Participants were tasked with generating synthetic images using existing generative models developed
                         using a dataset derived from last year’s MedVQA-GI challenge [2].
                            Machine learning has been a common method used to identify lesions in gastrointestinal (GI) images [3,
                         4, 5, 6, 7, 8, 9, 10, 11, 12, 13]. Traditionally, the emphasis in GI analysis has been on disease detection
                         from images or videos, focusing mostly on polyp detection [14, 15, 16, 17, 18, 19, 20]. Several challenges
                         have demonstrated consistent advancements in this field, including some challenges we have organized
                         in the past [21, 22, 23, 24, 25]. However, there has been a growing interest in extending the capabilities
                         of image analysis GI through the generation of synthetic images [26, 27]. This new focus aims to develop
                         models that generate realistic GI images that can be used in-place of real data. Such synthetic images
                         can be used to train medical professionals, refine diagnostic algorithms without the privacy concerns of
                         real patient data, and improve the interpretability and reliability of AI systems in clinical settings. To
                         this end, this year’s MedVQA-GI focuses on synthetic GI image generation. The dataset and the scripts
                         used to verify and evaluate submissions are available in our public GitHub repository1 .
                            The remainder of this paper is organized as follows. First, we start with an explanation of the creation
                         of the dataset, looking at how the data was collected and organized. Then, we discuss the specific
                         sub-tasks involved in the MedVQA-GI challenge and the evaluation methods used. Finally, we present
                         statistics on the participants and the results of the submitted runs.
                          CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                         *
                           Corresponding author.
                          $ steven@simula.no (S. Hicks); andrea@simula.no (A. Storås); paalh@simula.no (P. Halvorsen); michael@simula.no
                          (M. Riegler); vajira@simula.no (V. Thambawita)
                                     © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                         1
                             https://github.com/simula/imageCLEFmed-MEDVQA-GI-2024

CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Generate an image containing a    Generate an image from a      Generate an image containing      Generate an image containing
            polyp.                 colonoscopy procedure.                   text.                         oesophagitis.


  Generate an image from a       Generate an image containing   Generate an image containing      Generate an image containing
   gastroscopy procedure.                 metal clip.                  biopsy forceps.                     the z-line.


  Generate an image with 2       Generate an image containing   Generate an image with a polyp
          findings.                          tube.                      of size < 5mm.         Generate a polyp of type paris iia.
         Figure 1: Examples from the development dataset that was provided by the challenge organizers. The
         samples represent different types of images contained within the dataset. Under each image is a prompt
         that is associated with an image. Note that there can be multiple prompts that match to the same
         image.


2. Dataset
The dataset used for this challenge is based on data developed for last year’s challenge, which was based
on the HyperKvasir dataset [28] and the Kvasir-Instrument dataset [29] datasets. Participants were
provided with a dataset consisting of 2, 000 image and text pairs, which was organized in a directory
containing the images and CSV files with prompts and connections to the image filenames. Example
images and corresponding prompts can be seen in Figure 1.


3. Task Description and Evaluation
This year, participants could participate in two sub-tasks: Image Synthesis and Optimal Prompt Genera-
tion. Participants could submit to either sub-task and were not limited to the number of submissions.

3.1. Sub-task 1: Image Synthesis
The first sub-task, Image Synthesis, involves using text-to-image generative models to construct a
comprehensive dataset of medical images from textual descriptions. This sub-task requires participants
to create accurate visual representations of various medical conditions described solely in text. For
example, with a description such as "An early-stage colorectal polyp," participants must generate an
image that precisely reflects the given text. Participants could use the development dataset, described
in Section 2, to develop their models.
   For the submission, each participant received a list of 5, 000 prompts. They were required to create
synthetic images based on these prompts and submit them to the organizers by email. Each submitted
image file was named according to the prompt’s index number from the list. The quality of the synthetic
images was assessed using two metrics: the Inception Score (IS) [30] and the Fréchet Inception Distance
(FID) [31]. These metrics evaluated how the synthetic images were compared with three distinct testing
datasets. The first data set consisted of images from the previous year’s MedVQA-GI challenge. The
second dataset was GastroVision [32], which is a newly released open-source collection that includes
8, 000 images obtained from various medical centers. The final data set used for the evaluation was a
combination of the first two datasets.

3.2. Sub-task 2: Optimal Prompt Generation
The second sub-task, Optimal Prompt Generation, focuses on participants creating their own prompts
to generate images that meet specific medical imaging requirements. This sub-task asked participants
to tailor their prompt generation skills to produce images that accurately match a set of predefined
categories. These categories are designed to test the model’s ability to generate precise and clinically
relevant images based on the prompts. Participants had to devise prompts for:

    • A prompt that generates an image containing n polyps.
    • A prompt that generates a polyp in a specific region of the image.
    • A prompt that generates a polyp of a specific type and size.
    • A prompt that generates an image containing no findings from either the esophagus or large
      bowel.
    • A prompt that generates an image containing one of the following instruments: biopsy forceps,
      metal clip, and tube.
    • A prompt that generates an image containing one of the following anatomical landmarks: Z-line,
      Pylorus, Cecum.

For evaluation, the effectiveness of each prompt was evaluated not only on the accuracy of the image
it produced but also on the conciseness of the prompt itself. Shorter and more precise prompts were
preferred, as they are more beneficial in clinical settings where clarity and efficiency are necessary.
Additionally, the generated images were subjected to the same quantitative evaluation metrics as in
sub-task 1, IS and FID, to ensure consistency in assessing the quality of images across different tasks.
This dual approach, which combined qualitative assessment of prompt effectiveness with quantitative
image quality metrics, provided a comprehensive assessment of participants’ proficiency in generating
both relevant prompts and high-quality synthetic images.


4. Participation and Results
This section provides an overview of participation in the challenge and discusses the results submitted
by those who completed it.

4.1. Participation
In total, 22 teams signed up for the task, 2 teams submitted runs, and 2 teams submitted working notes
papers [33, 34]. Table 1 shows an overview of the participants and the number of submissions to each
sub-task alongside the number of participants from last year’s challenge. This year experienced a
noticeable decline in participation compared to last year’s challenge. One reason for this may be the
complexity of the task and the hardware and model requirements. Furthermore, future editions could
benefit from enhanced outreach and support mechanisms, such as tutorials or "getting started" scripts,
to broaden participant engagement and lower the entry barriers.

4.2. Results
A total of six runs were submitted to sub-task 1, and no runs were submitted to sub-task 2. This section
gives an overview of the results of each run and briefly discusses the approach submitted by each
participant. The results can be seen in Table 2.

4.2.1. MMCP Team
Team MMCP’s approach was based on two methods: fine-tuning Kandinsky models and implementing
a Medical Synthesis with Diffusion Model (MSDM). They fine-tune pre-trained Kandinsky models to
generate images from text prompts. In addition, they experimented with MSDM, showing improved
results over Kandinsky-based models. Example images of each sumbission can be seen in Figure 2. For
more information on their approach, please read their working notes paper [33].


 Generate an image       Generate an image       Generate an image   Generate an image     Generate an image     Generate an image
 not containing text.   containing the z-line.    containing tube.   containing a polyp.   from a colonoscopy.   from a gastroscopy.
          Figure 2: Team 2 submission examples. Please note that these images have been cherry picked, please
          see the participant paper for more details [33].


4.2.2. Team 2
Team 2 used a different approach for each submission. The first approach did not generate synthetic
images, rather it retrieved images that closely related to the input prompt. To do this, they used a
Connecting text and images (CLIP) model. The second submission used a fine-tuned stable diffusion
model that generated synthetic images. The third submission used a fine-tuned Low-Rank Adaptation
of Large Language Models (LoRA) model to generate images. This method uses LoRA to modify pre-
existing stable diffusion model to enable the production of high-quality images that closely align with
the input specifications. Example images of each submission can be seen in Figure 3.


Table 1
An overview of the submissions to each task availalbe at MedVQA-GI.
                                                         MedVQA 2023         MedVQA 2024         Difference
                        # Registrations                         26                  22                -4
                        # Teams that submitted                  8                   2                 -6
                        # Submissions to Task 1                 10                  6                 -4
                        # Submissions to Task 2                 4                   0                 -4
                        # Submissions to Task 3                 2                    -                 -
                        # Paper Submissions                     6                   2                 -4
Table 2
Results for Task 1. Each submission is evaluated using the FID and the Inception Score (IS). The FID scores is
calculated against the MedVQA testing datasert (Single), GastroVision (Multi), and a combination of the two
(Both). The IS socre is calculated on a 10-way split of the synthetic images, where we display the mean (avg),
standard deviation (sd), and median (med).
   Team              Submission      FID (Single)      FID (Multi)   FID (Both)      IS (avg)     IS (std)   IS (med)
                     submission1     0.125             0.121         0.119           1.773        0.023      1.775
   MMCP Team         submission2     0.120             0.117         0.115           1.791        0.028      1.792
                     submission3     0.086             0.064         0.066           1.624        0.031      1.633
                     submission1*    0.114             0.128         0.124           1.568        0.025      1.560
   team2             submission2     0.099             0.064         0.067           2.327        0.065      2.339
                     submission3     0.110             0.073         0.076           2.362        0.050      2.359


  Generate an image from a                                     Generate an image from a         Generate an image from a
   colonoscopy procedure.    Generate an image with 1 polyp.    gastroscopy procedure.           colonoscopy procedure.
        Figure 3: Team 2 submission examples. Please note that these images have been cherry picked, please
        see the participant paper for more details [].


4.3. Discussion
The challenge results highlight several important insights and areas for further exploration. Firstly,
the performance across the two teams and runs varied. This variability underscores the complexity of
creating high-quality medical images. However, we found that the quality of the images did not always
correspond to the scores provided by the quantitative metrics, suggesting that we need more robust
synthetic image quality metrics specifically for medical images and their applications.
   Another notable finding was that there was some confusion surrounding generation of synthetic
images. One team submitted a run that retrieved "real" images that corresponded to the submitted
prompt. This deviated from the intended goal, as the main point was to generate synthetic images. This
highlights the need for clearer communication of the challenge requirements.
   Furthermore, reduced participation compared to last year indicates possible entry barriers that
may include the complexity of tasks or a lack of foundational resources for newcomers. Addressing
these barriers could involve providing more comprehensive datasets, detailed examples of successful
implementations, and potentially simplifying the challenge structure to attract a broader range of
participants.


5. Conclusion and Future Outlook
This paper discussed the second edition of the MedVQA-GI challenge, which took place at ImageCLEF
in 2024. The challenge consisted of two sub-tasks centered on the generation of synthetic images in the
gastrointestinal tract. In the future, we plan on making a more robust task with more resources to get
started. Furthermore, we also want to merge the tasks from the first year with this year’s challenge to
keep the task more consistent.
References
 [1] B. Ionescu, H. Müller, A.-M. Drăgulinescu, J. Rückert, A. Ben Abacha, A. G. S. de Herrera, L. Bloch,
     R. Brüngel, A. Idrissi-Yaghir, H. Schäfer, C. S. Schmidt, T. M. Pakull, H. Damm, B. Bracke, C. M.
     Friedrich, A.-G. Andrei, Y. Prokopchuk, D. Karpenka, A. Radzhabov, V. Kovalev, C. Macaire,
     D. Schwab, B. Lecouteux, E. Esperança-Rodier, W.-W. Yim, Y. Fu, Z. Sun, M. Yetisgen, F. Xia, S. A.
     Hicks, M. A. Riegler, V. Thambawita, A. Storås, P. Halvorsen, M. Heinrich, J. Kiesel, M. Potthast,
     B. Stein, Overview of ImageCLEF 2024: Multimedia retrieval in medical, socialmedia and rec-
     ommender systems applications, in: CLEF2024 Working Notes, CEUR Workshop Proceedings,
     CEUR-WS.org, Grenoble, France, 2024.
 [2] S. A. Hicks, A. Storås, P. Halvorsen, T. de Lange, M. A. Riegler, V. Thambawita, Overview of
     imageclefmedical 2023 – medical visual question answering for gastrointestinal tract, in: CLEF2023
     Working Notes, CEUR Workshop Proceedings, CEUR-WS.org, Thessaloniki, Greece, 2023.
 [3] C. Hassan, M. Spadaccini, A. Iannone, R. Maselli, M. Jovani, V. T. Chandrasekar, G. Antonelli,
     H. Yu, M. Areia, M. Dinis-Ribeiro, et al., Performance of artificial intelligence in colonoscopy for
     adenoma and polyp detection: a systematic review and meta-analysis, Gastrointestinal endoscopy
     93 (2021) 77–85.
 [4] A. Alammari, A. R. Islam, J. Oh, W. Tavanapong, J. Wong, P. C. De Groen, Classification of ulcerative
     colitis severity in colonoscopy videos using cnn, in: Proceedings of the ACM International
     Conference on Information Management and Engineering (ACM ICIME), 2017, pp. 139–144.
     doi:https://doi.org/10.1145/3149572.3149613.
 [5] D. Bychkov, N. Linder, R. Turkki, S. Nordling, P. E. Kovanen, C. Verrill, M. Walliander, M. Lundin,
     C. Haglund, J. Lundin, Deep learning based tissue analysis predicts outcome in colorectal cancer,
     Scientific Reports 8 (2018) 3395. URL: http://dx.doi.org/10.1038/s41598-018-21758-3. doi:https:
     //doi.org/10.1038/s41598-018-21758-3.
 [6] Y. Mori, S.-e. Kudo, M. Misawa, Y. Saito, H. Ikematsu, K. Hotta, K. Ohtsuka, F. Urushibara, S. Kataoka,
     Y. Ogawa, Y. Maeda, K. Takeda, H. Nakamura, K. Ichimasa, T. Kudo, T. Hayashi, K. Wakamura,
     F. Ishida, H. Inoue, H. Itoh, M. Oda, K. Mori, Real-Time Use of Artificial Intelligence in Identification
     of Diminutive Polyps During Colonoscopy: A Prospective Study, Annals of Internal Medicine 169
     (2018) 357–366. doi:https://doi.org/10.7326/M18-0249.
 [7] K. Pogorelov, S. L. Eskeland, T. de Lange, C. Griwodz, K. R. Randel, H. K. Stensland, D.-T. Dang-
     Nguyen, C. Spampinato, D. Johansen, M. Riegler, P. Halvorsen, A holistic multimedia system
     for gastrointestinal tract disease detection, in: Proceedings of the ACM on Multimedia Systems
     Conference (MMSYS), 2017, pp. 112–123. doi:https://doi.org/10.1145/3193740.
 [8] J. Silva, A. Histace, O. Romain, X. Dray, B. Granado, Toward embedded detection of polyps in wce
     images for early diagnosis of colorectal cancer, International Journal of Computer Assisted Radiol-
     ogy and Surgery 9 (2014) 283–293. doi:https://doi.org/10.1007/s11548-013-0926-3.
 [9] V. L. Thambawita, D. Jha, H. L. Hammer, H. D. Johansen, D. Johansen, P. Halvorsen, M. Riegler, An
     extensive study on cross-dataset bias and evaluation metrics interpretation for machine learning
     applied to gastrointestinal tract abnormality classification, ACM Transactions on Computing for
     Healthcare (2020).
[10] D. Jha, M. Riegler, D. Johansen, P. Halvorsen, H. Johansen, Doubleu-net: A deep convolutional
     neural network for medical image segmentation, in: Proceeding of the International Symposium
     on Computer Based Medical Systems (CBMS), 2020.
[11] Q. Angermann, J. Bernal, C. Sánchez-Montes, M. Hammami, G. Fernández-Esparrach, X. Dray,
     O. Romain, F. J. Sánchez, A. Histace, Towards real-time polyp detection in colonoscopy videos:
     Adapting still frame-based methodologies for video sequences analysis, in: Proceedings of Com-
     puter Assisted and Robotic Endoscopy and Clinical Image-Based Procedures (CARE CLIP), volume
     10550, Springer, 2017, pp. 29–41.
[12] K. Pogorelov, M. Riegler, P. Halvorsen, P. T. Schmidt, C. Griwodz, D. Johansen, S. L. Eskeland,
     T. de Lange, Gpu-accelerated real-time gastrointestinal diseases detection, in: Proceedings of the
     International Symposium on Computer-Based Medical Systems (CBMS)„ IEEE, 2016, pp. 185–190.
     doi:https://doi.org/10.1109/CBMS.2016.63.
[13] M. Riegler, K. Pogorelov, P. Halvorsen, T. de Lange, C. Griwodz, P. T. Schmidt, S. L. Eskeland,
     D. Johansen, EIR - efficient computer aided diagnosis framework for gastrointestinal endoscopies,
     in: Proceedings of the IEEE International Workshop on Content-Based Multimedia Indexing
     (CBMI), 2016, pp. 1–6. doi:https://doi.org/10.1109/CBMI.2016.7500257.
[14] Y. Wang, W. Tavanapong, J. Wong, J. H. Oh, P. C. De Groen, Polyp-alert: Near real-time feed-
     back during colonoscopy, Computer Methods and Programs in Biomedicine 120 (2015) 164–179.
     doi:https://doi.org/10.1016/j.cmpb.2015.04.002.
[15] D. Jha, P. H. Smedsrud, M. A. Riegler, D. Johansen, T. De Lange, P. Halvorsen, H. D. Johansen,
     Resunet++: An advanced architecture for medical image segmentation, in: Proceedings of the
     International Symposium on Multimedia (ISM), 2019, pp. 225–230. doi:https://doi.org/10.
     1109/ISM46123.2019.00049.
[16] J. Bernal, A. Histace, M. Masana, Q. Angermann, C. Sánchez-Montes, C. Rodriguez, M. Hammami,
     A. Garcia-Rodriguez, H. Córdova, O. Romain, G. Fernández-Esparrach, X. Dray, J. Sanchez, Polyp
     detection benchmark in colonoscopy videos using gtcreator: A novel fully configurable tool for
     easy and fast annotation of image databases, in: Proceedings of Computer Assisted Radiology and
     Surgery (CARS), 2018. doi:https://hal.archives-ouvertes.fr/hal-01846141.
[17] Y. Guo, J. Bernal, B. J Matuszewski, Polyp segmentation with fully convolutional deep neural
     networks—extended evaluation study, Journal of Imaging 6 (2020) 69.
[18] M. Min, S. Su, W. He, Y. Bi, Z. Ma, Y. Liu, Computer-aided diagnosis of colorectal polyps using
     linked color imaging colonoscopy to predict histology, Scientific reports 9 (2019) 2881. doi:https:
     //doi.org/10.1038/s41598-019-39416-7.
[19] N. M. Ghatwary, X. Ye, M. Zolgharni, Esophageal abnormality detection using densenet based
     faster r-cnn with gabor features, IEEE Access 7 (2019) 84374–84385. doi:https://doi.org/10.
     1109/ACCESS.2019.2925585.
[20] S. Shah, N. Park, N. E. H. Chehade, A. Chahine, M. Monachese, A. Tiritilli, Z. Moosvi, R. Ortizo,
     J. Samarasena, Effect of computer-aided colonoscopy on adenoma miss rates and polyp detection:
     a systematic review and meta-analysis, Journal of Gastroenterology and Hepatology 38 (2023)
     162–176.
[21] S. Hicks, M. Riegler, P. Smedsrud, T. B. Haugen, K. R. Randel, K. Pogorelov, H. K. Stensland, D.-T.
     Dang-Nguyen, M. Lux, A. Petlund, T. de Lange, P. T. Schmidt, P. Halvorsen, Acm multimedia
     biomedia 2019 grand challenge overview, in: Proceedings of the ACM International Conference
     on Multimedia (ACM MM), 2019, pp. 2563–2567. doi:https://doi.org/10.1145/3343031.
     3356058.
[22] K. Pogorelov, M. Riegler, P. Halvorsen, S. A. Hicks, K. R. Randel, D.-T. Dang-Nguyen, M. Lux,
     O. Ostroukhova, T. De Lange, Medico multimedia task at mediaeval 2018, in: Proceeding of the
     MediaEval Benchmarking Initiative for Multimedia Evaluation Workshop (MediaEval), 2018.
[23] M. Riegler, K. Pogorelov, P. Halvorsen, K. Randel, S. Eskeland, D.-T. Dang-Nguyen, M. Lux, C. Gri-
     wodz, C. Spampinato, T. de Lange, Multimedia for medicine: the medico task at mediaeval 2017,
     in: Proceeding of the MediaEval Benchmarking Initiative for Multimedia Evaluation Workshop
     (MediaEval), 2017.
[24] J. Bernal, H. Aymeric, Miccai endoscopic vision challenge polyp detection and segmentation,
     https://endovissub2017-giana.grand-challenge.org/home/, 2017. Accessed: 2017-12-11.
[25] S. Hicks, M. Riegler, P. Smedsrud, T. B. Haugen, K. R. Randel, K. Pogorelov, H. K. Stensland, D.-T.
     Dang-Nguyen, M. Lux, A. Petlund, T. de Lange, P. T. Schmidt, P. Halvorsen, Acm multimedia
     biomedia 2019 grand challenge overview, in: Proceedings of the 27th ACM International Confer-
     ence on Multimedia, MM ’19, Association for Computing Machinery, New York, NY, USA, 2019, p.
     2563–2567. URL: https://doi.org/10.1145/3343031.3356058. doi:10.1145/3343031.3356058.
[26] V. Thambawita, P. Salehi, S. A. Sheshkal, S. A. Hicks, H. L. Hammer, S. Parasa, T. d. Lange,
     P. Halvorsen, M. A. Riegler, Singan-seg: Synthetic training data generation for medical image
     segmentation, PLOS ONE 17 (2022) 1–24. URL: https://doi.org/10.1371/journal.pone.0267976.
     doi:10.1371/journal.pone.0267976.
[27] D. Yoon, H.-J. Kong, B. S. Kim, W. S. Cho, J. C. Lee, M. Cho, M. H. Lim, S. Y. Yang, S. H. Lim,
     J. Lee, J. H. Song, G. E. Chung, J. M. Choi, H. Y. Kang, J. H. Bae, S. Kim, Colonoscopic image
     synthesis with generative adversarial network for enhanced detection of sessile serrated lesions
     using convolutional neural network, Sci Rep 12 (2022) 261.
[28] H. Borgli, V. Thambawita, P. H. Smedsrud, S. Hicks, D. Jha, S. L. Eskeland, K. R. Randel, K. Pogorelov,
     M. Lux, D. T. D. Nguyen, et al., Hyperkvasir, a comprehensive multi-class image and video dataset
     for gastrointestinal endoscopy, Scientific data 7 (2020). doi:10.1038/s41597-020-00622-y.
[29] D. Jha, S. Ali, K. Emanuelsen, S. A. Hicks, V. Thambawita, E. Garcia-Ceja, M. A. Riegler, T. de Lange,
     P. T. Schmidt, H. D. Johansen, D. Johansen, P. Halvorsen, Kvasir-instrument: Diagnostic and
     therapeutic tool segmentation dataset in gastrointestinal endoscopy, in: Proceeedings of the
     International COnference on MultiMedia Modeling (MMM), 2021, pp. 218–229. doi:10.1007/
     978-3-030-67835-7_19.
[30] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, Improved techniques for
     training gans, Advances in neural information processing systems 29 (2016).
[31] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, S. Hochreiter, Gans trained by a two time-scale
     update rule converge to a local nash equilibrium, Advances in neural information processing
     systems 30 (2017).
[32] D. Jha, V. Sharma, N. Dasu, N. K. Tomar, S. Hicks, M. Bhuyan, P. K. Das, M. A. Riegler, P. Halvorsen,
     T. de Lange, U. Bagci, Gastrovision: A multi-class endoscopy image dataset for computer aided
     gastrointestinal disease detection, in: ICML Workshop on Machine Learning for Multimodal
     Healthcare Data (ML4MHD 2023), 2023.
[33] M. Chaychuk, Mmcp team at imageclefmed 2024 task on image synthesis: Diffusion models for
     text-to-image generation of colonoscopy images, in: CLEF2024 Working Notes, CEUR Workshop
     Proceedings, CEUR-WS.org, Grenoble, France, 2024.
[34] E.-P. Oluwafemi Ojonugwa, M. Rahman, F. Khalifa, Advancing ai-powered medical image synthesis:
     Insights from medvqa-gi challenge using clip, fine-tuned stable diffusion, and dream-booth + lora,
     in: CLEF2024 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org, Grenoble, France,
     2024.