Method and Software Tool for Generating Artificial Databases of Biomedical Images Based on Deep Neural Networks Oleh Berezskya, Petro Liashchynskyia, Oleh Pitsuna and Grygoriy Melnyka a West Ukrainian National University, 11 Lvivska st., Ternopil, 46001, Ukraine Abstract A wide variety of biomedical image data, as well as methods for generating training images using basic deep neural networks, were analyzed. Additionally, all platforms for creating images were analyzed, considering their characteristics. The article develops a method for generating artificial biomedical images based on GAN. GAN architecture has been developed for biomedical image synthesis. The data foundation and module for generating training images were designed and implemented in a software system. A comparison of the generated image database with known databases was made. Keywords 1 Breast cancer, image generation, generative adversarial networks. training data sets, digital platform, artificial databases of biomedical images 1. Introduction Currently, deep neural networks (DNNs) are widely used for automatic diagnosis in medicine. Training data sets (TND) are used to train DNNs. Training datasets help improve the accuracy of automatic diagnosis. The following biomedical images are used to make a diagnosis in oncology: cytological, histological and immunohistochemical images. Original sets of these images are limited. This is explained by the objective reasons for their receipt. Sources for creating data sets include both international scientific projects and competitions in the development and testing of algorithms (for example, Kaggle). To distribute data sets, resources from individual projects, conferences and competitions (challenges), as well as special platforms are used. Examples of such special platforms for publishing datasets are: Data.gov[1], Kaggle Datasets[2], Zenodo[3], Google Dataset Search[4] and many others. When designing biomedical image analysis systems, the problem of increasing TND is also relevant. Increasing the TND helps improve the accuracy of models and takes into account rare classes. To increase the data set, you can use augmentation and synthesis tools. Augmentation is the process of artificially increasing the size of a TND by using various image transformations. These can be operations such as shifting, scaling, rotating, changing contrast and others, allowing you to get different variations of the same image. Synthetic data is created artificially using an initial small dataset. Generative adversarial networks (GANs) are used to create synthetic data. Purpose of artificial (synthetic) image sets: 1. Training and testing of algorithms. Synthetic datasets allow developers and researchers to create images of known pathologies. This is useful when real-world clinical data are limited or unavailable. 2. Studying the stability of algorithms to various artifacts that may appear in real images. The characteristics of artificial (synthetic) image sets are as follows: 1. Synthetic datasets allow you to control image parameters, such as resolution, types of pathologies, degree of complexity, and others. This facilitates the study of specific aspects of diagnosis. IDDM’2023: 6th International Conference on Informatics & Data-Driven Medicine, November 17 - 19, 2023, Bratislava, Slovakia EMAIL: ob@wunu.edu.ua (A. 1); p.liashchynskyi@st.wunu.edu.ua (A. 2); o.pitsun@wunu.edu.ua (A. 3); mgm@wunu.edu.ua (A. 4); ORCID: 0000-0001-9931-4154 (A. 1); 0000-0002-3920-6239 (A. 2); 0000-0003-0280-8786 (A. 3) ; 0000-0003-0646-7448 (A. 4); © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Wor Pr ks hop oceedi ngs ht I tp: // ceur - SSN1613- ws .or 0073 g CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 2. When creating synthetic images, the true state of the pathology is known, which allows you to accurately evaluate the effectiveness of the algorithms and analyze their accuracy. 3. Using synthetic data, large volumes of images can be generated for testing and validating algorithms at different scales. 4. Public access makes these and other datasets available to the public, facilitating data sharing and research collaboration. Therefore, an actual problem is the generation of biomedical images in oncology. This provides the necessary accuracy in the classification of biomedical images. To solve this problem, GAN was used in this work. 2. Literature review In their fundamental work, Ian Goodfellow and others developed the concept of Generative Adversarial Networks (GANs) [5]. The developed architecture consists of two neural networks – a generator and a discriminator. The generator produces data for the discriminator. The discriminator distinguishes genuine data from generated data. However, the disadvantages of such networks remain potential training instability and non-convergence of the model. In work [6], the potential of GANs for datasets focused on liver pathologies was explored. The main theme of the research was the ability of GANs to combat the common problem of overfitting in deep learning models. The paper demonstrated the improvement and extension of medical data sets, which increased the reliability of the model. Synthesizing medical images using GANs involves not only extension data sets, but also ensuring the confidentiality of patient information. Work [7] used GAN to create synthetic medical data to ensure patient data privacy by adding noise to images. The authors of article [8] used GAN to create MRI images of the heart. The goal of the work is to study the influence of architectural features of networks on the quality of synthesized images. Ensuring high resolution and accuracy of synthesized images are major problems. Other research [9,10] has shown that GANs can be used to synthesize images of retinal and lung cancer cells. The synthesized lung cancer nodules passed the visual Turing test with the participation of radiologists. The article [11] explored the problem of “model collapse,” when the generator synthesizes a small number of images, thus limiting the variety of synthesized images. The article suggests that the difficulty of training GANs and the need for large amounts of training data limit the application of GANs in medicine. The authors of work [12] showed that a GAN network is also capable of learning to simulate the distribution of all high-resolution MRI images. To synthesize high-resolution images of skin lesions, the researchers compared several GAN network architectures. A classifier was successfully trained based on the synthesized samples [13]. Using the concept of progressive growth of GANs, the authors of [14,15] generated realistic synthetic images of skin lesions. The method has been developed in [16] that allow synthesizing histopathological images. Based on specific tissue types, cancer subtypes, and known reference image data, the authors synthesize histopathological images. The authors of the article [17] used an improved conditional GAN architecture, which they called HistoGAN. The researchers used the self-attention module and other methods to stabilize the learning process and improve the quality of the synthesized images. The authors of the article [18] propose a generative competitive network with additional regularization based on loss of sharpness to generate realistic histopathological images. The authors additionally introduce a sharpening loss that enhances the contrast of pixels on the contours of the nuclei. Researchers in [19] propose a vision transformer-based GAN model for synthetically augmenting a set of histopathological images. The authors of this work have developed a many of CADs for automatic diagnosis in oncology [20- 24]. Therefore, GANs have significant potential in biomedical image synthesis. However, the application of GANs is influenced by training complexity, data quality, ethical issues, and clinical relevance. 3. Problem statement The conducted analysis showed that there is a problem of increasing datasets for the classification of biomedical images. To solve of this problem, it is necessary to: - analyze the typical training sets of images; - develop the artificial image synthesis method; - develop the software synthesis and image storage; - conduct the computer experiments. 4. Analysis of typical training sets of images World practice has shown that the creation and storage of data sets is a pressing problem. Creating datasets is a separate task. A large number of specialists from different countries are involved in the creation of data sets. Data sets are stored on well-known platforms. This is also a separate task. Therefore, we will first analyze the training data sets. The main characteristics TND [25] are as follows: 1. Image volume – the number of images of normal and pathological tissues of the human body. 2. Image format. The most efficient way to present images is through WSI (Whole Slide Imaging) multi-scale whole slide imaging. To train algorithms, TND must contain instructions or expert segmentation of micro-objects. This segmentation is maintained through contour coordinates and binary masks. Automated microscopy systems are used to create TND. For example, the Automated Slide Analysis Platform (ASAP) is an open platform for visualization, annotation, and automated analysis of WSI images. ASAP is built on the basis of well-known open source software OpenSlide, Qt and OpenCV. Camelyon16, Tupac16 are TDSs playing an important role in the development of modern digital histopathology and cancer diagnosis. Based on large volumes of tissue imaging, these kits provide researchers and clinicians with a unique opportunity to study and develop advanced machine learning algorithms for automated analysis of histological data. The Camelyon16 dataset [26] was built to detect lymph node metastases for breast cancer. It includes about 400 gigabytes of lymphatic images, which are used to train and test the algorithms. The TUPAC16 dataset [27, 28] was created to investigate the morphological characteristics and detect cancerous abnormalities in the mammary glands. The set includes 490 gigabytes of images. The Camelyon17 dataset is an extended version of Camelyon16, containing even more images and clinical data. Platforms are formed based on data sets. Let's analyze the main functions of the platforms. The main functions of the platforms are: 1. Data storage and organization. Platforms allow you to load, store and organize data in hierarchical structures. Versioning, GitHub integration, visit and download statistics can also be supported. 2. Availability and exchange. Data can be shared with the global community. 3. Methods and description of data sets. The platforms allow you to provide detailed descriptions and metadata for each dataset. Metadata includes DOI, author information (ORCID), keywords, project and research grant information, source citations, organization type, publisher type. 4. Licenses and access control. Users can set licenses and rules for accessing datasets. The following are available for public sets: Creative Commons, GPL, Open Data Commons, Community Data License. Collaboration functionality is also available, allowing multiple users to jointly own and maintain a private or publicly accessible dataset. 5. Tools for analysis and rendering. Some platforms provide tools for analyzing, processing and visualizing data directly on the platform. 6. Support for data formats. Platforms can support different data formats, including text, images, video, audio and others. In particular, the available formats are XML, PDF, HTML, EXCEL, CSV, JSON, RDF, DOC, ZIP, BigQuery. Comparative characteristics of known platforms are presented in Table 1. 5. Artificial image synthesis method A method for synthesizing artificial biomedical databases using GANs has been developed in this research. GAN is a neural network consisting of two neural networks: a generator and a discriminator, trained simultaneously using a competitive process. The generator tries to create realistic data, the discriminator distinguishes genuine data from artificially generated ones. Over time, the generator improves its ability to create realistic data by learning from feedback from the discriminator. Table 1 Comparative characteristics of platforms Characteristics/ Data.gov [1] Kaggle Datasets Zenodo [3] Google Dataset Platform [2] Search [4] Type Government Commercial Academic Search tool resource platform repository Openness Open data Open and closed Open data Mostly open data (sometimes data with restrictions) Categories of data Government data Diverse (from Academic and Various (from (education, science to research data various sources) health, economy, entertainment) etc.) File formats CSV, JSON, XML, CSV, SQLite, JSON PDF, CSV, ZIP, Depends on the etc. and others other scientific source formats API Yes Yes Yes No Additional Resources for Competition, Digital DOIs, Search by functions developers, cores (code), integration with metadata visualization tools discussions GitHub The developed method consists of the following steps (Figure 1): 1. Loading training images from the directory. 2. Extension of the training sample by applying affine distortions. In the developed method, the following affine distortions are applied to images: random scaling, rotation, shift. All operations are used with a 50% probability. The software implementation of distortions is based on the Rudi library [29]. 3. Train and evaluate the GAN network using the extended training set from the previous step. The network is trained for a given number of iterations. To assess the quality of generated images, the following metrics are used: FID (Frechet Inception Distance) and IS (Inception Score). Both metrics are based on the Google Inception 3 classifier model, designed for classifying color images and trained on the ImageNet dataset. The FID metric helps evaluate the quality of the generated images, and the IS metric helps evaluate the diversity of images. Smaller FID values indicate better quality of the synthesized images, and larger IS values indicate better diversity. The model is evaluated based on metrics every M training iterations. Once a given number of Iter iterations is reached, we generate N images and store them and the trained model in a database. Thus, our method combines data loading, data augmentation, GAN training, image generation, and image quality evaluation. To generate the image, use the GAN. The architecture of the generator and discriminator is based on ResNet Block. Also in the generator and discriminator there is a self-attention mechanism. The architecture of GenBlock and the generator is shown in Figures 2 and 3. The architecture of the DiscBlock and discriminator is shown in Figures 4 and 5. The following training parameters were used during generation: 1. Optimizer – Adam. 2. Generator learning rate 1e-4. 3. Discriminator learning rate 4e-4. 4. Loss function – Hinge Loss. 5. Epochs – 100,000. 6. Batch size – 96. Figure 1: The method of image generation and storage 6. Software synthesis and image storage The implementation of the software is based on the use of the Google Cloud Platform (GCP) cloud infrastructure. This approach allows efficient use of data storage resources. The program infrastructure is shown in Figure 6. The software implementation has two main files: train.py and generate.py. The train.py file is for training a PyTorch GAN model. This file defines the architecture and parameters of the GAN model. To train the model, we use the Vertex AI service, which provides infrastructure and tools for efficient model training (with GPU support). The FID and IS metrics are used to evaluate the performance of the trained model. After graduation, the model is uploaded to Cloud Storage. Next, the trained GAN model is deployed on the Vertex AI platform by generating a final URL. This address can be accessed to create images. The deployment process involves specifying the necessary computing resources and loading our model code into a Docker container for efficient use. Figure 2: GenBlock Figure 3: Generator Once the model is deployed, the master data is stored in a Cloud SQL instance. This database is a central repository for storing the following data: Model endpoint URLs, Image scoring metrics. This allows you to track model versions and associated performance characteristics. Figure 4: DiscBlock Figure 5: Discriminator The generate.py file is used to create new images. Users can specify the number of images to create. An optional parameter is the identifier of the model used for image synthesis. By default, the last model added to the database is used. For example, to generate 1000 images, we use model ID 5. To do this, just run the program by calling python generate.py 1000. 5. This script interacts with the URL endpoint of the deployed model, sending requests to generate images. Figure 6: Cloud infrastructure of the program The generated images are obtained and stored in GCP cloud storage. At the same time, information about the created images is written to the cloud SQL instance of PostgreSQL. The database schema is shown in Figure 7. Figure 7: PostgreSQL database schema Thus, our software infrastructure integrates Python, PyTorch, Google Cloud Platform services such as Vertex AI, Cloud Storage, and PostgreSQL Cloud to facilitate GAN model training, deployment, and management. The generate.py script provides a convenient interface for creating new images while maintaining a record of the model's performance and generated content in cloud storage and a database. This architecture makes it possible to efficiently generate images for various purposes. 7. Computer experiments A dataset was used to conduct computer experiments. The dataset is a sample of histological images measuring 64 by 64 pixels. The dataset is divided into three classes. The total number of images is 185. This sample is expanded to approximately 700 images by applying a set of affine transformations (random rotation, translation, scaling). An example of the start date of the set is given in Table 2. Table 2 Characteristics of the initial dataset Type of images Color RGB histological images Image classes G1 G2 G3 Example images for each class resolution 64 by 64 pixels 64 by 64 pixels 64 by 64 pixels The total number of 700 images in the dataset For the experiments we used the GCP n1-standard-4 virtual machine: 15 GB RAM, 4 vCPU, Nvidia Tesla V100 GPU 16 GB (13.2 TFLOPS). The network training process lasted about 11 hours. The Inception Score (IS) and FID metrics were used to evaluate the network. The metrics values are as follows: IS – 3.025, FID – 68. As a result of the experiments, 2000 artificial images were generated for each class. The resolution of the generated images is 64 by 64 pixels. An example of these images is shown in Figure 8. Figure 8: Example of generated images The results of the experiments are stored in the Cloud SQL database. Examples of records from the database are shown in Figure 9. Figure 9: Application records from tables of generated images 8. Conclusions 1. The known and available training datasets were analyzed, revealing the limitations of rare classes in biomedical dataset images. 2. Global image storage platforms were analyzed, their characteristic functions were highlighted. The conducted analysis showed that the number of artificial databases of biomedical images is small. 3. The method for generating artificial images has been developed, which consists of the following steps: affine distortions of the original images, generation of images based on GAN, and evaluation of the quality of the generated images. 4. The GAN architecture, which consists of a generator and a discriminator, has been developed. The basis of the discriminator and generator is the ResNet Block. The self-attention mechanism is used for the generator and discriminator. 5. Software module for generating and storing artificial images in the Python programming language was developed. Software infrastructure combines Python, PyTorch, Google Cloud Platform services such as Vertex AI, Cloud Storage and PostgreSQL Cloud, 6. The database of 2000 artificial images per class was created through computer experiments. Images have a resolution of 64x64 pixels. 7. Analysis was conducted to assess image quality based on IS and FID metrics. Obtained values are: IS – 3.025, FID – 68. 9. References [1] Data.gov. URL: https://data.gov [2] Kaggle Datasets. URL: https://www.kaggle.com/datasets [3] Zenodo. URL: https://zenodo.org [4] Google Dataset Search. URL: https://datasetsearch.research.google.com [5] I.J. Goodfellow, J. Pouget-Abadie, M Mirza, et al. Generative Adversarial Nets, in: Proceedings of the 27th International Conference on Neural Information Processing Systems, Volume 2, 2014, pp. 2672-2680. [6] M. Frid-Adar, I. Diamant, E. Klang, M. Amitai, J. Goldberger, H. Greenspan, GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification. Neurocomputing, 321 (2018) 321-331. [7] B. K. Beaulieu-Jones, Z. S. Wu, C. Williams, C. S. Greene, Privacy-preserving generative deep neural networks support clinical data sharing. Circulation. Cardiovascular quality and outcomes, 12.7. doi: 10.1101/159756 [8] A. Chartsias, T. Joyce, R. Dharmakumar, S.A. Tsaftaris, Adversarial Image Synthesis for Unpaired Multi-modal Cardiac Data. In: Tsaftaris, S., Gooya, A., Frangi, A., Prince, J. (eds) Simulation and Synthesis in Medical Imaging. SASHIMI 2017. Lecture Notes in Computer Science, vol 10557. Springer, Cham. doi: 10.1007/978-3-319-68127-6_1 [9] T. Schlegl, P. Seebock, S. M. Waldstein, U. SchmidtErfurth, and G. Langs. Unsupervised anomaly detection with generative adversarial networks to guide marker discovery, in: International Conference on Information Processing in Medical Imaging, Springer, 2017, pp. 146–157. [10] M. J. M. Chuquicusma, S. Hussein, J. R. Burt, U. Bagci. How to fool radiologists with generative adversarial networks? A visual turing test for lung cancer diagnosis. 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), 2018, pp. 240–244. [11] D. Nie et al., Medical Image Synthesis with Deep Convolutional Adversarial Networks. IEEE Transactions on Biomedical Engineering 65.12 (2018) 2720-2730, doi: 10.1109/TBME.2018.2814538. [12] C. Bermudez et al. Learning Implicit Brain MRI Manifolds with Deep Learning. in: Proceedings of SPIE--the International Society for Optical Engineering vol. 10574 (2018): 105741L. doi:10.1117/12.2293515 [13] C. Baur, S. Albarqouni, N. Navab. Melanogans, High resolution skin lesion synthesis with GANs. CoRR, 2018b. doi: 10.48550/arXiv.1804.04338 [14] C. Baur, S. Albarqouni, N. Navab. Generating highly realistic images of skin lesions with GANs. OR 2.0/CARE/CLIP/ISIC@MICCAI (2018) doi:10.1007/978-3-030-01201-4_28. [15] K. Tero, T. Aila, S. Laine, J. Lehtinen, Progressive Growing of GANs for Improved Quality, Stability, and Variation. ArXiv abs/1710.10196 (2017) [16] L. Hou, Ayush Agarwal, D. Samaras, T. Kurç, Rajarsi R. Gupta, J. Saltz, Unsupervised Histopathology Image Synthesis. ArXiv. abs/1712.05021 (2017) [17] Y. Xue et al. Selective Synthetic Augmentation with HistoGAN for Improved Histopathology Image Classification. Medical Image Anal. 67 (2021) 101816 doi:10.1016/j.media.2020.101816. [18] S. Butte, H. Wang, M. Xian, A. Vakanski Sharp-GAN: Sharpness Loss Regularized GAN for Histopathology Image Synthesis. in: 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI), 2021, pp. 1-5 [19] M. Li, C. Li, C. Peng, B.C. Lovell Conditioned Generative Transformers for Histopathology Image Synthetic Augmentation. (2022) doi:10.48550/arXiv.2212.09977 [20] O. Berezsky, O. Pitsun, G. Melnyk, T. Datsko, I. Izonin, B. Derysh, An Approach toward Automatic Specifics Diagnosis of Breast Cancer Based on an Immunohistochemical Image. Journal of Imaging 9(1) (2023) 12. doi: 10.3390/jimaging9010012 [21] O. Berezsky, O. Pitsun, P. Liashchynskyi, B. Derysh, N. Batryn, Computational Intelligence in Medicine, volume 149 of Lecture Notes on Data Engineering and Communications Technologiesthis, Springer, Cham, 2023, pp. 488–510. doi: 10.1007/978-3-031-16203-9_28 [22] O. Berezsky, P. Liashchynskyi, O. Pitsun, P. Liashchynskyi, M. Berezkyy Comparison of Deep Neural Network Learning Algorithms for Biomedical Image Processing. CEUR Workshop Proceeding, 2022, 3302, pp. 135–145. [23] O.M. Berezsky, P.B. Liashchynskyi, Comparison of generative adversarial networks architectures for biomedical images synthesis. Applied Aspects of Information Technology. 4(3) (2021) 250– 260. doi: 10.15276/aait.03.2021.4. [24] O. Berezsky, O.Pitsun, N. Batryn, T. Datsko, K. Berezska, L. Dubchak Modern automated microscopy systems in oncology, in: Proceedings of the 1st International Workshop on Informatics & Data-Driven Medicine, Lviv, Ukraine, 28-30 november 2018, pp. 311-325 [25] The CAMELYON16 challenge. URL: https://camelyon16.grand-challenge.org/Data/ [26] B. E. Bejnordi, M Veta, P Johannes van Diest, et al. Diagnostic Assessment of Deep Learning Algorithms for Detection of Lymph Node Metastases in Women With Breast Cancer. Journal of the American Medical Assosiation. 318 (2017) 2199-2210. doi:10.1001/jama.2017.14585 [27] Veta, Mitko, et al. "Predicting breast tumor proliferation from whole-slide images: the TUPAC16 challenge." Medical image analysis. 54 (2019): 111-121. doi: 10.1016/j.media.2019.02.012. [28] Tumor Proliferation Assessment Challenge. URL: https://tupac.grand-challenge.org [29] Liashchynskyi P. Rudi library. URL: https://github.com/liashchynskyi/rudi.