<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Fingerprint Identification of Generative Models Using a MultiFormer Ensemble Approach Notebook for ImageCLEF Lab at CLEF 2024</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Md</forename><forename type="middle">Ismail</forename><surname>Siddiqi Emon</surname></persName>
							<affiliation key="aff0">
								<orgName type="department" key="dep1">Department of Computer Science</orgName>
								<orgName type="department" key="dep2">SCMNS School</orgName>
								<orgName type="institution">Morgan State University</orgName>
								<address>
									<postCode>21251</postCode>
									<settlement>Baltimore</settlement>
									<region>Maryland</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Mahmudul</forename><surname>Hoque</surname></persName>
							<affiliation key="aff0">
								<orgName type="department" key="dep1">Department of Computer Science</orgName>
								<orgName type="department" key="dep2">SCMNS School</orgName>
								<orgName type="institution">Morgan State University</orgName>
								<address>
									<postCode>21251</postCode>
									<settlement>Baltimore</settlement>
									<region>Maryland</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><roleName>Md</roleName><forename type="first">Rakibul</forename><surname>Hasan</surname></persName>
							<affiliation key="aff0">
								<orgName type="department" key="dep1">Department of Computer Science</orgName>
								<orgName type="department" key="dep2">SCMNS School</orgName>
								<orgName type="institution">Morgan State University</orgName>
								<address>
									<postCode>21251</postCode>
									<settlement>Baltimore</settlement>
									<region>Maryland</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Fahmi</forename><surname>Khalifa</surname></persName>
							<email>fahmi.khalifa@morgan.edu</email>
							<affiliation key="aff1">
								<orgName type="department" key="dep1">Electrical &amp; Computer Engineering Dept</orgName>
								<orgName type="department" key="dep2">School of Engineering</orgName>
								<orgName type="institution">Morgan State University</orgName>
								<address>
									<postCode>21251</postCode>
									<settlement>Baltimore</settlement>
									<region>Maryland</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Mahmudur</forename><surname>Rahman</surname></persName>
							<email>md.rahman@morgan.edu</email>
							<affiliation key="aff0">
								<orgName type="department" key="dep1">Department of Computer Science</orgName>
								<orgName type="department" key="dep2">SCMNS School</orgName>
								<orgName type="institution">Morgan State University</orgName>
								<address>
									<postCode>21251</postCode>
									<settlement>Baltimore</settlement>
									<region>Maryland</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Fingerprint Identification of Generative Models Using a MultiFormer Ensemble Approach Notebook for ImageCLEF Lab at CLEF 2024</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">3925281B8D619EF3E0EB91519BA8E4D3</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T18:00+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>GAN Fingerprint</term>
					<term>Multiformer</term>
					<term>BLIP2</term>
					<term>Generative Model Fingerprint</term>
					<term>Training Data Fingerprint</term>
					<term>DINOv2</term>
					<term>Late Fusion</term>
					<term>Thresholding</term>
					<term>Reranking</term>
					<term>ARI Scoring</term>
					<term>CT image denoising</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>In the ever-changing realm of medical image processing, ImageCLEF brought a new dimension with the Identifying GAN Fingerprint task, catering to the advancement of visual media analysis. This year, the author presented the task of detecting training image fingerprints to control the quality of synthetic images for the second time (as task 1) and introduced the task of detecting generative model fingerprints for the first time (as task 2). Both tasks are aimed at discerning these fingerprints from images, on both real training images and the generative models. The dataset utilized encompassed 3D CT images of lung tuberculosis patients, with the development dataset featuring a mix of real and generated images, and the test dataset. Our team 'CSMorgan' contributed several approaches, leveraging multiformer (combined feature extracted using BLIP2 and DINOv2) networks, additive and mode thresholding techniques, and late fusion methodologies, bolstered by morphological operations. In Task 1, our optimal performance was attained through a late fusion-based reranking strategy, achieving an F1 score of 0.51, while the additive average thresholding approach closely followed with a score of 0.504. In Task 2, our multiformer model garnered an impressive Adjusted Rand Index (ARI) score of 0.90, and a fine-tuned variant of the multiformer yielded a score of 0.8137. These outcomes underscore the efficacy of the multiformer-based approach in accurately discerning both real image and generative model fingerprints.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>In 2004 ImageCLEF <ref type="bibr" target="#b0">[1]</ref> was established for medical imaging which has pioneered advancements in this field. After it advanced to its second edition <ref type="bibr" target="#b1">[2]</ref>, it marked a significant evolution by the inclusion of medical Generative Adversarial Networks (GANs) tasks. And, the first iteration of its kind came in 2023. This task aimed to explore a specific hypothesis which was that the GANs might embed "fingerprints" of real images within the synthetic medical images they generate. Confirming this hypothesis could have significant implications. It might lead to a reconsideration of the copyright status of synthetic images. This would challenge the conventional view that synthetic images are entirely artificial. In recent years, there has been a substantial surge in the application of GANs and diffusion models within the medical domain <ref type="bibr" target="#b2">[3,</ref><ref type="bibr" target="#b3">4,</ref><ref type="bibr" target="#b4">5]</ref>. These sophisticated architectures are capable of generating synthetic images and facilitating their translation across different modalities. This burgeoning utilization underscores the transformative potential of GANs and diffusion models in enhancing medical imaging and diagnostics. Medical imaging professionals have been exploring various applications of GANs in medical image analysis, such as creating artificial medical images and distinguishing between real and fake images. They have developed effective architectures like "Attention GAN" <ref type="bibr" target="#b5">[6]</ref> and "ABC GAN" <ref type="bibr" target="#b6">[7]</ref> to produce lifelike medical images, which assist with various tasks, including training AI models and protecting patient privacy. However, despite these advancements <ref type="bibr" target="#b7">[8,</ref><ref type="bibr" target="#b8">9]</ref>, a major challenge remains: differentiating between real and synthetic medical images. This is an area where scientists continue to focus their efforts. Generative models <ref type="bibr" target="#b9">[10,</ref><ref type="bibr" target="#b10">11,</ref><ref type="bibr" target="#b11">12,</ref><ref type="bibr" target="#b12">13]</ref>, a recent AI innovation, have driven significant advancements across multiple domains <ref type="bibr" target="#b13">[14,</ref><ref type="bibr" target="#b14">15,</ref><ref type="bibr" target="#b15">16]</ref>, including generative medical imaging <ref type="bibr" target="#b16">[17,</ref><ref type="bibr" target="#b17">18,</ref><ref type="bibr" target="#b18">19]</ref>, such as authors from <ref type="bibr" target="#b16">[17]</ref> shows that synthesizing high-resolution images of skin lesions with Generative Adversarial Networks (GANs) can address the lack of labeled data and skewed class distributions in skin image analysis. Using progressive growing, they produce realistic dermoscopic images that are difficult for expert dermatologists to distinguish from real ones, outperforming other GAN architectures like DCGAN and LAPGAN. Synthetic biomedical images serve vital roles in research, healthcare professional training <ref type="bibr" target="#b19">[20]</ref>, and patient care enhancement from anisotropic diffusion <ref type="bibr" target="#b20">[21]</ref> to AnoGAN, an unsupervised deep convolutional GAN to identify anomalies in imaging data for disease markers, demonstrated by accurately detecting anomalies in retinal OCT images <ref type="bibr" target="#b21">[22]</ref>. Additionally recent advancement demonstrates that <ref type="bibr" target="#b22">[23]</ref> discriminating between malignant and benign lung nodules remains challenging, necessitating CAD systems to assist radiologists. Using unsupervised learning with Deep Convolutional-Generative Adversarial Networks (DC-GANs), they aim to generate realistic lung nodule samples, hypothesizing that difficult-to-differentiate imaging features will be highly discriminative, thereby improving diagnostic accuracy, training radiologists, and generating realistic samples for deep network training.They address challenges such as data scarcity, cost, and ethical considerations associated with real patient data acquisition.</p><p>The 2024 ImageCLEFmedical GANs Task provides a forum to investigate the influence of GANs on the generation of artificial biomedical images, facilitating the examination of the potential advantages and ethical concerns of their application. This includes two basic objectives: (1) how to identify fingerprints of training data in synthetic biomedical images (inspection), and (2) how to find fingerprints of different generative models on images that they are designed to generate (differentiation). This essentially allows researchers to compare models and highlight the characteristics, patterns, or features present in synthetic images that would separate the models other than that they might not be alike. This dataset includes axial slices of 3D CT images of around 8000 lung tuberculosis patients and acts as a great resource for research.</p><p>Task 1 aims to identify the real images that GANs' images were generated from, so as to address privacy and security concerns associated with the use of artificial images <ref type="bibr" target="#b17">[18]</ref>. The datasets were provided by the organizers, with a development set divided into marked artificial and real images with training while mentioning no information on the percentage of the used and unused images from this set is disclosed. This extensive dataset is useful to rigorously test the hypotheses and push the community forward with discoveries in biomedical image synthesis.</p><p>Task 2 is to detect the fingerprints of generative models in GANs-generated images. The authors of ImageCLEFmed GANs argue that one way to understand this behavior is to envision that each AI model imparts its own distinct "signature" on the images it generates. So, here, our aim is to reveal these hidden signatures to identify what makes each model unique. It's like reading differently styled handwriting from different authors but in the AI area. We do not simply aim to differentiate models but rather to gain a deeper understanding of them, by examining the hidden patterns and subtleties contained within the synthetic images.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Datasets</head><p>In the second edition of the ImageCLEFmed GANs challenge, the organizers presented two tasks: "Identify training data fingerprints" and "Detect generative models' fingerprints. "</p><p>For the first task, the hypothesis is that when images are generated using diffusion models, the original real images leave specific fingerprints in the generated models. If this hypothesis is true, it could impose additional restrictions on publishing or sharing such images publicly, as they would be as sensitive as the original images. Conversely, if the hypothesis is false, it could lead to a vast dataset of artificially generated images using diffusion models, potentially revolutionizing the medical imaging field. The second task, although different in its subjective nature, shares a similar objective: identifying generative model fingerprints in images produced by various diffusion models. The challenge is to determine whether these models imprint unique fingerprints in the generated images. The organizers did not disclose the number of diffusion models used, but the approach remains consistent regardless of the quantity.</p><p>The author of the ImageCLEFmed GANs task provided both a development set and a training set. For Task 1, there were two datasets consisting of axial slices of 3D CT images from approximately 8,000 lung tuberculosis patients. The artificial slice images, sized at 256×256 pixels, were generated using various undisclosed generative adversarial networks and diffusion neural networks. Over 12,000 generative images were included, and the test set contained a total of 8,000 images. For Task 2, a dataset comprising 3,000 generated image files was provided. Figure <ref type="figure" target="#fig_0">1</ref> depicts examples of images from the original dataset.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Proposed Methodology</head><p>Here we are going to explain the details of the approaches utilized in our submission for both tasks: "Identify training data fingerprints" and "Detect generative models' fingerprints". For the task of identifying training data fingerprints, we first performed morphological operations to reduce noise in the CT images. This preprocessing step was crucial for improving the quality of synthetic medical images generated by GANs. Subsequently, we implemented BLIP and DINOv2 as image signature generators. As illustrated in Figure <ref type="figure" target="#fig_1">2</ref>, these morphological operations helped control the quality of the images. After preprocessing, we conducted individual feature rankings for each model and then concatenated the feature rankings from both models. We also performed dimensionality reduction on the concatenated features to enhance the ranking process. Finally, we applied late fusion to combine the results from the previous steps, optimizing the identification of training data fingerprints. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Morphological operations</head><p>During image processing image opening is widely utilized for effective morphological operation. It primarily aimed to remove small objects from 3D axial CT images. But at the same time preserves the size and shape of larger structures. Which is an effective noise-releasing mechanism. As illustrated in Figure <ref type="figure" target="#fig_2">3</ref>, electronic noise in CT images often originates from the combination of the detector system and the reconstruction kernel, with sharper kernels typically resulting in noisier images. According to <ref type="bibr" target="#b23">[24]</ref>, this noise is a consequence of efforts to enhance image quality without increasing the radiation dose. Our hypothesis is that applying image opening will eliminate small, noisy, and irrelevant details, thereby potentially enhancing the efficiency of fingerprint detection in medical images. By preserving key features, image opening maintains the integrity of essential image details, ensuring that the important structures remain intact while the noise is reduced.</p><p>Image opening consists of two main steps erosion (Eq. 1) followed by dilation (Eq. 2). Image opening (3) can be represented as:</p><formula xml:id="formula_0">𝐴 ⊖ 𝐵 = {𝑧 ∈ 𝐸|𝐵 𝑧 ⊆ 𝐴}<label>(1)</label></formula><formula xml:id="formula_1">𝐴 ⊕ 𝐵 = ⋃︁ 𝑏∈𝐵 𝐴 𝑏<label>(2)</label></formula><formula xml:id="formula_2">𝐴 ∘ 𝐵 = (𝐴 ⊖ 𝐵) ⊕ 𝐵<label>(3)</label></formula><p>Here in equation ( <ref type="formula" target="#formula_0">1</ref>) above image erosion has been performed, where 𝐴 and 𝐵 are input image and structuring elements respectively. And 𝐵 𝑧 is the translation of 𝐵 by 𝑧. Then in equation ( <ref type="formula" target="#formula_1">2</ref>), image dilation is performed where 𝐴 𝑧 is the translation of 𝐴 by 𝑧. In the equation ( <ref type="formula" target="#formula_2">3</ref>), is the image opening operation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Multiformer</head><p>To build a multiformer we have chosen two foundation models which are BLIP and DINOv2 as backbone architecture, see Fig. <ref type="figure" target="#fig_3">4</ref>, for which details are given below.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.1.">BLIP Architecture</head><p>Bootstrapping Language-Image Pre-training 2 (BLIP2) <ref type="bibr" target="#b24">[25,</ref><ref type="bibr" target="#b25">26]</ref> utilize a Visual Transformer (ViT) <ref type="bibr" target="#b26">[27]</ref> for image encoding, dividing images into patches and encoding them into a sequence of embeddings with a [CLS] token representing the global image feature. This method is computationally efficient compared to traditional object detectors. BLIP is a multimodal Mixture of Encoder-Decoder (MED), operates in three modes: unimodal encoder <ref type="bibr" target="#b27">[28]</ref> (similar to BERT for text), image-grounded text encoder (incorporates visual information via cross-attention layers), and image-grounded text decoder (uses causal self-attention layers for text generation). During pre-training, we jointly optimize three objectives: two for understanding and one for generation. Each image-text pair undergoes one forward pass through the ViT and three through the text transformer for different tasks. The Image-Text Contrastive Loss (ITC) inspired from <ref type="bibr" target="#b28">[29,</ref><ref type="bibr" target="#b29">30]</ref> to align visual and textual feature spaces, improving vision-language understanding by encouraging positive image-text pairs to have similar representations. The Image-Text Matching Loss (ITM) <ref type="bibr" target="#b29">[30]</ref> learns fine-grained multimodal representations through a binary classification task to determine whether image-text pairs are matched, utilizing a hard negative mining strategy. The Language Modeling Loss (LM) trains the model to generate textual descriptions from images using cross-entropy loss, enhancing the model's capability to convert visual information into coherent captions. To maximize efficiency and leverage multi-task learning, the text encoder and decoder share parameters except for the self-attention layers, which capture the differences between encoding and decoding tasks. This shared architecture benefits from improved training efficiency and effective multi-task learning.</p><p>The BLIP2 feature extractor is a multi-modal model designed to extract and integrate features from both images and text. It begins with patch embedding for images, converting each image 𝑋 into patches 𝑃 using a convolutional layer, 𝑃 = Conv(𝑋). These patches are then processed through a transformer encoder, 𝐹 𝑣 = Transformer 𝑣 (𝑃 ), to capture visual relationships. Image data 𝐼 is tokenized and embedded into high-dimensional space, 𝐸 𝑡 = Embedding(𝐼), and passed through another transformer encoder, 𝐹 𝑡 = Transformer 𝑡 (𝐸 𝑡 ). The cross-modal attention mechanisms enhance its performance to state of the art, 𝐹 𝑣𝑡 = CrossAttention(𝐹 𝑣 , 𝐹 𝑡 ), are used to align and integrate visual features. The model is pre-trained on large datasets with paired images and text to learn these representations. The pre-trained model can be fine-tuned for downstream tasks to improve performance on specific datasets. The BLIP2 feature extractor thus provides robust, high-level features suitable for tasks such as image classification, object detection, and text-image matching.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.2.">DINOv2 Architecture</head><p>DINOv2 <ref type="bibr" target="#b30">[31]</ref> is an enhanced version of DINO <ref type="bibr" target="#b31">[32]</ref>, integrating various improvements and using a larger, more diverse dataset to accelerate and stabilize training at scale. It utilizes the LVD-142M curated dataset, which includes data from sources like ImageNet <ref type="bibr" target="#b32">[33]</ref>, Google Landmarks <ref type="bibr" target="#b33">[34]</ref>, Mapillary SLS <ref type="bibr" target="#b34">[35]</ref>, and Food-101 <ref type="bibr" target="#b36">[36]</ref>. The dataset is de-duplicated using FAISS <ref type="bibr" target="#b37">[37]</ref> batch searches and embeddings. Training combines DINO and iBOT losses with SwAV centering, involving a learnable student and an EMA teacher. Key techniques include multi-crop cross-entropy for global image representation, patch-level masking, and separate weights for image and patch objectives. The teacher's softmaxcentering is replaced by Sinkhorn-Knopp batch normalization, and the KoLeo regularizer ensures batch uniformity. High-resolution images are used towards the end of pre-training. DINOv2's implementation features FlashAttention <ref type="bibr" target="#b38">[38,</ref><ref type="bibr" target="#b39">39]</ref> for efficiency, nested tensors from xFormers, stochastic depth, and mixed-precision PyTorch FSDP. Distillation <ref type="bibr" target="#b40">[40,</ref><ref type="bibr" target="#b41">41]</ref> into smaller models from a larger teacher model is also included. Ablation studies cover model selection, data curation strategies, model scaling, and loss objectives. DINOv2 achieves results comparable to weakly supervised text models like EVA-CLIP on ImageNet and performs better than SSL methods like Mugs, EsViT, and iBOT. It shows strong performance in domain generalization, image and video classification, instance recognition, and semantic segmentation. For depth estimation, DINOv2 uses a linear layer on frozen tokens and experiments with concatenation of ViT <ref type="bibr" target="#b26">[27]</ref> layers/blocks and regression over a DPT decoder, achieving superior results on datasets like NYUd, KITTI, and SUN-RGBd. Qualitative results demonstrate effective semantic matching and foreground extraction using PCA on patch features.</p><p>We utilized the DINO Vision Transformer because it is a sophisticated image processing model capable of handling and analyzing images through a systematic series of steps. Below is a description of how its architecture operates. Initially, the input image is divided into smaller patches, which are then embedded into a higher-dimensional space using a convolutional layer. This embedding captures the local features of each patch. These embedded patches pass through multiple nested tensor blocks, each consisting of several components: layer normalization to standardize inputs, memory-efficient attention mechanisms to focus on different parts of the image, and multi-layer perceptrons for feature refinement. Each block also includes layer scaling and dropout layers to improve training stability and prevent overfitting. The model incorporates 12 of these blocks, creating a deep network capable of learning complex image representations. After processing through all blocks, a final normalization layer is applied, followed by an identity layer that prepares the features for subsequent tasks. This architecture allows the DINO ViT to effectively learn and process detailed image features, making it suitable for various image processing applications.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Autoencoder for feature reduction</head><p>Initially, we employed raw features for clustering to assign labels to the test set. Subsequently, we integrated robust features from both the BLIP and DINOv2 models. To enhance the clustering results, we implemented an Autoencoder to encode this extensive feature set into a more meaningful and reduced representation. This dimensionality reduction facilitated improved clustering performance. Autoencoders (AEs) have been quite popularised in AI research over recent years and consequently, there have been many studies and advancements made in this subject, There are mainly two types, they are Fundamental AEs And Variants. The simplest ones, as used in our working example, one can find, are auto-associative neural networks, connected to a multi-layer perceptron, where input is being reconstructed. The encoders consist of an encoder that compresses an input vector with recognition weights into a code vector and a decoder that decompresses the input vector from a code vector with generative weights. This structure allows each layer of a deep network to be trained separately using the basic AE as a building block. The encoder activation function (sf) may be, for example, sigmoid or hyperbolic tangent, a weight matrix (W) is a bias vector (b), and x Wine are used in the computation of the hidden representation of an input vector x with a 𝑦 = 𝑓 Θ (𝑥) = 𝑠𝑓 (𝑊 𝑥 + 𝑏), where 𝑊 is a weight matrix, 𝑏 is a bias vector, and 𝑠𝑓 is the encoder activation function (e.g., sigmoid or hyperbolic tangent). The hidden representation 𝑦 is then decoded back to a reconstruction vector 𝑧 using 𝑧 = 𝑔 Θ (𝑦) = 𝑠𝑔(𝑊 ′ 𝑦 + 𝑏 ′ ), where sg is the activation function of the decoder. The aim is to make the reconstruction( z ) as close to the input( x ) as possible. To simplify training, the weight matrix 𝑊 ′ is often constrained to be the transpose of 𝑊 (tied weights), reducing the number of free parameters. This mapping process ensures that each input is transformed into a hidden representation and then reconstructed, enabling effective learning and data representation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experimental Result Analysis</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">System Specification and Parameter Settings</head><p>The proposed model is implemented on a Google Cloud Vertex AI instance, utilizing a NVIDIA V100 Tensor Core GPU. This GPU boasts 5120 CUDA cores, 640 Tensor cores, up to 32 GB of HBM2 memory, and a memory bandwidth reaching up to 900 GB/s. These specifications allow for highly efficient processing and accelerated computation, which are essential for handling the complex tasks and large datasets involved in our model's execution.</p><p>For the BLIP architecture, the parameters used make it better in feature extraction. The input image size was normalized at 256×256 pixels, followed by data augmentations of random cropping, horizontal flipping with probability 𝑝 = 0.5, rotation within ±20 ∘ , and color jittering with brightness adjustment factors [0.75, 1.25]. Feature Extraction used the pre-trained ViT large model as the backbone. The value chosen for the learning rate 𝜂 was 0.0005 and a batch size 𝐵 of 16 was employed. The AdamW optimizer where weight decay factor was added was used for loss function minimization 𝐿(𝜃) Normalization layers were used for scaling the features within the same scale of any particular signal for the dataset.</p><p>For Dinov2 architecture, a number of key parameters were defined to maximally optimize their performance. The model was pre-trained on a large companys own well curated set of images, making it a great base for later fine-tuning. The size of the input image was kept uniform during training, 224 × 224 pixels, where several data augmentations like center cropping, random affine transformations, resizing, and normalizing, were performed. We used a learning rate 𝜂 of 0.001 and batch size (B) of 32. Adam optimizer with 𝛽 1 = 0.9 and 𝛽 2 = 0.999 was used to minimize the loss function 𝐿(𝜃) where 𝜃 denotes the model parameters. We applied a dropout with a probability of p = 0.5 and layer normalization to prevent overfitting and maintain training stability.</p><p>For the Dinov2, BLIP we then tried to tune these parameters carefully to get as high performance as possible to identify generative model fingerprintfs and test image fingerprints. I executed these procedures while training the item co-occurrence model with data augmentation, learning rates, and optimization technique and achieved high quality and stable item embeddings.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Identify Training Data Fingerprints Experiments</head><p>For identifying training data fingerprints, we first reduced noise in the CT images through morphological operations. We then used BLIP and DINOv2 as image signature generators. As shown in Figure <ref type="figure" target="#fig_1">2</ref>, these steps improved the quality of synthetic medical images generated by GANs. We ranked features individually and after concatenation, performed dimensionality reduction, and used late fusion to refine our fingerprint identification results.</p><p>For submission 1, we implemented the "additive mode thresholding" technique, which considers local variations in image intensity to enhance image processing. First, we reduced the dimension of the feature vector using Principal Component Analysis (PCA). We then combined all the features and weighted them by the total. The mode of this final weighted result was used as the threshold. For the test images, we applied a similar weighting approach: if the weighted value was less than the mode, the image was tagged as not used; otherwise, it was tagged as used. This method allowed us to account for local intensity variations, improving the accuracy of our thresholding.</p><p>In submission 2, which we titled "additive average thresholding, " we took a different approach. We calculated the final result for each subject and then averaged these results across all subjects. This average became the threshold value for classification. By using the average, we aimed to create a more generalized threshold that could effectively classify the images based on the overall distribution of the data.</p><p>For submission 3, we used an encoder model to handle the extensive feature set generated by the backbone models. The encoder compressed this concatenated feature set, reducing its dimensionality. With the reduced feature set, we applied both mode and mean thresholding techniques. This dual approach allowed us to leverage the strengths of both thresholding methods, providing a robust classification mechanism.</p><p>In submission 5, we employed a late fusion strategy to combine the decisions from the previous four methods. Late fusion involves aggregating the results at the decision level rather than at the feature level. We used majority voting to finalize the classification, ensuring that the combined decisions of the different methods provided a more accurate and reliable result. This ensemble approach helped to mitigate the weaknesses of individual methods and improved the overall performance of our classification system.</p><p>For the final submission 6, we performed reranking using the Agglomerative Clustering algorithm. This algorithm conducts hierarchical clustering with a bottom-up approach, allowing us to specify parameters such as the number of clusters, distance metric, and linkage criterion. The reranking was based on decisions from the previous submissions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Identify Generative Model Fingerprints Experiemnts</head><p>To identify generative models' fingerprints, we initially reduced noise in the CT images using morphological operations. We then employed the pre-trained BLIP2 and DINOv2 architectures for feature extraction. As illustrated in Figure <ref type="figure" target="#fig_1">2</ref>, our objective was to accurately label each subject with the corresponding model number, determining which generative or diffusion model produced each image.</p><p>For submissions 1 and 2, we utilized a combination of feature sets from BLIP and DINOv2 titled 'multiformer' architecture with different augmentation techniques. In submission 1, we applied center cropping and random affine transformations, along with resizing, normalizing, and other standard setups. In submission 2, we used random cropping, random horizontal flipping, random rotation, and color jittering. These augmentations introduced variations in size, orientation, and brightness to the training dataset, enhancing the model's robustness and accuracy. We then used k-means and agglomerative clustering to assign labels to each subject. Submissions 3 and 4 followed a similar feature extraction method, but we applied PCA and autoencoder was applied for combined feature dimensionality reduction, respectively. The same clustering algorithms were then used for label assignment.</p><p>In submissions 5 and 6, we leveraged solely the BLIP architecture. Submission 5 used a BLIP base model for feature extraction, while submission 6 utilized the BLIP pre-trained ViT large model. The normalized feature sets were subsequently fed into clustering algorithms for labeling.</p><p>For submissions 7 and 8, we performed ensemble voting and reranking based on the decisions from previous submissions. Ensemble voting combined results at the decision level rather than the feature level, employing majority voting to determine the final classification, ensuring a more accurate and reliable outcome. For reranking, we applied Density-Based Spatial Clustering (DBSCAN). This algorithm identifies clusters by ensuring each point within a cluster has a neighborhood defined by a specified radius, containing at least a minimum number of points, thereby separating dense regions from areas with fewer points.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4.">Results and Discussions</head><p>The results presented in Table <ref type="table" target="#tab_0">1</ref> show the metrics for datasets 1 and 2, referred to as db1 and db2, for the task of identifying training data fingerprints. Initially, we applied morphological opening operations, which first erode the image and then dilate the eroded image using the same structuring element. After performing these operations, we passed the images to our vision transformer models for further processing. Submission 1 utilized the Dinov2 model with additive mode thresholding. This model employs task-agnostic and cognitive approaches through self-supervised learning and features a multipurpose backbone derived from a pre-trained, extensive, and well-curated image set, powered by a visual transformer model. Additive mode thresholding was then used to assign labels to each image. This method achieved (shown in Table <ref type="table" target="#tab_0">1</ref>) an accuracy and precision of 0.5025 for dataset 1, and an accuracy and F1 score of 0.491 for dataset 2. As shown in Table <ref type="table" target="#tab_1">2</ref> the overall accuracy was 0.492. Submission 2 used the Blip architecture with additive average thresholding. This model employs an unimodal decoder to extract pattern signatures from images. Subsequently, additive average thresholding was applied to assign labels to each image. This approach resulted in an accuracy and precision of 0.50425 for dataset 1, and an accuracy and F1 score of 0.4995 for dataset 2. As shown in 2.As shown in Table <ref type="table" target="#tab_1">2</ref> the overall accuracy for submission 2 was 0.501875. Submissions 3 and 4 are based on the concatenated multiformer feature fusion. We subsequently applied PCA and autoencoder for dimensionality reduction to both feature sets derived by the Dinov2 and Blip models. This approach yielded our best results, with submission 4 achieving an accuracy and precision of 0.49575 for dataset 1. For dataset 2, the highest accuracy and F1 score of 0.5005 were attained, making it the best among these submissions. As shown in Table <ref type="table" target="#tab_1">2</ref> the overall accuracy for submissions 3 and 4 was 0.496357 and 0.4957 respectively. In the final two submissions, 5 and 6, we employed a reranking technique. Submission 6 provided the highest accuracy for dataset 1, reaching 0.51. Meanwhile, for dataset 2, submission 5 achieved the best accuracy at 0.506. As shown in Table <ref type="table" target="#tab_1">2</ref> the overall accuracy for submissions 5 and 6 was 0.500875 and 0.502625 respectively. Moreover for Task 2, we investigated the intriguing notion that generative models might leave distinct marks on the images they create. Our goal was to determine whether different models have unique "fingerprints" within the synthetic images they produce. By closely examining these images, we aimed to uncover the specific characteristics that define each model's output. This time the author of the ImageCLEFmed GANs <ref type="bibr" target="#b7">[8]</ref> provided results are represented by the Adjusted Rand Index (ARI), which measures the similarity between two clusters. The ARI improves upon the Rand Index by considering the likelihood of chance agreements between clusters, thus enhancing its reliability.</p><formula xml:id="formula_3">ARI = ∑︀ 𝑖𝑗 (︀ 𝑎 𝑖𝑗 2 )︀ − [︁ ∑︀ 𝑖 (︀ 𝑎 𝑖. 2 )︀ ∑︀ 𝑗 (︀ 𝑎 .𝑗 2 )︀ ]︁ / (︀ 𝑛 2 )︀ 1 2 [︁ ∑︀ 𝑖 (︀ 𝑎 𝑖. 2 )︀ + ∑︀ 𝑗 (︀ 𝑎 .𝑗 2 )︀ ]︁ − [︁ ∑︀ 𝑖 (︀ 𝑎 𝑖. 2 )︀ ∑︀ 𝑗 (︀ 𝑎 .𝑗 2 )︀ ]︁ / (︀ 𝑛 2 )︀<label>(4)</label></formula><p>As shown in Eqn (4) we can see that the ARI is calculated with the formula: ARI = Index−Expected Index Max Index−Expected Index . Here, the Index represents the raw agreement index, which counts pairs of elements that are either in the same or different clusters in both the true and predicted clusterings. The Expected Index accounts for the expected value of the raw index if the cluster assignments were random, while the Max Index is the maximum value of the raw index, indicating perfect clustering. The ARI ranges from -1 to 1, where 1 signifies perfect agreement, 0 indicates a random clustering, and -1 suggests no agreement. The formula leverages a contingency table and binomial coefficients to adjust the Rand Index for change, providing a more accurate measure of clustering similarity. Among our total of eight submissions shown in Table <ref type="table" target="#tab_2">3</ref> Submission 2 stood out with the highest ARI score of 0.900, demonstrating a strong agreement between the predicted clustering and the ground truth, and showcasing its exceptional performance in data clustering. In contrast, Sub5 and Sub6 recorded very low ARI scores of 0.001 and 0.002, respectively, indicating poor alignment between the predicted clusters and the actual data.</p><p>Our analysis of the results revealed a spectrum of scores for each submission, illustrating how well they matched the real data. While submissions like Sub2 performed exceptionally well, others such as Sub5 and Sub6 fell short of expectations. Overall, our findings highlight the distinctive marks generative models leave on the images they produce, which could aid in recognizing and attributing these images to specific models in the future.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusions</head><p>In the dynamic field of medical image processing, ImageCLEFmed GANs has launched a pioneering initiative with the Identifying GAN Fingerprint task, alongside the task of detecting generative model fingerprints. In this paper, we tackled the first task by proposing six approaches, utilizing additive thresholding, autoencoder, and reranking techniques to classify images as used or not used for generating synthetic images. To address the task of detecting generative models' fingerprints, we implemented eight approaches involving Dinov2, Blip, and ensemble feature fusion. These findings highlight the significance of our efforts in advancing medical image analysis techniques. Moving forward, we plan to apply advanced noise-removing techniques to leverage pixel-level connectivity. Additionally, we aim to develop a unified framework that integrates classical and transformer-based architectures to enhance our ability to detect imprint signatures.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Examples of CT images: (a) from the dataset for identifying training data fingerprints; (b) not from the dataset for identifying training data fingerprints; (c) generated from the dataset for identifying training data fingerprints, and (d) generated from the dataset for detecting generative model fingerprints.</figDesc><graphic coords="3,72.00,129.29,451.28,319.48" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: A block diagram of the sequential processing steps (a) and detailed schematic of the the developed system for identifying training data and detecting generative models' fingerprints (b). The process in (b) begins with morphological operations to reduce noise in CT images, followed by using BLIP and DINOv2 for image signature generation. After concatenating and reducing the dimensionality of features using an encoder and PCA, late fusion is applied to optimize the identification of training data.</figDesc><graphic coords="4,72.00,157.56,451.27,236.55" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: (a) Noisy CT images sampled from the ImageCLEF GAN dataset for task training image fingerprint identification; (b) Processed images after applying the image opening operation, which involves an erosion step followed by dilation. (c) Noisy CT images sampled from the ImageCLEF GAN dataset for the task of generative model fingerprint identification; (d) Processed images after applying the image opening operation for task 2, which also involves an erosion step followed by dilation.</figDesc><graphic coords="5,72.00,65.61,451.27,325.00" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: A thematic block diagram of multiformer leveraging the powerful architecture of BLIP2 and DINOv2.</figDesc><graphic coords="6,72.00,65.61,451.26,187.01" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Submission Results for the Task 1: Identifying Training Data Fingerprints.</figDesc><table><row><cell>Dataset 1 (db1)</cell><cell>Dataset 2 (db2)</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2</head><label>2</label><figDesc>Overall Submission Results for the Identifying Training Data Fingerprints task</figDesc><table><row><cell cols="2">Submission Identify</cell><cell>Identify</cell><cell>Identify</cell><cell>Identify</cell></row><row><cell></cell><cell>training data</cell><cell>training data</cell><cell>training data</cell><cell>training data</cell></row><row><cell></cell><cell>"fingerprints"-</cell><cell>"fingerprints"-</cell><cell>"fingerprints"-</cell><cell>"fingerprints"-</cell></row><row><cell></cell><cell>Accuracy</cell><cell>Precision</cell><cell>Recall</cell><cell>F1-score</cell></row><row><cell>1</cell><cell>0.492</cell><cell>0.497</cell><cell>0.497</cell><cell>0.497</cell></row><row><cell>2</cell><cell>0.5</cell><cell>0.501875</cell><cell>0.501875</cell><cell>0.501875</cell></row><row><cell>3</cell><cell>0.496</cell><cell>0.496375</cell><cell>0.496375</cell><cell>0.496375</cell></row><row><cell>4</cell><cell>0.5</cell><cell>0.4957</cell><cell>0.4957</cell><cell>0.4957</cell></row><row><cell>5</cell><cell>0.47</cell><cell>0.500875</cell><cell>0.500875</cell><cell>0.500875</cell></row><row><cell>6</cell><cell>0.483</cell><cell>0.502625</cell><cell>0.502625</cell><cell>0.502625</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3</head><label>3</label><figDesc>Overall Submission Results for the Generative Model Fingerprint Detection Task</figDesc><table><row><cell>Submission</cell><cell>ARI Score</cell></row><row><cell>1</cell><cell>0.8137499357777883</cell></row><row><cell>2</cell><cell>0.9000159097044281</cell></row><row><cell>3</cell><cell>0.26753081555895303</cell></row><row><cell>4</cell><cell>0.36560477207139175</cell></row><row><cell>5</cell><cell>0.0013132463035679</cell></row><row><cell>6</cell><cell>0.0017768435</cell></row><row><cell>7</cell><cell>0.1785452554</cell></row><row><cell>8</cell><cell>0.2323909988</cell></row></table></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Acknowledgments</head><p>This work was supported by the National Science Foundation (NSF) grant (ID. 2131307) "CISE-MSI: DP: IIS: III: Deep Learning-Based Automated Concept and Caption Generation of Medical Images Towards Developing an Effective Decision Support".</p></div>
			</div>


			<div type="funding">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>(M. M. Rahman) https://github.com/ismailEmonFu (Md. I. S. Emon); https://github.com/HoqueMahmudul (M. Hoque); https://github.com/Hasan-MdRakibul (M. R. Hasan); https://mdrahmanlab.com/ (M. M. Rahman) 0000-0003-0595-229X (Md. I. S. Emon); 0009-0006-5532-4135 (M. Hoque); 0000-0002-6179-2238 (M. R. Hasan); https://orcid.org/0000-0003-3318-2851 (F. Khalifa)</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Overview of ImageCLEF 2024: Multimedia retrieval in medical applications</title>
		<author>
			<persName><forename type="first">B</forename><surname>Ionescu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Müller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Drăgulinescu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Rückert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ben Abacha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Garcıa Seco De Herrera</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Bloch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Brüngel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Idrissi-Yaghir</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Schäfer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">S</forename><surname>Schmidt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">M</forename><surname>Pakull</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Damm</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Bracke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">M</forename><surname>Friedrich</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Andrei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Prokopchuk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Karpenka</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Radzhabov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Kovalev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Macaire</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Schwab</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Lecouteux</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Esperança-Rodier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Yim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Fu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Yetisgen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Xia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">A</forename><surname>Hicks</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Riegler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Thambawita</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Storås</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Halvorsen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Heinrich</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Kiesel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Potthast</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Stein</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Experimental IR Meets Multilinguality, Multimodality, and Interaction, Proceedings of the 15th International Conference of the CLEF Association (CLEF 2024</title>
		<title level="s">Springer Lecture Notes in Computer Science LNCS</title>
		<meeting><address><addrLine>Grenoble, France</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Overview of 2024 ImageCLEFmedical GANs Task -Investigating Generative Models&apos; Impact on Biomedical Synthetic Images</title>
		<author>
			<persName><forename type="first">A</forename><surname>Andrei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Radzhabov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Karpenka</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Prokopchuk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Kovalev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Ionescu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Müller</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CLEF2024 Working Notes, CEUR Workshop Proceedings</title>
				<meeting><address><addrLine>Grenoble, France</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Generative adversarial networks in medical image segmentation: A review</title>
		<author>
			<persName><forename type="first">S</forename><surname>Xun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Chai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Huang</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.compbiomed.2021.105063</idno>
		<ptr target="https://doi.org/10.1016/j.compbiomed.2021.105063" />
	</analytic>
	<monogr>
		<title level="j">Computers in Biology and Medicine</title>
		<imprint>
			<biblScope unit="volume">140</biblScope>
			<biblScope unit="page">105063</biblScope>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Generative adversarial networks assisted machine learning based automated quantification of grain size from scanning electron microscope back scatter images</title>
		<author>
			<persName><forename type="first">A</forename><surname>Anantatamukala</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">M</forename><surname>Krishna</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">B</forename><surname>Dahotre</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.matchar.2023.113396</idno>
		<ptr target="https://doi.org/10.1016/j.matchar.2023.113396" />
	</analytic>
	<monogr>
		<title level="j">Materials Characterization</title>
		<imprint>
			<biblScope unit="volume">206</biblScope>
			<biblScope unit="page">113396</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">The use of generative adversarial networks in medical image augmentation</title>
		<author>
			<persName><forename type="first">A</forename><surname>Makhlouf</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Maayah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Abughanam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Neural Comput. Appl</title>
		<imprint>
			<biblScope unit="volume">35</biblScope>
			<biblScope unit="page" from="24055" to="24068" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">Attentiongan: Unpaired image-to-image translation using attention-guided generative adversarial networks</title>
		<author>
			<persName><forename type="first">H</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">H S</forename><surname>Torr</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Sebe</surname></persName>
		</author>
		<idno>CoRR abs/1911.11897</idno>
		<ptr target="http://arxiv.org/abs/1911.11897.arXiv:1911.11897" />
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Abc-gan: Spatially constrained counterfactual generation for image classification explanations</title>
		<author>
			<persName><forename type="first">D</forename><surname>Mindlin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Schilling</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Cimiano</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Explainable Artificial Intelligence</title>
				<editor>
			<persName><forename type="first">L</forename><surname>Longo</surname></persName>
		</editor>
		<meeting><address><addrLine>Switzerland, Cham</addrLine></address></meeting>
		<imprint>
			<publisher>Springer Nature</publisher>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="260" to="282" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Overview of ImageCLEFmedical GANs 2023 task -Identifying Training Data &quot;Fingerprints&quot; in Synthetic Biomedical Images Generated by GANs for Medical Image Security</title>
		<author>
			<persName><forename type="first">A</forename><surname>Andrei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Radzhabov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Coman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Kovalev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Ionescu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Müller</surname></persName>
		</author>
		<ptr target="org" />
	</analytic>
	<monogr>
		<title level="m">CLEF2023 Working Notes, CEUR Workshop Proceedings, CEUR-WS</title>
				<meeting><address><addrLine>Thessaloniki, Greece</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Overview of ImageCLEF 2023: Multimedia retrieval in medical, socialmedia and recommender systems applications</title>
		<author>
			<persName><forename type="first">B</forename><surname>Ionescu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Müller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Drăgulinescu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Yim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ben Abacha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Snider</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Adams</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Yetisgen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Rückert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Garcıa Seco De Herrera</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">M</forename><surname>Friedrich</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Bloch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Brüngel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Idrissi-Yaghir</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Schäfer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">A</forename><surname>Hicks</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Riegler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Thambawita</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Storås</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Halvorsen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">J A A A R I C V K A S G I</forename><surname>Nikolaos Papachrysos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Johanna</forename><surname>Schöler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Manguinhas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Ştefan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">G</forename><surname>Constantin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Dogariu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Deshayes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Popescu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Experimental IR Meets Multilinguality, Multimodality, and Interaction, Proceedings of the 14th International Conference of the CLEF Association (CLEF 2023</title>
		<title level="s">Springer Lecture Notes in Computer Science LNCS</title>
		<meeting><address><addrLine>Thessaloniki, Greece</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Geometrically regularized autoencoders for non-euclidean data</title>
		<author>
			<persName><forename type="first">C</forename><surname>Jang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y.-K</forename><surname>Noh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">C</forename><surname>Park</surname></persName>
		</author>
		<ptr target="https://openreview.net/forum?id=_q7A0m3vXH0" />
	</analytic>
	<monogr>
		<title level="m">The Eleventh International Conference on Learning Representations</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Learning from demonstration using a curvature regularized variational auto-encoder (curvvae)</title>
		<author>
			<persName><forename type="first">T</forename><surname>Rhodes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Bhattacharjee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">D</forename><surname>Lee</surname></persName>
		</author>
		<idno type="DOI">10.1109/IROS47612.2022.9981930</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)</title>
				<imprint>
			<date type="published" when="2022">2022. 2022</date>
			<biblScope unit="page" from="10795" to="10800" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">Z</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Perrin-Gilbert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Narmanli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Klein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">R</forename><surname>Myers</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Cohen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">J</forename><surname>Waterfall</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">P</forename><surname>Sethna</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2403.01078</idno>
		<title level="m">𝛾-vae: Curvature regularized variational autoencoders for uncovering emergent low dimensional geometric structure in high dimensional data</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<author>
			<persName><forename type="first">D</forename><surname>Ahn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Cho</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Min</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Jang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">H</forename><surname>Park</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">H</forename><surname>Jin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Kim</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2403.17377</idno>
		<title level="m">Self-rectifying diffusion sampling with perturbed-attention guidance</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<author>
			<persName><forename type="first">Y</forename><surname>Skandarani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P.-M</forename><surname>Jodoin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Lalande</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2105.05318</idno>
		<title level="m">Gans for medical image synthesis: An empirical study</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">A new generative adversarial network for medical images super resolution</title>
		<author>
			<persName><forename type="first">W</forename><surname>Ahmad</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Ali</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Shah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Azmat</surname></persName>
		</author>
		<idno type="DOI">10.1038/s41598-022-13658-4</idno>
		<ptr target="http://dx.doi.org/10.1038/s41598-022-13658-4.doi:10.1038/s41598-022-13658-4" />
	</analytic>
	<monogr>
		<title level="j">Scientific Reports</title>
		<imprint>
			<biblScope unit="volume">12</biblScope>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Hallucinating face in the dct domain</title>
		<author>
			<persName><forename type="first">W</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W.-K</forename><surname>Cham</surname></persName>
		</author>
		<idno type="DOI">10.1109/TIP.2011.2142001</idno>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Image Processing</title>
		<imprint>
			<biblScope unit="volume">20</biblScope>
			<biblScope unit="page" from="2769" to="2779" />
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Generating highly realistic images of skin lesions with gans, in: OR 2.0 Context-Aware Operating Theaters</title>
		<author>
			<persName><forename type="first">C</forename><surname>Baur</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Albarqouni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Navab</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Computer Assisted Robotic Endoscopy, Clinical Image-Based Procedures, and Skin Image Analysis: First International Workshop, OR 2.0 2018, 5th International Workshop, CARE 2018, 7th International Workshop, CLIP 2018, Third International Workshop, ISIC 2018, Held in Conjunction with MICCAI 2018</title>
		<title level="s">Proceedings</title>
		<meeting><address><addrLine>Granada, Spain</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2018">September 16 and 20, 2018. 2018</date>
			<biblScope unit="volume">5</biblScope>
			<biblScope unit="page" from="260" to="267" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Generative adversarial nets</title>
		<author>
			<persName><forename type="first">I</forename><surname>Goodfellow</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Pouget-Abadie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mirza</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Warde-Farley</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ozair</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Courville</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Bengio</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in neural information processing systems</title>
		<imprint>
			<biblScope unit="volume">27</biblScope>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Simulation and synthesis in medical imaging</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">F</forename><surname>Frangi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">A</forename><surname>Tsaftaris</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">L</forename><surname>Prince</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE transactions on medical imaging</title>
		<imprint>
			<biblScope unit="volume">37</biblScope>
			<biblScope unit="page" from="673" to="679" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Dicyc: Gan-based deformation invariant cross-domain information fusion for medical image synthesis</title>
		<author>
			<persName><forename type="first">C</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Papanastasiou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Tsaftaris</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Newby</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Gray</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Macnaught</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Macgillivray</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.inffus.2020.10.015</idno>
	</analytic>
	<monogr>
		<title level="j">Information Fusion</title>
		<imprint>
			<biblScope unit="volume">67</biblScope>
			<biblScope unit="page" from="147" to="160" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Scale-space and edge detection using anisotropic diffusion</title>
		<author>
			<persName><forename type="first">P</forename><surname>Perona</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Malik</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on pattern analysis and machine intelligence</title>
		<imprint>
			<biblScope unit="volume">12</biblScope>
			<biblScope unit="page" from="629" to="639" />
			<date type="published" when="1990">1990</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Unsupervised anomaly detection with generative adversarial networks to guide marker discovery</title>
		<author>
			<persName><forename type="first">T</forename><surname>Schlegl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Seeböck</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">M</forename><surname>Waldstein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Schmidt-Erfurth</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Langs</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International conference on information processing in medical imaging</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="146" to="157" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">How to fool radiologists with generative adversarial networks? a visual turing test for lung cancer diagnosis</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">J</forename><surname>Chuquicusma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Hussein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Burt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Bagci</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE 15th international symposium on biomedical imaging (ISBI 2018)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2018">2018. 2018</date>
			<biblScope unit="page" from="240" to="244" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Ct image noise reduction based on adaptive wiener filtering with wavelet packet thresholding</title>
		<author>
			<persName><forename type="first">M</forename><surname>Diwakar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kumar</surname></persName>
		</author>
		<ptr target="https://api.semanticscholar.org/CorpusID:15070114" />
	</analytic>
	<monogr>
		<title level="m">International Conference on Parallel, Distributed and Grid Computing</title>
				<imprint>
			<date type="published" when="2014">2014. 2014</date>
			<biblScope unit="page" from="94" to="98" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation</title>
		<author>
			<persName><forename type="first">J</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Xiong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">C H</forename><surname>Hoi</surname></persName>
		</author>
		<ptr target="https://api.semanticscholar.org/CorpusID:246411402" />
	</analytic>
	<monogr>
		<title level="m">International Conference on Machine Learning</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Savarese</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Hoi</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2301.12597</idno>
		<title level="m">Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<monogr>
		<title level="m" type="main">An image is worth 16x16 words: Transformers for image recognition at scale</title>
		<author>
			<persName><forename type="first">A</forename><surname>Dosovitskiy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Beyer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kolesnikov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Weissenborn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Unterthiner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Dehghani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Minderer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Heigold</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gelly</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Uszkoreit</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Houlsby</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2010.11929</idno>
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Pre-training of deep bidirectional transformers for language understanding</title>
		<author>
			<persName><forename type="first">J</forename><surname>Devlin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Toutanova</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Association for Computational Linguistics</title>
		<imprint>
			<biblScope unit="page" from="4171" to="4186" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<analytic>
		<title level="a" type="main">Learning transferable visual models from natural language supervision</title>
		<author>
			<persName><forename type="first">A</forename><surname>Radford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">W</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Hallacy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ramesh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Goh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Agarwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Sastry</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Askell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Mishkin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Clark</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Krueger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
		<ptr target="https://api.semanticscholar.org/CorpusID:231591445" />
	</analytic>
	<monogr>
		<title level="m">International Conference on Machine Learning</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b29">
	<analytic>
		<title level="a" type="main">Align before fuse: Vision and language representation learning with momentum distillation</title>
		<author>
			<persName><forename type="first">J</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">R</forename><surname>Selvaraju</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">D</forename><surname>Gotmare</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">R</forename><surname>Joty</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Xiong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">C H</forename><surname>Hoi</surname></persName>
		</author>
		<ptr target="https://api.semanticscholar.org/CorpusID:236034189" />
	</analytic>
	<monogr>
		<title level="m">Neural Information Processing Systems</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b30">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><surname>Oquab</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Darcet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Moutakanni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Vo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Szafraniec</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Khalidov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Fernandez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Haziza</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Massa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>El-Nouby</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Assran</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Ballas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Galuba</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Howes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P.-Y</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S.-W</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Misra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Rabbat</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Sharma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Synnaeve</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Jegou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Mairal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Labatut</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Joulin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Bojanowski</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2304.07193</idno>
		<title level="m">Dinov2: Learning robust visual features without supervision</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b31">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><surname>Caron</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Touvron</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Misra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Jégou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Mairal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Bojanowski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Joulin</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2104.14294</idno>
		<title level="m">Emerging properties in self-supervised vision transformers</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b32">
	<analytic>
		<title level="a" type="main">Imagenet: A large-scale hierarchical image database</title>
		<author>
			<persName><forename type="first">J</forename><surname>Deng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Dong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Socher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L.-J</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Fei-Fei</surname></persName>
		</author>
		<ptr target="https://ieeexplore.ieee.org/abstract/document/5206848/" />
	</analytic>
	<monogr>
		<title level="m">Computer Vision and Pattern Recognition</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2009">2009. 2009</date>
			<biblScope unit="page" from="248" to="255" />
		</imprint>
	</monogr>
	<note>IEEE Conference on</note>
</biblStruct>

<biblStruct xml:id="b33">
	<monogr>
		<author>
			<persName><forename type="first">T</forename><surname>Weyand</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Araujo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Cao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Sim</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2004.01804</idno>
		<title level="m">Google landmarks dataset v2 -a large-scale benchmark for instance-level recognition and retrieval</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b34">
	<analytic>
		<title level="a" type="main">Mapillary street-level sequences: A dataset for lifelong place recognition</title>
		<author>
			<persName><forename type="first">F</forename><surname>Warburg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Hauberg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lopez-Antequera</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Gargallo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Kuang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Civera</surname></persName>
		</author>
		<ptr target="https://ieeexplore.ieee.org/xpl/conhome/9142308/proceeding" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of IEEE Conference on Computer Vision and Pattern Recognition 2020</title>
				<meeting>IEEE Conference on Computer Vision and Pattern Recognition 2020<address><addrLine>United States</addrLine></address></meeting>
		<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2020">2020. 2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b35">
	<monogr>
		<title level="m">IEEE/CVF Conference on Computer Vision and Pattern Recognition</title>
				<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="19" to="25" />
		</imprint>
	</monogr>
	<note>Conference date</note>
</biblStruct>

<biblStruct xml:id="b36">
	<analytic>
		<title level="a" type="main">Food-101 -mining discriminative components with random forests</title>
		<author>
			<persName><forename type="first">L</forename><surname>Bossard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Guillaumin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Van Gool</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">European Conference on Computer Vision</title>
				<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b37">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><surname>Douze</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Guzhva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Deng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Johnson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Szilvasy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P.-E</forename><surname>Mazaré</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lomeli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Hosseini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Jégou</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2401.08281</idno>
		<title level="m">The faiss library</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b38">
	<monogr>
		<title level="m" type="main">Flashattention: Fast and memory-efficient exact attention with io-awareness</title>
		<author>
			<persName><forename type="first">T</forename><surname>Dao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">Y</forename><surname>Fu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ermon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Rudra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Ré</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2205.14135</idno>
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b39">
	<monogr>
		<author>
			<persName><forename type="first">T</forename><surname>Dao</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2307.08691</idno>
		<title level="m">Flashattention-2: Faster attention with better parallelism and work partitioning</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b40">
	<monogr>
		<author>
			<persName><forename type="first">N</forename><surname>Reimers</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Gurevych</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2004.09813</idno>
		<title level="m">Making monolingual sentence embeddings multilingual using knowledge distillation</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b41">
	<monogr>
		<author>
			<persName><forename type="first">N</forename><surname>Brown</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Williamson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Anderson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Lawrence</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2311.13657</idno>
		<title level="m">Efficient transformer knowledge distillation: A performance review</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
