Evaluation of the Privacy of Images Generated by
                         ImageCLEFmedical GANs 2024 Based on Similarity
                         Methods
                         Notebook for ImageCLEF Lab at CLEF 2024

                         Shitong Cao, Xiaobing Zhou*
                         School of Information Science and Engineering, Yunnan University, Kunming 650504, Yunnan, China


                                      Abstract
                                      Our team’s primary contribution to the ImageCLEFmedical GANs 2024 task is as follows. This task aims to assess
                                      whether synthetic medical images generated by Generative Adversarial Networks (GANs) contain identifiable
                                      features from the training data. We employed a similarity-based classification method, categorizing real images
                                      based on their similarity to generated images. In this work, we utilized various similarity calculation methods
                                      to evaluate the similarity between real and generated images. We calculated the similarity for original images,
                                      noisy images, and features extracted through a feature network. On the validation dataset, our similarity-based
                                      approach achieved an F1-score of 0.732 and an accuracy of 0.760. In the submitted results, our best F1-score was
                                      0.598. Our experimental results demonstrated that our method could distinguish between images that were “used”
                                      and those that were “not used”.

                                      Keywords
                                      GANs, Medical Images, Similarity Calculation, Magnify Differences


                         1. Introduction
                        Deep learning models have significant potential in supporting medical diagnosis and treatment, achiev-
                        ing remarkable results in various medical image analysis tasks. However, training these models requires
                        vast amounts of data, which is often challenging to obtain. Deep generative models, capable of generat-
                        ing highly realistic medical images, have been used to create large synthetic datasets to facilitate model
                        training[1, 2].
                           Nevertheless, since generative models model the probability distribution of the data, synthetic images
                        produced by these models may threaten the privacy of patient images used in training. Recent studies
                        have shown that medical images, such as chest X-rays and MRI scans, can be used to re-identify patients,
                        exacerbating concerns about privacy breaches.
                           To identify potential privacy threats associated with the use and sharing of synthetic medical data
                        in real-world scenarios, a new challenge has been introduced as part of the ImageCLEFmedical CLEF
                        challenge 2024[3, 4]. Our team’s username is shitongcao. ImageCLEF is a multimodal challenge aimed
                        at verifying whether images generated by Generative Adversarial Networks (GANs) are similar enough
                        to the training data to pose a privacy risk. Specifically, given a set of synthetic images and a set of
                        real images, the task is to identify which real images were used to train the model that generated the
                        synthetic data. This is a binary classification task[5] where real images can be classified as “used” or
                        “not used”.
                           The generated images are produced by learning the data distribution of real images, meaning that
                        the synthetic images are statistically similar to the real ones. The closer the data distribution of the
                        generated images is to the real images, the higher the quality of the generated images, making them
                        visually closer to real images. In this work, our task is to perform binary classification to categorize

                          CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                         *
                           Corresponding author.
                          $ stcao@mail.ynu.edu.cn (S. Cao); zhouxb@ynu.edu.cn (X. Zhou)
                           0009-0003-1298-4166 (S. Cao); 0000-0003-1983-0971 (X. Zhou)
                                   © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
images as used or not used. To achieve this goal, we calculate similarity scores to determine the category
of the images, classifying images with high similarity scores[6, 7, 8] as used and those with low similarity
scores as not used.
   We employed three different methods to calculate similarity. First, we directly computed the similarity
between the original generated images and real images by comparing their pixel values. Second, we
applied noise to the original images and then calculated the similarity between the noisy images and
the real images, which enhanced the robustness of the images. Finally, we extracted features from the
images using advanced deep learning models to obtain high-dimensional features and then calculated
the similarity between these features. These features contain deep information about the images,
providing a more accurate reflection of the similarity between images.
   By using these three methods, we were able to comprehensively evaluate the similarity between
generated and real images, effectively accomplishing the binary classification task. This multi-method
approach not only improved the accuracy of the classification but also provided richer information and
stronger guarantees for image processing and analysis.


2. Related Work
Synthetic images[9, 10, 11] offer an effective way to create representative cases, enabling researchers
and clinicians to better study and understand various medical conditions, develop diagnostic tools,
and explore treatment strategies. Moreover, synthetic images address privacy issues associated with
patient data. Medical images often contain sensitive information, making it difficult to share or publicly
release datasets. By generating synthetic images, the statistical and anatomical characteristics of real
data can be preserved while removing specific patient information, thus maintaining privacy, enabling
more open collaboration, and facilitating research progress. Consequently, synthetic images are an
indispensable resource in the medical field, used for data augmentation, rare scenario simulation, and
privacy protection. Their use helps researchers, clinicians, and technologists tackle critical challenges,
enhance diagnostic accuracy, improve patient care, and advance medical imaging technologies.
   Nataraj et al. proposed a novel method combining co-occurrence matrices and deep learning tech-
niques to detect GAN-generated fake images. The authors extracted co-occurrence matrices from the
three color channels in the pixel domain and trained a deep convolutional neural network (CNN) model.
The method demonstrated good generalization capability when trained on one dataset and tested on
another.
   GANs[9] have also been widely applied in medical image-to-image translation tasks. For example,
Zhu et al. introduced CycleGAN[12], a method capable of translating images from one domain to
another without the need for paired training data.
   In conclusion, previous studies have demonstrated the potential of GANs in generating synthetic
medical images and performing image-to-image translation tasks. However, distinguishing between
synthetic and real medical images remains an active area of research, requiring robust methods to
ensure the reliability and integrity of generated data.


3. Method
3.1. Data Similarity Statistics
To conduct a statistical analysis of the images, we categorized them into three groups: generated
images, used images, and not used images. It was observed that the similarity between the images was
significantly high. To validate the similarity between the data, we employed the Three-Component
Weighted Structural Similarity Index (3-SSIM) to calculate the similarity. 3-SSIM is an improved version
of SSIM (Structural Similarity Index). Unlike SSIM, which compares the entire image as a whole, 3-SSIM
evaluates the similarity in edge, texture, and smooth regions separately, assigning different weights to
these components to obtain the final assessment result.
   The Three-Component Weighted Structural Similarity Index calculates the similarity between images
by separately evaluating the edge information, texture areas, and smooth regions, and then assigns
different weights to these components. The final similarity score is obtained by summing these weighted
values. We performed statistical analysis on three sets of data: the similarity between real images
(real-real), the similarity between generated and real images (generated-real), and the similarity between
generated images (generated-generated).

Table 1
Data analysis was conducted on the datasets for both tasks. The mean, minimum, and maximum 3-SSIM values
between images in the development and test datasets were calculated, and the results for the two datasets were
averaged.
      Dataset         Subsets                 Average SSIM      Minimum SSIM         Maximum SSIM
                      Real-Real                   12.9               5.5                  19.0
                      Generated-Real              13.4               7.9                  21.5
      Development
                      Generated-Used              15.2               9.1                  21.5
                      Generated-Not Used          13.4               6.4                  18.6
                      Real-Real                   12.8               5.6                  18.8
      Test
                      Generated-Real              13.2               7.8                  21.3

   The experimental results presented in Table 1 reveal that the similarity between generated images is
higher than the similarity between generated and real images, which in turn is higher than the similarity
between real images. This is because the generated images are produced by the same model, resulting
in relatively higher similarity. The similarity between generated and real images is slightly lower due
to the inclusion of two types of images; the used images likely have a higher similarity to the generated
images, while the not used images have a lower similarity. The similarity values for real-real are slightly
lower, but the difference is not significant. When calculating the data in the table, the self-comparison
similarity was excluded as it is identical and has no practical significance. Therefore, the maximum
similarity in real-real and gen-gen is not 30.

3.2. Similarity Calculation Methods
Similarity calculation methods play a crucial role in image processing, machine learning, and information
retrieval. This paper provides a detailed introduction to several common similarity calculation methods,
including Euclidean distance, cosine similarity, and Structural Similarity Index (SSIM). Similarity
calculation is used to measure the degree of similarity between two objects, such as images, feature
vectors, or strings. These methods are widely applied in image classification, clustering, retrieval, and
recommendation systems. Each method has its own applicable scenarios and advantages, making the
choice of an appropriate similarity calculation method critical to the performance and effectiveness
of algorithms. As shown in Figure 1, this paper presents an illustration of different image similarity
calculations.
   When calculating image similarity, Euclidean distance is a commonly used method that measures the
difference between two images at the pixel level. This method assumes that images can be represented as
points in a high-dimensional space, where each pixel’s color value (typically RGB values) is considered
a dimension in this space. Euclidean distance calculates the straight-line distance between these two
points (i.e., the two images), with a smaller distance indicating higher similarity between the images.
First, it is essential to ensure that the two images have the same dimensions; in this task, the generated
images and the real images are of identical sizes. The formula for calculating the Euclidean distance is
as follows.
                                                   ⎯
                                                   ⎸ 𝑛
                                                   ⎸∑︁
                                         𝑑(A, B) = ⎷      (𝐴𝑖 − 𝐵𝑖 )2
                                                      𝑖=1

Where A and B are two n-dimensional vectors.
Figure 1: Schematic diagram of triplet loss. The generated image serves as the anchor, the used image serves as
positive sample, and the not used image serves as negative sample.


    Cosine similarity is another commonly used method for measuring image similarity, particularly
 in content-based image retrieval (CBIR) systems. Cosine similarity assesses the similarity between
 two vectors by measuring the cosine of the angle between them. The core idea is that the closer the
 directions of the two vectors, the more similar they are, regardless of their magnitudes. Calculating
 cosine similarity involves converting each image into a vector form. This typically entails flattening
 the pixel values of the image or features extracted from the image (such as color histograms, texture
 descriptors, shape features, etc.) into a one-dimensional vector. The cosine similarity value ranges from
-1 to 1, where 1 indicates identical directions (very similar), 0 indicates orthogonality (no similarity), and
-1 indicates completely opposite directions. Cosine similarity focuses on directional similarity, ignoring
 magnitude. In some cases, two images might be very similar in terms of certain feature ratios, but the
 absolute differences in actual pixel values could be significant.
                                                             ∑︀𝑛
                                            A·B                      𝐴𝑖 𝐵𝑖
                               cos(𝜃) =            = √︁∑︀ 𝑖=1√︁∑︀
                                           ‖A‖B‖            𝑛      2     𝑛     2
                                                            𝑖=1 𝐴𝑖       𝑖=1 𝐵𝑖

  Where A and B are two vectors. A · B represents the dot product of vectors A and B. ||A|| and ||B||
represent the magnitudes of vectors A and B.
  Structural Similarity (SSIM) is a more intuitive and effective method for calculating image similarity.
SSIM considers the luminance, contrast, and structural information of images, which allows it to more
accurately reflect the human visual system’s perception of image quality. SSIM first calculates the
luminance difference between two images. The luminance comparison is achieved by calculating the
mean values of the images, which reflects the overall brightness levels of the images. Next, SSIM
calculates the contrast difference. The contrast comparison is achieved by calculating the standard
deviation of the images; the greater the standard deviation, the higher the image contrast. Finally,
SSIM compares the structural information of the two images. This step is achieved by calculating the
covariance of the images. Covariance reflects the linear relationship between the pixels of the images,
capturing the structural characteristics of the images.

                                             (2𝜇𝑥 𝜇𝑦 + 𝐶1 ) (2𝜎𝑥𝑦 + 𝐶2 )
                             SSIM(𝑥, 𝑦) = (︀ 2           )︀ (︀           )︀
                                            𝜇𝑥 + 𝜇2𝑦 + 𝐶1 𝜎𝑥2 + 𝜎𝑦2 + 𝐶2
  Where x and y are corresponding blocks of the two images. 𝜇𝑥 and 𝜇𝑦 are the mean values of image
blocks x and y. 𝜎𝑥 and 𝜎𝑦 are the standard deviations of image blocks x and y. 𝜎𝑥𝑦 is the covariance of
image blocks x and y.
Figure 2: Schematic diagram of triplet loss. The generated image serves as the anchor, the used image serves as
positive sample, and the not used image serves as negative sample.


3.3. Expanding the Differences between Images
Noise can significantly affect the quality of images, thereby impacting the results of similarity calcula-
tions. Common types of noise include Gaussian white noise and salt-and-pepper noise.
   Gaussian White Noise: This type of noise follows a normal distribution, typically with a mean of
zero, and the standard deviation can be set according to the actual situation. Gaussian noise adds a
random value to each pixel of the image, resulting in an overall blurring effect.
   Salt-and-Pepper Noise: This type of noise randomly changes image pixels to either white (255) or
black (0) with a certain probability, commonly occurring during image transmission. Salt-and-pepper
noise creates random white or black spots in the image, making it appear speckled and unclear. As
shown in Figure 2, similarity calculation is performed on the two images with added noise
   Experiments show that noise significantly impacts image similarity values. After adding noise, the
similarity between images decreases notably, effectively expanding the differences between images.
Gaussian white noise and salt-and-pepper noise degrade image quality in different ways, thus affecting
the results of similarity calculations. As a method for measuring image similarity, it is crucial to consider
the impact of noise on the results and to apply appropriate noise reduction measures in practice to
improve the accuracy of similarity calculations.


4. Experiments
4.1. Evaluation Metrics
We carried out assessments across two distinct experiments. Initially, we split the validation set
into two equal portions, designating one half as the test set to streamline our experimental analysis.
Subsequently, we submitted our findings to the IMAGECLEFMED GANS 2024: IDENTIFY TRAINING
DATA FINGERPRINTS competition. This challenge was addressed as a binary classification issue, and
its evaluation criteria encompassed several critical performance metrics: F1-score, accuracy, precision,
recall, and specificity. Notably, the F1-score has been chosen as the principal metric for this year’s
evaluation. The definitions of these metrics are as follows:

                                                           𝑇𝑃
                                          Precision =                                                       (1)
                                                         𝑇𝑃 + 𝐹𝑃
                                                         𝑇𝑃
                                            Recall =                                                        (2)
                                                       𝑇𝑃 + 𝐹𝑁
                                                           𝑇𝑁
                                         Specificity =                                                      (3)
                                                         𝑇𝑁 + 𝐹𝑃
                                                     𝑇𝑃 + 𝑇𝑁
                                  Accuracy =                                                               (4)
                                                𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
                                                     Precision · Recall
                                        F1-score =                                                         (5)
                                                     Precision + Recall

4.2. Experimental Results
In this experiment, we conducted tests on both the original images and images with added noise.
We used Euclidean distance, cosine similarity, and Structural Similarity Index (SSIM) to calculate the
similarity between images. Based on the similarity scores, we performed classification to obtain accuracy,
precision, recall, and F1-score.
   To highlight the differences between images, we introduced varying degrees of noise into the images
for the experiments. Specifically, we added different levels of noise, including Gaussian noise and
salt-and-pepper noise, to assess their impact on similarity scores and classification performance. This
approach helped us understand how noise degrades image quality and affects similarity calculations.
   By systematically introducing different noise levels, we were able to evaluate the effectiveness and
stability of different similarity calculation methods under various noise conditions. This comprehensive
experimental design allowed us to more accurately measure the impact of noise on image similarity
calculations, providing valuable insights for future image processing and analysis improvements. The
experimental results, summarized in the table 2 illustrate the performance metrics of each similarity
calculation method under different noise levels.

Table 2
Similarity scores between the original images and the noise-added images were calculated using Euclidean
distance, cosine similarity, and SSIM methods. The same procedures were applied to both tasks, and the results
for the two tasks were averaged.
       Gaussian noise          Metric         Accuracy     Precision   Specificity   Recall   F1-score
       w/o               Eucledian Distanc      0.675        0.663        0.638      0.713     0.687
       w/o               Cosine Similarity      0.613        0.576        0.375      0.850     0.687
       w/o                     SSIM             0.678        0.682        0.702      0.689     0.704
       w/                Eucledian Distanc      0.650        0.650        0.650      0.650     0.650
       w/                Cosine Similarity      0.705        0.683        0.720      0.696     0.717
       w/                      SSIM             0.731        0.740        0.750      0.713     0.726


   We made predictions for all 4,000 images generated by each model and submitted these prediction
results. To evaluate the performance of the models, we used the F1-score as the primary evaluation
metric, as it comprehensively considers both precision and recall, providing a more holistic assessment
of performance. Additionally, we used accuracy as a secondary metric to measure the correctness of
the model across all predictions. This evaluation process ensured that we could fully understand the
performance of the models under different conditions. We submitted a total of eight different results.
The detailed scores are summarized in the table 3, showing the specific performance and corresponding
evaluation scores for each submission. These results help us to further analyze and improve the models,
enhancing their effectiveness in practical applications.
   As shown in Table 3, our three best experimental results are presented. Through our experiments,
we found that adding noise significantly increases the differences between images. We performed
similarity calculations on the images with added noise and classified them based on these similarity
scores. Notably, using the SSIM similarity calculation method yielded the best results on the test set.
    Table 3
    Presentation of experimental results
               Metric           F1     Acc    Prec    Rcall    F1     Acc     Prec    Rcall   F1
               SSIM           0.598   0.614   0.599   0.689   0.641   0.504   0.503   0.622   0.556
         Eucledian Distanc    0.598   0.615   0.600   0.688   0.641   0.504   0.503   0.619   0.555
         Cosine Similarity    0.614   0.711   0.824   0.538   0.651   0.593   0.751   0.279   0.407


5. Conclusion
In this study, we employed various similarity calculation methods to classify images. By setting similarity
thresholds, we classified the images based on the similarity between real images and generated images.
Different similarity calculation methods were used to evaluate the similarity between real and generated
images. We performed similarity calculations on both the original images and the images with added
noise. Next, we will investigate calculating similarity based on extracted features. When calculating
feature similarity, it is crucial to ensure that the features contain more detailed information to capture
the subtle differences between different images.


References
 [1] N. K. Singh, K. Raza, Medical image generation using generative adversarial networks: A review,
     Health informatics: A computational perspective in healthcare (2021) 77–96.
 [2] C. Han, H. Hayashi, L. Rundo, R. Araki, W. Shimoda, S. Muramatsu, Y. Furukawa, G. Mauri,
     H. Nakayama, Gan-based synthetic brain mr image generation, in: 2018 IEEE 15th international
     symposium on biomedical imaging (ISBI 2018), IEEE, 2018, pp. 734–738.
 [3] A. Andrei, A. Radzhabov, D. Karpenka, Y. Prokopchuk, V. Kovalev, B. Ionescu, H. Müller, Overview
     of 2024 ImageCLEFmedical GANs Task – Investigating Generative Models’ Impact on Biomedical
     Synthetic Images, in: CLEF2024 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org,
     Grenoble, France, 2024.
 [4] B. Ionescu, H. Müller, A. Drăgulinescu, J. Rückert, A. Ben Abacha, A. Garcıa Seco de Herrera,
     L. Bloch, R. Brüngel, A. Idrissi-Yaghir, H. Schäfer, C. S. Schmidt, T. M. Pakull, H. Damm, B. Bracke,
     C. M. Friedrich, A. Andrei, Y. Prokopchuk, D. Karpenka, A. Radzhabov, V. Kovalev, C. Macaire,
     D. Schwab, B. Lecouteux, E. Esperança-Rodier, W. Yim, Y. Fu, Z. Sun, M. Yetisgen, F. Xia, S. A. Hicks,
     M. A. Riegler, V. Thambawita, A. Storås, P. Halvorsen, M. Heinrich, J. Kiesel, M. Potthast, B. Stein,
     Overview of ImageCLEF 2024: Multimedia retrieval in medical applications, in: Experimental
     IR Meets Multilinguality, Multimodality, and Interaction, Proceedings of the 15th International
     Conference of the CLEF Association (CLEF 2024), Springer Lecture Notes in Computer Science
     LNCS, Grenoble, France, 2024.
 [5] Y. Tokozume, Y. Ushiku, T. Harada, Between-class learning for image classification, in: Proceedings
     of the IEEE conference on computer vision and pattern recognition, 2018, pp. 5486–5494.
 [6] G. Palubinskas, Image similarity/distance measures: what is really behind mse and ssim?, Interna-
     tional Journal of Image and Data Fusion 8 (2017) 32–53.
 [7] P.-E. Danielsson, Euclidean distance mapping, Computer Graphics and image processing 14 (1980)
     227–248.
 [8] P. Xia, L. Zhang, F. Li, Learning similarity with cosine similarity ensemble, Information sciences
     307 (2015) 39–52.
 [9] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio,
     Generative adversarial networks, Communications of the ACM 63 (2020) 139–144.
[10] D. P. Kingma, M. Welling, Auto-encoding variational bayes, arXiv preprint arXiv:1312.6114 (2013).
[11] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, B. Ommer, High-resolution image synthesis with
     latent diffusion models, in: Proceedings of the IEEE/CVF conference on computer vision and
     pattern recognition, 2022, pp. 10684–10695.
[12] J.-Y. Zhu, T. Park, P. Isola, A. A. Efros, Unpaired image-to-image translation using cycle-consistent
     adversarial networks, in: Proceedings of the IEEE international conference on computer vision,
     2017, pp. 2223–2232.