Classification of Real and Generated Images based on Feature Similarity Notebook for ImageCLEF Lab at CLEF 2024 Huihui Tang1 , Hancheng Wang1,2 and Jining Chen1,* 1 Guangxi Key Laboratory of Digital Infrastructure, Guangxi Zhuang Autonomous Region Information Center 2 GUANGXI BEITOU IT INNOVATION TECHNOLOGY INVESTMENT GROUP CO.,LTD. Abstract Deceptive images can be shared on social network services within seconds, posing a significant risk. In the application and research of artificial intelligence on medical images, data issues have always been a challenge, including insufficient amounts of medical image data and privacy concerns. Currently, generative models such as Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and diffusion models have achieved remarkable results, generating high-quality images. Using generative models for the generation of medical images is a research focus. The essence of generative models is to learn the distribution of data, not to create something out of nothing. Therefore, we investigate whether the real data can be identified from the generated data. In this paper, based on similarity calculations, we calculate the similarity between original images and images with added perturbations. We use self-supervised Masked Autoencoders to reconstruct the images and thus achieve feature similarity calculation. By judging the similarity, we can identify the real images used for training the generative model. Our experimental results on the validation set show an accuracy of 0.743, a precision of 0.721, a recall of 0.720, and an F1 score of 0.726; the F1 score on the test set is 0.603. The experimental results indicate that generated images also pose a threat to patient privacy. Keywords GANs, Pre-trained Model, Feature Extraction, Similarity Calculation 1. Introduction Currently, artificial intelligence has numerous hot topics in medical research, including studies on medical imaging, medical image classification, detection, and segmentation tasks[1, 2, 3]. However, these tasks face a common challenge: the lack of data. Initially, data augmentation[4] was used as a solution, transforming small amounts of data into larger datasets through operations such as flipping and cropping. This approach relies on the translation invariance of convolutional neural networks. However, data augmentation only addresses superficial issues, and in some tasks, the increased quantity is still insufficient. Additionally, the amount of data often determines the upper limit of the model. Large models require vast amounts of data for training, potentially exceeding tens of billions of data points. As models become larger, the demand for data also increases; otherwise, the models either overfit to noise or fail to fully utilize their capabilities. Medical data, in particular, presents complex challenges due to privacy concerns and the difficulty of annotation, making data acquisition difficult and resulting in consistently small datasets. The advent of generative models represents a technological breakthrough, capable of creating large amounts of high- quality data. Currently, models based on Generative Adversarial Networks (GANs)[5] and Variational Autoencoders (VAEs)[6] can generate high-quality images, while the popular diffusion models not only produce high-quality images but also exhibit diversity. Therefore, generative models are an effective method for data augmentation. However, privacy concerns must be considered for medical data. Even if the generated data differs, does it still pose a privacy risk? Can the original data be identified from the generated images? CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France * Corresponding author. $ tanghh@gxi.gov.cn (H. Tang); 1392207107@qq.com (H. Wang); chenjn@gxi.gov.cn (J. Chen) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings To address this issue, this paper employs multiple methods, including similarity calculation, enhancing image details before calculating similarity, and using deep learning and Masked Autoencoders (MAE)[7] methods to calculate the similarity of extracted image features. This approach aims to determine which real data were used for training the generative models based on the generated data. 2. Related Work In recent years, Generative Adversarial Networks (GANs)[5] have garnered widespread attention in the medical field for various image generation and translation tasks. Numerous studies have explored the application of GANs in medical image synthesis and image-to-image translation, particularly in the recognition or detection of synthetic images. A substantial amount of work has investigated the use of GANs to generate synthetic medical images. For instance, Choi et al. proposed a method called “StarGAN”[8], a multi-domain image synthesis technique successfully applied to generate diverse and realistic brain MRI images. Similarly, the paper by Kench et al. introduced SliceGAN[9], an architecture that utilizes GANs to generate high-quality three-dimensional datasets from a single representative two-dimensional image. Synthetic images play a crucial role in the medical field as they offer significant advantages and address major challenges[10]. Firstly, the generation of synthetic images allows for the augmentation of limited or insufficient datasets. In many medical imaging applications, obtaining large and diverse annotated datasets can be challenging and time-consuming. By generating synthetic images, researchers can expand the training data, thereby enhancing the robustness and generalization of machine learning models. Secondly, synthetic images can simulate rare or difficult-to-obtain medical scenarios. Certain conditions or diseases may have low prevalence or be challenging to capture through traditional imaging methods, making synthetic images a valuable resource in these instances. Synthetic images offer a novel approach to creating representative cases, allowing researchers and clinicians to study and understand these conditions better, develop more effective diagnostic tools, and explore various treatment strategies. Additionally, synthetic images address privacy concerns associated with patient data. Medical images often contain sensitive information, making data sharing and public release challenging. By generating synthetic images, it is possible to retain the statistical and anatomical characteristics of the data while removing specific patient information. This approach preserves privacy, facilitates more open collaboration, and advances research progress. In conclusion, synthetic images are indispensable in the medical field. They play a crucial role in data augmentation, simulation of rare conditions, and privacy protection. Their utilization empowers researchers, clinicians, and technologists to tackle key challenges, enhance diagnostic accuracy, improve patient care, and advance medical imaging technologies. 3. Method 3.1. Task Description and Dataset Analysis After training the generative model, it is possible to produce new images. In order to identify the potential privacy threats of using and sharing synthetic medical data in various real-world scenarios, a new challenge (ImageCLEFmedical GANs[11]) arose as part of the medical track of the ImageCLEF Challenge 2024[12]. Our team’s username is robot. This task aims to verify whether the generated images can leak information from the original training data. By analyzing the generated images, we can distinguish which images from the real dataset were used to train the generative model. The dataset used in this study is as follows: The data is divided into two folders: development data and test data. In the development data, the dataset includes both used and not used images. In Table 1 and Table 2, the dataset structures for Task 1 and Task 2 are shown, respectively. Although there is a significant difference in the number of real datasets between the two tasks during the Table 1 Task1 dataset structure description. Dataset Generated Real 100 (used) Development 10000 100 (not used) Test 5000 4000 (used and not used) Table 2 Task2 dataset structure description. Dataset Generated Real 3000 (used) Development 10000 3000 (not used) Test 7200 4000 (used and not used) development phase, the approach and methods for handling both tasks are consistent. 3.2. Data Visualization Analysis The quality of a generative model is evaluated based on its ability to produce high-quality and diverse images. A generative model learns the distribution of real data, and a model that can generate high- quality images indicates that it has accurately learned the data distribution of real images. Diversity reflects the creative ability of the generative model, assessing whether it can create different images based on the learned data distribution. Visualizing the data distribution allows for an intuitive understanding of the relationships between the data. In this paper, we provide histogram visualizations of the statistical data for both the generated and real images. (a) Pixel statistics of generated images. (b) Pixel statistics of real images. Figure 1: Pixel statistics are performed on the generated images and real images of the two tasks respectively, and the results of the two tasks are averaged. As illustrated in Figure 1, we have compared the pixel values of the generated images with those of the real images. It is evident that the pixel value distributions are quite similar. The horizontal axis of the histogram represents the pixel values, while the vertical axis represents the number of pixels. Based on the results of the two histogram statistics, it can be seen that the generated images and the real images are highly similar. Additionally, it is evident that the real images are also highly similar to each other, even though they include two categories: used images and not used images. 3.3. Feature Extraction In this paper, we explore the inherent connection between generated images and used images, which were involved in the training process. As we know, in mainstream generative model methods, Genera- tive Adversarial Networks (GANs) have both a generator and a discriminator that mutually enhance each other, ultimately generating the required image data through the generator. The generator’s structure involves extracting features and then reconstructing them. The principle structure of Vari- ational Autoencoders (VAEs) is similar, directly reconstructing the extracted features by building a reconstruction loss. Diffusion models work in the same way, adding noise and then reconstructing the noisy features. Although these generative models employ various ingenious designs when constructing loss functions, they essentially aim to achieve reconstruction loss. The calculation of reconstruction loss is typically done using the Mean Squared Error (MSE) function, which is also a method of image similarity comparison. Therefore, the overall approach adopted in this paper is reverse inference through similarity comparison. MAE (Masked Autoencoder-Decode) is a self-supervised learning and deep learning method. We know that image similarity can be compared, and similarly, features extracted by deep learning can also be compared for similarity. The features extracted by deep learning often contain highly integrated information. Therefore, calculating the similarity of extracted features is a worthwhile approach to consider. Feature extraction networks need to be trained to accurately extract information from images. Although we can directly use models pre-trained on ImageNet for feature extraction and similarity calculation, the results are not satisfactory because there is a significant difference between ImageNet data and medical data. To address this issue, unsupervised learning or self-supervised learning methods can be considered. As shown in Figure 2, this is the structure of our overall network model. MAE is a self-supervised learning method that treats both data and labels as inputs for model reconstruction. The uniqueness of MAE lies in its approach of masking part of the information and then reconstructing it. This is an excellent idea because, in data with high image similarity, masking part of the information forces the model to focus more on details. This increases the difficulty of reconstruction, allowing the model to learn to distinguish image details from a small amount of data. This approach ensures that the model can approximate the true image even with occlusions, thus emphasizing detailed information. Consequently, we can calculate the similarity of the extracted features. 3.4. Feature Similarity After completing feature extraction, similarity calculation becomes a critical step for image recognition and classification. This process aims to measure how close two images are in the feature space, and it employs common similarity measures such as Euclidean distance, cosine similarity, and Manhattan distance to achieve this. Each of these measures has its own advantages and applications depending on the nature of the data and the specific requirements of the task. The specific steps for similarity calculation using features extracted by the MAE network are as follows. First, acquire the feature vectors for each image to be compared by using a pre-trained MAE network. This involves passing the images through the encoder part of the MAE network, which compresses the input into a latent representation capturing the essential features of the image. Next, choose an appropriate similarity measure. For instance, cosine similarity can be particularly useful in cases where the magnitude of the feature vectors is not as important as the orientation, making it ideal for assessing the angle between vectors in high-dimensional space. Alternatively, Euclidean distance might be preferred for tasks where the absolute differences in feature values are more meaningful. Once the similarity measure is selected, calculate the distance or similarity score between the feature vectors of the two images. This score quantitatively expresses how similar or different the images are based on their extracted features. Finally, based on the calculated similarity scores, perform operations such as classification, clustering, or retrieval of images. In classification tasks, images can be assigned to predefined categories based on Figure 2: Comparative Learning Diagram. The generated image is obtained from the used image, so the two are similar in comparative learning. When comparing the generated image with the not used image for learning, it is dissimilar. their similarity to representative examples. In clustering, images are grouped into clusters of similar items, which can reveal inherent structures in the data without prior labeling. For image retrieval, the similarity scores can be used to rank a database of images, retrieving those that are most similar to a given query image. This comprehensive approach ensures that the images are analyzed and utilized effectively, leveraging the power of MAE-based feature extraction and similarity calculation to enhance various image processing tasks. 4. Experiments 4.1. Experimental Design In our experiment, we used the NVIDIA GeForce RTX 3090 graphics card to complete two tasks. The detailed process and time for each experiment are as follows: Task 1: To ensure the accuracy and efficiency of the experiment, we loaded all the data onto the RTX 3090 graphics card for processing. Throughout the experiment, we leveraged its powerful computational capabilities and efficient parallel processing features, significantly enhancing the data processing speed. After multiple iterations and optimizations, the total experiment time for Task 1 was approximately 15 hours. During this period, the graphics card operated efficiently, ensuring the integrity of the data and the reliability of the experimental results. Task 2: Following the completion of Task 1, we continued to utilize the RTX 3090 graphics card for the second experiment. Similar to Task 1, we performed multiple data loading and processing operations, fully exploiting the card’s advantages in deep learning and large-scale data processing. Through repeated experiments and optimizations, we successfully completed Task 2 in approximately 16 hours. Throughout this period, the graphics card maintained high efficiency, ensuring the continuity of the experiment and the consistency of the results. 4.2. Evaluation Metrics This task was approached as a binary classification problem, and its evaluation involved several key performance metrics: F1-score, accuracy, precision, recall, and specificity. Among these, the F1-score has been designated as the primary metric for this year’s evaluation. The definitions of these metrics are as follows: 𝑇𝑃 Precision = (1) 𝑇𝑃 + 𝐹𝑃 𝑇𝑃 Recall = (2) 𝑇𝑃 + 𝐹𝑁 𝑇𝑁 Specificity = (3) 𝑇𝑁 + 𝐹𝑃 𝑇𝑃 + 𝑇𝑁 Accuracy = (4) 𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁 Precision · Recall F1-score = (5) Precision + Recall 4.3. Experimental Results To conduct a more comprehensive and detailed experimental analysis, we divided the validation dataset into two parts: one as the validation set and the other as the test set. This division allows the validation set to be used for parameter tuning and performance evaluation of the model, ensuring that adjustments made during training are effective. Meanwhile, the test set is used for the final performance evaluation to assess the model’s generalization ability on unseen data. This approach helps us to more accurately measure the actual performance and stability of the model, thereby obtaining more reliable and representative experimental results. The experimental results are shown in the table 3. 4.3.1. Ablation Study Table 3 The average of the experimental results of task1 and task2 on the development dataset. We used different feature extraction models for ablation experiments. Model Accuracy Precision Specificity Recall F1-score VGG[13] 0.655 0.652 0.661 0.706 0.676 InceptionNet[14] 0.616 0.637 0.590 0.850 0.677 Resnet50[15] 0.632 0.640 0.648 0.652 0.651 Resnet101[15] 0.650 0.873 0.812 0.484 0.587 Mobilenetv2[16] 0.721 0.732 0.748 0.715 0.720 Mobilenetv3[17] 0.722 0.717 0.744 0.694 0.708 EfficientNet[18] 0.648 0.752 0.815 0.487 0.644 MAE[7] 0.743 0.721 0.733 0.720 0.726 Table 4 Scores of the eight different submitted results. Submission Model F1 Acc Prec Recall F1 Acc Prec Recall F1 VGG 0.312 0.583 0.812 0.216 0.341 0.559 0.761 0.174 0.283 InceptionNet 0.314 0.583 0.812 0.216 0.341 0.559 0.747 0.178 0.287 ResNet50 0.503 0.503 0.504 0.409 0.451 0.504 0.503 0.619 0.555 Mobilenetv2 0.350 0.615 0.600 0.688 0.641 0.504 0.579 0.031 0.058 Mobilenetv3 0.429 0.503 0.504 0.409 0.451 0.593 0.751 0.279 0.407 EfficientNet 0.524 0.615 0.600 0.688 0.641 0.593 0.751 0.279 0.407 MAE 0.603 0.711 0.824 0.538 0.651 0.504 0.503 0.619 0.555 We conducted an ablation study on different feature extraction modules to evaluate their perfor- mance in image classification tasks. Specifically, we used VGG, InceptionNet, ResNet50, ResNet101, MobileNetV2, MobileNetV3, EfficientNet, and MAE pre-trained models for feature extraction. Subse- quently, we performed similarity calculations to classify the images based on these features. To comprehensively assess the effectiveness of these feature extraction modules, we evaluated multiple metrics including accuracy, precision, specificity, recall, and F1-score. By calculating and analyzing these metrics, we were able to compare the strengths and weaknesses of each model in similarity computation and image classification tasks. The data in the table 3 indicates that the MAE pre-trained model achieved the best performance. This model excelled across all the evaluated metrics, demonstrating its robust capability in feature extraction and image classification. These results suggest that the MAE pre-trained model not only captures detailed features of images but also effectively performs classification tasks, providing strong support for future research and applications. 4.3.2. Submission We submitted results for the IMAGECLEFmed GANS 2024: Identify Training Data Fingerprints com- petition. Each submission file had to contain predictions (1 - used, 0 - not used) for all 4,000 images generated by each model. The evaluation method primarily used the F1-score as the evaluation metric, while accuracy was used as the secondary metric. In total, we submitted eight results, with our best score being an F1-score of 0.711. As shown in Table 4, the submitted results exhibited significant score differences, which may be due to the selection of less optimal features. When classifying based on the similarity scores, the resulting score differences were considerable. However, the overall experimental results indicate that our method is capable of identifying which real images were used to train the image generation model from the generated images. The results obtained from the development dataset differed somewhat from those of the test dataset, likely due to dimensional variations between the datasets. Despite these differences, we successfully developed methods that achieved high F1-scores and accuracy in identifying images used across both datasets. These findings reinforce the hypothesis that synthetic images generated by deep generative models can potentially expose patient identities. 5. Conclusion We performed feature extraction on the images, using advanced deep learning models to obtain high- dimensional feature representations. Subsequently, we calculated the similarity of these extracted features, using cosine similarity to evaluate the similarity scores between each pair of image features. Based on these similarity scores, we accomplished the binary classification task, categorizing the images as either used or unused. This method allows us to effectively identify and classify images, providing a solid foundation for subsequent image processing and analysis. In conclusion, this paper and the ImageCLEFmed GANS challenge contribute to raising awareness about the potential privacy risks associated with the use and sharing of synthetic medical data in real-world applications. We underscore the importance of implementing privacy protection techniques when developing deep generative models using sensitive medical data. 6. Acknowledgements This work is supported by Open Project Program of Guangxi Key Laboratory of Digital Infrastructure No.GXDINB2024001. References [1] L. Cai, J. Gao, D. Zhao, A review of the application of deep learning in medical image classification and segmentation, Annals of translational medicine 8 (2020). [2] D. Wang, Y. Zhang, K. Zhang, L. Wang, Focalmix: Semi-supervised learning for 3d medical image detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3951–3960. [3] K. Ramesh, G. K. Kumar, K. Swapna, D. Datta, S. S. Rajest, A review of medical image segmentation algorithms, EAI Endorsed Transactions on Pervasive Health and Technology 7 (2021) e6–e6. [4] N. Salem, H. Malik, A. Shams, Medical image enhancement based on histogram algorithms, Procedia Computer Science 163 (2019) 300–311. [5] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial networks, Communications of the ACM 63 (2020) 139–144. [6] D. P. Kingma, M. Welling, Auto-encoding variational bayes, arXiv preprint arXiv:1312.6114 (2013). [7] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, R. Girshick, Masked autoencoders are scalable vision learners, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16000–16009. [8] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, J. Choo, Stargan: Unified generative adversarial networks for multi-domain image-to-image translation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8789–8797. [9] H. Chung, J. C. Ye, Feature disentanglement in generating three-dimensional structure from two-dimensional slice with slicegan, arXiv preprint arXiv:2105.00194 (2021). [10] J. T. Guibas, T. S. Virdi, P. S. Li, Synthetic medical images from dual generative adversarial networks, arXiv preprint arXiv:1709.01872 (2017). [11] B. Ionescu, H. Müller, A. Drăgulinescu, J. Rückert, A. Ben Abacha, A. Garcıa Seco de Herrera, L. Bloch, R. Brüngel, A. Idrissi-Yaghir, H. Schäfer, C. S. Schmidt, T. M. Pakull, H. Damm, B. Bracke, C. M. Friedrich, A. Andrei, Y. Prokopchuk, D. Karpenka, A. Radzhabov, V. Kovalev, C. Macaire, D. Schwab, B. Lecouteux, E. Esperança-Rodier, W. Yim, Y. Fu, Z. Sun, M. Yetisgen, F. Xia, S. A. Hicks, M. A. Riegler, V. Thambawita, A. Storås, P. Halvorsen, M. Heinrich, J. Kiesel, M. Potthast, B. Stein, Overview of ImageCLEF 2024: Multimedia retrieval in medical applications, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction, Proceedings of the 15th International Conference of the CLEF Association (CLEF 2024), Springer Lecture Notes in Computer Science LNCS, Grenoble, France, 2024. [12] A. Andrei, A. Radzhabov, D. Karpenka, Y. Prokopchuk, V. Kovalev, B. Ionescu, H. Müller, Overview of 2024 ImageCLEFmedical GANs Task – Investigating Generative Models’ Impact on Biomedical Synthetic Images, in: CLEF2024 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org, Grenoble, France, 2024. [13] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556 (2014). [14] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9. [15] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. [16] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, L.-C. Chen, Mobilenetv2: Inverted residuals and linear bottlenecks, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4510–4520. [17] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasude- van, et al., Searching for mobilenetv3, in: Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 1314–1324. [18] M. Tan, Q. Le, Efficientnet: Rethinking model scaling for convolutional neural networks, in: International conference on machine learning, PMLR, 2019, pp. 6105–6114.