=Paper=
{{Paper
|id=Vol-3084/paper4
|storemode=property
|title=Exploring the Asynchronous of the Frequency Spectra of GAN-Generated Facial Images
|pdfUrl=https://ceur-ws.org/Vol-3084/paper4.pdf
|volume=Vol-3084
|authors=Le Minh Binh,Simon S. Woo
}}
==Exploring the Asynchronous of the Frequency Spectra of GAN-Generated Facial Images==
Exploring the Asynchronous of the Frequency Spectra of GAN-Generated Facial Images Le Minh Binh1 , Simon S. Woo2 1 Department of Software, Sungkyunkwan University, South Korea 2 Department of Applied Data Science, Sungkyunkwan University, South Korea Abstract The rapid progression of Generative Adversarial Networks (GANs) has raised a concern of their misuses for malicious pur- poses, especially in creating fake face images. Although many proposed methods succeed in detecting GAN-based synthetic images, they are still limited by the need for large quantities of the training fake image dataset, and challenges for the detec- torβs generalizability to unknown facial images. In this paper, we propose a new approach that explores the asynchronous frequency spectra of color channels, which is simple but effective for training both unsupervised and supervised learning models to distinguish GAN-based synthetic images. We further investigate the transferability of a training model that learns from our suggested features in one source domain and validates on another target domains with prior knowledge of the featuresβ distribution. Our experimental results show that the discrepancy of spectra in the frequency domain is a practical artifact to effectively detect various types of GAN-based generated images. Keywords Asynchronous of frequency, GAN-based synthetic images 1. Introduction In recent years, there has been tremendous progress in Generative Adversarial Networks (GAN), in which two Real images modules (generator vs. discriminator), play a minimax game to produce highly realistic data. Unfortunately, in addition to several fruitful GAN applications, attackers can exploit GANs for malicious purposes, such as spread- ing fake news [1] or propagating fake pornography of celebrities [2] as shown in the past. Meanwhile, several efforts have been made by researchers [3, 4, 5] to re- sist these nasty misuses. Wang et al. built a deep neural network to classify GAN-based generated images and empirically demonstrates that a classifier trained on one single dataset can generalize to different GAN datasets[4]. Especially, Dzanic and Shah [6] empirically show the Fake images systematic bias in high spatial frequencies and use this characteristic to classify real and deep network generated images. However, they did not explore the deeper sta- tistical frequency features that we propose in our work, and they simply focused on converting the RGB to gray image, unlike ours. Also, the checkerboard artifacts in spectrum generated Figure 1: The frequency spectra differences of real vs. fake by up-sampling components of GAN model were also ex- images. Top: Real images and their corresponding concur- tensively investigated by Zhang et al. [3] and Frank et al. rent spectra. Bottom: Fake images and their corresponding [5]. While these techniques have shown success in terms (chaotic) spectra. It is difficult for human eyes to distinguish of achieving high accuracy, they typically require a large between real and fake images, but after applying DFT on each channel of images, the vital clues to distinguish real vs. fake International Workshop on Safety & Security of Deep Learning, images can be discovered. IJCAI 2021 " bmle@g.skku.edu (L. M. Binh); swoo@g.skku.edu (S. S. Woo) 0000-0002-4344-3421 (L. M. Binh); 0000-0002-8983-1542 (S. S. Woo) quantities of data to train and the incurred high computa- Β© 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). tional complexity, which can be prohibitively expensive CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) R 010 011 G 1010 B 011 Descriptive Feature extraction features Discrete Fourier transform Figure 2: Illustration of our end-to-end pipeline. Unlike other research [6] that apply DFT on a grayscale image, we focus on each channel information and utilize statistical methods to obtain discriminative features. in many practical applications. As a consequence, there 2. Our Methods is a strong need for detection methodologies that can achieve comparable levels of performance with limited 2.1. Fourier Spectrum Analysis training data, requiring low computational requirements In this section, we construct our hypothesis on the channel- and better generalization to unknown GAN-based gener- wise asynchrony of GAN-based generated images. The ated images. convolutional operation at the ππ‘β layer of a generative In this work, we first observe that the frequencies of model is formulated as follows: the channels in the real images are highly correlated as β β πΆ(π) shown in Fig. 1 and Fig. 3. In fact, the discriminator in the βοΈ GAN model can make the generated synthesized images π΄π+1 π = πΆπππ£ππ (π΄π ) = π β π πΉππ β π π(π΄ππ )β , (1) highly realistic, close to real images. However, to the best π=1 of our knowledge, there has not been any attempt to where πΆ is the number of channels of the ππ‘β layerβs (π) apply direct correlation constraint between channels on output π΄π , πΉ π β Rππ Γππ ΓπΆ(π+1) ΓπΆ(π) is a set of πΆ(π+1) Γ the output images in the frequency domain in the GAN πΆ(π) trainable 2π· filters that have size of ππ Γ ππ . And models. From this observation, we hypothesize that the π π(Β·), β and π(Β·) denote the up-sampling operator, con- insufficiency of channel-dependent training in most of volutional operator and activation function, respectively. the current GAN models can produce the channel-wise According to Khayatkhoei and Elgammal [11], we can asynchrony in the frequency domain. The asynchronous simplify the Eq. 1 by restricting π(Β·) to to rectified lin- can be exposed in various GAN datasets, which can be ef- ear units (ReLU), which makes the πΆπππ£ππ (π΄π ) become fectively used to distinguish GAN-based synthetic images locally piece-wise linear, and absorbing the up-sampling by both unsupervised or supervised learning methods. π π(Β·) into π΄ππ . In this way, we transform Eq. 1 to: In order to demonstrate the effectiveness of our pro- posed approach, we experiment with four types of datasets: βοΈπΆ Fake Head Talker [7], StyleGAN [8], StarGAN [9], and π΄ π+1 π = πΆπππ£ π π (π΄ π ) = π πΉππ β π΄ππ . (2) Adversarial Latent Auto Encoder (ALAE) [10]. Our main π=1 contributions in this work are summarized as follows: By applying the 2D discrete Fourier transform (DFT) to π΄π+1π , it is now viewed in the frequency domain as β’ We firstly introduce the asynchronous in the fre- follows: quency domain for detecting GAN-generated fake (οΈ πΆ )οΈ images. π+1 βοΈ π Λ π΄π = F(π΄π ) = F π+1 πΉππ β π΄π π β’ We propose effective unsupervised and super- π=1 vised learning models using our discriminative πΆ (οΈ )οΈ mining features to classify real and fake images. βοΈ π = F πΉππ β π΄ππ (linearity property of FT) β’ We demonstrate the transferability of our pro- π=1 posed learning models across different GAN-based πΆ βοΈ (οΈ )οΈ (οΈ )οΈ π synthetic datasets using our spectra features. = F πΉππ Γ F π΄ππ (conv. property of FT) π=1 πΆ Λ πππ Γ π΄ Λ ππ = β¨πΉΛππ , π΄ βοΈ = πΉ Λπ β©, (3) π=1 where πΉΛππ = (πΉ Λ ππ1 , ..., πΉ Λ πππΆ )π , and π΄Λπ = (π΄ Λ π1 , ..., π΄ Λ ππΆ )π . Based on this important observation, we propose the Equation 3 indicates that in the frequency domain, every following key statistical descriptive features to discrimi- channel of in the next layer is decomposed into the com- nate the GAN images in the frequency domain: π πππ, bination all previous layerβs channels with different sets π ππ₯, π ππ, ππΆππππ πΊ , ππΆππππ π΅ , and ππΆππππΊπ΅ , where of coefficients. If we consider π΄π+1 as the synthesized the details are presented below: output image and fix π΄ Λπ , every vector πΉΛ π is π,π=1,..πΆ(π+1) β’ π πππ. We take the average of the channel-wise trained independently to minimized the loss that applied spectrum differences: on each π΄Λ π+1 π . When we consider π΄ Λπ as a basic and πΉΛππ as π+1 ππ πΊ + ππ π΅ + ππΊπ΅ the coordinate of π΄ Λ π with respect to π΄ Λπ , to synthesize π πππ = , (6) a new image, the generative models expect that π΄ Λπ is 3 also good enough so that each independent coefficient Also, we use ππ πΊ that is the average spectrum vector πΉΛππ can produce corresponding single channel. In differences between the spectra of the Red and addition, these output channels should become as natural Green channel in an image as follows: as possible in spacial domain after being stacked together in the order of three color channels: Red, Green and Blue. π π» 1 βοΈ βοΈ ββ Small shifts of πΉΛπ,π=1,..πΆ in the frequency domain π β ππ πΊ = πππππ (π’, π£)βπππππΊ (π’, π£)β, (π+1) π π» π’=1 π£=1 may only change fine-grained details of visualization but (7) can produce frequency-bias when there is no direct con- and ππ π΅ and ππΊπ΅ can be similarly defined. straint between channels. β’ π ππ₯ and π ππ. We take the maximum and min- imum values in {ππ πΊ , ππ π΅ , ππΊπ΅ }. β’ ππΆππππ πΊ . We calculate the correlation coefficient 2.2. Descriptive Features Extraction between πππππ and πππππΊ and transform it to Let β be a color image with three channels: Red, Green positive range value by adding 1 to its negative and Blue, which has width of π and height of π». To values as follows: create its frequency representation, we firstly apply 2D ππΆππππ πΊ = βπ(πππππ , πππππΊ ) + 1, (8) DFT on each channel as follows: π βοΈ βοΈ π» π’π₯ π£π¦ where π is the Pearson correlation coefficient, and Fβπ /πΊ/π΅ (π’, π£) = βπ /πΊ/π΅ (π₯, π¦)Β·πβπ2π( π + π» ) , ππΆππππ π΅ and ππΆππππΊπ΅ can be similarly defined. π₯=1 π¦=1 (4) Our end-to-end pipeline of extracting above frequency where π₯ and π¦, and denote the π₯ and π¦ slice in the π‘β π‘β descriptive features from a given image is visually illus- width and height dimension of β. For convenience, we trated in Fig. 2. use the notation Fβπ /πΊ/π΅ to represent the function that is independently applied for each channel of the image. 2.3. Binary Classifier Note that Fβπ /πΊ/π΅ (π’, π£) is now a complex number, i.e. Fβπ /πΊ/π΅ (π’, π£) β C, and the spectrum of each channel To demonstrate the characteristic-defining ability of the is obtained as follows: spectrum disagreement, we first employ the simple clas- (οΈ )οΈ sifiers to classify real and fake images as below: πππππ /πΊ/π΅ (π’, π£) = πππ Fβπ /πΊ/π΅ (π’, π£) , (5) β’ Gaussian Mixture Model (GMM). GMM is a probabilistic model that assumes the distribution where πππ(Β·) denotes the modulus of complex number. of observed sampling data points is composed of Although it would be challenging for human eyes to a mixture of many Gaussian distributions, partic- distinguish between real and GAN-generated fake im- ularly, in our case is two distributions of real and ages, we believe that their frequency spectra differences fake class. To determine the means and variances can be possibly exposed, when we stack the three chan- of the two Gaussian distributions, Expectation- nelsβ spectra of real vs. fake. Figure 1 presents our exam- Maximization (EM) algorithm is used to itera- ple of imagesβ spectra from VoxCeleb2 dataset [12] and tively estimate these parameters. Our descriptive Fake Head Talker dataset [13]. In particular, in the real features populate in the way that the higher spec- images, we empirically find that the spectra of three color trum agreement of an image will have the lower channels are mostly concurrent when stacking together, descriptive values. Therefore, we can expect that whereas they become noisy in the fake images, as shown the Gaussian distribution in the mixture model, in Fig. 1. which has a smaller expectation will represent the real imagesβ distribution, and the other repre- representations in the latent space with adversar- sents the fake imagesβ distribution. By applying ial settings. The ALAE model can not only syn- EM, we can classify real and fake images in an thesize high-resolution images comparing with unsupervised manner in which the labels of a StyleGAN, but also can further manipulate or re- training dataset are not required. construct the new input facial images. β’ Support Vector Machine (SVM). SVM is a ro- bust supervised learning method that maximizes Details of each dataset in our experiment are summa- the margin of hyperplanes between different classes. rized in Table 1. In our experiment, the number of real The samples that lie along the margins are called and GAN fake images are equal in both training and test the support vectors. In our experiment, we use sets. We further provide the histograms to visualize the SVM with the radial basis function (RBF) kernel distributions of six descriptive features of these datasets to train with our six proposed features. in Supp. Section A. 3.2. Experimental Results 3. Experiment To demonstrate the discriminative power of our proposed 3.1. Datasets features, we perform three different experiments. Binary classification. Our experimental results are To examine the effectiveness of our proposed frequency shown in Table 2. We can observe that both unsuper- features, we experiment four types of dataset: Fake Head vised and supervised methods are able to produce high Talker [7], StyleGAN [8], StarGAN [9], and Adversarial performance with our newly introduced frequency fea- Latent Auto Encoder (ALAE) [10]. A brief description of tures. The accuracy scores of the unsupervised method each dataset is provided below as well as in Table 1: on Fake Head Talker and ALAE dataset are competitive, β’ Fake Head Talker dataset [7]. Fake Head Talker compared to the supervised approaches. At the same is generated by the few-shot learning system that time, they are still higher than 80% on StyleGAN and is pre-trained extensively on a large dataset (meta- StarGAN. Meanwhile, the supervised methodβs accuracy learning). Particularly, their approach includes an scores are always higher than 95% on the four datasets. embedder, a generator, and a discriminator. After We can conclude that our proposed features based on the training on a large corpus of talking head videos asynchronous in the frequency spectrum can effectively of different faces with adversarial training, their capture the characteristics of the GAN-generated images, approach can transform facial landmarks from and provide the foundation for distinguishing fake from a source frame into realistically-looking person- real images. alized photographs with a few photos of a new Unbalanced training datasets. Furthermore, to study target person, and further mimic the target. the feasibility of training with an unbalanced dataset us- ing our features, we gradually reduce the number of fake β’ StyleGAN dataset [8]. StyleGAN is a high-level images in each training dataset to 25%, 5%, and 1% of style controlling approach that governs its gen- the total training data size. After that, we apply the SVM erator through adaptive instance normalization as our learning model. To demonstrate our approachβs ef- (AdaIN) and Gaussian noise adding in each con- fectiveness, we compare our method with FakeTalkerDe- volutional layer. Furthermore, by proposing two tect model [13], which deployed a pre-trained AlexNet novel metrics such as perceptual path length and and Siamese network trained on RGB images. The results linear separability, the generated images are less are presented in Table 3. We can observe that our method entangled and have different factors of variation. with the hand-crafted features outperforms the AlexNet β’ StarGAN dataset [9]. StarGAN is a unified model and FakeTalkerDetect on both balanced and unbalanced architecture that is able to train on multiple datasets datasets. Therefore, we can conclude that our simple yet across different domains. By proposing a simple effective features are capable to characterize the fake mask vector, the StarGAN is able to flexibly uti- image much better in the unbalanced training dataset lize multiple datasets containing different label scenario, as well. sets, and achieve competitive results in the facial Unsupervised domain adaptation. In this task, we attribute transfer tasks. This new approach with propose an algorithm using our proposed features that only a single generator and a discriminator has allows a pre-trained SVM model on one source dataset addressed the scalability and robustness limita- (e.g., StyleGAN) can detect fake images in a new target tions of many previous research. dataset (e.g., ALAE) with the only prior knowledge of the β’ Adversarial Latent Auto Encoder (ALAE) datasettarget feature expectations. [10]. ALAE is an autoencoder-based generative model that is capable to learn the disentangled Table 1 Details of datasets used in our experiment. Training size Test size Datasets Resolution Source datasets (real+ fake) (real+ fake) Fake Head Talker [7] 224 Γ 224 VoxCeleb2 [12] 18,800 18,800 StyleGAN [8] 1024 Γ 1024 FFHQ 1 2,000 2,000 StarGAN [9] 256 Γ 256 CelebA [14] 2,000 1,998 ALAE [10] 1024 Γ 1024 FFHQ 2,000 2,000 Table 2 Experimental results of GMM and SVM on four datasets with our discriminative features. GMM SVM Datasets Accuracy Recall Precision F1 Accuracy Recall Precision F1 Fake Head Talker 0.996 1.000 0.991 0.996 0.9972 0.994 1.000 0.997 StyleGAN 0.849 0.762 0.915 0.831 0.951 0.938 0.963 0.950 StarGAN 0.903 0.807 1.000 0.893 0.972 0.949 0.994 0.971 ALAE 0.992 0.984 1.000 0.992 0.999 0.997 1.000 0.998 Table 3 Algorithm 1 Unsupervised domain adaptation with Comparison between our approach using proposed descrip- SVM using our proposed descriptive features tive features and AlexNet and FakeTalkerDetect method on Require: Labeled source set {π π , π π }, unlabeled tar- Fake Head Talker dataset. The precision, recall and F1 scores get set π π‘ , where π π and π π‘ includes the six of AlexNet and FakeTalkerDetect from [13], and their values are rounded to second decimal proposed features [π1π , .., π6π ], respectively. The Methods Accuracy Recall Precision F1 [οΈprior knowledge of Gaussian expectation values: ππ π,0 , ππ π,1 π=1,...,6 and ππ‘π,0 , ππ‘π,1 π=1,...,6 . ]οΈ [οΈ ]οΈ AlexNet (50% fake) 0.981 0.98 0.98 0.98 1: Step 1: Scale (οΈ each feature )οΈ (οΈin π and π π‘)οΈ: π FakeTalkerDetect 0.984 0.98 0.98 0.98 SVM (ours) 0.997 0.994 1.00 0.997 Β― ππ = ππ β ππ,0 / ππ,1 β ππ π,0 , π π π π AlexNet (25% fake) 0.971 0.95 0.95 0.96 πΒ―ππ‘ = πππ‘ β ππ‘π,0 / ππ‘π,1 β ππ‘π,0 (οΈ )οΈ (οΈ )οΈ FakeTalkerDetect 0.986 0.98 0.98 0.98 2: Step 2: Fit source set [πΒ―1 , .., πΒ―6 ], π with SVM {οΈ π π π }οΈ SVM (ours) 0.998 0.995 1.000 0.997 model. AlexNet (5% fake) 0.964 0.98 0.80 0.87 FakeTalkerDetect 0.988 0.99 0.91 0.94 3: Step 3: Use pre-trained SVM to predict target set label SVM (ours) 0.997 0.997 0.997 0.997 from [πΒ―1π‘ , .., πΒ―6π‘ ]. AlexNet (1% fake) 0.963 0.98 0.61 0.67 FakeTalkerDetect 0.988 0.99 0.74 0.82 SVM (ours) 0.992 0.999 0.986 0.992 suggested features the pre-trained SVM shows its strong detection ability in the new target domain, where all the detection performance is above 80% of accuracy for any In particular, we first take the two Gaussian expecta- pair of source and target dataset. This preliminary exper- tion values of two mixture distributions of each feature iment shows that our proposed features can be utilized from both source and target dataset. These expectation in domain adaptation tasks with more complex learning values are kept as our prior knowledge about the target models in the future. dataset. We then scale the source training set features such that their two Gaussian expectation values are nor- malized between 0 and 1 , to better fit the training dataset 4. Conclusion with the SVM model. In the testing phase, with our prior knowledge above, we can scale the testing features from Although GANs have significantly advanced in the past, the target dataset using the known expectation values we discover that there are some areas that GANsβ can- and feed them to the pre-trained SVM model to make pre- not mimic the real images effectively in the frequency diction. This adaptation learning process is summarized domain. Thus, in this work, we propose a preliminary in the Algorithm 1. approach that reveals the asynchronous in frequency do- We experiment with the four fake dataset and present main of the three channels in GAN images. By mining the results in Table 4. We can observe that with our statistical features in frequency domain, our simple yet Table 4 Experimental results of domain adaptation task using our proposed features Source dataset Target dataset Accuracy Recall Precision F1 StyleGAN 0.800 0.908 0.745 0.819 Fake Head Talker StarGAN 0.918 0.844 0.992 0.912 ALAE 0.994 0.995 0.993 0.994 Fake Head Talker 0.965 0.932 0.998 0.964 StyleGAN StarGAN 0.906 0.814 0.998 0.896 ALAE 0.991 0.982 1.000 0.991 Fake Head Talker 0.983 0.982 0.983 0.983 StarGAN StyleGAN 0.832 0.980 0.756 0.854 ALAE 0.996 0.997 0.994 0.996 Fake Head Talker 0.993 0.989 0.998 0.993 ALAE StyleGAN 0.890 0.955 0.845 0.897 StarGAN 0.929 0.871 0.985 0.925 effective unsupervised and supervised learning methods [2] S. Cole, We are truly fucked: Everyone is can easily discriminate the real and GAN-based synthetic making ai-generated fake porn now, 2018. facial images without utilizing deep learning methods. URL: https://www.vice.com/en/article/bjye8a/ Our extensive experiments demonstrates that the pro- reddit-fake-porn-app-daisy-ridley. posed featuresβ power in three scenarios: 1) unsupervised [3] X. Zhang, S. Karaman, S.-F. Chang, Detecting and and supervised binary classification, 2) unbalanced train- simulating artifacts in gan fake images, in: 2019 ing dataset, and 3) domain adaptation task. For future IEEE International Workshop on Information Foren- work, we plan to explore and exploit more on these as- sics and Security (WIFS), IEEE, 2019, pp. 1β6. pects of GAN-generated images to combat against mis- [4] S.-Y. Wang, O. Wang, R. Zhang, A. Owens, A. A. uses from attackers, and extend our work to deepfake Efros, Cnn-generated images are surprisingly easy detection. to spot... for now, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, 2020, pp. 8695β8704. Acknowledgments [5] J. Frank, T. Eisenhofer, L. SchΓΆnherr, A. Fischer, D. Kolossa, T. Holz, Leveraging frequency analysis This work was partly supported by Institute of Informa- for deep fake image recognition, in: International tion & communications Technology Planning & Eval- Conference on Machine Learning, PMLR, 2020, pp. uation (IITP) grant funded by the Korea government 3247β3258. (MSIT) (No.2019-0-00421, AI Graduate School Support [6] T. Dzanic, K. Shah, F. Witherden, Fourier spectrum Program (Sungkyunkwan University)), (No. 2019-0-01343, discrepancies in deep network generated images, Regional strategic industry convergence security core arXiv preprint arXiv:1911.06465 (2019). talent training business) and the Basic Science Research [7] E. Zakharov, A. Shysheya, E. Burkov, V. Lempit- Program through National Research Foundation of Ko- sky, Few-shot adversarial learning of realistic neu- rea (NRF) grant funded by Korea government MSIT (No. ral talking head models, in: Proceedings of the 2020R1C1C1006004). Additionally, this research was partly IEEE/CVF International Conference on Computer supported by IITP grant funded by the Korea govern- Vision, 2019, pp. 9459β9468. ment MSIT (No. 2021-0-00017, Original Technology De- [8] T. Karras, S. Laine, T. Aila, A style-based generator velopment of Artificial Intelligence Industry) and was architecture for generative adversarial networks, partly supported by the Korea government MSIT, under in: Proceedings of the IEEE/CVF Conference on the High-Potential Individuals Global Training Program Computer Vision and Pattern Recognition, 2019, pp. (2019-0-01579) supervised by the IITP. 4401β4410. [9] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, J. Choo, References Stargan: Unified generative adversarial networks for multi-domain image-to-image translation, in: [1] T. Quandt, L. Frischlich, S. Boberg, T. Schatto- Proceedings of the IEEE conference on computer Eckrodt, Fake news, The international encyclopedia vision and pattern recognition, 2018, pp. 8789β8797. of Journalism Studies (2019) 1β6. [10] S. Pidhorskyi, D. A. Adjeroh, G. Doretto, Adver- sarial latent autoencoders, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2020, pp. 14104β14113. [11] M. Khayatkhoei, A. Elgammal, Spatial frequency bias in convolutional generative adversarial net- works, arXiv preprint arXiv:2010.01473 (2020). [12] A. Nagrani, J. S. Chung, W. Xie, A. Zisserman, Vox- celeb: Large-scale speaker verification in the wild, Computer Science and Language (2019). [13] H. Jeon, Y. Bang, S. S. Woo, Faketalkerdetect: Ef- fective and practical realistic neural talking head detection with a highly unbalanced dataset, in: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2019, pp. 0β0. [14] Z. Liu, P. Luo, X. Wang, X. Tang, Deep learning face attributes in the wild, in: Proceedings of Inter- national Conference on Computer Vision (ICCV), 2015. A. Distribution of Statistical Descriptive Features The histogram distributions of our six proposed statistical features in the frequency domains are present in the Fig. 3. We can observe that these feature distributions are highly separable between real and fake images across four datasets. Figure 3: The histogram distributions of our six statistical descriptive features from four datasets: Fake Head Talker, StyleGAN, StarGAN and ALAE. B. Example GAN-based Synthetic Images We provide example images from four datasets used in our experiment.