=Paper= {{Paper |id=Vol-3084/paper4 |storemode=property |title=Exploring the Asynchronous of the Frequency Spectra of GAN-Generated Facial Images |pdfUrl=https://ceur-ws.org/Vol-3084/paper4.pdf |volume=Vol-3084 |authors=Le Minh Binh,Simon S. Woo }} ==Exploring the Asynchronous of the Frequency Spectra of GAN-Generated Facial Images== https://ceur-ws.org/Vol-3084/paper4.pdf
Exploring the Asynchronous of the Frequency Spectra of
GAN-Generated Facial Images
Le Minh Binh1 , Simon S. Woo2
1
    Department of Software, Sungkyunkwan University, South Korea
2
    Department of Applied Data Science, Sungkyunkwan University, South Korea


                                             Abstract
                                             The rapid progression of Generative Adversarial Networks (GANs) has raised a concern of their misuses for malicious pur-
                                             poses, especially in creating fake face images. Although many proposed methods succeed in detecting GAN-based synthetic
                                             images, they are still limited by the need for large quantities of the training fake image dataset, and challenges for the detec-
                                             tor’s generalizability to unknown facial images. In this paper, we propose a new approach that explores the asynchronous
                                             frequency spectra of color channels, which is simple but effective for training both unsupervised and supervised learning
                                             models to distinguish GAN-based synthetic images. We further investigate the transferability of a training model that learns
                                             from our suggested features in one source domain and validates on another target domains with prior knowledge of the
                                             features’ distribution. Our experimental results show that the discrepancy of spectra in the frequency domain is a practical
                                             artifact to effectively detect various types of GAN-based generated images.

                                             Keywords
                                             Asynchronous of frequency, GAN-based synthetic images



1. Introduction
In recent years, there has been tremendous progress in
Generative Adversarial Networks (GAN), in which two




                                                                                                                                                                               Real images
modules (generator vs. discriminator), play a minimax
game to produce highly realistic data. Unfortunately, in
addition to several fruitful GAN applications, attackers
can exploit GANs for malicious purposes, such as spread-
ing fake news [1] or propagating fake pornography of
celebrities [2] as shown in the past. Meanwhile, several
efforts have been made by researchers [3, 4, 5] to re-
sist these nasty misuses. Wang et al. built a deep neural
network to classify GAN-based generated images and
empirically demonstrates that a classifier trained on one
single dataset can generalize to different GAN datasets[4].
Especially, Dzanic and Shah [6] empirically show the                                                                                                                           Fake images
systematic bias in high spatial frequencies and use this
characteristic to classify real and deep network generated
images. However, they did not explore the deeper sta-
tistical frequency features that we propose in our work,
and they simply focused on converting the RGB to gray
image, unlike ours.
   Also, the checkerboard artifacts in spectrum generated                                                             Figure 1: The frequency spectra differences of real vs. fake
by up-sampling components of GAN model were also ex-                                                                  images. Top: Real images and their corresponding concur-
tensively investigated by Zhang et al. [3] and Frank et al.                                                           rent spectra. Bottom: Fake images and their corresponding
[5]. While these techniques have shown success in terms                                                               (chaotic) spectra. It is difficult for human eyes to distinguish
of achieving high accuracy, they typically require a large                                                            between real and fake images, but after applying DFT on each
                                                                                                                      channel of images, the vital clues to distinguish real vs. fake
International Workshop on Safety & Security of Deep Learning,                                                         images can be discovered.
IJCAI 2021
" bmle@g.skku.edu (L. M. Binh); swoo@g.skku.edu (S. S. Woo)
 0000-0002-4344-3421 (L. M. Binh); 0000-0002-8983-1542
(S. S. Woo)                                                                                                           quantities of data to train and the incurred high computa-
                                       Β© 2021 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).                     tional complexity, which can be prohibitively expensive
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
              R                                                                                  010
                                                                                                 011
              G                                                                         1010
               B                                                                        011
                                                                                                                 Descriptive
                                                                                      Feature extraction          features
                                     Discrete Fourier transform

Figure 2: Illustration of our end-to-end pipeline. Unlike other research [6] that apply DFT on a grayscale image, we focus on
each channel information and utilize statistical methods to obtain discriminative features.



in many practical applications. As a consequence, there 2. Our Methods
is a strong need for detection methodologies that can
achieve comparable levels of performance with limited 2.1. Fourier Spectrum Analysis
training data, requiring low computational requirements
                                                              In this section, we construct our hypothesis on the channel-
and better generalization to unknown GAN-based gener-
                                                              wise asynchrony of GAN-based generated images. The
ated images.
                                                              convolutional operation at the π‘™π‘‘β„Ž layer of a generative
   In this work, we first observe that the frequencies of
                                                              model is formulated as follows:
the channels in the real images are highly correlated as                                        βŽ›                         ⎞
                                                                                                    𝐢(𝑙)
shown in Fig. 1 and Fig. 3. In fact, the discriminator in the                                       βˆ‘οΈ
GAN model can make the generated synthesized images              𝐴𝑙+1
                                                                   𝑖    = πΆπ‘œπ‘›π‘£π‘–π‘™ (𝐴𝑙 ) = 𝜎 ⎝               𝑙
                                                                                                         𝐹𝑖𝑐  βŠ› π‘ˆ 𝑝(𝐴𝑙𝑐 )⎠ , (1)
highly realistic, close to real images. However, to the best                                        𝑐=1

of our knowledge, there has not been any attempt to where 𝐢 is the number of channels of the π‘™π‘‘β„Ž layer’s
                                                                          (𝑙)
apply direct correlation constraint between channels on
                                                              output 𝐴𝑙 , 𝐹 𝑙 ∈ Rπ‘˜π‘™ Γ—π‘˜π‘™ ×𝐢(𝑙+1) ×𝐢(𝑙) is a set of 𝐢(𝑙+1) Γ—
the output images in the frequency domain in the GAN
                                                              𝐢(𝑙) trainable 2𝐷 filters that have size of π‘˜π‘™ Γ— π‘˜π‘™ . And
models. From this observation, we hypothesize that the
                                                              π‘ˆ 𝑝(Β·), βŠ› and 𝜎(Β·) denote the up-sampling operator, con-
insufficiency of channel-dependent training in most of
                                                              volutional operator and activation function, respectively.
the current GAN models can produce the channel-wise
                                                              According to Khayatkhoei and Elgammal [11], we can
asynchrony in the frequency domain. The asynchronous
                                                              simplify the Eq. 1 by restricting 𝜎(·) to to rectified lin-
can be exposed in various GAN datasets, which can be ef-
                                                              ear units (ReLU), which makes the πΆπ‘œπ‘›π‘£π‘–π‘™ (𝐴𝑙 ) become
fectively used to distinguish GAN-based synthetic images
                                                              locally piece-wise linear, and absorbing the up-sampling
by both unsupervised or supervised learning methods.
                                                              π‘ˆ 𝑝(Β·) into 𝐴𝑙𝑐 . In this way, we transform Eq. 1 to:
   In order to demonstrate the effectiveness of our pro-
posed approach, we experiment with four types of datasets:                                             βˆ‘οΈπΆ

Fake Head Talker [7], StyleGAN [8], StarGAN [9], and                       𝐴  𝑙+1
                                                                              𝑖    = πΆπ‘œπ‘›π‘£ 𝑙
                                                                                          𝑖 (𝐴  𝑙
                                                                                                  ) =          𝑙
                                                                                                             𝐹𝑖𝑐  βŠ› 𝐴𝑙𝑐 .     (2)
Adversarial Latent Auto Encoder (ALAE) [10]. Our main                                                   𝑐=1

contributions in this work are summarized as follows:         By applying the 2D discrete Fourier transform (DFT)
                                                              to 𝐴𝑙+1𝑖   , it is now viewed in the frequency domain as
      β€’ We firstly introduce the asynchronous in the fre- follows:
        quency domain for detecting GAN-generated fake                                   (οΈƒ 𝐢                  )οΈƒ
        images.                                                  𝑙+1                        βˆ‘οΈ 𝑙
                                                               ˜
                                                              𝐴𝑖 = F(𝐴𝑖 ) = F   𝑙+1
                                                                                                   𝐹𝑖𝑐 βŠ› 𝐴𝑐  𝑙

      β€’ We propose effective unsupervised and super-                                        𝑐=1
        vised learning models using our discriminative                     𝐢      (︁         )︁
        mining features to classify real and fake images.
                                                                          βˆ‘οΈ         𝑙
                                                                       =        F 𝐹𝑖𝑐  βŠ› 𝐴𝑙𝑐        (linearity property of FT)
      β€’ We demonstrate the transferability of our pro-                    𝑐=1
        posed learning models across different GAN-based                   𝐢
                                                                          βˆ‘οΈ      (︁ )︁       (︁ )︁
                                                                                     𝑙
        synthetic datasets using our spectra features.                 =        F 𝐹𝑖𝑐   Γ— F 𝐴𝑙𝑐           (conv. property of FT)
                                                                           𝑐=1
                                                                            𝐢
                                                                                 ˜ 𝑙𝑖𝑐 Γ— 𝐴
                                                                                         ˜ 𝑙𝑐 = βŸ¨πΉΛœπ‘–π‘™ , 𝐴
                                                                           βˆ‘οΈ
                                                                       =         𝐹                      Λœπ‘™ ⟩,                  (3)
                                                                           𝑐=1
where πΉΛœπ‘–π‘™ = (𝐹  ˜ 𝑙𝑖1 , ..., 𝐹
                              ˜ 𝑙𝑖𝐢 )𝑇 , and π΄Λœπ‘™ = (𝐴  ˜ 𝑙1 , ..., 𝐴
                                                                   ˜ 𝑙𝐢 )𝑇 .      Based on this important observation, we propose the
Equation 3 indicates that in the frequency domain, every                       following key statistical descriptive features to discrimi-
channel of in the next layer is decomposed into the com-                       nate the GAN images in the frequency domain: 𝑀 π‘’π‘Žπ‘›,
bination all previous layer’s channels with different sets                     𝑀 π‘Žπ‘₯, 𝑀 𝑖𝑛, π‘–πΆπ‘œπ‘Ÿπ‘Ÿπ‘…πΊ , π‘–πΆπ‘œπ‘Ÿπ‘Ÿπ‘…π΅ , and π‘–πΆπ‘œπ‘Ÿπ‘ŸπΊπ΅ , where
of coefficients. If we consider 𝐴𝑙+1 as the synthesized                        the details are presented below:
output image and fix 𝐴        Λœπ‘™ , every vector 𝐹˜ 𝑙                     is
                                                     𝑖,𝑖=1,..𝐢(𝑙+1)
                                                                                    β€’ 𝑀 π‘’π‘Žπ‘›. We take the average of the channel-wise
trained independently to minimized the loss that applied
                                                                                      spectrum differences:
on each 𝐴˜ 𝑙+1
           𝑖   . When we consider 𝐴          Λœπ‘™ as a basic and πΉΛœπ‘–π‘™ as
                         𝑙+1                                                                              𝑑𝑅𝐺 + 𝑑𝑅𝐡 + 𝑑𝐺𝐡
the coordinate of 𝐴   ˜ 𝑖 with respect to 𝐴        Λœπ‘™ , to synthesize
                                                                                               𝑀 π‘’π‘Žπ‘› =                    ,           (6)
a new image, the generative models expect that 𝐴                    Λœπ‘™ is                                        3
also good enough so that each independent coefficient                                 Also, we use 𝑑𝑅𝐺 that is the average spectrum
vector πΉΛœπ‘–π‘™ can produce corresponding single channel. In                              differences between the spectra of the Red and
addition, these output channels should become as natural                              Green channel in an image as follows:
as possible in spacial domain after being stacked together
in the order of three color channels: Red, Green and Blue.                                           π‘Š 𝐻
                                                                                                 1 βˆ‘οΈ βˆ‘οΈ βƒ’βƒ’
Small shifts of πΉΛœπ‘–,𝑖=1,..𝐢              in the frequency domain
                      𝑙
                                                                                                                                          βƒ’
                                                                                      𝑑𝑅𝐺 =                  𝑆𝑝𝑒𝑐𝑅 (𝑒, 𝑣)βˆ’π‘†π‘π‘’π‘πΊ (𝑒, 𝑣)βƒ’,
                                  (𝑙+1)                                                       π‘Š 𝐻 𝑒=1 𝑣=1
may only change fine-grained details of visualization but
                                                                                                                                      (7)
can produce frequency-bias when there is no direct con-
                                                                                      and 𝑑𝑅𝐡 and 𝑑𝐺𝐡 can be similarly defined.
straint between channels.
                                                                                    β€’ 𝑀 π‘Žπ‘₯ and 𝑀 𝑖𝑛. We take the maximum and min-
                                                                                      imum values in {𝑑𝑅𝐺 , 𝑑𝑅𝐡 , 𝑑𝐺𝐡 }.
                                                                                    β€’ π‘–πΆπ‘œπ‘Ÿπ‘Ÿπ‘…πΊ . We calculate the correlation coefficient
2.2. Descriptive Features Extraction                                                  between 𝑆𝑝𝑒𝑐𝑅 and 𝑆𝑝𝑒𝑐𝐺 and transform it to
Let ℐ be a color image with three channels: Red, Green                                positive range value by adding 1 to its negative
and Blue, which has width of π‘Š and height of 𝐻. To                                    values as follows:
create its frequency representation, we firstly apply 2D
                                                                                            π‘–πΆπ‘œπ‘Ÿπ‘Ÿπ‘…πΊ = βˆ’πœŒ(𝑆𝑝𝑒𝑐𝑅 , 𝑆𝑝𝑒𝑐𝐺 ) + 1,         (8)
DFT on each channel as follows:
                         π‘Š βˆ‘οΈ
                         βˆ‘οΈ 𝐻
                                                                 𝑒π‘₯    𝑣𝑦
                                                                                      where 𝜌 is the Pearson correlation coefficient, and
Fℐ𝑅/𝐺/𝐡 (𝑒, 𝑣) =                    ℐ𝑅/𝐺/𝐡 (π‘₯, 𝑦)Β·π‘’βˆ’π‘–2πœ‹( π‘Š + 𝐻 ) ,                    π‘–πΆπ‘œπ‘Ÿπ‘Ÿπ‘…π΅ and π‘–πΆπ‘œπ‘Ÿπ‘ŸπΊπ΅ can be similarly defined.
                         π‘₯=1 𝑦=1
                                                    (4)                        Our end-to-end pipeline of extracting above frequency
where π‘₯ and 𝑦, and denote the π‘₯ and 𝑦 slice in the
                                             π‘‘β„Ž         π‘‘β„Ž                     descriptive features from a given image is visually illus-
width and height dimension of ℐ. For convenience, we                           trated in Fig. 2.
use the notation Fℐ𝑅/𝐺/𝐡 to represent the function that
is independently applied for each channel of the image.                        2.3. Binary Classifier
Note that Fℐ𝑅/𝐺/𝐡 (𝑒, 𝑣) is now a complex number, i.e.
Fℐ𝑅/𝐺/𝐡 (𝑒, 𝑣) ∈ C, and the spectrum of each channel                           To demonstrate the characteristic-defining ability of the
is obtained as follows:                                                        spectrum disagreement, we first employ the simple clas-
                              (︁              )︁                               sifiers to classify real and fake images as below:
     𝑆𝑝𝑒𝑐𝑅/𝐺/𝐡 (𝑒, 𝑣) = π‘šπ‘œπ‘‘ Fℐ𝑅/𝐺/𝐡 (𝑒, 𝑣) , (5)
                                                                                    β€’ Gaussian Mixture Model (GMM). GMM is a
                                                                                      probabilistic model that assumes the distribution
where π‘šπ‘œπ‘‘(Β·) denotes the modulus of complex number.
                                                                                      of observed sampling data points is composed of
   Although it would be challenging for human eyes to
                                                                                      a mixture of many Gaussian distributions, partic-
distinguish between real and GAN-generated fake im-
                                                                                      ularly, in our case is two distributions of real and
ages, we believe that their frequency spectra differences
                                                                                      fake class. To determine the means and variances
can be possibly exposed, when we stack the three chan-
                                                                                      of the two Gaussian distributions, Expectation-
nels’ spectra of real vs. fake. Figure 1 presents our exam-
                                                                                      Maximization (EM) algorithm is used to itera-
ple of images’ spectra from VoxCeleb2 dataset [12] and
                                                                                      tively estimate these parameters. Our descriptive
Fake Head Talker dataset [13]. In particular, in the real
                                                                                      features populate in the way that the higher spec-
images, we empirically find that the spectra of three color
                                                                                      trum agreement of an image will have the lower
channels are mostly concurrent when stacking together,
                                                                                      descriptive values. Therefore, we can expect that
whereas they become noisy in the fake images, as shown
                                                                                      the Gaussian distribution in the mixture model,
in Fig. 1.
                                                                                      which has a smaller expectation will represent
       the real images’ distribution, and the other repre-          representations in the latent space with adversar-
       sents the fake images’ distribution. By applying             ial settings. The ALAE model can not only syn-
       EM, we can classify real and fake images in an               thesize high-resolution images comparing with
       unsupervised manner in which the labels of a                 StyleGAN, but also can further manipulate or re-
       training dataset are not required.                           construct the new input facial images.
     β€’ Support Vector Machine (SVM). SVM is a ro-
       bust supervised learning method that maximizes          Details of each dataset in our experiment are summa-
       the margin of hyperplanes between different classes. rized in Table 1. In our experiment, the number of real
       The samples that lie along the margins are called and GAN fake images are equal in both training and test
       the support vectors. In our experiment, we use sets. We further provide the histograms to visualize the
       SVM with the radial basis function (RBF) kernel distributions of six descriptive features of these datasets
       to train with our six proposed features.             in Supp. Section A.

                                                              3.2. Experimental Results
3. Experiment
                                                                To demonstrate the discriminative power of our proposed
3.1. Datasets                                                   features, we perform three different experiments.
                                                                   Binary classification. Our experimental results are
To examine the effectiveness of our proposed frequency shown in Table 2. We can observe that both unsuper-
features, we experiment four types of dataset: Fake Head vised and supervised methods are able to produce high
Talker [7], StyleGAN [8], StarGAN [9], and Adversarial performance with our newly introduced frequency fea-
Latent Auto Encoder (ALAE) [10]. A brief description of tures. The accuracy scores of the unsupervised method
each dataset is provided below as well as in Table 1:           on Fake Head Talker and ALAE dataset are competitive,
     β€’ Fake Head Talker dataset [7]. Fake Head Talker compared to the supervised approaches. At the same
        is generated by the few-shot learning system that time, they are still higher than 80% on StyleGAN and
        is pre-trained extensively on a large dataset (meta- StarGAN. Meanwhile, the supervised method’s accuracy
        learning). Particularly, their approach includes an scores are always higher than 95% on the four datasets.
        embedder, a generator, and a discriminator. After We can conclude that our proposed features based on the
        training on a large corpus of talking head videos asynchronous in the frequency spectrum can effectively
        of different faces with adversarial training, their capture the characteristics of the GAN-generated images,
        approach can transform facial landmarks from and provide the foundation for distinguishing fake from
        a source frame into realistically-looking person- real images.
        alized photographs with a few photos of a new              Unbalanced training datasets. Furthermore, to study
        target person, and further mimic the target.            the  feasibility of training with an unbalanced dataset us-
                                                                ing our features, we gradually reduce the number of fake
     β€’ StyleGAN dataset [8]. StyleGAN is a high-level
                                                                images in each training dataset to 25%, 5%, and 1% of
        style controlling approach that governs its gen-
                                                                the total training data size. After that, we apply the SVM
        erator through adaptive instance normalization
                                                                as our learning model. To demonstrate our approach’s ef-
        (AdaIN) and Gaussian noise adding in each con-
                                                                fectiveness, we compare our method with FakeTalkerDe-
        volutional layer. Furthermore, by proposing two
                                                                tect model [13], which deployed a pre-trained AlexNet
        novel metrics such as perceptual path length and
                                                                and Siamese network trained on RGB images. The results
        linear separability, the generated images are less
                                                                are presented in Table 3. We can observe that our method
        entangled and have different factors of variation.
                                                                with the hand-crafted features outperforms the AlexNet
     β€’ StarGAN dataset [9]. StarGAN is a unified model
                                                                and FakeTalkerDetect on both balanced and unbalanced
        architecture that is able to train on multiple datasets
                                                                datasets. Therefore, we can conclude that our simple yet
        across different domains. By proposing a simple
                                                                effective features are capable to characterize the fake
        mask vector, the StarGAN is able to flexibly uti-
                                                                image much better in the unbalanced training dataset
        lize multiple datasets containing different label
                                                                scenario, as well.
        sets, and achieve competitive results in the facial
                                                                   Unsupervised domain adaptation. In this task, we
        attribute transfer tasks. This new approach with
                                                                propose an algorithm using our proposed features that
        only a single generator and a discriminator has
                                                                allows a pre-trained SVM model on one source dataset
        addressed the scalability and robustness limita-
                                                                (e.g., StyleGAN) can detect fake images in a new target
        tions of many previous research.
                                                                dataset (e.g., ALAE) with the only prior knowledge of the
     β€’ Adversarial Latent Auto Encoder (ALAE) datasettarget feature expectations.
        [10]. ALAE is an autoencoder-based generative
        model that is capable to learn the disentangled
Table 1
Details of datasets used in our experiment.
                                                                                   Training size       Test size
                 Datasets                     Resolution       Source datasets
                                                                                    (real+ fake)     (real+ fake)
                 Fake Head Talker [7]      224 Γ— 224           VoxCeleb2 [12]          18,800           18,800
                 StyleGAN [8]             1024 Γ— 1024             FFHQ 1               2,000             2,000
                 StarGAN [9]               256 Γ— 256            CelebA [14]            2,000            1,998
                 ALAE [10]                1024 Γ— 1024              FFHQ                2,000             2,000


Table 2
Experimental results of GMM and SVM on four datasets with our discriminative features.
                                                  GMM                                              SVM
         Datasets
                               Accuracy       Recall   Precision      F1      Accuracy      Recall     Precision      F1
         Fake Head Talker        0.996        1.000      0.991       0.996     0.9972       0.994        1.000       0.997
         StyleGAN                0.849        0.762      0.915       0.831      0.951       0.938        0.963       0.950
         StarGAN                 0.903        0.807      1.000       0.893      0.972       0.949        0.994       0.971
         ALAE                    0.992        0.984      1.000       0.992      0.999       0.997        1.000       0.998



Table 3                                                            Algorithm 1 Unsupervised domain adaptation with
Comparison between our approach using proposed descrip-            SVM using our proposed descriptive features
tive features and AlexNet and FakeTalkerDetect method on           Require: Labeled source set {𝑋 𝑠 , π‘Œ 𝑠 }, unlabeled tar-
Fake Head Talker dataset. The precision, recall and F1 scores
                                                                       get set 𝑋 𝑑 , where 𝑋 𝑠 and 𝑋 𝑑 includes the six
of AlexNet and FakeTalkerDetect from [13], and their values
are rounded to second decimal
                                                                       proposed features [𝑓1𝑠 , .., 𝑓6𝑠 ], respectively. The
 Methods              Accuracy   Recall   Precision      F1            [οΈ€prior   knowledge      of Gaussian      expectation    values:
                                                                          π‘šπ‘ π‘–,0 , π‘šπ‘ π‘–,1 𝑖=1,...,6 and π‘šπ‘‘π‘–,0 , π‘šπ‘‘π‘–,1 𝑖=1,...,6 .
                                                                                        ]οΈ€              [οΈ€            ]οΈ€
 AlexNet (50% fake)     0.981     0.98      0.98        0.98
                                                                    1: Step 1: Scale  (οΈ€ each   feature
                                                                                                     )οΈ€ (οΈ€in 𝑋    and 𝑋 𝑑)οΈ€:
                                                                                                               𝑠
 FakeTalkerDetect       0.984     0.98      0.98        0.98
 SVM (ours)            0.997     0.994      1.00       0.997                   Β―
                                                                              𝑓𝑖 = 𝑓𝑖 βˆ’ π‘šπ‘–,0 / π‘šπ‘–,1 βˆ’ π‘šπ‘ π‘–,0 ,
                                                                                 𝑠         𝑠      𝑠          𝑠

 AlexNet (25% fake)    0.971      0.95         0.95     0.96                  𝑓¯𝑖𝑑 = 𝑓𝑖𝑑 βˆ’ π‘šπ‘‘π‘–,0 / π‘šπ‘‘π‘–,1 βˆ’ π‘šπ‘‘π‘–,0
                                                                                      (οΈ€            )οΈ€ (οΈ€                 )οΈ€
 FakeTalkerDetect      0.986      0.98         0.98     0.98
                                                                    2: Step 2: Fit source set [𝑓¯1 , .., 𝑓¯6 ], π‘Œ            with SVM
                                                                                                      {οΈ€ 𝑠        𝑠     𝑠
                                                                                                                          }οΈ€
 SVM (ours)            0.998     0.995        1.000    0.997
                                                                       model.
 AlexNet (5% fake)     0.964      0.98         0.80     0.87
 FakeTalkerDetect      0.988      0.99         0.91     0.94        3: Step 3: Use pre-trained SVM to predict target set label
 SVM (ours)            0.997     0.997        0.997    0.997           from [𝑓¯1𝑑 , .., 𝑓¯6𝑑 ].
 AlexNet (1% fake)     0.963      0.98         0.61     0.67
 FakeTalkerDetect      0.988      0.99         0.74     0.82
 SVM (ours)            0.992     0.999        0.986    0.992
                                                             suggested features the pre-trained SVM shows its strong
                                                             detection ability in the new target domain, where all the
                                                             detection performance is above 80% of accuracy for any
   In particular, we first take the two Gaussian expecta-
                                                             pair of source and target dataset. This preliminary exper-
tion values of two mixture distributions of each feature
                                                             iment shows that our proposed features can be utilized
from both source and target dataset. These expectation
                                                             in domain adaptation tasks with more complex learning
values are kept as our prior knowledge about the target
                                                             models in the future.
dataset. We then scale the source training set features
such that their two Gaussian expectation values are nor-
malized between 0 and 1 , to better fit the training dataset 4. Conclusion
with the SVM model. In the testing phase, with our prior
knowledge above, we can scale the testing features from Although GANs have significantly advanced in the past,
the target dataset using the known expectation values we discover that there are some areas that GANs’ can-
and feed them to the pre-trained SVM model to make pre- not mimic the real images effectively in the frequency
diction. This adaptation learning process is summarized domain. Thus, in this work, we propose a preliminary
in the Algorithm 1.                                          approach that reveals the asynchronous in frequency do-
   We experiment with the four fake dataset and present main of the three channels in GAN images. By mining
the results in Table 4. We can observe that with our statistical features in frequency domain, our simple yet
Table 4
Experimental results of domain adaptation task using our proposed features
                    Source dataset      Target dataset       Accuracy    Recall   Precision    F1
                                        StyleGAN               0.800     0.908      0.745     0.819
                    Fake Head Talker    StarGAN                0.918     0.844      0.992     0.912
                                        ALAE                   0.994     0.995      0.993     0.994
                                        Fake Head Talker      0.965      0.932      0.998     0.964
                    StyleGAN            StarGAN               0.906      0.814      0.998     0.896
                                        ALAE                  0.991      0.982      1.000     0.991
                                        Fake Head Talker      0.983      0.982      0.983     0.983
                    StarGAN             StyleGAN              0.832      0.980      0.756     0.854
                                        ALAE                  0.996      0.997      0.994     0.996
                                        Fake Head Talker      0.993      0.989      0.998     0.993
                    ALAE                StyleGAN              0.890      0.955      0.845     0.897
                                        StarGAN               0.929      0.871      0.985     0.925



effective unsupervised and supervised learning methods        [2] S. Cole, We are truly fucked: Everyone is
can easily discriminate the real and GAN-based synthetic          making ai-generated fake porn now, 2018.
facial images without utilizing deep learning methods.            URL:       https://www.vice.com/en/article/bjye8a/
Our extensive experiments demonstrates that the pro-              reddit-fake-porn-app-daisy-ridley.
posed features’ power in three scenarios: 1) unsupervised     [3] X. Zhang, S. Karaman, S.-F. Chang, Detecting and
and supervised binary classification, 2) unbalanced train-        simulating artifacts in gan fake images, in: 2019
ing dataset, and 3) domain adaptation task. For future            IEEE International Workshop on Information Foren-
work, we plan to explore and exploit more on these as-            sics and Security (WIFS), IEEE, 2019, pp. 1–6.
pects of GAN-generated images to combat against mis-          [4] S.-Y. Wang, O. Wang, R. Zhang, A. Owens, A. A.
uses from attackers, and extend our work to deepfake              Efros, Cnn-generated images are surprisingly easy
detection.                                                        to spot... for now, in: Proceedings of the IEEE/CVF
                                                                  Conference on Computer Vision and Pattern Recog-
                                                                  nition, 2020, pp. 8695–8704.
Acknowledgments                                               [5] J. Frank, T. Eisenhofer, L. SchΓΆnherr, A. Fischer,
                                                                  D. Kolossa, T. Holz, Leveraging frequency analysis
This work was partly supported by Institute of Informa-
                                                                  for deep fake image recognition, in: International
tion & communications Technology Planning & Eval-
                                                                  Conference on Machine Learning, PMLR, 2020, pp.
uation (IITP) grant funded by the Korea government
                                                                  3247–3258.
(MSIT) (No.2019-0-00421, AI Graduate School Support
                                                              [6] T. Dzanic, K. Shah, F. Witherden, Fourier spectrum
Program (Sungkyunkwan University)), (No. 2019-0-01343,
                                                                  discrepancies in deep network generated images,
Regional strategic industry convergence security core
                                                                  arXiv preprint arXiv:1911.06465 (2019).
talent training business) and the Basic Science Research
                                                              [7] E. Zakharov, A. Shysheya, E. Burkov, V. Lempit-
Program through National Research Foundation of Ko-
                                                                  sky, Few-shot adversarial learning of realistic neu-
rea (NRF) grant funded by Korea government MSIT (No.
                                                                  ral talking head models, in: Proceedings of the
2020R1C1C1006004). Additionally, this research was partly
                                                                  IEEE/CVF International Conference on Computer
supported by IITP grant funded by the Korea govern-
                                                                  Vision, 2019, pp. 9459–9468.
ment MSIT (No. 2021-0-00017, Original Technology De-
                                                              [8] T. Karras, S. Laine, T. Aila, A style-based generator
velopment of Artificial Intelligence Industry) and was
                                                                  architecture for generative adversarial networks,
partly supported by the Korea government MSIT, under
                                                                  in: Proceedings of the IEEE/CVF Conference on
the High-Potential Individuals Global Training Program
                                                                  Computer Vision and Pattern Recognition, 2019, pp.
(2019-0-01579) supervised by the IITP.
                                                                  4401–4410.
                                                              [9] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, J. Choo,
References                                                        Stargan: Unified generative adversarial networks
                                                                  for multi-domain image-to-image translation, in:
 [1] T. Quandt, L. Frischlich, S. Boberg, T. Schatto-             Proceedings of the IEEE conference on computer
     Eckrodt, Fake news, The international encyclopedia           vision and pattern recognition, 2018, pp. 8789–8797.
     of Journalism Studies (2019) 1–6.                       [10] S. Pidhorskyi, D. A. Adjeroh, G. Doretto, Adver-
     sarial latent autoencoders, in: Proceedings of the
     IEEE/CVF Conference on Computer Vision and Pat-
     tern Recognition, 2020, pp. 14104–14113.
[11] M. Khayatkhoei, A. Elgammal, Spatial frequency
     bias in convolutional generative adversarial net-
     works, arXiv preprint arXiv:2010.01473 (2020).
[12] A. Nagrani, J. S. Chung, W. Xie, A. Zisserman, Vox-
     celeb: Large-scale speaker verification in the wild,
     Computer Science and Language (2019).
[13] H. Jeon, Y. Bang, S. S. Woo, Faketalkerdetect: Ef-
     fective and practical realistic neural talking head
     detection with a highly unbalanced dataset, in: Pro-
     ceedings of the IEEE/CVF International Conference
     on Computer Vision Workshops, 2019, pp. 0–0.
[14] Z. Liu, P. Luo, X. Wang, X. Tang, Deep learning
     face attributes in the wild, in: Proceedings of Inter-
     national Conference on Computer Vision (ICCV),
     2015.
A. Distribution of Statistical Descriptive Features
The histogram distributions of our six proposed statistical features in the frequency domains are present in the Fig.
3. We can observe that these feature distributions are highly separable between real and fake images across four
datasets.
Figure 3: The histogram distributions of our six statistical descriptive features from four datasets: Fake Head Talker, StyleGAN,
StarGAN and ALAE.
B. Example GAN-based Synthetic Images
We provide example images from four datasets used in our experiment.