=Paper=
{{Paper
|id=Vol-3084/paper8
|storemode=property
|title=Towards Image Data Hiding via Facial Stego Synthesis With Generative Model
|pdfUrl=https://ceur-ws.org/Vol-3084/paper8.pdf
|volume=Vol-3084
|authors=Li Dong,Jie Wang,Rangding Wang,Yuanman Li,Weiwei Sun
}}
==Towards Image Data Hiding via Facial Stego Synthesis With Generative Model==
<pdf width="1500px">https://ceur-ws.org/Vol-3084/paper8.pdf</pdf>
<pre>
Towards Image Data Hiding via Facial Stego Synthesis
With Generative Model
Li Dong1,2 , Jie Wang1,2 , Rangding Wang1,2 , Yuanman Li3 and Weiwei Sun4
1
  Faculty of Electrical Engineering and Computer Science, Ningbo University, Zhejiang, China, 315211
2
  Southeast Digital Economic Development Institute, Zhejiang, China, 324000
3
  Shenzhen University, Guangdong, China, 518061
4
  Alibaba Group, Zhejiang, China, 310052


                                             Abstract
                                             Stego synthesis-based data hiding aims to directly produce a plausible natural image to convey secret message. However,
                                             most of the existing works neglected the possible communication degradations and forensic actions, which commonly occur
                                             in practice. In this paper, we devise a generative adversarial network (GAN)-based framework to synthesize facial stego
                                             images. The framework consists of four components: generator, extractor, discriminator and forensic network. Specifically,
                                             the generator is deployed to generate a realistic facial stego image from the secret message and key, while the extractor aims at
                                             extracting the secret message from the stego image with the provided secret key. To combat forensics, we explicitly integrate
                                             a forensic network into the proposed framework, which is responsible for guiding the update of generator. Three degradation
                                             layers are further incorporated, enforcing the generator to characterize the communication degradations. Experimental results
                                             demonstrate that the proposed framework could accurately extract the secret message and effectively resist the forensic
                                             detection and certain degradations, while attaining realistic facial stego images.

                                             Keywords
                                             data hiding, stego synthesis, generative adversarial network


1. Introduction                                                                                                       tion probability maps. For the methods HayersGAN [7],
                                                                                                                      HiDDeN [8] and SteganoGAN [9], they all designed an
Data hiding aims to embed the secret message into a                                                                   encoder-decoder alike framework based on GAN. These
cover signal, without incurring awareness of an adver-                                                                methods could automatically learn the suitable areas for
sary. It is widely used in many applications, e.g., covert                                                            embedding the secret bitstream message.
communication [1] and multimedia data protection [2, 3].                                                                 For the last several years, the adversarial examples
The primitive ad-hoc Least-Significant Bit (LSB) replaces                                                             to neural networks meet data hiding, and continuously
the bit in least significant bit-plane of each pixel with                                                             drawing extensive attention from the community. Some
the secret bit. While the modern data hiding methods at-                                                              studies, e.g., [10, 11], found that adding slight pertur-
tempt to eliminate the traces of data hiding action and im-                                                           bations to the input data would paralyze the prediction
prove the steganographic capacity. For example, content-                                                              capability of learning-based classifiers. As the opponent
adaptive steganography [1] designed sophisticated dis-                                                                of data hiding, steganalysis aims to expose the data hiding
tortion function according to prior knowledge and used                                                                on stego signal and usually involves machine-learning
Syndrome-Trellis coding to embed the secret message.                                                                  classifiers. Therefore, it is possible for data hiding meth-
Recently, neural network-based data hiding is becoming                                                                ods to bypass steganalysis by borrowing some strategies
one of the active research directions. Baluja [4] employed                                                            from the adversarial examples-related works. Tang et al.
convolutional neural networks to hide an entire secret                                                                [12] presented the Adversarial Embedding (ADV-EMB)
image into the cover image in an end-to-end fashion. The                                                              method that adjusts the modification cost of image ele-
work SSGAN [5] attempted to exploit GAN to synthesize                                                                 ments, according to the gradients that back-propagated
a cover image which is more suitable for the subsequent                                                               from the target steganalytic neural network. The con-
steganographic data embedding. ASDL-GAN [6] inte-                                                                     structed adversarial stego could effectively fool the ste-
grated the content-adaptive steganography and GAN, in                                                                 ganalytic network, revealing the vulnerability of the deep
which the generator was able to produce the modifica-                                                                 learning-based steganalyzer.
                                                                                                                         Note that, all aforementioned data hiding techniques
International Workshop on Safety & Security of Deep Learning, 21st
-26th August, 2021
                                                                                                                      are based on the cover modification. The common char-
Envelope-Open dongli@nbu.edu.cn (L. Dong); 1811082196@nbu.edu.cn                                                      acteristic is that these methods can not be independent of
(J. Wang); wangrangding@nbu.edu.cn (R. Wang);                                                                         the modification on the given cover image. As such, it in-
yuanmanli@szu.edu.cn (Y. Li); sunweiwei.sww@alibaba-inc.com                                                           evitably leaves artifacts exposing to steganalysis. On the
(W. Sun)                                                                                                              contrary, stego synthesis-based data hiding, e.g., [13, 14],
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative
    CEUR
    Workshop
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       Commons License Attribution 4.0 International (CC BY 4.0).
                                       CEUR Workshop Proceedings (CEUR-WS.org)
                                                                                                                      refers to synthesizing the stego image directly from the
    Proceedings
                                                                        Adversarial Training                    Anti-Forensics
secret message. It could pose more challenges for ste-
ganalysis. Under this concept, traditional methods tried                              Discriminator 𝓓             Forensic network 𝓕𝜽

to produce stego image based on some hand-crafted des-              Genuine image 𝐈


ignations. Although the capacity was relatively higher,
                                                               Secret                                                                       Degraded stego
they were limited to synthesizing patterned images, such       message 𝐦      Generator 𝓖
                                                                                                                Degradation                  image 𝓝(𝐒)
                                                                                                                 layers 𝓝
as textures and fingerprints. As an alternative solution,     10101···11
                                                                                                 Synthesized                                        Extracted
                                                                                                stego image 𝐒                                       message 𝐦′
some methods [15, 16] use GAN to synthesize stego im-                          10010···01                                     Extractor 𝓔           10101···01
ages with rich semantics, e.g., face and food. However,                        Secret key 𝐤

the accuracy of message extraction was unsatisfactory                           Facial Stego Image Synthesis and Message Extraction

under image degradations. Moreover, the synthesized
                                                              Figure 1: Overview of the proposed FSIS-GAN framework.
stego images can be easily identified by a well-trained
forensic detector. It is thus urgent to further improve
the robustness of message extraction and anti-forensic        modification would leave embedding traces that can be
capability of stego synthesis-based data hiding methods.      detected. To resist the detection by steganalyzer, stego
   In this work, we propose a Facial Stego Image Synthe-      synthesis-based data hiding method could directly pro-
sis method for data hiding with GAN, which is termed as       duce the stego images from the given secret message.
FSIS-GAN. Unlike the cover modification-based data hid-       For early attempts, Wu et al. citewu2014steganography
ing methods, FSIS-GAN is designed without providing           proposed a texture image synthesis-based method, which
a cover image beforehand. Compared with the existing          selectively distributes the source patches of the original
stego synthesis-based methods, FSIS-GAN can not only          texture image onto the synthesized stego image. The
synthesize realistic facial stego images, but also achieve    message hiding and extraction depend on the choice of
superior performance in terms of robustness and anti-         source patches. Motivated by the fingerprint biomet-
forensic capability. Experimental results conducted on        rics, Li et al. [14] proposed to use the hologram phase
the public facial dataset validate such merits of our pro-    constructed from the secret message to synthesize fin-
posed method. The main contributions of this work can         gerprint stego image. The hologram phase consists of
be summarized as follows,                                     two phases: The first spiral phase encodes the secret
                                                              message to the two-dimensional points with different po-
     • We explicitly consider the image degradation dur-      larities, and the second continuous phase is to synthesize
       ing the covert communication, and integrate mul-       fingerprint images. It is worth noting that conventional
       tiple degradation layers into the framework. This      stego image synthesis-based methods can only synthe-
       boost the robustness performance in terms of the       size patterned stego image such as textures, lacking rich
       message extraction.                                    semantics, which limits their practical applications.
     • We incorporate a forensic network during train-           Instead, Hu et al. [15] suggested using the genera-
       ing FSIS-GAN. By exploiting the gradients from         tor of GAN to synthesize a facial stego image from the
       such a forensic network, the stego image pro-          secret message. Meanwhile, the secret message can be
       duced by the learned generator could effectively       extracted from the stego image by the corresponding ex-
       fool the forensic network.                             tractor network. Similarly, Zhang et al. [16] exploited
     • We explicitly adopt the secret key into the data       GAN to generate stego image with different semantic
       hiding procedure of FISI-GAN, which could fur-         labels, which could improve the robustness of data ex-
       ther improve the reliability of the secret message     traction but significantly scarifying the steganographic
       extraction.                                            capacity. The main advantage of the GAN-based works is
                                                              that they could synthesize stego images with rich seman-
   The rest of this paper is organized as follows. Section    tics. However, we shall note that stego images can be
II briefly reviews the related work on stego synthesis-       easily identified by some well-trained forensic networks.
based data hiding. Section III describes the proposed FSIS-   In addition, there is no trade-off between capacity and
GAN, including network architecture and loss function.        extraction accuracy.
Section IV presents the experimental results, and the final
conclusions are drawn in Section V.
                                                              3. Facial Image Data Hiding via
2. Stego Synthesis-based Data                                    Generative Stego Synthesis
   Hiding                                               In this section, we first give an overview of the proposed
                                                        FSIS-GAN framework and then introduce each compo-
The majority of data hiding method involves the modifi- nent of the framework, accompanied with thorough dis-
cation on the given cover images. However, such cover cussion on the loss function, network structure and train-
ing procedure.                                              where N(⋅) models the image degradation process, and
                                                            N(S) is the degraded stego image. Here, m′ ∈ (0, 1)𝑙𝑚
3.1. Overview of FSIS-GAN                                   denotes the extracted secret message. It shall be noted
                                                            that the extracted message m′ shall be (approximately)
The proposed FSIS-GAN framework is illustrated in Fig- equals the original secret message m, and thus one can
ure 1. In general, it is an end-to-end framework consist- employ error correcting mechanism to fully correct the
ing of three parts, where each part is designed to achieve erroneous bits.
a specific goal. First, the part of facial stego image syn-    To measure the distortion between the original secret
thesis and message extraction contains a generator G, an message m and the extracted message m′ , we use the
extractor E and the degradation layers N. The generator cross-entropy loss to calculate the message extraction loss
G is deployed to convert the secret message along with LE , which is given by
the secret key into a facial stego image. The degradation                               𝑙𝑚
layers N are used to simulate possible common image                                1
                                                                LE (m, m′ ) = − ∑ [𝑚𝑖 log(𝑚𝑖′ ) + (1 − 𝑚𝑖 )log(1 − 𝑚𝑖′ )],
degradations within the communication channel. The                                𝑙 𝑚 𝑖=1
extractor E is learned to recover the secret message from                                                                        (3)
                                                                               ′                                   ′
the degraded stego image. Second, there is a discrimina- where 𝑚𝑖 and 𝑚𝑖 is 𝑖-th element of m and m , respectively.
tor D in the part of adversarial training, which aims at       Note that, our proposed FSIS-GAN framework explic-
distinguishing the genuine data sample from the ones pro-   itly  receiving a secret key as an input, which is designed
duced by the generator G. Third, a well-trained existing to satisfy the Kerckhoffs’ principle. It means that even
forensic network F𝜃 (parameterized by 𝜃) is introduced the extractor E network is completely exposed to an at-
in the part of anti-forensics, which could distinguish the tacker, the secret message m will be recovered only if
genuine from the synthesized facial stego image. Note the receiver obtain both the secret key k and the facial
that this target forensic network is treated as a fixed stego image S. It is worth emphasizing that, for most of
adversary, and its network parameters are always frozen. the existing GAN-based methods, e.g., [15, 16], there is
                                                            no involvement of a secret key. Further notice that as
                                                            the input of the extractor E, the dimensions of secret key
3.2. Stego Image Synthesis and Message                      k is greatly smaller than that of the facial stego image
      Extraction                                            S. Thus, the extractor E tends to discard the secret key
The part of facial stego image synthesis and message because it carries much less information. To mitigate this
extraction achieve two functionalities. First, by using issue, we propose to use randomly generated incorrect
                                                                          ̃           𝑙           ̃
the generator G, one can convert the given secret mes- secret key k ∈ {0, 1} 𝑘 , where k ≠ k, as input during train-
sage into a facial stego image. Second, the extractor E is  ing   stage.  Instead   of   directly using the correct secret key
responsible for extracting the secret message from the and minimize the difference between the extracted and
input stego image. Furthermore, a secret key is intro- original message, we maximize the differences between
duced to ensure the communication reliability and high the extracted and original message when applying incor-
diversity of the generated facial stego image.              rect secret key. Mathematically, the loss term inverse loss
   Generally, generator G and extractor E aim to learn LE      ̃ , can be expressed by the negative cross-entropy loss:
two mappings, i.e., mapping the given secret message                                𝑙𝑚
into a stego image, and vice versa. More formally, let                          1
                                                              LE          ′
                                                                 ̃ (m, m̃ ) =      ∑ [𝑚𝑖 log(𝑚                           ̃𝑖′ )], (4)
                                                                                                ̃𝑖′ ) + (1 − 𝑚𝑖 )log(1 − 𝑚
m ∈ {0, 1}𝑙𝑚 and k ∈ {0, 1}𝑙𝑘 be the binary secret message                     𝑙𝑚 𝑖=1
and the secret key, respectively. Generator G is intended where 𝑚    ̃𝑖′ is the 𝑖-th element of the extracted message m′̃
to learn the first mapping, transforming the message m with the incorrect key k,̃ i.e., m′̃ = E(N(S), k).              ̃
along with the secret key k into a stego image:                Enhancing robustness with degradation layers:
                                                            In a practical communication channel, there often ex-
                        S = G(m, k),                    (1) ists degradations on the synthesized stego image S, when
                                                            transmitting the stego to a receiver. To this end, the data
where S denotes the synthesized facial stego image of
                                                            hiding system requires certain robustness to ensure the
shape 𝐶 × 𝐻 × 𝑊. To recover the secret message, we next
                                                            accuracy of message extraction. Therefore, in this work,
introduce the extractor E. Considering that the facial
                                                            we take three representative degradations into account,
stego image S may be degraded during transmission, the
                                                            i.e., image noise pollution, blurring, and compression.
second mapping should be from the degraded stego image
                                                            For noise pollution, we consider the one of the most
along with the secret key k to the secret message, which
                                                            widely-used noise models: Gaussian noise. For blurring,
can be expressed by
                                                            the Gaussian blurring is used. For signal compression,
                     m′ = E(N(S), k),                   (2) JPEG image compression is employed, which is exten-
                                                            sively used for reducing the bandwidth of transmission
process. In experiments, we implement these three types       3.4. Anti-forensics Part
of degradation as neural network layers N to degrade the
                                                              Remind that there is no explicit cover images involved in
stego image. Specifically, three network layers are used
                                                              stego synthesis-based data hiding methods. This merit
for simulating each type of degradation. Gaussian noise
                                                              makes such type of data hiding method could effectively
layer (GNL) is to add Gaussian noise to the facial stego
                                                              resist to conventional steganalysis detection. However,
image S. Gaussian blurring layer (GBL) blurs S. For JPEG
                                                              as pointed in [15], a well-trained forensic network could
compression, considering that the quantitation operation
                                                              readily distinguish a synthesized stego image from the
is non-differentiable, we approximate the quantization
                                                              genuine one, even the synthesized stego image is of no
operation with a differentiable polynomial function. Such
                                                              perceptual differences to an observer.
differentiating technique can also be referred to the work
                                                                 Although F𝜃 is an expert in such a detection task, some
HiDDeN [8].
                                                              studies [10, 11] have shown that deep neural network-
                                                              based classifiers are vulnerable to adversarial examples.
3.3. Adversarial Training Part                                Inspired by this, we propose to apply strategies of ob-
As aforementioned, the hand-crafted stego synthesis-          taining adversarial examples to evade the stego detection
based data hiding methods [13, 14] only could synthesize      network as a way for realizing anti-forensics. In FSIS-
patterned images such as texture and fingerprint, limiting    GAN framework, we consider a white-box scenario, i.e.,
their practical applications. Synthesizing a natural image    assuming one has full knowledge of the target foren-
with semantics is a challenging task. However, this prob-     sic network. The target forensic network F is trained
lem can be alleviated with the guidance of adversarial        with the genuine images from a publicly available facial
training. In this part, the purpose of the discriminator      dataset and the synthesized images that produced by BE-
D is to conduct adversarial training with the generator       GAN [17]. Then, we integrate the well-trained F𝜃 into
G and improve the plausibility of the synthesized facial      the FSIS-GAN framework, in which F𝜃 receives the syn-
stego images.                                                 thesized facial stego image S and output the confidence.
   More specifically, let I be the genuine facial image sam-  The gradients that back-propagated by the F𝜃 are used
ple of shape 𝐶 × 𝐻 × 𝑊 from a publicly available genuine      to update the parameters of the generator G. To measure
facial image dataset. The discriminator D estimates the       the loss of resisting forensic detection, we define the anti-
probability that a given image sample belonging to a syn-     forensic loss LF𝜃 to computes the cross-entropy between
thesized by the generator G. The generator G attempts to      the output of F𝜃 and our target genuine image label:
fool the discriminator D. Through such adversarial train-
                                                                          LF𝜃 (S) = − log (1 − F𝜃 (S)),           (8)
ing, the generator G is encouraged to synthesize much
more realistic facial stego images. As a variant of GAN, where F (S) ∈ (0, 1) is the confidence output by F .
                                                                    𝜃                                               𝜃
the network structure and loss function of BEGAN [17] Clearly, the decrement of L indicates the probability
                                                                                         F𝜃
provides a good reference for improving training stability. increment of S being identified as a genuine image by F .
                                                                                                                    𝜃
Thus, we in this work employ the adversarial training
loss used in BEGAN. Mathematically, the adversarial loss
Ladv for the generator G can be calculated as                3.5. Network Structure and Training
                               1                                    Strategy
          Ladv (D(S), S) =         [|D(S) − S|],        (5)
                             𝐶𝐻 𝑊                         The network architecture of the generator G and the
where the shape of output D(S) is same as the facial stegoextractor E are shown in Figure 2. For generator G, the
image. The adversarial loss LD for the discriminator D    secret key vector k is first concatenated to the secret
is                                                        message vector m and then fed to subsequent layers.
                   1                                      Then, G applies two fully-connected (FC) layers and three
     LD (I, S) =       [|D(I) − I| − ℎ𝑡 ⋅ |D(S) − S|],  (6)
                                                          convtranspose (ConvT) layers to produce the facial stego
                 𝐶𝐻 𝑊
where ℎ𝑡 controls the discrimination ability of D in the  image S. In particular, after each FC layer or ConvT
                                                          layer, we apply batch normalization (BN) [18] and ReLU
𝑡-th training step to equilibrate the adversarial training.
It can be computed as                                     activation function to process intermediate vectors. In
                                                          experiments, we found that both m and k are composed
                     𝜆
       ℎ𝑡+1 = ℎ𝑡 +       [𝛾|D(I) − I| − |D(S) − S|].  (7) of binary number 0 or 1, and such form is not suitable
                   𝐶𝐻 𝑊                                   as input and the adversarial training loss would diverge.
Here the parameter 𝜆 is the learning rate of training, To solve this issue, additional BN layers were added, and
and 𝛾 is a hyper-parameter to control the diversity of normalization operation is carried out inside the network.
synthesized facial images. The quality and diversity of Experiential results show that this trick could greatly
the facial stego images can be freely adjusted by tuning alleviate the divergence problem.
the parameter 𝛾.
                                                                                                                                                                       4. Experiment results
                                                                                                                                               !+,%-"'./"0*
                                                                                                                                              '%")1 .&()"*#
                        !"#$"%
                        3"+*"
                                                            (!
                                   !"0!


!"#$"%                          !"#$%&'#%&'
                                                                             (!
                                                                                                                                                                       In this section, we first introduce the experimental setup.
                                                                                                                                                                       Then, to verify the robustness of our proposed FSIS-GAN,
&"''()"*!                                                                                  !"#)*+,-                 !"#)*+,-               !"#)*
                                                                                             +.'/0                      +.'/0              +*%#1


     !"#"
                        !"$0" % 0! &
                                                                         *)"
                                                                                  + -
                                                                                   "
                                                                                  , ,                                                                                  it is evaluated under image degradation and without
                                                                                                                                                                       degradation, respectively. Finally, the anti-forensic capa-
                                                                                                           + -
                                              !"!'()                                                 .("    "
                                                                                                           ) )
                                                                                                                                 + -
                                                                                                                           !*"    "
                                                                                                                                 ( (

                                                                                                                                              /"+"-
                                                                                                                                                                       bility of FSIS-GAN is validated.
                                   (a) Network structure of the generator G

                                          !"#$%&'#%&'
                                                                                                                                                                       4.1. Experimental Setup
                (!
                                                                                                                                                                       Our experiments are conducted on the CelebA dataset
!"#$"%
&"'(!
                                                                 !"#/)0'12    !"#/)0'12          !"#/)0'12              !"#/)0'12
                                                                                                                                                   (!)*+,-"+.          [20], where the region with face is identified and ex-
         !"/"
                                                                                                           +*,"
                                                                                                                  # $
                                                                                                                   "
                                                                                                                                    *!+"
                                                                                                                                           # $
                                                                                                                                             "
                                                                                                                                           !, !,
                                                                                                                                                                !"/!   tracted. All images are reshaped into 3 × 64 × 64. The
                                                                                                                                                                       following three metrics are used for evaluation:
                                                                                                                  . .
                                                                                           # $                                                          45%$2#%".
                                                                                    !+."    "
                                                                                           - -                                                         1"++23"(#$
                                                                         # $
                          !"#"$                                       ,-" "
                                                                         + +
                 !')%*"+,-".(           &"#"$           '& ( !)"#"$
                +%"3/ ,123"(%
                                                                                                                                                                            • Fréchet Inception Distance (FID) [21], which
                                    (b) Network structure of the extractor E                                                                                                  is a widely-used perceptual image quality assess-
Figure 2: Network structure of the generator G and the extrac-                                                                                                                ment metric for synthesized images. FID is a de
tor E. “Concat”, “FC”, “ConvT”, “BN”, “Conv” denote the con-                                                                                                                  facto metric for assessing the image quality cre-
catenation, fully-connected layer, convtranspose layer, batch                                                                                                                 ated by generator of GANs’. Lower FID score
norm, and convolution layer, respectively.                                                                                                                                    indicates better consistency with human’s per-
                                                                                                                                                                              ception on natural images.
                                                                                                                                                                            • Accuracy of message extraction (ACC) that is
   For extractor E, we shall ensure the secret key vector                                                                                                                                           𝐿
k and the facial stego image matrix S in a way such                                                                                                                           computed by ACC = 𝐿Ext , where 𝐿Ext is the length
that the extractor E would not neglect the information                                                                                                                        of correctly extracted message and 𝐿 is the length
provided by the secret key. To this end, the extractor                                                                                                                        of secret message m.
E first applies FC layer to the secret key to form the                                                                                                                      • Probability of missed detection (PMD). This
intermediate matrix, i.e., 1 × 𝑊 × 𝐻. Then, the facial stego                                                                                                                  metric can be calculated by PMD = 𝐹 𝑁𝐹+𝑇𝑁
                                                                                                                                                                                                                        𝑃
                                                                                                                                                                                                                          , where
image S and the intermediate matrix are concatenated,                                                                                                                         𝐹 𝑁 (False Negative) is the ratio for case “synthe-
and then feed the fused tensor to the four convolutional                                                                                                                      sized facial image is misclassified as a genuine
(Conv) layers. Finally, the extractor E applies the FC layer                                                                                                                  one”, and 𝑇 𝑃 (True Positive) is the ratio for case
and Sigmoid activation function to produce the message                                                                                                                        “synthesized facial image is correctly detected”.
vector m′ (or m′̃ ) with size of 1 × 𝑙𝑚 .                                                                                                                                     Larger PMD indicates higher resisting ability to
   For the discriminator D, we adopt the auto-encoder                                                                                                                         the forensic network.
alike structure from BEGAN [17]. For the target forensic
network F, we use Ye-Net [19], which is a widely-used           The proposed FSIS-GAN framework is implemented
steganalytic method.                                         with PyTorch and train on four NVIDIA GTX1080Ti
   The training process of the proposed FSIS-GAN frame-      GPUs with 11GB memory. The number of training
work is iteratively optimize the loss function of each       epochs is set to 400 with a mini batch-size of 64. We
network, except the well-trained forensic network F𝜃 .       use Adam [22] as the optimizer with a learning rate of
We apply the extraction loss LE and the adversarial loss     2 × 10−4 . For the hyper-parameters 𝛼 and 𝛽 in (9), with
LD as the loss function for the extractor E and the dis-     a number of trials and errors, we empirically set them
criminator D, respectively. In particular, The total loss    as 0.1 in experiments. The parameter 𝛾 in (7) is set to
LG for the generator G is a proper fusion of the four        0.7, which is expected to produce reasonably diverse fa-
losses aforementioned as follows                             cial stego images. The competing method is the most
                                                             related work [15]. We implement this work by ourselves
           LG = Ladv + 𝛼(LE + LE    ̃ ) + 𝛽L F𝜃 ,        (9) because there is no publicly available code. With certain
                                                             tweaking and fine-tuning, the tested results were com-
where Ladv is the adversarial loss for G, LE    ̃ is the in- parable to the originally reported data from [15]. For a
verse loss, and LF𝜃 is the anti-forensic loss. The hyper- fair comparison, the length of the secret message 𝑙𝑚 and
parameters of 𝛼 and 𝛽 are used to control the relative the secret key 𝑙𝑘 are all set to 100, so as to the payload is
importance among the four losses.                            identical to that of work [15].
                                                                  Table 1
                                                                  Comparison of message extraction accuracy (%) for the case
                                                                  of no communication degradations. Here, k and k̃ denote
                                                                  the correct and incorrect secret key, respectively. FSIS-GAN-
                                                                  WD is a variant of the proposed method by excluding the
                                                                  degradation layers, and FSIS-GAN-WD (ex LE   ̃ ) represents the
                                                                  FSIS-GAN-WD trained without inverse loss LE    ̃.


                                                                             Hu et al. FSIS-GAN-WD FSIS-GAN-WD (ex LẼ)
Figure 3: Comparison of exemplar synthesized stego images.         Scheme
                                                                              [15]     with k with k̃ with k with k̃
Top: Hu et al. [15]; Bottom: Proposed FSIS-GAN-WD.
                                                                  Accuracy    85.23     98.76    71.50    99.41       97.01

4.2. Performance Without Degradations
                                                                  would significantly deduce the extraction accuracy from
Notice that the competing method [15] does not consider           97.01% to 71.50%, while FSIS-GAN-WD almost retains
the image degradations. To verify the effectiveness of            the same extraction accuracy. This phenomena means
the proposed method under same settings and make a                that the involvement of the secret key will not work if
fair comparison. We in this subsection to evaluate the            we exclude the inverse loss. In contrast, FSIS-GAN-WD
performance without degradation layers N. The facial              (ex LE                            ̃
                                                                        ̃ ) with the incorrect key k still attains a quite high
stego image S will be transmitted to extractor E without          extraction accuracy of (> 97%). In a short summary, with-
any degradation. To avoid confusion, this variation of            out the inverse loss LE ̃ , the variant FSIS-GAN-WD (ex
our proposed method is termed as FSIS-GAN-WD (WD is               LẼ ) will violate the Kerckhsoffs’ principle.
abbreviated for Without Degradations). We first compare
the visual quality of the facial stego images with the
competing method [15]. As can be seen from Figure 3, the          4.3. Performance With Degradations
proposed FSIS-GAN-WD could synthesize more realistic              In this subsection, we test the robustness performance
facial stego images in comparison with Hu et al. [15].            of the proposed framework under certain image degra-
With more careful inspection, one can notice that the             dations. The image degradation type and level are given
stego images produced by FSIS-GAN-WD are more vivid               as prior knowledge. This scenario is common in practice
and with more correct semantic structures. It is difficult        because one can obtain some prior knowledge on the
for a common human to aware the inauthenticity of the             degradation through probing the communication chan-
facial stego images synthesized by FSIS-GAN-WD. In                nel. Thus, one can fix the degradation layers N and its
contrast, the stego images generated by Hu et al. [15] are        associated parameters during training stage. Specifically,
typically blurry and severely distorted, which apparently         in our experiments, the standard deviation 𝜎1 of the Gaus-
draw attentions from a forensic analyzer. For the FID             sian noise layer (GNL) is set to 0.2. The kernel width 𝑑
evaluation experiment, we use 10, 000 pairs of genuine            and the standard deviation 𝜎2 of the Gaussian blurring
images and synthesized facial stego images to compute             layer (GBL) are set to 3 and 1, respectively. The differ-
the FID score. The FID score of FSIS-GAN-WD is 23.20,             entiable JPEG compression layer (JCL) is implemented
which is much smaller than that of Hu et al. [15]’s 32.07.        as suggested by the work HiDDEN [8] For referring sim-
   Then, we evaluate the extraction accuracy for the case         plicity, this variation is termed as FSIS-GAN-FD (FD is
of without degradation. The results are tabulated in Table        abbreviated for Fixed Degradation) in the sequel.
1. To demonstrate the impact of the inverse loss LE       ̃ on       Firstly, the stego images synthesized by FSIS-GAN-FD
the extraction accuracy, the ablation experiments are also        are provided in Figure 4. One can observe that some
conducted, by excluding the inverse loss during training.         speckle noises emerge in the generated stego images,
This LE ̃ -ablated version is denoted as FSIS-GAN-WD (ex          which can be clearly seen from the highlighted regions
LẼ ). From   the Table 1, one can draw the following con-        with red line in Figure 4 (b). Quantitatively, the FID score
clusions. First, the extraction accuracy of FSIS-GAN-WD           of FSIS-GAN-FD is 41.40, which is inferior to that of
with the correct secret key k is 98.76%, which dramati-           FSIS-GAN-WD (23.20) and Hu et al. [15] (32.07). Never-
cally outperforms 85.23% of the competing method [15].            theless, the stego images produced by FSIS-GAN-FD are
Second, by comparing FSIS-GAN-WD and FSIS-GAN-                    intuitively more realistic than that of Hu et al. [15].
WD (ex LE   ̃ ), one can see that, the extraction accuracy of        Secondly, in Table 2, we report the extraction accuracy
FSIS-GAN-WD with a correct secret key k slightly infe-            performance under fixed degradations. Not surprisingly,
rior to that of FSIS-GAN-WD (ex LE      ̃ ). This suggests that   one can notice that the extraction accuracy of Hu et al.
the introduced inverse loss would marginally harm the             [15] and FSIS-GAN-WD greatly degrade, which can be
extraction accuracy. However, when comparing the case             attributed to the overlooking on degradation-resistant
of incorrect key k,̃ the participation of the inverse loss LẼ    message extraction issue. In contrast, FSIS-GAN-FD ex-
                                                                                                           100
                                                                                                                                                                 Hu et al.
                                                                                                                                                                 FSIS-GAN-WD
                                                                                                                                                                 FSIS-GAN-FD
                                                                                                           90


                                                                             Message extraction accuracy
                                                                                                           80


                                                                                                           70


                                                                                                           60
                (a)                             (b)
Figure 4: The comparison of synthesized facial stego images,                                               50
                                                                                                                 95   75          55            45          15           5
where four images of (a) are produced by FSIS-GAN-WD;                                                                      Compression quality factors (QF)

images of (b) are stego images produced by FSIS-GAN-FD.               Figure 5: Comparison of the message extraction accuracy (%)
With the introduction of degradation layers, minor speckle            under various levels of JPEG compression degradation.
noises emerge (highlighted with red rectangular).
                                                                      ever, as pointed in [15], a well-trained forensic network
Table 2
                                                                      can effectively identify a synthesized image. To solve this
Comparison of message extraction accuracy (%) under various
degradation conditions. The bold and marked value with
                                                                      issue, we explicitly considered the anti-forensics scenario
an asterisk (*) denote the highest extraction accuracy with           and introduce the anti-forensic loss LF𝜃 .
correct secret key k and the lowest extraction accuracy with             To demonstrate the influence of anti-forensic loss LF𝜃 ,
the incorrect secret key k,̃ respectively.                            we conduct the ablation experiment by excluding the loss
                                                                      term LF𝜃 , and thus this variant is termed as FSIS-GAN (ex
                      Hu et al. FSIS-GAN-WD FSIS-GAN-FD               LF𝜃 ). For a concrete example, we employ the well-trained
    Scheme
                       [15]     with k with k̃ with k with k̃         forensic network Ye-Net [19] F𝜃 to detect 3000 facial
W/o degradation        85.23   98.76    71.50∗        98.22   72.08
                                                                      stego images produced by different methods, and record
                                                                      the probability of missed detection (PMD). The PMD’s
   Fixed GNL           52.72    59.78   56.23∗        95.58   72.74
                                                                      of Hu et al. [15], FSIS-GAN (ex LF𝜃 ), and FSIS-GAN are
                                             ∗
   Fixed GBL           69.68    57.52   54.68         98.58   73.78   3.23%, 8.84%, and 89.91, respectively. As clearly shown,
    Fixed JCL          65.33    61.38   58.00∗        98.46   72.67   for FSIS-GAN (ex LF𝜃 ), despite the facial stego images
                                                                      look natural for human, they are easily exposed to the
                                                                      forensic network, where the PMD value is lower than 10%.
hibits quite promising results. Under three types of degra-           In contrast, by introducing the anti-forensic loss term, the
dation layers, the extraction accuracy typically exceeds              value of PMD of FSIS-GAN could reach 89.91%. This
94% (though lower than that of FSIS-GAN-WD, which is                  means the proposed method FSIS-GAN could effectively
specifically designed for the non-degradation scenario).              bypass the existing forensic network, retaining an nice
The results verify that for the case of known degrada-                anti-forensic capability.
tions, the proposed framework could learn to effectively
resistant the fixed degradations, by employing the fixed
degradation layers during the training.                               5. Conclusion
   Finally, to illustrate how the robustness of message
extraction changes under different degradation levels,  In this work, we proposed a stego-synthesis based data
we test different degradation types with a variety of   hiding method using generative neural network, by
degradation levels. Due to space limit, we only report  explicitly considering the image degradation and anti-
the JPEG compression degradation in Figure 5. As can    forensic need. Specifically, the generator is to synthe-
be seen, with the decrement of quality factor (𝑄𝐹), the size a facial stego image from the given secret message
extraction accuracy generally decreases. Although the   and secret key. The extractor aims to recover the secret
JCL that adopted from HiDDEN [8] could handle non-      message with the secret key. Through the adversarial
differentiable JPEG compression, it cannot perfectly re-training with the discriminator, the generator could pro-
produce the JPEG compression artifacts. Nevertheless,   duce realistic facial stego images. The degradation layers
FSIS-GAN-FD still achieve superior robustness, when     are introduced during the training, which significantly
comparing with other two schemes.                       enhance the robustness of message extraction. A forensic
                                                        network is incorporated during training, in response to
                                                        the possible adversarial forensic analysis in communi-
4.4. Performance of Anti-forensics                      cation channel. Experimental results verified that, our
Recall that, owing to that no cover images are involve- approach could generate more natural facial stego im-
ment for data hiding, our method has a relatively good ages, while retaining higher message extraction accuracy
undetectability when exposed to a steganalyzer. How- and nice anti-forensic ability.
Acknowledgments                                                     phy, IEEE Trans. Inf. Forensics Security 14 (2019)
                                                                    2074–2087.
This work was supported in part by the National Natu-          [13] K. Wu, C. Wang, Steganography using reversible
ral Science Foundation of China under Grant 61901237,               texture synthesis, IEEE Trans. Image Process. 24
in part by the Open Project Program of the State Key                (2014) 130–139.
Laboratory of CADCG, Zhejiang University under Grant           [14] S. Li, X. Zhang, Toward construction-based data
A2006, and in part by the Ningbo Natural Science Foun-              hiding: From secrets to fingerprint images, IEEE
dation under Grant 2019A610103. Thanks to Southeast                 Trans. Image Process. 28 (2018) 1482–1497.
Digital Economic Development Institute for supporting          [15] D. Hu, L. Wang, W. Jiang, S. Zheng, B. Li, A novel im-
the computing facility.                                             age steganography method via deep convolutional
                                                                    generative adversarial networks, IEEE Access 6
                                                                    (2018) 38303–38314.
References                                                     [16] Z. Zhang, G. Fu, R. Ni, J. Liu, X. Yang, A genera-
 [1] V. Sedighi, R. Cogranne, J. Fridrich, Content-                 tive method for steganography by cover synthesis
     adaptive steganography by minimizing statistical               with auxiliary semantics, Tsinghua Science and
     detectability, IEEE Trans. Inf. Forensics Security 11          Technology 25 (2020) 516–527.
     (2015) 221–234.                                           [17] D. Berthelot, T. Schumm, L. Metz, BEGAN: Bound-
 [2] J. Zhou, W. Sun, L. Dong, X. Liu, O. C. Au, Y. Y. Tang,        ary equilibrium generative adversarial networks,
     Secure reversible image data hiding over encrypted             arXiv preprint arXiv:1703.10717 (2017).
     domain via key modulation, IEEE Trans. Circuits           [18] S. Ioffe, C. Szegedy, Batch normalization: Acceler-
     Syst. Video Technol. 26 (2015) 441–452.                        ating deep network training by reducing internal
 [3] L. Dong, J. Zhou, W. Sun, D. Yan, R. Wang, First               covariate shift, arXiv preprint arXiv:1502.03167
     steps toward concealing the traces left by reversible          (2015).
     image data hiding, IEEE Trans. Circuits Syst. II, Exp.    [19] J. Ye, J. Ni, Y. Yi, Deep learning hierarchical repre-
     Briefs 67 (2020) 951–955.                                      sentations for image steganalysis, IEEE Trans. Inf.
 [4] S. Baluja, Hiding images within images, IEEE Trans.            Forensics Security 12 (2017) 2545–2557.
     Pattern Anal. Mach. Intell. 42 (2020) 1685–1697.          [20] Z. Liu, P. Luo, X. Wang, X. Tang, Deep learning
 [5] H. Shi, J. Dong, W. Wang, Y. Qian, X. Zhang, SS-               face attributes in the wild, in: Proc. IEEE Int. Conf.
     GAN: secure steganography based on generative                  Comput. Vis., 2015.
     adversarial networks, in: Pacific Rim Conference          [21] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler,
     on Multimedia, 2017, pp. 534–544.                              S. Hochreiter, Gans trained by a two time-scale
 [6] W. Tang, S. Tan, B. Li, J. Huang, Automatic stegano-           update rule converge to a local nash equilibrium,
     graphic distortion learning using a generative ad-             in: Proc. Adv. Neural Inf. Process. Syst., 2017, pp.
     versarial network, IEEE Signal Process. Lett. 24               6629–6640.
     (2017) 1547–1551.                                         [22] D. P. Kingma, J. Ba, Adam: A method for stochas-
 [7] J. Hayes, G. Danezis, Generating steganographic                tic optimization, arXiv preprint arXiv:1412.6980
     images via adversarial training, in: Proc. Adv. Neu-           (2014).
     ral Inf. Process. Syst., 2017, pp. 1954–1963.
 [8] J. Zhu, R. Kaplan, J. Johnson, F. Li, HiDDen: Hid-
     ing data with deep networks, in: Proc. Eur. Conf.
     Comput. Vis., 2018, pp. 657–672.
 [9] K. A. Zhang, A. Cuesta-Infante, L. Xu, K. Veera-
     machaneni,       SteganoGAN: High capacity im-
     age steganography with GANs, arXiv preprint
     arXiv:1901.03892 (2019).
[10] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Er-
     han, I. Goodfellow, R. Fergus, Intriguing properties
     of neural networks, arXiv preprint arXiv:1312.6199
     (2013).
[11] I. J. Goodfellow, J. Shlens, C. Szegedy, Explain-
     ing and harnessing adversarial examples, arXiv
     preprint arXiv:1412.6572 (2014).
[12] W. Tang, B. Li, S. Tan, M. Barni, J. Huang, CNN-
     based adversarial embedding for image steganogra-

</pre>