=Paper=
{{Paper
|id=Vol-3359/paper12
|storemode=property
|title=StylGAN-Canvas: Augmenting StyleGAN3 for Real-Time Human-AI Co-Creation
|pdfUrl=https://ceur-ws.org/Vol-3359/paper12.pdf
|volume=Vol-3359
|authors=Shuoyang Zheng
|dblpUrl=https://dblp.org/rec/conf/iui/Zheng23
}}
==StylGAN-Canvas: Augmenting StyleGAN3 for Real-Time Human-AI Co-Creation==
StyleGAN-Canvas: Augmenting StyleGAN3 for Real-Time
Human-AI Co-Creation
Shuoyang Zheng1,*
1
Creative Computing Institute, University of the Arts London, 45-65 Peckham Rd, London, UK
Abstract
Motivated by the mixed initiative generative AI interfaces (MIGAI), we propose bridging the gap between StyleGAN3 and
human-AI co-creative patterns by augmenting the latent variable model with the ability of image-conditional generation. We
modify the existing generator architecture in StyleGAN3, enabling it to use high-level visual ideas to guide the human-AI
co-creation. The resulting model, StyleGAN-Canvas, can solve various image-to-image translation tasks while maintaining
the internal behaviour of StyleGAN3. We deploy our models to a real-time graphic interface and conduct qualitative human
opinion studies. We use the MIGAI framework to frame our findings and present a preliminary evaluation of our models’
usability in a generic co-creative context.
Keywords
generative adversarial networks, human-AI co-creation, creativity support tools,
erative neural networks, StyleGAN models have been
found to be widely used as creativity support tools, cre-
ating unconventional visual aesthetics [5, 6, 7] and novel
human-AI co-creative experiences [8, 9, 10, 11]. This mo-
tivates research on human-AI co-creative applications,
offering insight into interaction possibilities between hu-
man creators and AI enabled by GANs [12].
Muller et al. [13] adapt notations introduced by Spoto
Figure 1: A prototype interface encapsulating a StyleGAN-
and Oleynik [14], presenting mixed initiative generative
Canvas model translating paper card layout into faces. The
user adjusts layouts while the model provides synchronous AI interfaces (MIGAI), an analytical framework with 11
generation based on visual similarity. A screen recording is vocabularies of actions to describe a human-AI interac-
available at: https://youtu.be/9AsfsT8uXGY tion process. These actions are analysed into sequences
to form generic human-AI co-creative patterns. How-
ever, Grabe et al. [15] identify a gap between the MIGAI
framework and latent variable models such as GANs.
1. Introduction This is partially due to latent variable models’ deficiency
of ability in interpreting visual design concepts such as
Generative adversarial networks (GANs) [1] have re-
sketches and semantic labels [16], leading to the left-out
cently been rapidly developed and have become a pow-
of the action ideate [15], which describes using high-level
erful tool for creating high-quality digital artefacts. In
concepts to guide or shape the generation [5]. Therefore,
the case of images, modern approaches to improving
Grabe et al. [15] suggested tailoring the MIGAI frame-
model quality have successfully brought the generated
work to fit co-creative GAN applications.
outcomes from coarse low-resolution interpretations to
Motivated by the gap between GANs and human-AI
realistic portraits with high diversity [2] and visual fi-
co-creative patterns, we suggest an alternative approach
delity [3]. Notably, introducing continuous convolution
to bridge the latent variable model, StyleGAN3, to the
in StyleGAN3 (alias-free GAN) [4] has enabled the gener-
co-creative framework by modifying the model’s techni-
ative network to perform equally well regardless of pixel
cal functioning. Specifically, by augmenting StyleGAN3
coordinates, paving the way for more flexible human-AI
with image-conditional generation ability, we enable it
interaction. Closely following the advances in deep gen-
to transform visual ideas into generation. This enables a
Joint Proceedings of the ACM IUI Workshops 2023, March 2023, Sydney, tightly-coupled human-AI interaction process that em-
Australia phasises using high-level visual concepts to guide the
*
Corresponding author. artefact and fulfil the action ideate [13], aligning with the
$ j.zheng0320211@arts.ac.uk (S. Zheng)
co-creative patterns mentioned in the MIGAI framework.
https://alaskawinter.cc/ (S. Zheng)
0000-0002-5483-6028 (S. Zheng) We limit our study to StyleGAN3 because its introduc-
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License tion of continuous convolution facilitates more flexible
Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
human inputs, which is a crucial feature required by our each comprising convolutions controlled by the interme-
approach. diate latent code 𝑤, non-linearities, and upsampling, and
Therefore, the primary aim of this research is to aug- eventually produces the output image 𝑧𝑁 = 𝐺(𝑧0 ; 𝑤).
ment StyleGAN3 with image-conditional generation abil- Its high-level style control is achieved by adaptive in-
ity for a co-creative context. To achieve this, we adapt the stance normalisation (AdaIN) [21], an approach to ampli-
existing model architecture in StyleGAN3, which takes fying specific channels of feature maps on a per-sample
a latent vector and a class label as the model’s input [4], basis. In practice, a learned affine transform is applied to
and propose an encoder network to extract features from the intermediate latent space 𝑊 to obtain style codes 𝑦,
the conditional image. We also adapt the architecture which are then used as scalar components to modulate
previously applied to various image-to-image translation the corresponding feature maps before each convolution
models [17, 18, 19] to connect the proposed encoder and layer. This architecture was then revised in StyleGAN2
StyleGAN3’s generator. The modified model, StyleGAN- [22] by replacing instance normalisation with feature
Canvas, takes a latent vector and an accompanying image map demodulation and inherited by StyleGAN3.
as inputs to guide the generation. We show results from In a later work, the continuous convolution approach
our models trained for various image-to-image transla- [23] implemented in StyleGAN3 (alias-free GAN) [4] has
tion tasks while maintaining the internal behaviours of dramatically changed the internal representations in the
StyleGAN3, providing more flexible and intuitive control synthesis network 𝐺. Signals in 𝐺 are operated in the
to ideate the co-creation. continuous domain rather than the discrete domain to
To evaluate our model in a generic co-creative context, make the network equivariant, which means any opera-
we build a graphic interface to facilitate the real-time tion 𝑓 in the network is equivariant to a spatial transfor-
interaction between users and the model. We conduct mation 𝑡 (𝑡 ∘ 𝑓 = 𝑓 ∘ 𝑡). This eliminates positional refer-
qualitative human opinion studies, identify potential co- ences; therefore, the model can be trained on unaligned
creative patterns using the MIGAI analytical framework, data, and the "texture sticking" behaviour in standard
and present an exploratory human subject study on our GAN models is removed.
model. By aligning StyleGAN-Canvas with the actions Moreover, the input of the synthesis network of Style-
set in MIGAI, we hope to bring its capability into the GAN3 uses a spatial map 𝑧0 defined by Fourier features
discussion of co-creative design processes, and provide [24] to precisely model the translation and rotation. The
a preliminary insight into the unexplored interaction spatial map 𝑧0 is sampled from uniformly distributed fre-
possibilities enabled by StyleGAN. quencies, fixed after initialisation, and spatially infinite
The rest of the paper is structured as follows. We sum- [4]. Its translation and rotation parameters are calculated
marise related works on StyleGAN, image-conditional by a learned affine layer based on intermediate latent
generation, and the co-creative pattern in Section 2. Then, space 𝑊 . The spatial map 𝑧0 acts as a coordinate map
we present our modification to the model’s architecture that allows later layers to grab on and therefore defines
in Section 3. We conduct experiments on our models the global transformations in the synthesis network [25].
and showcase the results and applications in Section 4.
Next, we evaluate the model in a co-creative context in 2.2. Image-Conditional Generation
Section 5. Section 6 highlights limitations and future
studies. Methods for image-conditional generation aim to gen-
erate a corresponding image given an image from the
source domain, depicting objects or scenes in different
2. Related Works styles or conditions. Current solutions to this are usually
categorised under two approaches. The first approach
2.1. Alias-Free GAN uses a linear encoder-decoder architecture [26], in which
Our model architecture is extended from StyleGAN3 the input image is encoded to a vector to match the target
(Alias-Free GAN). This section summarises the back- domain, and then decoded into an output image. This
ground of StyleGAN and reviews the continuous con- method was later extended to a generic framework for
volution approach introduced by StyleGAN3 that enables various tasks, such as sketches and layout to image, facial
the translation and rotation equivariant feature. frontalisation, inpainting, and super-resolution [27].
StyleGAN [20] is a style-based generator with a regu- The second direction uses conditional GANs [17]
larised latent space offering high-level style control over with U-net architectures [28]. It uses a similar encoder-
image generation. The StyleGAN generator comprises decoder setting, but instead of encoding the input image
a mapping network 𝑀 that transforms the initial latent into a vector, it uses the residual feature maps from the
code 𝑧 to intermediate latent code 𝑤 ∼ 𝑊 , and a syn- encoder as spatial information and propagates them to
thesis network 𝐺 with a sequence of 𝑁 synthesis blocks, the decoder through skip connections [29]. The propa-
gated features are concatenated with the outputs from
Figure 2: We build a residual feature encoder and adapt the StyleGAN3 generator. The main datapath consists of (i) 10
downsampling residual blocks (Section 3.1.1), each consisting of a mapped shortcut with a 1 × 1 convolutional layer and bath
normalisation, and a downsampling block with two convolutional layers, an activation layer (Leaky ReLU), bath normalisations
and a clamping layer, (ii) a conditional mapping network (Section 3.1.2), (iii) StyleGAN3 mapping network, (iv) adapted
StyleGAN3 synthesis blocks (Section 3.2).
corresponding decoder layers. This method aims to pro- 3. Extending StyleGAN to
vide the generator with a mechanism to circumvent the
bottleneck layer and allow spatial information to be shut-
Image-conditional Generation
tled from the encoder to the decoder [28]. Therefore, it The objective of our approach is to allow StyleGAN3 to
introduces a strong locality bias [26] into the generation, use images as conditions to guide the generation. This
which means each pixel in the output has a positional section will discuss our modification to the current Style-
reference to the input, and the general image structure GAN3 architecture to address this objective.
is preserved during translation. This method has been As mentioned in Section 2.1, current StyleGAN3 mod-
extended for various tasks such as line-conditioned gener- els learn a mapping from a random noise vector 𝑧 to an
ation [19], layout-to-image synthesis [30], and semantic output image 𝑍𝑁 = 𝐺(𝑧). The modification aims to
region-adaptive synthesis [31]. extend the input from a vector 𝑧 to a conditional image
𝑥 combined with 𝑧, and generate 𝑍𝑁 that is close to
2.3. Human-AI Co-Creative Pattern the corresponding ground truth 𝑦. To do this, we first
need a feature extraction encoder 𝐸 that extracts features
Mixed Initiative Generative AI Interfaces (MIGAI) [13] from 𝑥, then adapts the generator 𝐺 to produce outputs
describe modes of interaction in which both human and based on the extracted features and the input vector 𝑧,
generative AI is engaged in a creative process. It aims i.e. 𝑍𝑁 = 𝐺(𝐸(𝑥), 𝑧). Besides, the training objective
to frame subprocesses in a creative flow by an action should push the generation closer to the corresponding
set with 11 vocabularies. A generic co-creative pattern image 𝑦.
is identified using the MIGAI framework, in which the
AI system first learns a target domain, then the human
ideates a design concept to guide the artefact; subse- 3.1. Feature Extraction Encoder
quently, the human and AI system take turns to eval- 3.1.1. Adapted Residual Network
uate and adjust, eventually produce the outcome [15].
Our study focuses on ideating, a process of conceptual- The feature extraction encoder 𝐸 employs an adapted
ising design solutions to high-level abstractions. Later ResNet [29] architecture as the encoder backbone, which
exploratory study emphasising the potential of enhanc- has been previously used for feature maps extraction in
ing the creative process through human-AI collaborative image-to-image translation works [27]. As shown in Fig-
ideation [16]. Methods for this have been advanced in di- ure 2 (left), the encoder network downsamples feature
verse fields of application: co-creative game content [32], maps to (𝑥/26 , 𝑦/26 ), where 𝑥 and 𝑦 denote the width
collaborative text editing [33], and text-guided image and height of the input image. In addition, as StyleGAN
generation [34]. uses mixed-precision to speed up training and inferences
[24], we utilise similar techniques and reduce the pre- U-Net and its variants has demonstrated that a simplified
cision to FP16 for the first five residual blocks in the structure with fewer feature fusions can achieve reason-
encoder. Consequently, it requires pre-normalisation and able results [37, 38]. Therefore, we reduce the number of
an extra clamping layer that clamps the output of the feature fusions and limit connections to only the first 𝑁
convolutional layers to ±29 [35]. layers. In practice, skip connections in these five models
each connect layer 𝑛 − 𝑖 of the encoder to layer 𝑖, where
3.1.2. Conditional Mapping Network 𝑛 is the total number of layers in the encoder, and 𝑖 is
limited to 𝑖 ∈ (0, 5]. The experiment in Section ?? will
Previous works in StyleGAN encoder [27] have high- show that 𝑁 = 5 is the best configuration that leads
lighted that replacing select layers of the extended latent to more stable training and efficient generation. This
space 𝑊 + by computed latent codes can facilitate multi- reduction in the network maintains unification of the
modal synthesis. And the extended latent space 𝑊 + can network’s internal behaviour, while taking advantage of
be roughly divided into coarse, medium, and fine layers, the efficiency of U-shaped structural models.
corresponding to different levels of detail and editability
[36]. This motivates us to add a conditional mapping net-
work as the epilogue layer of our residual feature encoder, 3.3. Loss Functions
which uses the same mapping network architecture in Standard StyleGAN loss function consists of the standard
StyleGAN, but takes a flattened 512 vector sampled from GAN loss function (i.e., logistic loss) and regularisation
the encoder’s bottleneck and produces a replacement terms (i.e., 𝑅1 ). We incorporate the training objectives in
latent space 𝑊 + ′. And 𝑊 + ′ is then concatenated StyleGAN with pixel-wise distance and perceptual loss
with a portion of the latent space 𝑊 + produced from that have been used in conditional GANs.
the original mapping network. This aims to facilitate The model is trained using different combinations of
multi-modal generation. objectives at two different training phases. The first phase
starts from zero to the first 300k images, and this is also
3.2. Adapting Generator the phase where the training images are blurred with
a Gaussian filter to prevent early collapses [4]. During
As mentioned in Section 2.1, the StyleGAN3 generator this phase, the pixel-wise loss 𝐿2 distance between input
consists of a mapping network and a synthesis network, images 𝑥 and target images 𝑦, logistic loss 𝐿𝐺𝐴𝑁 and
our modification aims to connect its synthesis network regularization terms 𝐿𝑟𝑒𝑔 as follows:
to extracted information from the feature extraction en-
coder. 𝐿2 (𝐺, 𝐸) = E𝑥,𝑦,𝑧 [‖𝑦 − 𝐺(𝐸(𝑥), 𝑧)‖2 ] (1)
We start by modifying the input layer in its synthesis
network. The original network utilises a fixed-size spatial
map 𝑧0 defined by Fourier features [24] as its input to 𝐿𝐺𝐴𝑁 (𝐷, 𝐺, 𝐸) = E𝑦 [𝑙𝑜𝑔𝐷(𝑦)]
(2)
model translation and rotation parameters. However, the +E𝑥,𝑧 [𝑙𝑜𝑔(1 − 𝐷(𝐺(𝐸(𝑥), 𝑧)))]
entire input layer is left out, and we use feature maps
from the last layer of the encoder directly as the synthesis
network’s input. This lets the translation and rotation 𝐿𝑟𝑒𝑔 (𝐸, 𝑀 ) = E𝑥,𝑧 [‖𝐸(𝑥)−𝑤 ¯ ‖2 +‖𝑀 (𝑧)−𝑤 ¯ ‖2 ] (3)
parameters be inherited from the spatial feature maps.
where 𝐺 and 𝐷 denote the generator and the discrimina-
Next, connections between the feature encoder aim
tor, 𝐸 denotes the feature encoder, and 𝑀 denotes the
to provide precise structural information about input
mapping network in the generator. Then, the training
images. A U-net [28] architecture is well-suited for prop-
loss 𝐿𝑝ℎ𝑎𝑠𝑒1 is defined as:
agating high-level details from the encoder to the decoder
[26]. However, as mentioned in Section 2.1, experiments 𝐿𝑝ℎ𝑎𝑠𝑒1 (𝐷, 𝐺, 𝐸) = 𝜆1 𝐿2 (𝐺, 𝐸)
on StyleGAN3 have demonstrated that high-level feature (4)
maps in the synthesis network encode information in +𝐿 𝐺𝐴𝑁 (𝐷, 𝐺, 𝐸) + 𝐿𝑟𝑒𝑔 (𝐸, 𝑀 )
continuous domains instead of discrete domains [4], re- The second phase starts after the training reaches 300k
lying on skip connections to propagates discrete features images. We add a perceptual loss 𝐿𝑉 𝐺𝐺 utilising a pre-
from the encoder may deviate from StyleGAN3’s inter- trained VGG19 [39] network, which has been used in
nal generation behaviour. To tackle this, we first move the training of previous conditional GANs and has led to
the concatenation node from the end of each synthesis more finer details in the resulting images [18], defined
block to the point before the filtered non-linearities layer, as follow:
shown in Figure 2 (right). We also remove the padding
layers in each synthesis block to ensure the dimension 𝐿𝑉 𝐺𝐺 (𝐺, 𝐸) = E𝑥,𝑦,𝑧 [‖𝐹 (𝑦)−
(5)
matches the skip connections. Additionally, research on 𝐹 (𝐺(𝐸(𝑥), 𝑧))‖2 ]
Then, the phase 2 loss is then calculated as follows: Results. Figure 3 (left) shows the results of the ablation
test for the 𝑁 = 5 to 𝑁 = 2 models trained on 1680k
𝐿𝑝ℎ𝑎𝑠𝑒2 (𝐷, 𝐺, 𝐸) = 𝜆1 𝐿2 (𝐺, 𝐸) + 𝜆2 𝐿𝑉 𝐺𝐺 (𝐺, 𝐸) samples. The rest of the two models (𝑁 = 1 and 𝑁 =
+𝐿𝐺𝐴𝑁 (𝐷, 𝐺, 𝐸) + 𝐿𝑟𝑒𝑔 (𝐸, 𝑀 ) 0) are unable to converge to sensible results after 800k
(6) samples and were therefore aborted. The results show
where 𝐹 denotes the pre-trained VGG19 feature extractor. notable improvements when increasing the number of
𝜆1 and 𝜆2 are constant numbers used to weigh the loss skip connections, especially in preserving details such as
parameters, which vary across different training data and background, hands, unique make-up and even the details
configurations. in hairs. Therefore, 𝑁 = 5 is decided as the final setting.
4. Experiments and Applications 4.2. Conditional Image Synthesis and
Editing
In the following Section 4.1, we analyse the effectiveness
Methodology. Conditional image synthesis uses
of the U-net proposed in Section 3.2 by a set of ablation
image-conditional models to generate image 𝑍 corre-
studies. Next, in Section 4.2, we demonstrate the training
sponding to the ground truth image 𝑦, given input con-
process of our model for several image-to-image trans-
dition image 𝑥. We tested our architecture on three
lation tasks with different datasets and showcase their
conditional image synthesis tasks: (i) generating face
results. We also experiment with scaling the model to
images from blurry images, (ii) generating face images
larger canvases in Section 4.3. Finally, we build a graphic
from canny edges, and (iii) generating realistic landscape
interface that implements our models with a set of trans-
images from blurry images.
formation filters in Section 4.4.
First, the ground truth 𝑦 was pre-processed into pro-
cessed condition 𝑥. In deblurring models, the pre-process
pipeline is a resizing layer that scales the resolution of 𝑦
to 256 × 256, and a Gaussian filter with sigma 𝜎 = 28
that process resized 𝑦 to blurred images as condition 𝑥. In
the edge-to-face model, the resolution of 𝑦 is first resized
to 256 × 256, and then applying a Canny edge detector
[42] to process resized 𝑦 to edges as condition 𝑥. 𝑦 and
𝑥 are provided to the training as paired data. After the
models were trained, the same pipelines were applied to
pre-process input 𝑥′ for inference. The training process
is illustrated in Figure 4 (top).
Figure 3: Ablating the skip connections The dataset used for face generation models was Flickr-
Faces-HQ (FFHQ) [20], and the dataset used for land-
scape photos generation was Landscapes High-Quality
(LHQ) [43], both with 512 × 512 resolution. We used
4.1. Analysis of The Skip Connections StyleGAN3-R, the translational and rotational equivari-
ant configuration of StyleGAN3, as the generator back-
Methodology. Section 3.2 proposed reducing the num- bone for the FFHQ dataset; and StyleGAN3-T, the trans-
ber of skip connections to only the first 𝑁 layers. To ver- lational equivariant configuration of StyleGAN3, as the
ify the feasibility of this design and find the most effective generator backbone for the LHQ dataset. The training
configuration of 𝑁 , we first trained six models on Flickr- configuration was identical to StyleGAN3.
Faces-HQ (FFHQ) [20] dataset with 512 × 512 resolution
for an ablation test [40] on the skip connections, which Results. The deblurring model on the FFHQ dataset
is a method to investigate knowledge representations in was trained with 3700k samples, and the deblurring
artificial neural networks by disabling specific nodes in a model on the LHQ dataset was trained with 6700k sam-
network. We use inversion tasks [41] as the training goal, ples. We compared ground truth samples 𝑦, conditions
in which the models are trained to reconstruct a given 𝑥, and generation outcomes 𝑍 in Figure 5 and Figure 6.
𝑁
image without translation, aiming to test the efficiency The edge-to-face model on the FFHQ dataset was trained
of the encoder. Skip connections are installed between with 3700k samples. In Figure 7, we compared ground
the encoder’s last 𝑁 layers and the synthesis network’s truth samples 𝑦, conditions 𝑥, and generation outcomes
first 𝑁 layers, where 𝑁 progressively reduces from 5 to 0 𝑍 with three randomly selected latent vectors for multi-
𝑁
in these six models. The resulting outputs are compared modal generations.
across six models to decide the final configuration.
Figure 4: During training, target images are processed into condition, the model generates fake images, and the fake images
and the target images are used to calculate the loss.
Figure 5: Results of our model for deblurring on FFHQ 512 ×
512 (left: ground truth samples, middle: processed conditions,
right: generations
Figure 7: Results of our model for edge-to-face model on
FFHQ 512 × 512 (ground truth samples, processed conditions,
and three generations with different latent vectors
Figure 6: Results of our model for deblurring on LHQ 512 ×
512 (left: ground truth samples, middle: processed conditions,
right: generations
Figure 8: Results of our model for edge-to-faces on FFHQ,
and local editing
In addition, the canny edges model provides an alter-
native approach to local editing. Modifying edges in the
condition image allows the model to alter semantical ele- absolute positional references [44], we hypothesised that
ments in the generation. Illustrated in Figure 4 (bottom), our modified architecture induced an extendable genera-
the modified conditions can be obtained by combining tion canvas that can be enlarged after trained on a fixed
and adapting existing conditions from other images. For resolution, without additional training required. There-
example, in Figure 8, we superimposed edges processed fore, we enlarge the input resolution to test the model’s
from other images to the original edges to add hair fringe, ability on a larger canvas.
glasses and smile; we painted on the original edges to The model for landscape photo generation was trained
modify eyes and add sunglasses. on the dataset with 512 × 512 resolution, taking inputs
with 256×256 resolution. To enlarge the generation can-
vas, we doubled and tripled the width of inputs, expand-
4.3. Large Canvas
ing their resolution to 1024 × 256 and 768 × 256. Then,
As mentioned in Section 3.2, the padding layers in the the expanded inputs were taken directly into the gener-
synthesis network are removed, and the entire generator ator and convolved by each convolutional layer. There-
does not have unintentional positional references with fore, the expected output resolutions are 2048 × 512 and
Figure 9: Examples of results with 512 × 2048 / 512 × 1536 image generated from model originally trained on 512 × 512
dataset. More results can be accessed from https://github.com/jasper-zheng/StyleGAN-Canvas
1536 × 512. Additional training is not required during from 10 to 32 ( ⃗𝑣 ∈ 𝑅32 ) to confront larger numbers of
the experiment. channels in StyleGAN3-R.
Figure 9 shows the resulting generations. While
the outputs are expanded up to four times larger than Interface design. Figure 10 shows a screenshot of the
the original canvas, the generation quality remains un- deployed system. The interface allows inputs from a web-
changed. This ensures that the input canvas can be flex- cam or locally selected image files (top left). The interface
ibly expanded when the models are implemented into allows users to pause, resume generation, and switch the
user interfaces. input between the webcam and local files. Users can in-
sert transformation filters into certain groups in certain
4.4. User Interface layers. The list on the bottom left side presents layers
available for network bending operations. Once a specific
Implementation. The deblurring models were de- layer is selected, the system shows current clusters of
ployed to a graphic user interface. The models were feature maps in the layer. Users can regenerate clusters
running on a cloud server. We used Flask and Socke- according to feature maps from the current frame. Then,
tIO for bi-directional communications between the web users activate the ‘route’ button to apply transformation
client and the server. The generation runs in real-time at filters to the cluster. The system provides basic transfor-
roughly ten frames per second on an NVIDIA RTX A4000 mation filters, including erosion and dilation, multiply,
GPU. The code for our implementation is available at translations, rotation, and scale.
https://github.com/jasper-zheng/realtime-flask-model.
Combining network bending. In addition to the
baseline model, the generation system is implemented
with network bending [45], an approach to alter a trained
model’s computational graph by inserting transformation
filters between different convolutional layers, allowing
the model to generate novel samples that are diverse from
the training data [46]. We used the clustering algorithm
presented by Broad et al. [45] to group spatially similar
feature maps in selected layers and allow the transfor-
mation filters to be inserted into specific groups. We
trained softmax feature extraction CNNs for each layer
and clustered each flatten layer by the k-means algorithm.
However, different from the implementation in Broad et Figure 10: A screenshot of our interface.
al.’s work, we increase the length of the flatten vector
5. Human Subjects Study and
Evaluation
This section presents a human subjects study to evaluate
our models in a human-AI co-creation context. The study
uses a thematic analysis approach to identify potential
co-creative patterns defined by the MIGAI framework
[13] that underlies the interaction.
5.1. Methodology
In the user study, we asked six participants to use the
Figure 11: Examples of works created by participants
generation interface described in Section 4.4. For the
webcam inputs, coloured paper cards and scissors were
available to the participants. Participants were asked
to arrange and layout paper cards in front of a webcam We used the MIGAI analytic framework [13] to frame the
pointing to a paper canvas, aiming to control the gen- human-AI co-creative patterns.
eration until it reached something they liked. Besides,
the transformation filters were also available to fine-tune 5.2.1. Learn
the creation further. Models used in Section 4.2 are avail-
The action learn in the MIGAI framework describes the
able to the participants, including the deblurring model
AI model using training data to construct its knowledge,
and the edge-to-face model trained on Flickr-Faces-HQ
usually involving the choice of datasets, model archi-
(FFHQ) [20], the deblurring model trained on Landscapes
tectures, and training configurations. Most participants
High-Quality (LHQ) [43].
used the model trained on the landscape dataset in the
The experiment followed the qualitative research pro-
end. When questioned on the reason for this decision,
cedure described by Adams et al. Adams et al. [47]. Six
they suggested the surprising, unexpected results pro-
participants were divided into two groups of three. We
duced by the landscape model are more likely to be ac-
conducted the study on the first group and ran an analy-
cepted by their visual aesthetic. Although both models
sis to emphasise issues raised by the participants based
can create unrealistic imagery, distortion and oddities on
on their frequency and fundamentality, leading to tenta-
human faces may easily lead to uncanny feelings and,
tive findings. Then, the interview questions are revised
more importantly, negative ethical issues such as bias
for the second group to probe and grow these findings.
and offences. Therefore, the choice of training data is
The study was divided into two parts, each lasting 15
critical for co-creation.
minutes. The first part asked participants to familiarise
the interface and explore the system components. The
second part asked the participant to create a work they 5.2.2. Ideate
liked. Observation is conducted while participants inter- The action ideate describes the process of using high-
act with the framework, and a 5-minute semi-structured level concepts to guide the generation. We collected
interview with open-ended questions follows each part comments aggregated around this process. Some posi-
of the experiment. tive comments indicate that creating shapes using paper
Participants were questioned on their attitudes regard- cards is an intuitive generation process. Whereas some
ing this form of interaction, the creation process, their negative comments suggest that the lack of controls in
generation outcomes, and the differences in their per- details (e.g., textures in the landscapes, details of the fa-
ceptions of different models. We loosely followed the cial elements) leads to confusion and complaint. Most
interview template during the interviews while ensuring participants pointed out that they would use this form of
these four topics were covered. generation as inspiration, giving them a high-level sketch
Figure 11 shows some examples of works created by developed from their ideated concept. Alternatively, they
the participants. may treat it as a playful experience or simply in pursuit of
an abstract visual effect. However, if the generation aims
5.2. Analysis to create a serious piece of work, they would eventually
switch to more stable methods with lower-level control
In this section, we use the thematic analysis [48] method to refine the work, such as Photoshop or CAD software.
to aggregate comments collected from the interview, aim- This is because, in this case, they want to be manually in
ing to identify critical factors influencing the interaction. charge of finer details such as lighting and textures.
5.2.3. Adapt 5.2.4. IterativeLoop
The action adapt describes adjusting existing artefacts. The MIGAI framework uses IterativeLoop to describe hu-
During the study, we observed that participants primar- mans reflectively learning from the co-creative process.
ily focused on exploring the outcome from the compu- We observed that some participants might memorise use-
tational agent by arbitrarily arranging the paper cards. ful patterns of shapes and colours that may trigger satis-
When they reach a layout that triggers interesting results, factory results, and later use the paper card to reproduce
they iteratively adjust the arrangement until achieving these patterns. Some participants describe this process
a satisfying result. Some participants tried to tackle the as a way to learn the preferences of the AI model to steer
lack of control by utilising other pre-processing or post- the co-creation through reflection.
processing processes. Figure 12 illustrates an example
of an action that attempts to fix the oddity by slightly 5.3. Discussion
warping the intermediate image in other image editing
software. Figure 13 illustrates a sequence of editing per- The motivation of this paper was to augment style-based
formed on intermediate condition images to intentionally GAN models into a co-creative context. By extending the
create unrealistic and novel outcomes. The edges were StyleGAN model to image-conditional generation, these
edited and assembled before the generation using Photo- models allow human agents to ideate a high-level con-
shop, and then uploaded to the interface as inputs. The cept of the artefacts, like blurred arrangement or edges.
user evaluated the current outcome, and then iteratively The flexibility of input and the semantically meaningful
fine-tuned the edges according to their preferences. control are critical features for effective co-creative ex-
periences. Meanwhile, to create a sense of a "cooperated
partner", the computational agent needs to maintain a
certain level of unpredictability, and AI’s contribution
needs to partly influence the human agents’ decision [49].
Therefore, the co-creative process balances machine au-
tonomy and human creativity. Our current results are
good indications that human agents act as a director
who steers the model by organising and reusing learned
knowledge in the computational agent. While further
studies on the model architecture may improve the gen-
eration quality, the current results in this research show
that bridging StyleGAN3 to the co-creative context is pos-
sible. Furthermore, it could be employed for novel and
unique co-creative experiences. For example, Figure 14
shows an interactive installation with webcam inputs
Figure 12: Slightly warping the intermediate image fixes the
that produce 704 × 1280 images in real-time.
peculiar effect in the waves.
Figure 13: Editing, assembling the condition images to inten- Figure 14: An interactive installation implementing our
tionally create unrealistic and novel outcomes framework, the model runs in real-time with webcam input
at 10fps.
6. Conclusion 6.2. Ethical Considerations and Energy
Consumption
In this work, we augmented StyleGAN3 with the ability
of image-conditional generation, enabling it to turn high- Potential negative societal impacts of images produced
level visual ideas into generation. This augmentation by GAN [50] were considered throughout the project.
aligned latent variable models with the co-creative pat- The models trained on the FFHQ dataset are for purely
terns mentioned in the MIGAI framework and brought academic purposes, and its interactive prototype will not
StyleGAN3 into a co-creative context. We adapted the be publicly distributed by any means. Model trainings
existing model architecture and proposed an encoder net- used approximately 300 hours of computation on A100
work to extract information from the conditional image. SXM4 80GB (TDP of 400W). Total emissions are estimated
We demonstrated the modified architecture, StyleGAN- to be 25.92kg 𝐶𝑂2 , as calculated by MachineLearning
Canvas, on various image-to-image translation tasks with Impact calculator [51]. Paper cards in the experiments
different datasets. In addition, we deployed our models to were limitedly allocated to participants, reused during
a graphic interface to facilitate the real-time interaction and after the experiments.
between users and the model. To evaluate our models
in a co-creative context, we conducted qualitative hu-
man opinion studies and identified potential co-creative Acknowledgments
patterns using the MIGAI analytic framework.
This research was carried out under the supervision of
Prof. Mick Grierson. I sincerely appreciate and treasure
6.1. Limitation and Future Works feedbacks and comments from him.
An overall criticism was the need for more interpretable
control in the system. While the input image acts as References
a blueprint for the generation, the user also needs pre-
cise control over the details when using the model as a [1] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza,
creative tool. This leads us to rethink the design of the B. Xu, D. Warde-Farley, S. Ozair, A. Courville,
intermediate representation. Our framework currently Y. Bengio, Generative adversarial networks, 2014.
only implements deblurring models for the experiment, arXiv:arXiv:1406.2661.
however, it might be more useful to use intermediate [2] A. Sauer, K. Schwarz, A. Geiger, Stylegan-xl:
representations that encode more detailed information Scaling stylegan to large diverse datasets, 2022.
(e.g., boundary maps or edges) like the interactive demos arXiv:arXiv:2202.00273.
in pix2pixHD. [3] B. Liu, Y. Zhu, K. Song, A. Elgammal, To-
Besides, the proposed architecture can also be im- wards faster and stabilized GAN training for
proved in technical aspects. Our approach proposes an high-fidelity few-shot image synthesis, CoRR
alternative for extending StyleGAN models to image- abs/2101.04775 (2021). URL: https://arxiv.org/abs/
conditional generation. Although it has demonstrated 2101.04775. arXiv:2101.04775.
its potential in solving several image-to-image transla- [4] T. Karras, M. Aittala, S. Laine, E. Härkönen,
tion tasks, the detailed architecture still needs further J. Hellsten, J. Lehtinen, T. Aila, Alias-free
investigation and refinement to improve the generation generative adversarial networks, in: M. Ran-
quality. Our model architecture utilises the equivariant zato, A. Beygelzimer, Y. Dauphin, P. Liang,
generator in StyleGAN3. However, our feature extrac- J. W. Vaughan (Eds.), Advances in Neural
tor still needs to be rotation equivariant. Therefore, the Information Processing Systems, volume 34,
generation may suffer when the rotation is not encoder. Curran Associates, Inc., 2021, pp. 852–863. URL:
Figure 15 show an example of failure where the encoder https://proceedings.neurips.cc/paper/2021/file/
does not preserve the rotation. It would be beneficial to 076ccd93ad68be51f23707988e934906-Paper.pdf.
make the feature encoder equivariant in future work. [5] S. Berns, T. Broad, C. Guckelsberger, S. Colton,
Automating generative deep learning for artis-
tic purposes: Challenges and opportunities, 2021.
arXiv:arXiv:2107.01858.
[6] M. Som, Mal som @errthangisalive, 2020. URL:
http://www.aiartonline.com/highlights-2020/
mal-som-errthangisalive/.
[7] D. Schultz, Artificial images, 2020. URL:
https://artificial-images.com/project/
Figure 15: encoder failed to extract the rotation
you-are-here-machine-learning-film/.
[8] T. Park, M.-Y. Liu, T.-C. Wang, J.-Y. Zhu, Seman- 48550/ARXIV.1711.11585.
tic image synthesis with spatially-adaptive normal- [19] Y. Li, X. Chen, F. Wu, Z.-J. Zha, Linestofacephoto:
ization, in: Proceedings of the IEEE/CVF Con- Face photo generation from lines with condi-
ference on Computer Vision and Pattern Recogni- tional self-attention generative adversarial network,
tion (CVPR), 2019, pp. 2337–2346. doi:10.48550/ 2019. URL: https://arxiv.org/abs/1910.08914. doi:10.
ARXIV.1903.07291. 48550/ARXIV.1910.08914.
[9] J. Rafner, S. Langsford, A. Hjorth, M. Gajdacz, [20] T. Karras, S. Laine, T. Aila, A style-based generator
L. Philipsen, S. Risi, J. Simon, J. Sherson, Utopian or architecture for generative adversarial networks,
dystopian?: Using a ml-assisted image generation 2018. URL: https://arxiv.org/abs/1812.04948. doi:10.
game to empower the general public to envision the 48550/ARXIV.1812.04948.
future, in: Creativity and Cognition, Association for [21] X. Huang, S. Belongie, Arbitrary style transfer in
Computing Machinery, New York, NY, USA, 2021, real-time with adaptive instance normalization, in:
p. 5. URL: https://doi.org/10.1145/3450741.3466815. ICCV, 2017.
doi:10.1145/3450741.3466815. [22] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehti-
[10] Y. Wang, W. Zhou, J. Bao, W. Wang, L. Li, H. Li, nen, T. Aila, Analyzing and improving the image
Clip2gan: Towards bridging text with the latent quality of stylegan, 2019. URL: https://arxiv.org/abs/
space of gans, 2022. URL: https://arxiv.org/abs/2211. 1912.04958. doi:10.48550/ARXIV.1912.04958.
15045. doi:10.48550/ARXIV.2211.15045. [23] N. Dey, A. Chen, S. Ghafurian, Group equiv-
[11] J. Simon, Artbreeder, 2018. URL: https://www. ariant generative adversarial networks, in: In-
artbreeder.com/about. ternational Conference on Learning Representa-
[12] S. Shahriar, Gan computers generate arts? a tions, 2021. URL: https://openreview.net/forum?id=
survey on visual arts, music, and literary text rgFNuJHHXv.
generation using generative adversarial network, [24] M. Tancik, P. Srinivasan, B. Mildenhall, S. Fridovich-
2021. URL: https://arxiv.org/abs/2108.03857. doi:10. Keil, N. Raghavan, U. Singhal, R. Ramamoorthi,
48550/ARXIV.2108.03857. J. Barron, R. Ng, Fourier features let networks
[13] M. Muller, J. D. Weisz, W. Geyer, Mixed initiative learn high frequency functions in low dimensional
generative ai interfaces: An analytic framework domains, in: H. Larochelle, M. Ranzato, R. Hadsell,
for generative ai applications, in: Proceedings of M. Balcan, H. Lin (Eds.), Advances in Neural
the Workshop The Future of Co-Creative Systems- Information Processing Systems, volume 33,
A Workshop on Human-Computer Co-Creativity Curran Associates, Inc., 2020, pp. 7537–7547. URL:
of the 11th International Conference on Computa- https://proceedings.neurips.cc/paper/2020/file/
tional Creativity (ICCC 2020), 2020. 55053683268957697aa39fba6f231c68-Paper.pdf.
[14] A. Spoto, N. Oleynik, Library of mixed- [25] Y. Alaluf, O. Patashnik, Z. Wu, A. Zamir, E. Shecht-
initiative creative interfaces, 2017. URL: man, D. Lischinski, D. Cohen-Or, Third time’s
http://mici.codingconduct.cc/. the charm? image and video editing with style-
[15] I. Grabe, M. G. Duque, S. Risi, J. Zhu, Towards gan3, 2022. URL: https://arxiv.org/abs/2201.13433.
a framework for human-ai interaction patterns in doi:10.48550/ARXIV.2201.13433.
co-creative gan applications, in: Joint Proceed- [26] E. Richardson, Y. Weiss, The surprising effective-
ings of the ACM IUI Workshops 2022, March 2022, ness of linear unsupervised image-to-image trans-
Helsinki, Finland, 2022. URL: https://ceur-ws.org/ lation, 2020. URL: https://arxiv.org/abs/2007.12568.
Vol-3124/paper9.pdf. doi:10.48550/ARXIV.2007.12568.
[16] J. Kim., M. L. Maher., S. Siddiqui., Collabora- [27] E. Richardson, Y. Alaluf, O. Patashnik, Y. Nitzan,
tive ideation partner: Design ideation in human- Y. Azar, S. Shapiro, D. Cohen-Or, Encoding in style:
ai co-creativity, in: Proceedings of the 5th Inter- a stylegan encoder for image-to-image translation,
national Conference on Computer-Human Inter- 2020. URL: https://arxiv.org/abs/2008.00951. doi:10.
action Research and Applications (CHIRA 2021), 48550/ARXIV.2008.00951.
INSTICC, SciTePress, 2021, pp. 123–130. doi:10. [28] O. Ronneberger, P. Fischer, T. Brox, U-net: Convo-
5220/0010640800003060. lutional networks for biomedical image segmenta-
[17] P. Isola, J.-Y. Zhu, T. Zhou, A. A. Efros, Image- tion, 2015. URL: https://arxiv.org/abs/1505.04597.
to-image translation with conditional adversarial doi:10.48550/ARXIV.1505.04597.
networks, CVPR (2017). [29] K. He, X. Zhang, S. Ren, J. Sun, Deep residual
[18] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, learning for image recognition, 2015. URL: https:
B. Catanzaro, High-resolution image synthesis //arxiv.org/abs/1512.03385. doi:10.48550/ARXIV.
and semantic manipulation with conditional gans, 1512.03385.
2017. URL: https://arxiv.org/abs/1711.11585. doi:10. [30] X. Liu, G. Yin, J. Shao, X. Wang, H. Li, Learning to
predict layout-to-image conditional convolutions 48550/ARXIV.1901.08644.
for semantic image synthesis, 2019. URL: https: [41] J.-Y. Zhu, P. Krähenbühl, E. Shechtman, A. A. Efros,
//arxiv.org/abs/1910.06809. doi:10.48550/ARXIV. Generative visual manipulation on the natural im-
1910.06809. age manifold, in: B. Leibe, J. Matas, N. Sebe,
[31] P. Zhu, R. Abdal, Y. Qin, P. Wonka, Sean: Image M. Welling (Eds.), Computer Vision – ECCV 2016,
synthesis with semantic region-adaptive normaliza- Springer International Publishing, Cham, 2016, pp.
tion, in: IEEE/CVF Conference on Computer Vision 597–613.
and Pattern Recognition (CVPR), 2020. [42] J. Canny, A computational approach to edge de-
[32] A. Liapis, G. Smith, N. Shaker, Mixed-initiative tection, IEEE Transactions on Pattern Analysis
content creation, Springer International Publish- and Machine Intelligence PAMI-8 (1986) 679–698.
ing, Cham, 2016, pp. 195–214. URL: https://doi. doi:10.1109/TPAMI.1986.4767851.
org/10.1007/978-3-319-42716-4_11. doi:10.1007/ [43] I. Skorokhodov, G. Sotnikov, M. Elhoseiny, Align-
978-3-319-42716-4_11. ing latent and image spaces to connect the uncon-
[33] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. De- nectable, arXiv preprint arXiv:2104.06954 (2021).
langue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun- [44] R. Xu, X. Wang, K. Chen, B. Zhou, C. C. Loy, Po-
towicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, sitional encoding as spatial inductive bias in gans,
Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gug- in: arxiv, 2020.
ger, M. Drame, Q. Lhoest, A. Rush, Transform- [45] T. Broad, F. F. Leymarie, M. Grierson, Network bend-
ers: State-of-the-art natural language process- ing: Expressive manipulation of deep generative
ing, in: Proceedings of the 2020 Conference on models, 2020. URL: https://arxiv.org/abs/2005.12420.
Empirical Methods in Natural Language Process- doi:10.48550/ARXIV.2005.12420.
ing: System Demonstrations, Association for Com- [46] T. Broad, S. Berns, S. Colton, M. Grierson, Active di-
putational Linguistics, Online, 2020, pp. 38–45. vergence with generative deep learning – a survey
URL: https://aclanthology.org/2020.emnlp-demos.6. and taxonomy, 2021. arXiv:arXiv:2107.05599.
doi:10.18653/v1/2020.emnlp-demos.6. [47] A. Adams, P. Lunt, P. Cairns, A qualititative ap-
[34] R. Gal, O. Patashnik, H. Maron, G. Chechik, proach to hci research, in: P. Cairns, A. Cox
D. Cohen-Or, Stylegan-nada: Clip-guided do- (Eds.), Research Methods for Human-Computer In-
main adaptation of image generators, 2021. teraction, Cambridge University Press, Cambridge,
arXiv:2108.00946. UK, 2008, pp. 138–157. URL: http://oro.open.ac.uk/
[35] T. Karras, M. Aittala, J. Hellsten, S. Laine, J. Lehti- 11911/.
nen, T. Aila, Training generative adversarial net- [48] V. Braun, V. Clarke, Thematic analysis., APA hand-
works with limited data, in: Proc. NeurIPS, 2020. book of research methods in psychology, Vol 2: Re-
[36] X. Mao, L. Cao, A. T. Gnanha, Z. Yang, Q. Li, R. Ji, search designs: Quantitative, qualitative, neuropsy-
Cycle encoding of a stylegan encoder for improved chological, and biological. (2012) 57–71. doi:10.
reconstruction and editability, 2022. URL: https: 1037/13620-004.
//arxiv.org/abs/2207.09367. doi:10.48550/ARXIV. [49] M. T. Llano, M. d’Inverno, M. Yee-King, J. Mc-
2207.09367. Cormack, A. Ilsar, A. Pease, S. Colton, Explain-
[37] H. Huang, L. Lin, R. Tong, H. Hu, Q. Zhang, able computational creativity (2022). URL: https:
Y. Iwamoto, X. Han, Y.-W. Chen, J. Wu, Unet 3+: A //arxiv.org/abs/2205.05682. doi:10.48550/ARXIV.
full-scale connected unet for medical image segmen- 2205.05682.
tation, 2020. URL: https://arxiv.org/abs/2004.08790. [50] V. U. Prabhu, D. A. Yap, A. Wang, J. Whaley, Cov-
doi:10.48550/ARXIV.2004.08790. ering up bias in celeba-like datasets with markov
[38] H. Lu, Y. She, J. Tie, S. Xu, Half-unet: A blankets: A post-hoc cure for attribute prior avoid-
simplified u-net architecture for medical im- ance, 2019. URL: https://arxiv.org/abs/1907.12917.
age segmentation, Frontiers in Neuroinformat- doi:10.48550/ARXIV.1907.12917.
ics 16 (2022). URL: https://www.frontiersin.org/ [51] A. Lacoste, A. Luccioni, V. Schmidt, T. Dandres,
articles/10.3389/fninf.2022.911679. doi:10.3389/ Quantifying the carbon emissions of machine learn-
fninf.2022.911679. ing, 2019. URL: https://arxiv.org/abs/1910.09700.
[39] K. Simonyan, A. Zisserman, Very deep convolu- doi:10.48550/ARXIV.1910.09700.
tional networks for large-scale image recognition,
2014. URL: https://arxiv.org/abs/1409.1556. doi:10.
48550/ARXIV.1409.1556.
[40] R. Meyes, M. Lu, C. W. de Puiseau, T. Meisen,
Ablation studies in artificial neural networks,
2019. URL: https://arxiv.org/abs/1901.08644. doi:10.