StyleGAN-Canvas: Augmenting StyleGAN3 for Real-Time Human-AI Co-Creation Shuoyang Zheng1,* 1 Creative Computing Institute, University of the Arts London, 45-65 Peckham Rd, London, UK Abstract Motivated by the mixed initiative generative AI interfaces (MIGAI), we propose bridging the gap between StyleGAN3 and human-AI co-creative patterns by augmenting the latent variable model with the ability of image-conditional generation. We modify the existing generator architecture in StyleGAN3, enabling it to use high-level visual ideas to guide the human-AI co-creation. The resulting model, StyleGAN-Canvas, can solve various image-to-image translation tasks while maintaining the internal behaviour of StyleGAN3. We deploy our models to a real-time graphic interface and conduct qualitative human opinion studies. We use the MIGAI framework to frame our findings and present a preliminary evaluation of our models’ usability in a generic co-creative context. Keywords generative adversarial networks, human-AI co-creation, creativity support tools, erative neural networks, StyleGAN models have been found to be widely used as creativity support tools, cre- ating unconventional visual aesthetics [5, 6, 7] and novel human-AI co-creative experiences [8, 9, 10, 11]. This mo- tivates research on human-AI co-creative applications, offering insight into interaction possibilities between hu- man creators and AI enabled by GANs [12]. Muller et al. [13] adapt notations introduced by Spoto Figure 1: A prototype interface encapsulating a StyleGAN- and Oleynik [14], presenting mixed initiative generative Canvas model translating paper card layout into faces. The user adjusts layouts while the model provides synchronous AI interfaces (MIGAI), an analytical framework with 11 generation based on visual similarity. A screen recording is vocabularies of actions to describe a human-AI interac- available at: https://youtu.be/9AsfsT8uXGY tion process. These actions are analysed into sequences to form generic human-AI co-creative patterns. How- ever, Grabe et al. [15] identify a gap between the MIGAI framework and latent variable models such as GANs. 1. Introduction This is partially due to latent variable models’ deficiency of ability in interpreting visual design concepts such as Generative adversarial networks (GANs) [1] have re- sketches and semantic labels [16], leading to the left-out cently been rapidly developed and have become a pow- of the action ideate [15], which describes using high-level erful tool for creating high-quality digital artefacts. In concepts to guide or shape the generation [5]. Therefore, the case of images, modern approaches to improving Grabe et al. [15] suggested tailoring the MIGAI frame- model quality have successfully brought the generated work to fit co-creative GAN applications. outcomes from coarse low-resolution interpretations to Motivated by the gap between GANs and human-AI realistic portraits with high diversity [2] and visual fi- co-creative patterns, we suggest an alternative approach delity [3]. Notably, introducing continuous convolution to bridge the latent variable model, StyleGAN3, to the in StyleGAN3 (alias-free GAN) [4] has enabled the gener- co-creative framework by modifying the model’s techni- ative network to perform equally well regardless of pixel cal functioning. Specifically, by augmenting StyleGAN3 coordinates, paving the way for more flexible human-AI with image-conditional generation ability, we enable it interaction. Closely following the advances in deep gen- to transform visual ideas into generation. This enables a Joint Proceedings of the ACM IUI Workshops 2023, March 2023, Sydney, tightly-coupled human-AI interaction process that em- Australia phasises using high-level visual concepts to guide the * Corresponding author. artefact and fulfil the action ideate [13], aligning with the $ j.zheng0320211@arts.ac.uk (S. Zheng) co-creative patterns mentioned in the MIGAI framework. € https://alaskawinter.cc/ (S. Zheng)  0000-0002-5483-6028 (S. Zheng) We limit our study to StyleGAN3 because its introduc- © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License tion of continuous convolution facilitates more flexible Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) human inputs, which is a crucial feature required by our each comprising convolutions controlled by the interme- approach. diate latent code 𝑤, non-linearities, and upsampling, and Therefore, the primary aim of this research is to aug- eventually produces the output image 𝑧𝑁 = 𝐺(𝑧0 ; 𝑤). ment StyleGAN3 with image-conditional generation abil- Its high-level style control is achieved by adaptive in- ity for a co-creative context. To achieve this, we adapt the stance normalisation (AdaIN) [21], an approach to ampli- existing model architecture in StyleGAN3, which takes fying specific channels of feature maps on a per-sample a latent vector and a class label as the model’s input [4], basis. In practice, a learned affine transform is applied to and propose an encoder network to extract features from the intermediate latent space 𝑊 to obtain style codes 𝑦, the conditional image. We also adapt the architecture which are then used as scalar components to modulate previously applied to various image-to-image translation the corresponding feature maps before each convolution models [17, 18, 19] to connect the proposed encoder and layer. This architecture was then revised in StyleGAN2 StyleGAN3’s generator. The modified model, StyleGAN- [22] by replacing instance normalisation with feature Canvas, takes a latent vector and an accompanying image map demodulation and inherited by StyleGAN3. as inputs to guide the generation. We show results from In a later work, the continuous convolution approach our models trained for various image-to-image transla- [23] implemented in StyleGAN3 (alias-free GAN) [4] has tion tasks while maintaining the internal behaviours of dramatically changed the internal representations in the StyleGAN3, providing more flexible and intuitive control synthesis network 𝐺. Signals in 𝐺 are operated in the to ideate the co-creation. continuous domain rather than the discrete domain to To evaluate our model in a generic co-creative context, make the network equivariant, which means any opera- we build a graphic interface to facilitate the real-time tion 𝑓 in the network is equivariant to a spatial transfor- interaction between users and the model. We conduct mation 𝑡 (𝑡 ∘ 𝑓 = 𝑓 ∘ 𝑡). This eliminates positional refer- qualitative human opinion studies, identify potential co- ences; therefore, the model can be trained on unaligned creative patterns using the MIGAI analytical framework, data, and the "texture sticking" behaviour in standard and present an exploratory human subject study on our GAN models is removed. model. By aligning StyleGAN-Canvas with the actions Moreover, the input of the synthesis network of Style- set in MIGAI, we hope to bring its capability into the GAN3 uses a spatial map 𝑧0 defined by Fourier features discussion of co-creative design processes, and provide [24] to precisely model the translation and rotation. The a preliminary insight into the unexplored interaction spatial map 𝑧0 is sampled from uniformly distributed fre- possibilities enabled by StyleGAN. quencies, fixed after initialisation, and spatially infinite The rest of the paper is structured as follows. We sum- [4]. Its translation and rotation parameters are calculated marise related works on StyleGAN, image-conditional by a learned affine layer based on intermediate latent generation, and the co-creative pattern in Section 2. Then, space 𝑊 . The spatial map 𝑧0 acts as a coordinate map we present our modification to the model’s architecture that allows later layers to grab on and therefore defines in Section 3. We conduct experiments on our models the global transformations in the synthesis network [25]. and showcase the results and applications in Section 4. Next, we evaluate the model in a co-creative context in 2.2. Image-Conditional Generation Section 5. Section 6 highlights limitations and future studies. Methods for image-conditional generation aim to gen- erate a corresponding image given an image from the source domain, depicting objects or scenes in different 2. Related Works styles or conditions. Current solutions to this are usually categorised under two approaches. The first approach 2.1. Alias-Free GAN uses a linear encoder-decoder architecture [26], in which Our model architecture is extended from StyleGAN3 the input image is encoded to a vector to match the target (Alias-Free GAN). This section summarises the back- domain, and then decoded into an output image. This ground of StyleGAN and reviews the continuous con- method was later extended to a generic framework for volution approach introduced by StyleGAN3 that enables various tasks, such as sketches and layout to image, facial the translation and rotation equivariant feature. frontalisation, inpainting, and super-resolution [27]. StyleGAN [20] is a style-based generator with a regu- The second direction uses conditional GANs [17] larised latent space offering high-level style control over with U-net architectures [28]. It uses a similar encoder- image generation. The StyleGAN generator comprises decoder setting, but instead of encoding the input image a mapping network 𝑀 that transforms the initial latent into a vector, it uses the residual feature maps from the code 𝑧 to intermediate latent code 𝑤 ∼ 𝑊 , and a syn- encoder as spatial information and propagates them to thesis network 𝐺 with a sequence of 𝑁 synthesis blocks, the decoder through skip connections [29]. The propa- gated features are concatenated with the outputs from Figure 2: We build a residual feature encoder and adapt the StyleGAN3 generator. The main datapath consists of (i) 10 downsampling residual blocks (Section 3.1.1), each consisting of a mapped shortcut with a 1 × 1 convolutional layer and bath normalisation, and a downsampling block with two convolutional layers, an activation layer (Leaky ReLU), bath normalisations and a clamping layer, (ii) a conditional mapping network (Section 3.1.2), (iii) StyleGAN3 mapping network, (iv) adapted StyleGAN3 synthesis blocks (Section 3.2). corresponding decoder layers. This method aims to pro- 3. Extending StyleGAN to vide the generator with a mechanism to circumvent the bottleneck layer and allow spatial information to be shut- Image-conditional Generation tled from the encoder to the decoder [28]. Therefore, it The objective of our approach is to allow StyleGAN3 to introduces a strong locality bias [26] into the generation, use images as conditions to guide the generation. This which means each pixel in the output has a positional section will discuss our modification to the current Style- reference to the input, and the general image structure GAN3 architecture to address this objective. is preserved during translation. This method has been As mentioned in Section 2.1, current StyleGAN3 mod- extended for various tasks such as line-conditioned gener- els learn a mapping from a random noise vector 𝑧 to an ation [19], layout-to-image synthesis [30], and semantic output image 𝑍𝑁 = 𝐺(𝑧). The modification aims to region-adaptive synthesis [31]. extend the input from a vector 𝑧 to a conditional image 𝑥 combined with 𝑧, and generate 𝑍𝑁 that is close to 2.3. Human-AI Co-Creative Pattern the corresponding ground truth 𝑦. To do this, we first need a feature extraction encoder 𝐸 that extracts features Mixed Initiative Generative AI Interfaces (MIGAI) [13] from 𝑥, then adapts the generator 𝐺 to produce outputs describe modes of interaction in which both human and based on the extracted features and the input vector 𝑧, generative AI is engaged in a creative process. It aims i.e. 𝑍𝑁 = 𝐺(𝐸(𝑥), 𝑧). Besides, the training objective to frame subprocesses in a creative flow by an action should push the generation closer to the corresponding set with 11 vocabularies. A generic co-creative pattern image 𝑦. is identified using the MIGAI framework, in which the AI system first learns a target domain, then the human ideates a design concept to guide the artefact; subse- 3.1. Feature Extraction Encoder quently, the human and AI system take turns to eval- 3.1.1. Adapted Residual Network uate and adjust, eventually produce the outcome [15]. Our study focuses on ideating, a process of conceptual- The feature extraction encoder 𝐸 employs an adapted ising design solutions to high-level abstractions. Later ResNet [29] architecture as the encoder backbone, which exploratory study emphasising the potential of enhanc- has been previously used for feature maps extraction in ing the creative process through human-AI collaborative image-to-image translation works [27]. As shown in Fig- ideation [16]. Methods for this have been advanced in di- ure 2 (left), the encoder network downsamples feature verse fields of application: co-creative game content [32], maps to (𝑥/26 , 𝑦/26 ), where 𝑥 and 𝑦 denote the width collaborative text editing [33], and text-guided image and height of the input image. In addition, as StyleGAN generation [34]. uses mixed-precision to speed up training and inferences [24], we utilise similar techniques and reduce the pre- U-Net and its variants has demonstrated that a simplified cision to FP16 for the first five residual blocks in the structure with fewer feature fusions can achieve reason- encoder. Consequently, it requires pre-normalisation and able results [37, 38]. Therefore, we reduce the number of an extra clamping layer that clamps the output of the feature fusions and limit connections to only the first 𝑁 convolutional layers to ±29 [35]. layers. In practice, skip connections in these five models each connect layer 𝑛 − 𝑖 of the encoder to layer 𝑖, where 3.1.2. Conditional Mapping Network 𝑛 is the total number of layers in the encoder, and 𝑖 is limited to 𝑖 ∈ (0, 5]. The experiment in Section ?? will Previous works in StyleGAN encoder [27] have high- show that 𝑁 = 5 is the best configuration that leads lighted that replacing select layers of the extended latent to more stable training and efficient generation. This space 𝑊 + by computed latent codes can facilitate multi- reduction in the network maintains unification of the modal synthesis. And the extended latent space 𝑊 + can network’s internal behaviour, while taking advantage of be roughly divided into coarse, medium, and fine layers, the efficiency of U-shaped structural models. corresponding to different levels of detail and editability [36]. This motivates us to add a conditional mapping net- work as the epilogue layer of our residual feature encoder, 3.3. Loss Functions which uses the same mapping network architecture in Standard StyleGAN loss function consists of the standard StyleGAN, but takes a flattened 512 vector sampled from GAN loss function (i.e., logistic loss) and regularisation the encoder’s bottleneck and produces a replacement terms (i.e., 𝑅1 ). We incorporate the training objectives in latent space 𝑊 + ′. And 𝑊 + ′ is then concatenated StyleGAN with pixel-wise distance and perceptual loss with a portion of the latent space 𝑊 + produced from that have been used in conditional GANs. the original mapping network. This aims to facilitate The model is trained using different combinations of multi-modal generation. objectives at two different training phases. The first phase starts from zero to the first 300k images, and this is also 3.2. Adapting Generator the phase where the training images are blurred with a Gaussian filter to prevent early collapses [4]. During As mentioned in Section 2.1, the StyleGAN3 generator this phase, the pixel-wise loss 𝐿2 distance between input consists of a mapping network and a synthesis network, images 𝑥 and target images 𝑦, logistic loss 𝐿𝐺𝐴𝑁 and our modification aims to connect its synthesis network regularization terms 𝐿𝑟𝑒𝑔 as follows: to extracted information from the feature extraction en- coder. 𝐿2 (𝐺, 𝐸) = E𝑥,𝑦,𝑧 [‖𝑦 − 𝐺(𝐸(𝑥), 𝑧)‖2 ] (1) We start by modifying the input layer in its synthesis network. The original network utilises a fixed-size spatial map 𝑧0 defined by Fourier features [24] as its input to 𝐿𝐺𝐴𝑁 (𝐷, 𝐺, 𝐸) = E𝑦 [𝑙𝑜𝑔𝐷(𝑦)] (2) model translation and rotation parameters. However, the +E𝑥,𝑧 [𝑙𝑜𝑔(1 − 𝐷(𝐺(𝐸(𝑥), 𝑧)))] entire input layer is left out, and we use feature maps from the last layer of the encoder directly as the synthesis network’s input. This lets the translation and rotation 𝐿𝑟𝑒𝑔 (𝐸, 𝑀 ) = E𝑥,𝑧 [‖𝐸(𝑥)−𝑤 ¯ ‖2 +‖𝑀 (𝑧)−𝑤 ¯ ‖2 ] (3) parameters be inherited from the spatial feature maps. where 𝐺 and 𝐷 denote the generator and the discrimina- Next, connections between the feature encoder aim tor, 𝐸 denotes the feature encoder, and 𝑀 denotes the to provide precise structural information about input mapping network in the generator. Then, the training images. A U-net [28] architecture is well-suited for prop- loss 𝐿𝑝ℎ𝑎𝑠𝑒1 is defined as: agating high-level details from the encoder to the decoder [26]. However, as mentioned in Section 2.1, experiments 𝐿𝑝ℎ𝑎𝑠𝑒1 (𝐷, 𝐺, 𝐸) = 𝜆1 𝐿2 (𝐺, 𝐸) on StyleGAN3 have demonstrated that high-level feature (4) maps in the synthesis network encode information in +𝐿 𝐺𝐴𝑁 (𝐷, 𝐺, 𝐸) + 𝐿𝑟𝑒𝑔 (𝐸, 𝑀 ) continuous domains instead of discrete domains [4], re- The second phase starts after the training reaches 300k lying on skip connections to propagates discrete features images. We add a perceptual loss 𝐿𝑉 𝐺𝐺 utilising a pre- from the encoder may deviate from StyleGAN3’s inter- trained VGG19 [39] network, which has been used in nal generation behaviour. To tackle this, we first move the training of previous conditional GANs and has led to the concatenation node from the end of each synthesis more finer details in the resulting images [18], defined block to the point before the filtered non-linearities layer, as follow: shown in Figure 2 (right). We also remove the padding layers in each synthesis block to ensure the dimension 𝐿𝑉 𝐺𝐺 (𝐺, 𝐸) = E𝑥,𝑦,𝑧 [‖𝐹 (𝑦)− (5) matches the skip connections. Additionally, research on 𝐹 (𝐺(𝐸(𝑥), 𝑧))‖2 ] Then, the phase 2 loss is then calculated as follows: Results. Figure 3 (left) shows the results of the ablation test for the 𝑁 = 5 to 𝑁 = 2 models trained on 1680k 𝐿𝑝ℎ𝑎𝑠𝑒2 (𝐷, 𝐺, 𝐸) = 𝜆1 𝐿2 (𝐺, 𝐸) + 𝜆2 𝐿𝑉 𝐺𝐺 (𝐺, 𝐸) samples. The rest of the two models (𝑁 = 1 and 𝑁 = +𝐿𝐺𝐴𝑁 (𝐷, 𝐺, 𝐸) + 𝐿𝑟𝑒𝑔 (𝐸, 𝑀 ) 0) are unable to converge to sensible results after 800k (6) samples and were therefore aborted. The results show where 𝐹 denotes the pre-trained VGG19 feature extractor. notable improvements when increasing the number of 𝜆1 and 𝜆2 are constant numbers used to weigh the loss skip connections, especially in preserving details such as parameters, which vary across different training data and background, hands, unique make-up and even the details configurations. in hairs. Therefore, 𝑁 = 5 is decided as the final setting. 4. Experiments and Applications 4.2. Conditional Image Synthesis and Editing In the following Section 4.1, we analyse the effectiveness Methodology. Conditional image synthesis uses of the U-net proposed in Section 3.2 by a set of ablation image-conditional models to generate image 𝑍 corre- studies. Next, in Section 4.2, we demonstrate the training sponding to the ground truth image 𝑦, given input con- process of our model for several image-to-image trans- dition image 𝑥. We tested our architecture on three lation tasks with different datasets and showcase their conditional image synthesis tasks: (i) generating face results. We also experiment with scaling the model to images from blurry images, (ii) generating face images larger canvases in Section 4.3. Finally, we build a graphic from canny edges, and (iii) generating realistic landscape interface that implements our models with a set of trans- images from blurry images. formation filters in Section 4.4. First, the ground truth 𝑦 was pre-processed into pro- cessed condition 𝑥. In deblurring models, the pre-process pipeline is a resizing layer that scales the resolution of 𝑦 to 256 × 256, and a Gaussian filter with sigma 𝜎 = 28 that process resized 𝑦 to blurred images as condition 𝑥. In the edge-to-face model, the resolution of 𝑦 is first resized to 256 × 256, and then applying a Canny edge detector [42] to process resized 𝑦 to edges as condition 𝑥. 𝑦 and 𝑥 are provided to the training as paired data. After the models were trained, the same pipelines were applied to pre-process input 𝑥′ for inference. The training process is illustrated in Figure 4 (top). Figure 3: Ablating the skip connections The dataset used for face generation models was Flickr- Faces-HQ (FFHQ) [20], and the dataset used for land- scape photos generation was Landscapes High-Quality (LHQ) [43], both with 512 × 512 resolution. We used 4.1. Analysis of The Skip Connections StyleGAN3-R, the translational and rotational equivari- ant configuration of StyleGAN3, as the generator back- Methodology. Section 3.2 proposed reducing the num- bone for the FFHQ dataset; and StyleGAN3-T, the trans- ber of skip connections to only the first 𝑁 layers. To ver- lational equivariant configuration of StyleGAN3, as the ify the feasibility of this design and find the most effective generator backbone for the LHQ dataset. The training configuration of 𝑁 , we first trained six models on Flickr- configuration was identical to StyleGAN3. Faces-HQ (FFHQ) [20] dataset with 512 × 512 resolution for an ablation test [40] on the skip connections, which Results. The deblurring model on the FFHQ dataset is a method to investigate knowledge representations in was trained with 3700k samples, and the deblurring artificial neural networks by disabling specific nodes in a model on the LHQ dataset was trained with 6700k sam- network. We use inversion tasks [41] as the training goal, ples. We compared ground truth samples 𝑦, conditions in which the models are trained to reconstruct a given 𝑥, and generation outcomes 𝑍 in Figure 5 and Figure 6. 𝑁 image without translation, aiming to test the efficiency The edge-to-face model on the FFHQ dataset was trained of the encoder. Skip connections are installed between with 3700k samples. In Figure 7, we compared ground the encoder’s last 𝑁 layers and the synthesis network’s truth samples 𝑦, conditions 𝑥, and generation outcomes first 𝑁 layers, where 𝑁 progressively reduces from 5 to 0 𝑍 with three randomly selected latent vectors for multi- 𝑁 in these six models. The resulting outputs are compared modal generations. across six models to decide the final configuration. Figure 4: During training, target images are processed into condition, the model generates fake images, and the fake images and the target images are used to calculate the loss. Figure 5: Results of our model for deblurring on FFHQ 512 × 512 (left: ground truth samples, middle: processed conditions, right: generations Figure 7: Results of our model for edge-to-face model on FFHQ 512 × 512 (ground truth samples, processed conditions, and three generations with different latent vectors Figure 6: Results of our model for deblurring on LHQ 512 × 512 (left: ground truth samples, middle: processed conditions, right: generations Figure 8: Results of our model for edge-to-faces on FFHQ, and local editing In addition, the canny edges model provides an alter- native approach to local editing. Modifying edges in the condition image allows the model to alter semantical ele- absolute positional references [44], we hypothesised that ments in the generation. Illustrated in Figure 4 (bottom), our modified architecture induced an extendable genera- the modified conditions can be obtained by combining tion canvas that can be enlarged after trained on a fixed and adapting existing conditions from other images. For resolution, without additional training required. There- example, in Figure 8, we superimposed edges processed fore, we enlarge the input resolution to test the model’s from other images to the original edges to add hair fringe, ability on a larger canvas. glasses and smile; we painted on the original edges to The model for landscape photo generation was trained modify eyes and add sunglasses. on the dataset with 512 × 512 resolution, taking inputs with 256×256 resolution. To enlarge the generation can- vas, we doubled and tripled the width of inputs, expand- 4.3. Large Canvas ing their resolution to 1024 × 256 and 768 × 256. Then, As mentioned in Section 3.2, the padding layers in the the expanded inputs were taken directly into the gener- synthesis network are removed, and the entire generator ator and convolved by each convolutional layer. There- does not have unintentional positional references with fore, the expected output resolutions are 2048 × 512 and Figure 9: Examples of results with 512 × 2048 / 512 × 1536 image generated from model originally trained on 512 × 512 dataset. More results can be accessed from https://github.com/jasper-zheng/StyleGAN-Canvas 1536 × 512. Additional training is not required during from 10 to 32 ( ⃗𝑣 ∈ 𝑅32 ) to confront larger numbers of the experiment. channels in StyleGAN3-R. Figure 9 shows the resulting generations. While the outputs are expanded up to four times larger than Interface design. Figure 10 shows a screenshot of the the original canvas, the generation quality remains un- deployed system. The interface allows inputs from a web- changed. This ensures that the input canvas can be flex- cam or locally selected image files (top left). The interface ibly expanded when the models are implemented into allows users to pause, resume generation, and switch the user interfaces. input between the webcam and local files. Users can in- sert transformation filters into certain groups in certain 4.4. User Interface layers. The list on the bottom left side presents layers available for network bending operations. Once a specific Implementation. The deblurring models were de- layer is selected, the system shows current clusters of ployed to a graphic user interface. The models were feature maps in the layer. Users can regenerate clusters running on a cloud server. We used Flask and Socke- according to feature maps from the current frame. Then, tIO for bi-directional communications between the web users activate the ‘route’ button to apply transformation client and the server. The generation runs in real-time at filters to the cluster. The system provides basic transfor- roughly ten frames per second on an NVIDIA RTX A4000 mation filters, including erosion and dilation, multiply, GPU. The code for our implementation is available at translations, rotation, and scale. https://github.com/jasper-zheng/realtime-flask-model. Combining network bending. In addition to the baseline model, the generation system is implemented with network bending [45], an approach to alter a trained model’s computational graph by inserting transformation filters between different convolutional layers, allowing the model to generate novel samples that are diverse from the training data [46]. We used the clustering algorithm presented by Broad et al. [45] to group spatially similar feature maps in selected layers and allow the transfor- mation filters to be inserted into specific groups. We trained softmax feature extraction CNNs for each layer and clustered each flatten layer by the k-means algorithm. However, different from the implementation in Broad et Figure 10: A screenshot of our interface. al.’s work, we increase the length of the flatten vector 5. Human Subjects Study and Evaluation This section presents a human subjects study to evaluate our models in a human-AI co-creation context. The study uses a thematic analysis approach to identify potential co-creative patterns defined by the MIGAI framework [13] that underlies the interaction. 5.1. Methodology In the user study, we asked six participants to use the Figure 11: Examples of works created by participants generation interface described in Section 4.4. For the webcam inputs, coloured paper cards and scissors were available to the participants. Participants were asked to arrange and layout paper cards in front of a webcam We used the MIGAI analytic framework [13] to frame the pointing to a paper canvas, aiming to control the gen- human-AI co-creative patterns. eration until it reached something they liked. Besides, the transformation filters were also available to fine-tune 5.2.1. Learn the creation further. Models used in Section 4.2 are avail- The action learn in the MIGAI framework describes the able to the participants, including the deblurring model AI model using training data to construct its knowledge, and the edge-to-face model trained on Flickr-Faces-HQ usually involving the choice of datasets, model archi- (FFHQ) [20], the deblurring model trained on Landscapes tectures, and training configurations. Most participants High-Quality (LHQ) [43]. used the model trained on the landscape dataset in the The experiment followed the qualitative research pro- end. When questioned on the reason for this decision, cedure described by Adams et al. Adams et al. [47]. Six they suggested the surprising, unexpected results pro- participants were divided into two groups of three. We duced by the landscape model are more likely to be ac- conducted the study on the first group and ran an analy- cepted by their visual aesthetic. Although both models sis to emphasise issues raised by the participants based can create unrealistic imagery, distortion and oddities on on their frequency and fundamentality, leading to tenta- human faces may easily lead to uncanny feelings and, tive findings. Then, the interview questions are revised more importantly, negative ethical issues such as bias for the second group to probe and grow these findings. and offences. Therefore, the choice of training data is The study was divided into two parts, each lasting 15 critical for co-creation. minutes. The first part asked participants to familiarise the interface and explore the system components. The second part asked the participant to create a work they 5.2.2. Ideate liked. Observation is conducted while participants inter- The action ideate describes the process of using high- act with the framework, and a 5-minute semi-structured level concepts to guide the generation. We collected interview with open-ended questions follows each part comments aggregated around this process. Some posi- of the experiment. tive comments indicate that creating shapes using paper Participants were questioned on their attitudes regard- cards is an intuitive generation process. Whereas some ing this form of interaction, the creation process, their negative comments suggest that the lack of controls in generation outcomes, and the differences in their per- details (e.g., textures in the landscapes, details of the fa- ceptions of different models. We loosely followed the cial elements) leads to confusion and complaint. Most interview template during the interviews while ensuring participants pointed out that they would use this form of these four topics were covered. generation as inspiration, giving them a high-level sketch Figure 11 shows some examples of works created by developed from their ideated concept. Alternatively, they the participants. may treat it as a playful experience or simply in pursuit of an abstract visual effect. However, if the generation aims 5.2. Analysis to create a serious piece of work, they would eventually switch to more stable methods with lower-level control In this section, we use the thematic analysis [48] method to refine the work, such as Photoshop or CAD software. to aggregate comments collected from the interview, aim- This is because, in this case, they want to be manually in ing to identify critical factors influencing the interaction. charge of finer details such as lighting and textures. 5.2.3. Adapt 5.2.4. IterativeLoop The action adapt describes adjusting existing artefacts. The MIGAI framework uses IterativeLoop to describe hu- During the study, we observed that participants primar- mans reflectively learning from the co-creative process. ily focused on exploring the outcome from the compu- We observed that some participants might memorise use- tational agent by arbitrarily arranging the paper cards. ful patterns of shapes and colours that may trigger satis- When they reach a layout that triggers interesting results, factory results, and later use the paper card to reproduce they iteratively adjust the arrangement until achieving these patterns. Some participants describe this process a satisfying result. Some participants tried to tackle the as a way to learn the preferences of the AI model to steer lack of control by utilising other pre-processing or post- the co-creation through reflection. processing processes. Figure 12 illustrates an example of an action that attempts to fix the oddity by slightly 5.3. Discussion warping the intermediate image in other image editing software. Figure 13 illustrates a sequence of editing per- The motivation of this paper was to augment style-based formed on intermediate condition images to intentionally GAN models into a co-creative context. By extending the create unrealistic and novel outcomes. The edges were StyleGAN model to image-conditional generation, these edited and assembled before the generation using Photo- models allow human agents to ideate a high-level con- shop, and then uploaded to the interface as inputs. The cept of the artefacts, like blurred arrangement or edges. user evaluated the current outcome, and then iteratively The flexibility of input and the semantically meaningful fine-tuned the edges according to their preferences. control are critical features for effective co-creative ex- periences. Meanwhile, to create a sense of a "cooperated partner", the computational agent needs to maintain a certain level of unpredictability, and AI’s contribution needs to partly influence the human agents’ decision [49]. Therefore, the co-creative process balances machine au- tonomy and human creativity. Our current results are good indications that human agents act as a director who steers the model by organising and reusing learned knowledge in the computational agent. While further studies on the model architecture may improve the gen- eration quality, the current results in this research show that bridging StyleGAN3 to the co-creative context is pos- sible. Furthermore, it could be employed for novel and unique co-creative experiences. For example, Figure 14 shows an interactive installation with webcam inputs Figure 12: Slightly warping the intermediate image fixes the that produce 704 × 1280 images in real-time. peculiar effect in the waves. Figure 13: Editing, assembling the condition images to inten- Figure 14: An interactive installation implementing our tionally create unrealistic and novel outcomes framework, the model runs in real-time with webcam input at 10fps. 6. Conclusion 6.2. Ethical Considerations and Energy Consumption In this work, we augmented StyleGAN3 with the ability of image-conditional generation, enabling it to turn high- Potential negative societal impacts of images produced level visual ideas into generation. This augmentation by GAN [50] were considered throughout the project. aligned latent variable models with the co-creative pat- The models trained on the FFHQ dataset are for purely terns mentioned in the MIGAI framework and brought academic purposes, and its interactive prototype will not StyleGAN3 into a co-creative context. We adapted the be publicly distributed by any means. Model trainings existing model architecture and proposed an encoder net- used approximately 300 hours of computation on A100 work to extract information from the conditional image. SXM4 80GB (TDP of 400W). Total emissions are estimated We demonstrated the modified architecture, StyleGAN- to be 25.92kg 𝐶𝑂2 , as calculated by MachineLearning Canvas, on various image-to-image translation tasks with Impact calculator [51]. Paper cards in the experiments different datasets. In addition, we deployed our models to were limitedly allocated to participants, reused during a graphic interface to facilitate the real-time interaction and after the experiments. between users and the model. To evaluate our models in a co-creative context, we conducted qualitative hu- man opinion studies and identified potential co-creative Acknowledgments patterns using the MIGAI analytic framework. This research was carried out under the supervision of Prof. Mick Grierson. I sincerely appreciate and treasure 6.1. Limitation and Future Works feedbacks and comments from him. An overall criticism was the need for more interpretable control in the system. While the input image acts as References a blueprint for the generation, the user also needs pre- cise control over the details when using the model as a [1] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, creative tool. This leads us to rethink the design of the B. Xu, D. Warde-Farley, S. Ozair, A. Courville, intermediate representation. Our framework currently Y. Bengio, Generative adversarial networks, 2014. only implements deblurring models for the experiment, arXiv:arXiv:1406.2661. however, it might be more useful to use intermediate [2] A. Sauer, K. Schwarz, A. Geiger, Stylegan-xl: representations that encode more detailed information Scaling stylegan to large diverse datasets, 2022. (e.g., boundary maps or edges) like the interactive demos arXiv:arXiv:2202.00273. in pix2pixHD. [3] B. Liu, Y. Zhu, K. Song, A. Elgammal, To- Besides, the proposed architecture can also be im- wards faster and stabilized GAN training for proved in technical aspects. Our approach proposes an high-fidelity few-shot image synthesis, CoRR alternative for extending StyleGAN models to image- abs/2101.04775 (2021). URL: https://arxiv.org/abs/ conditional generation. Although it has demonstrated 2101.04775. arXiv:2101.04775. its potential in solving several image-to-image transla- [4] T. Karras, M. Aittala, S. Laine, E. Härkönen, tion tasks, the detailed architecture still needs further J. Hellsten, J. Lehtinen, T. Aila, Alias-free investigation and refinement to improve the generation generative adversarial networks, in: M. Ran- quality. Our model architecture utilises the equivariant zato, A. Beygelzimer, Y. Dauphin, P. Liang, generator in StyleGAN3. However, our feature extrac- J. W. Vaughan (Eds.), Advances in Neural tor still needs to be rotation equivariant. Therefore, the Information Processing Systems, volume 34, generation may suffer when the rotation is not encoder. Curran Associates, Inc., 2021, pp. 852–863. URL: Figure 15 show an example of failure where the encoder https://proceedings.neurips.cc/paper/2021/file/ does not preserve the rotation. It would be beneficial to 076ccd93ad68be51f23707988e934906-Paper.pdf. make the feature encoder equivariant in future work. [5] S. Berns, T. Broad, C. Guckelsberger, S. Colton, Automating generative deep learning for artis- tic purposes: Challenges and opportunities, 2021. arXiv:arXiv:2107.01858. [6] M. Som, Mal som @errthangisalive, 2020. URL: http://www.aiartonline.com/highlights-2020/ mal-som-errthangisalive/. [7] D. Schultz, Artificial images, 2020. URL: https://artificial-images.com/project/ Figure 15: encoder failed to extract the rotation you-are-here-machine-learning-film/. [8] T. Park, M.-Y. Liu, T.-C. Wang, J.-Y. Zhu, Seman- 48550/ARXIV.1711.11585. tic image synthesis with spatially-adaptive normal- [19] Y. Li, X. Chen, F. Wu, Z.-J. Zha, Linestofacephoto: ization, in: Proceedings of the IEEE/CVF Con- Face photo generation from lines with condi- ference on Computer Vision and Pattern Recogni- tional self-attention generative adversarial network, tion (CVPR), 2019, pp. 2337–2346. doi:10.48550/ 2019. URL: https://arxiv.org/abs/1910.08914. doi:10. ARXIV.1903.07291. 48550/ARXIV.1910.08914. [9] J. Rafner, S. Langsford, A. Hjorth, M. Gajdacz, [20] T. Karras, S. Laine, T. Aila, A style-based generator L. Philipsen, S. Risi, J. Simon, J. Sherson, Utopian or architecture for generative adversarial networks, dystopian?: Using a ml-assisted image generation 2018. URL: https://arxiv.org/abs/1812.04948. doi:10. game to empower the general public to envision the 48550/ARXIV.1812.04948. future, in: Creativity and Cognition, Association for [21] X. Huang, S. Belongie, Arbitrary style transfer in Computing Machinery, New York, NY, USA, 2021, real-time with adaptive instance normalization, in: p. 5. URL: https://doi.org/10.1145/3450741.3466815. ICCV, 2017. doi:10.1145/3450741.3466815. [22] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehti- [10] Y. Wang, W. Zhou, J. Bao, W. Wang, L. Li, H. Li, nen, T. Aila, Analyzing and improving the image Clip2gan: Towards bridging text with the latent quality of stylegan, 2019. URL: https://arxiv.org/abs/ space of gans, 2022. URL: https://arxiv.org/abs/2211. 1912.04958. doi:10.48550/ARXIV.1912.04958. 15045. doi:10.48550/ARXIV.2211.15045. [23] N. Dey, A. Chen, S. Ghafurian, Group equiv- [11] J. Simon, Artbreeder, 2018. URL: https://www. ariant generative adversarial networks, in: In- artbreeder.com/about. ternational Conference on Learning Representa- [12] S. Shahriar, Gan computers generate arts? a tions, 2021. URL: https://openreview.net/forum?id= survey on visual arts, music, and literary text rgFNuJHHXv. generation using generative adversarial network, [24] M. Tancik, P. Srinivasan, B. Mildenhall, S. Fridovich- 2021. URL: https://arxiv.org/abs/2108.03857. doi:10. Keil, N. Raghavan, U. Singhal, R. Ramamoorthi, 48550/ARXIV.2108.03857. J. Barron, R. Ng, Fourier features let networks [13] M. Muller, J. D. Weisz, W. Geyer, Mixed initiative learn high frequency functions in low dimensional generative ai interfaces: An analytic framework domains, in: H. Larochelle, M. Ranzato, R. Hadsell, for generative ai applications, in: Proceedings of M. Balcan, H. Lin (Eds.), Advances in Neural the Workshop The Future of Co-Creative Systems- Information Processing Systems, volume 33, A Workshop on Human-Computer Co-Creativity Curran Associates, Inc., 2020, pp. 7537–7547. URL: of the 11th International Conference on Computa- https://proceedings.neurips.cc/paper/2020/file/ tional Creativity (ICCC 2020), 2020. 55053683268957697aa39fba6f231c68-Paper.pdf. [14] A. Spoto, N. Oleynik, Library of mixed- [25] Y. Alaluf, O. Patashnik, Z. Wu, A. Zamir, E. Shecht- initiative creative interfaces, 2017. URL: man, D. Lischinski, D. Cohen-Or, Third time’s http://mici.codingconduct.cc/. the charm? image and video editing with style- [15] I. Grabe, M. G. Duque, S. Risi, J. Zhu, Towards gan3, 2022. URL: https://arxiv.org/abs/2201.13433. a framework for human-ai interaction patterns in doi:10.48550/ARXIV.2201.13433. co-creative gan applications, in: Joint Proceed- [26] E. Richardson, Y. Weiss, The surprising effective- ings of the ACM IUI Workshops 2022, March 2022, ness of linear unsupervised image-to-image trans- Helsinki, Finland, 2022. URL: https://ceur-ws.org/ lation, 2020. URL: https://arxiv.org/abs/2007.12568. Vol-3124/paper9.pdf. doi:10.48550/ARXIV.2007.12568. [16] J. Kim., M. L. Maher., S. Siddiqui., Collabora- [27] E. Richardson, Y. Alaluf, O. Patashnik, Y. Nitzan, tive ideation partner: Design ideation in human- Y. Azar, S. Shapiro, D. Cohen-Or, Encoding in style: ai co-creativity, in: Proceedings of the 5th Inter- a stylegan encoder for image-to-image translation, national Conference on Computer-Human Inter- 2020. URL: https://arxiv.org/abs/2008.00951. doi:10. action Research and Applications (CHIRA 2021), 48550/ARXIV.2008.00951. INSTICC, SciTePress, 2021, pp. 123–130. doi:10. [28] O. Ronneberger, P. Fischer, T. Brox, U-net: Convo- 5220/0010640800003060. lutional networks for biomedical image segmenta- [17] P. Isola, J.-Y. Zhu, T. Zhou, A. A. Efros, Image- tion, 2015. URL: https://arxiv.org/abs/1505.04597. to-image translation with conditional adversarial doi:10.48550/ARXIV.1505.04597. networks, CVPR (2017). [29] K. He, X. Zhang, S. Ren, J. Sun, Deep residual [18] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, learning for image recognition, 2015. URL: https: B. Catanzaro, High-resolution image synthesis //arxiv.org/abs/1512.03385. doi:10.48550/ARXIV. and semantic manipulation with conditional gans, 1512.03385. 2017. URL: https://arxiv.org/abs/1711.11585. doi:10. [30] X. Liu, G. Yin, J. Shao, X. Wang, H. Li, Learning to predict layout-to-image conditional convolutions 48550/ARXIV.1901.08644. for semantic image synthesis, 2019. URL: https: [41] J.-Y. Zhu, P. Krähenbühl, E. Shechtman, A. A. Efros, //arxiv.org/abs/1910.06809. doi:10.48550/ARXIV. Generative visual manipulation on the natural im- 1910.06809. age manifold, in: B. Leibe, J. Matas, N. Sebe, [31] P. Zhu, R. Abdal, Y. Qin, P. Wonka, Sean: Image M. Welling (Eds.), Computer Vision – ECCV 2016, synthesis with semantic region-adaptive normaliza- Springer International Publishing, Cham, 2016, pp. tion, in: IEEE/CVF Conference on Computer Vision 597–613. and Pattern Recognition (CVPR), 2020. [42] J. Canny, A computational approach to edge de- [32] A. Liapis, G. Smith, N. Shaker, Mixed-initiative tection, IEEE Transactions on Pattern Analysis content creation, Springer International Publish- and Machine Intelligence PAMI-8 (1986) 679–698. ing, Cham, 2016, pp. 195–214. URL: https://doi. doi:10.1109/TPAMI.1986.4767851. org/10.1007/978-3-319-42716-4_11. doi:10.1007/ [43] I. Skorokhodov, G. Sotnikov, M. Elhoseiny, Align- 978-3-319-42716-4_11. ing latent and image spaces to connect the uncon- [33] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. De- nectable, arXiv preprint arXiv:2104.06954 (2021). langue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun- [44] R. Xu, X. Wang, K. Chen, B. Zhou, C. C. Loy, Po- towicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, sitional encoding as spatial inductive bias in gans, Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gug- in: arxiv, 2020. ger, M. Drame, Q. Lhoest, A. Rush, Transform- [45] T. Broad, F. F. Leymarie, M. Grierson, Network bend- ers: State-of-the-art natural language process- ing: Expressive manipulation of deep generative ing, in: Proceedings of the 2020 Conference on models, 2020. URL: https://arxiv.org/abs/2005.12420. Empirical Methods in Natural Language Process- doi:10.48550/ARXIV.2005.12420. ing: System Demonstrations, Association for Com- [46] T. Broad, S. Berns, S. Colton, M. Grierson, Active di- putational Linguistics, Online, 2020, pp. 38–45. vergence with generative deep learning – a survey URL: https://aclanthology.org/2020.emnlp-demos.6. and taxonomy, 2021. arXiv:arXiv:2107.05599. doi:10.18653/v1/2020.emnlp-demos.6. [47] A. Adams, P. Lunt, P. Cairns, A qualititative ap- [34] R. Gal, O. Patashnik, H. Maron, G. Chechik, proach to hci research, in: P. Cairns, A. Cox D. Cohen-Or, Stylegan-nada: Clip-guided do- (Eds.), Research Methods for Human-Computer In- main adaptation of image generators, 2021. teraction, Cambridge University Press, Cambridge, arXiv:2108.00946. UK, 2008, pp. 138–157. URL: http://oro.open.ac.uk/ [35] T. Karras, M. Aittala, J. Hellsten, S. Laine, J. Lehti- 11911/. nen, T. Aila, Training generative adversarial net- [48] V. Braun, V. Clarke, Thematic analysis., APA hand- works with limited data, in: Proc. NeurIPS, 2020. book of research methods in psychology, Vol 2: Re- [36] X. Mao, L. Cao, A. T. Gnanha, Z. Yang, Q. Li, R. Ji, search designs: Quantitative, qualitative, neuropsy- Cycle encoding of a stylegan encoder for improved chological, and biological. (2012) 57–71. doi:10. reconstruction and editability, 2022. URL: https: 1037/13620-004. //arxiv.org/abs/2207.09367. doi:10.48550/ARXIV. [49] M. T. Llano, M. d’Inverno, M. Yee-King, J. Mc- 2207.09367. Cormack, A. Ilsar, A. Pease, S. Colton, Explain- [37] H. Huang, L. Lin, R. Tong, H. Hu, Q. Zhang, able computational creativity (2022). URL: https: Y. Iwamoto, X. Han, Y.-W. Chen, J. Wu, Unet 3+: A //arxiv.org/abs/2205.05682. doi:10.48550/ARXIV. full-scale connected unet for medical image segmen- 2205.05682. tation, 2020. URL: https://arxiv.org/abs/2004.08790. [50] V. U. Prabhu, D. A. Yap, A. Wang, J. Whaley, Cov- doi:10.48550/ARXIV.2004.08790. ering up bias in celeba-like datasets with markov [38] H. Lu, Y. She, J. Tie, S. Xu, Half-unet: A blankets: A post-hoc cure for attribute prior avoid- simplified u-net architecture for medical im- ance, 2019. URL: https://arxiv.org/abs/1907.12917. age segmentation, Frontiers in Neuroinformat- doi:10.48550/ARXIV.1907.12917. ics 16 (2022). URL: https://www.frontiersin.org/ [51] A. Lacoste, A. Luccioni, V. Schmidt, T. Dandres, articles/10.3389/fninf.2022.911679. doi:10.3389/ Quantifying the carbon emissions of machine learn- fninf.2022.911679. ing, 2019. URL: https://arxiv.org/abs/1910.09700. [39] K. Simonyan, A. Zisserman, Very deep convolu- doi:10.48550/ARXIV.1910.09700. tional networks for large-scale image recognition, 2014. URL: https://arxiv.org/abs/1409.1556. doi:10. 48550/ARXIV.1409.1556. [40] R. Meyes, M. Lu, C. W. de Puiseau, T. Meisen, Ablation studies in artificial neural networks, 2019. URL: https://arxiv.org/abs/1901.08644. doi:10.