Gestural Inputs as Control Interaction for Generative Human-AI Co-Creation John Joon Young Chung1 , Minsuk Chang2 and Eytan Adar1 1 University of Michigan, Ann Arbor, MI 2 Naver AI Lab, Seongnam, Republic of Korea Abstract While AI-powered generative systems offer new avenues for art-making, directing these algorithms remains a central challenge. Current methods for steering have focused on conventional interaction techniques (widgets, examples, etc.). This position paper argues that the intersection of user needs in creative contexts and algorithmic capabilities requires re-thinking our interactions with generative AI. We propose that rough gestural inputs, such as hand gestures or sketching, can enhance the experience of human-AI co-creation–even for text. First, the undetermined and ambiguous nature of gestural inputs corresponds to the purpose and the capabilities of generative systems. Second, rough gestural can be intuitive and expressive, facilitating iterative co-creation. We discuss design dimensions for inputs of artifact-creating systems, then characterize existing and proposed input interactions with those dimensions. We highlight how gestural inputs can expand the control interaction for generative systems by analyzing existing tools and describing speculative input designs. Our hope is that gestural inputs become actively studied and adopted to support user intentions and maximize the perceived efficacy of generative algorithms. Keywords generation, controllability, gestural input 1. Introduction existing control interactions for generative algorithms. Current control interactions range from inputting a a Technologies such as Generative adversarial network simple number (e.g., have a violin play with the maxi- (GAN) [1], and pretrained language models (PLM) [2] mum amount of vibrato, by setting the parameter value of have the potential to enable human-AI co-creation. These 1.0 [3]) to using natural language prompts (e.g., produce algorithms are attractive in creative contexts as the AI can an image of a dragon sitting on a castle [4]) to providing generate novel creations–something the human hadn’t examples (e.g., make this photograph look like this exam- considered. However, this process can backfire when ple from Picasso [5]). We argue that these approaches novelty and surprise misalign with the user’s intention are limited in different ways, particularly in co-creative and preference. Most commonly, produced text can sud- tasks. For example, numerical inputs imply an ‘exact’ denly turn from what the author wants. Users often have level of control. This over-promises and sets a very high to re-run the algorithm and iterate until they get the de- expectation for the user: the system will produce ex- sired results. Without control, users can only hope that actly what was specified. Unfortunately, this does not the next generation will be better than the last. Thus, often match algorithmic capabilities. Natural language controllability becomes key to effective iteration. The prompts and example-driven interfaces are problematic best controls go beyond steering the behavior of the al- for other reasons. Creative work often requires itera- gorithm. They also manage the user’s expectations of tion and experimentation with alternatives. Prompts and what the algorithm will produce. There are many conven- examples do not readily support this iteration. Users tional ways to provide interactive control but without ad- may not understand why the algorithm did what it did, dressing these goals effectively. This paper proposes that how the results can be corrected, or may simply be chal- rough gestural ‘sketches’ coupled with abstract represen- lenged to find or create new examples or prompts. The tations of content (i.e., information visualizations) can cost of iterative practice may make generative algorithms facilitate control interaction for generative algorithms. unappealing in practice. Our proposal is strongly motivated by limitations in In answer to many challenges for generative tools, we propose that rough gestural inputs, such as sketching, can Joint Proceedings of the ACM IUI Workshops 2022, March 2022, be a sweet spot for human-AI co-creation. First, gestural Helsinki, Finland input conveys imprecise and ambiguous intentions [6, Envelope-Open jjyc@umich.edu (J. J. Y. Chung); minsuk.chang@navercorp.com (M. Chang); eadar@umich.edu (E. Adar) 7], which corresponds to the nondeterministic nature of GLOBE https://johnr0.github.io/ (J. J. Y. Chung); generative algorithms. Second, because impreciseness is https://minsukchang.com/ (M. Chang); http://cond.org/ (E. Adar) allowed (e.g., simple brush strokes [8]), the interaction of © 2021 Copyright © 2022 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). specification would be easier. With easier interactions, CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 1 Joint Proceedings of the ACM IUI Workshops 2022, March 2022, Helsinki, Finland 1–10 Figure 1: TaleBrush facilitate human-AI story co-creation by allowing users to intuitively control and sensemake story generation with a line sketching interaction. Specifically, users can control the protagonist’s fortune with line sketching. In TaleBrush, the writer can write a portion of a story (A1) and sketch the change of the protagonist’s fortune as a control input (A2, green shaded area). In the sketched line, the 𝑥 and 𝑦 positions stand for the chronological story position and the protagonist’s fortune, respectively. For the protagonist’s fortune, the higher the position, the better the fortune. In the sketch, the width shows the possible variance in the fortune of the generated sentences. With the given line sketch, TaleBrush generates story sentences (B1, indicated with blue). These sentences are then visualized upon the original sketch (B2, the blue line and dots). the end-user may not need to think carefully about the and imprecise nature of sketching corresponds to the examples or prompts they generate, thus allowing for user’s ambiguous intentions and the algorithm’s uncer- more rapid iteration. tainty. This example also demonstrates how one input The idea of gestural or sketching inputs in the context modality and representation (i.e., visual) can be used to of generation has some history. For example, low fidelity guide a different output modality (i.e., textual). sketches created by the end-users can guide the genera- In this paper, we expand on this idea. We first intro- tion of photorealistic images [9]. Here the sketch is the duce the design dimensions for inputs of artifact-creation input and is in the same modality as the output (e.g., take systems. These include the types of support one wants the visual dragon I scribbled and make a visual photo- with an AI tool, considerations of algorithmic uncertainty, realistic version). The input indicates “what generation the precision of input, and ease of iteration on algorithms should be done.” Control is implemented through more and inputs. Using design dimensions, we characterize standard interactive approaches (e.g., by adjusting this different existing input types for generative human-AI slider, I am indicating how to bias color selection for the co-creation. We specifically discuss how our example sys- dragon). This is not to imply a clear separation between tem, TaleBrush [10], adopts sketched inputs to facilitate input and control, as they are often inexorably connected iterative human-AI co-creation in story writing. Con- as mechanisms to have a system produce the desired out- sidering sketching and gestural inputs for control will put. However, our specific suggestion is that the input enable new ways to support human-AI co-creation. interactions–and sketching and gesture, in particular– can be expanded to also control “how the generation should be done.’’ 2. Design Dimensions of One example of a‘sketch-as-control’ interaction is our Artifact-Creation Support controllable story generation system, TaleBrush (Fig- ure 1). Here, TaleBrush leverages abstract visual rep- We first consider possible dimensions for designing con- resentation of the character’s fortune to control the story trol interactions for creation support. We scope “creation generation. The canvas is a 2D plane that allows for the support” to systems that help creatives directly imple- specification of the protagonist’s fortune (𝑦-axis) and the ment artifacts. We exclude those tools that serve a more story’s progression (𝑥-axis). In this interface, the control indirect role, such as critiquing the created artifact. This interaction is as simple as a single stroke of a line. This boundary is something we have previously considered approach has several benefits. First, it is easier to interact in surveying the range of tools in the creative space [11]. with than alternatives (e.g., having multiple sliders for We propose a focus on three aspects: 1) type of support, different story parts). Most importantly, the ambiguous 2) algorithm, and 3) input (summarized in Figure 2). 2 Joint Proceedings of the ACM IUI Workshops 2022, March 2022, Helsinki, Finland 1–10 generation, use variants of generative algorithms. In con- trast to the augmentation category, the end-user is ceding some creative control to the tool. Though, of course, the human maintains ultimate control over what makes it into the final artifact. Transfer tools turn one artifact into another. A com- mon feature is that they receive some ‘original’ artifact (e.g., a picture, a piece of text, a sketch, etc.) as input. The tool will then act on this input to generate a variant–often some alteration of the original input. A wide range of tools fall into this category, and they are often modality- specific. For example, in the visual design/art space, we see systems that transfer one visual art piece’s style to an- other image [5]. Other tools in the space will transform rough sketches into photorealistic images [9]. As with image-based style transfer, we find similar approaches for text where the software can transform the written input to the style of a particular author [12]. Though tar- Figure 2: Summary of the design dimensions of artifact- get styles are commonly required, not all transfer tools creation support in relation to designing generative co- need them. In music, for example, there are tools that creation tools. Orange items (transfer and generation) are two transform some input piece of music by adding effects supports provided with generative co-creation tools. Items in like delay or compression [13]. green (non-deterministic, direct, and indirect) are design ele- ments for generative co-creation tools (our focus is on indirect Finally, we observe tools focused on generation. With inputs). In indirect input design, the requirements for genera- these, the algorithm generates content from incomplete tive co-creation tools include: 1) easy and fast iteration and inputs or those of a different modality. For example, an 2) algorithmic uncertainty matching the user’s expectation. input might be some previous part of the music, story, These align with gestural inputs, in considering the ease of or some portion of drawings. The algorithm’s purpose interaction and their ability to express the user’s ambiguous is not to change this initial input, but rather to add to intentions. Note that the two-dimensional diagram in charac- them. In music and text, these algorithms continue the teristics of input is drawn based on the qualitative analysis of from the user-provided ‘start’ or ‘infill’ when given some different input approaches. start and end states [14, 15, 16, 17]. In visual arts, we most often find this type of algorithm in systems that can fill empty spaces in an image [18, 19]. Note that many 2.1. Type of Support tools sit somewhere between transfer and generation and may depend on how the underlying task is defined. CSTs are an extremely diverse and broad category, even For example, we might have a tool that automatically when restricted to direct influence [11]. Within this cate- colors a part of an image. From the perspective of the gory we see tools that can augment, transfer, or generate. whole image, this may be transfer (especially if the input Though the last two categories are most relevant to our is some color palette or color model). However, because proposal, augmentation is also worth considering. we are also generating new colors, we might treat the CSTs that provide augmentation support often enhance colorization task as generative. a task the creative is already doing through computa- Different types of tools will require different types of tional means. Many direct manipulation tools fall into controls. However, there are similarities in user needs this category. The least ‘intelligent’ of these replicate and expectations (e.g., surprise and novelty but also a existing tools in a digital format. For example, a digi- willingness to cede some creative control to the software). tal painting canvas has various types of digital brushes. This is in contrast to non-creative applications (e.g., pre- Other augmentation tools provide some limited automa- dictive form filling) where ambiguity and surprise and tion. For example, a bucket tool will flood-fill a closed undesirable. As we argue below, gestural and sketched area in a sketch. Most augmentation tools are highly inputs hold promise here. deterministic. They are “predictable” and more naturally correspond to the user’s mental model of what the sys- tem will do. When using augmentation-focused tools, the user is firmly in control over both the idea and style of the final artifact. The second and third types of support, transfer and 3 Joint Proceedings of the ACM IUI Workshops 2022, March 2022, Helsinki, Finland 1–10 2.2. Algorithm the artifact are iterated on. For example, with the direct manipulation of the box, only the position is changing, 2.2.1. Type: Algorithmic Uncertainty but not the color or size. Style transfer algorithms [21] are A tool’s algorithmic pipeline can differ depending on at the other extreme. With every run of these algorithms, how certain we are of the pipeline’s output. Deterministic the entire image (or many parts of it) will change. algorithms are one extreme in that users can predict the Iteration naturally connects back to algorithmic un- result when using these algorithms. Direct manipulation certainty. If the user better understands the algorithm’s implementations are, naturally, one example. When a behavior (e.g., what the transfer algorithm changes and box is dragged with a mouse cursor, the end-user knows how), iteration may become easier. With high uncer- where it will end up. Automated algorithms with clear tainty, the user may need to iterate many times to get rules are also deterministic. For example, with flood- the effect or artifact they want. fill (e.g., a bucket tool), the user knows the system will fill closed areas. If something goes wrong, the user can 2.3. Input quickly isolate the problem. On the other extreme are non-deterministic algorithms 2.3.1. Type: Input Directness which represent many machine learning (ML) algorithms. An input method targets the artifact directly or indi- Though powerful, the inferences made by these algo- rectly [11]. With direct input, the end-user indicates rithms lead to increased uncertainty and failures. For the artifact or subject ‘target.’ Because of this directness, example, in comic colorization, ‘flatting’ is the process of inputs are usually in the same medium as the target arti- automatically creating colored polygons under different fact. In some situations, a portion of the artifact can also parts of the linear art (e.g., one for the face, one for the be used as a direct input. For example, we can select a shirt, etc.). The algorithm for automated flatting makes portion of the image or the story. At the other extreme inferences about shapes even when they are not ‘closed’ are those inputs that do not directly impact the artifact in an expected way. For example, creases in a drawing but may give broad instructions on how the tool should for a shirt may lead a poorly designed algorithm to make implement something. The simplest example might be a too many polygons or not connect them appropriately. slider control for some parameters. The user isn’t touch- Ideally, the system will produce one polygon that en- ing the artifact directly (i.e., the story or image) but the capsulates the entire shirt. However, current flatting change in the slider guides the tool. The modality of software is imperfect and can make the wrong inference. indirect input can be far from the medium (e.g., visual The algorithm’s uncertainty in what makes up the object arts as artifacts and numbers as inputs). As with our can lead to unexpected bleeding [20]–a failure case. introductory example, abstract visual encodings can also However, in the creative setting, and specifically for be used for indirect inputs. In that example, the end-user generative algorithms, uncertainty can be a feature (rather drew the character’s fortune to produce text. than a failure). Or, more precisely, the line between a novel, desirable result, and an error are not necessarily 2.3.2. Characteristics: Input Precision clear cut. There is rarely a single gold standard for what should be generated, and the user might subjectively While there are numerous input approaches for artifact- decide whether the output fits their goal. creation systems, they vary on the spectrum of precision. These varying levels are helpful in different contexts. The 2.2.2. Characteristics: Ease of Algorithmic most traditional type of widgets receives one specific Iteration value. Examples include a number in the slider or a category in a dropdown box. With this precise control, Iterative design is important in creating artifacts. This users will expect the output to react precisely. is mainly due to the explorative nature of the task. How Not all inputs need to be precise. Natural language easy it is to iterate depends on the algorithm’s properties. prompt is one example and can handle a wider range of in- First among these is latency–the time taken by the algo- put precision [22, 23]. Roughly specified language would rithm for each cycle. The lower the latency, the easier be imprecise, but at the same time, allow a high degree the iteration. For example, when moving a box with a of freedom in how it can be interpreted. For example, mouse cursor, the iteration is real-time as the box’s posi- asking for a “rough texture” can mean many things— tion instantly updates with the user’s movement. On the anything from Jackson Pollock’s chaotic style to Van other hand, many generative algorithms take significant Gogh’s impressionism. However, language can support time to generate artifacts, significantly slowing down finer control. For example, if we say “move the selected iteration. square 3 pixels left,” this does not leave much room for A second algorithmic aspect that impacts iteration is misinterpretation. scope. Here, we define scope as relating to what parts of 4 Joint Proceedings of the ACM IUI Workshops 2022, March 2022, Helsinki, Finland 1–10 At the imprecise end, we often find Examples as in- or examples, can increase iterative costs. This is mainly puts [24, 21]. While they are often used as direct material due to the vast space of options for these modalities. For for transfer (e.g., source of style in visual style transfer), natural language prompts, the user needs to come up with it is up to the algorithm to determine, if it can, which better wordings or more specific details on the prompts. aspects of the input should be followed closely and which This can be tricky if the user is to express differences in are only suggestions. For example, when transferring degree (e.g., how would one use language to express the the style of Van Gogh’s The Starry Night, it may not be level of roughness of the texture in a painting?). Simi- clear whether the user wants the colors or textures to larly, iterating with examples is difficult because the user be transferred. Adding more examples might make the needs to search for more or better examples. If such an target clearer. However, it may be hard for the user to de- example can’t be easily found or created, the user will termine which attributes overlap between the examples struggle to iterate. and which are ambiguous. The interaction with uncer- Gestural or sketch inputs can help with iteration. While tain algorithms makes this problem even more complex these input modalities come at the cost of precision, ges- as it is not obvious if the issue is with the input or the tural input is flexible, intuitive with lowered cognitive inherent ambiguity of the system. demands. These properties can reduce iteration time. For As with language prompts, Gestural inputs, such as example, the user can erase and redraw a portion of the sketches or hand gestures, can also have a wide range of sketches to quickly change the specifications. precision. For example, gestural inputs for direct manip- ulation require outputs to follow the given input exactly. When resizing a box in graphics editors, users expect 3. Designing Generative the box to follow the cursor they are moving. However, Co-Creation Tools sketches can be used for low-precision input. For exam- ple, sketches can express flexible and lightweight ideas The design dimensions above represent a large design with their roughness, ambiguity, and uncertainty [7, 6]. space. However, we can begin to consider points in the Similarly, hand gestures have been used to provide im- space that are either required, or are more suitable, for precise but intuitive and flexible inputs, such as serving co-creation tools. as rough scaffolds in 3D modeling [25]. As we see in these examples, gestural inputs can be designed to pro- 3.1. Requirements for Generative vide high intuitiveness and flexibility and be traded off against precision. Co-creation Tools We note that input precision is often related to input As we argued above, generative algorithms are usually difficulty. As we know from psychophysical properties used to support transfer or generation. Additionally, such as Fitts’s Law [26], certain input precision comes at these systems have increasingly leaned towards machine- the cost of time or difficulty. Lower precision interactions, learning-based approaches. Thus, we are largely working such as gestures, can often lower interaction difficulty. in the non-deterministic algorithmic space and this im- plies a couple of key requirements for tools. 2.3.3. Characteristics: Ease of Input Iteration First, iteration should be easy and fast. In cre- ative tasks, iteration and exploration are necessary as Just as we consider the iterative cost at the algorithmic they expose the artist to more options and, eventually, level, it is worth considering it at the input level as well. a concretization of ‘direction’ [27, 28]. Thus, users of Though these two might be tied, a tool might have rela- creative tools often want to be able to iterate, which is tively small back-end iterative costs but widely diverging well-aligned with the reality that to use non-deterministic front-end costs. For example, the algorithm itself might tools, they need to iterate. Unfortunately, sometimes the run quickly but generating good example inputs may cost of iteration becomes high. Thus, tools should either take a long time. Thus, different input approaches vary act to speed up the number of iterations and, if that is in how well they support iteration. not possible, to reduce them. In both situations, reducing Traditional input widgets, such as numerical values the iteration cost is critical. on sliders, are relatively easy. With a single slider, the Second, algorithmic uncertainty should match the control options given to the user are tightly restricted, user’s expectations. With standard algorithms, we and a change in value does not require much effort. How- would only need to worry about the user’s expectations ever, even with simple slider widgets, the user needs to in how their input and deterministic output relate. For decide if a control should be changed and then make the example, dragging an icon into the trash would lead to actual change to the correct value. As the number of it being deleted. However, with non-deterministic algo- input controls grows, so does iteration cost. rithms, instead of a specific output, they would need to Other types of inputs, such as natural language prompts 5 Joint Proceedings of the ACM IUI Workshops 2022, March 2022, Helsinki, Finland 1–10 model a range of possible outputs. Without this under- 3.3. Input Design standing, end-users are likely to be dissatisfied with the To achieve our requirements, we argue that interaction is results. They will also find it difficult to model how their a critical factor. We focus on possible input approaches. input choices will lead to a better, or more certain, output. These will naturally range based on the type of input Our advantage in creative tools is that some degree of directness. uncertainty is actually a desired property. Our goal is not necessarily to make the tool appear deterministic as creativity often requires ‘surprise.’ Thus, users both want 3.3.1. Direct Inputs: A Small Space of Design and expect some level of (controlled) uncertainty. A user Direct inputs are usually made in the same medium as the may be willing to make a rough specification. At the artifact. For example, we might use low-fidelity sketches same time, they would understand and expect that the on the drawing surface when the target artifact is visual. tool will have some degrees of freedom within that space. These sketches will then be transferred to high-fidelity Note that none of this is to say that we need to force images on (essentially) the same surface/encoding [9]. In the algorithms to match the user’s expectations. In some other cases, the algorithm may simply append elements cases, users might not have well-defined expectations. In to the sketch [18]. This type of input serves as the ‘mate- other cases, we may change their expectations. rial’ for the generation—where the transfer is applied or what the generative algorithms build upon. 3.2. Algorithmic Design While direct inputs may depend on the application domain, their specific type may be largely constrained There are various ways to approach the iteration and to a small design space. This is largely because the repre- uncertainty problems on the algorithmic side. For exam- sentation depends on the target artifact’s medium (e.g., ple, by adding extensive controllability features, we can a drawing canvas). Additionally, the interactions are provide the user with fine-grained controls for steering constrained by the underlying algorithm. For example, the behavior of the generative algorithms. However, this we may train an algorithm to produce a photo-realistic requires building algorithms that can actually accept all image given a low-resolution sketch. Such datasets are these controls. more readily available and easier to produce. The user- On the positive side, detailed control may reduce the facing input modality and form are thus constrained to number of iterations, at least from the algorithmic per- something that looks like the training data. Finally, di- spective. That is, fine-grained controls would reduce the rect inputs create a set of expectations for the end-user ambiguity of the input and enable the generative system that need to be maintained in the interactive controls. to produce a more targeted response. Detailed controls Because of these constraints, which may limit our design also work to ‘teach’ the end-user how to model and direct space options, we move to consider indirect inputs. the underlying algorithm. Their expectations of system capabilities would come to be more in alignment with 3.3.2. Indirect Inputs: Approaches and Their reality with fewer iterations. Of course, reducing the Limitations latency of the algorithm would also facilitate the ease of iteration. Clever designs, such as using smaller models Indirect inputs serve as instructions for both what and before executing more costly larger ones may help here. how to generate. Unlike direct inputs, they are not depen- However, shifting the responsibility of satisfying our dent on the artifact’s medium. One can work in abstract requirements to the algorithmic side exclusively is not spaces or through abstract representations. Thus, there realistic. Regardless of the algorithm, many bottlenecks is often more freedom to design with indirect inputs. The for iteration are from the interaction side. Increasing consequence of freeing ourselves from the constraints of the number of controls may be cognitively costly for the the domain also enables us to consider additional algo- end-user. This is not to say that fewer controls or simpler rithm types. This flexibility further affords a better abil- inputs, such as examples or prompts, reduce cognitive ity to match the end user’s high-level intentions rather cost. The cognitive cost of figuring out how to change than forcing them to work within their algorithm and or create an example can be equally bad. When coupled interface constraints. However, this is not to say that all with the specific demands of creative applications–that indirect inputs are good ones. A novel indirect interac- we want some iteration and some surprise–achieving tion might be further from the target artifact’s modality a ‘sweet-spot’ through algorithmic means alone seems and thus might be harder to master when the mapping implausible. Put another way, simply changing the algo- is complex. A poorly designed indirect interaction can rithm can’t solve our problem if the interface costs are also increase cognitive costs and reduce the ability to high or the user’s requirements for a creative tool are iterate. All together, indirect inputs open up a vast space unmet. of possibilities but introduce various pitfalls. 6 Joint Proceedings of the ACM IUI Workshops 2022, March 2022, Helsinki, Finland 1–10 To better understand which aspects may help or hin- encodings. Finally, as we have argued before, gestural der, we focus on three types of inputs: traditional input inputs convey the sense of being rough and flexible. This widgets, natural language prompts, and gestural inputs. strongly aligns with the non-determinism of the algo- Traditional input widgets, such as sliders for numerical rithm and the ambiguity of the user’s intent. Moreover, inputs, represent the simplest option. If we are able to use it can convey the ‘unfinished’ nature of the generative these in the interface, it often means that we can directlyprocess. map the user’s expectation to the actual behavior of the As a demonstration of the feasibility of this approach system. In reality, this depends on how the end-user un- we describe our system TaleBrush [10] (Figure 1). Tale- derstands the construct represented with the widget. If Brush is a human-AI story co-creation tool that gener- the label on the slider is ambiguous (e.g., this will control ates story sentences according to the specifications of the the ‘brightness’ of the text) or novel (e.g., this will control protagonist’s fortune. For example, if we were describ- the ‘certitudeness’ of the text), the user may struggle with ing Cinderella’s fortune, we might say that: her fortune the control. Clearly, increased experience with the tool started low (with he stepmother and sisters), improved will improve as the user calibrates to the system. More greatly as she went to the ball, collapsed as she was forced critical, however, are situations where a user only has a to flee, and then improved again when the prince found rough idea of what they want to generate. Here, a stan- her. dard input widget may be insufficient. As critically, the TaleBrush allows the user first to input a portion of the algorithms themselves may not deliver on the precision story (direct input) in the text box (Figure 1A1). Then, of the input. Thus, the interface is over-promising. The they can sketch out the protagonist’s fortune in a 2- ease of iteration with input widgets largely depends on dimensional line sketch as in Figure 1A2. This is roughly the complexity of the interface. A single slider is simple,a standard time series with the 𝑥 and 𝑦 axes standing for but many controls and interactions will naturally become sequence position and fortune levels, respectively. Us- more challenging. ing this sketched line (which is actually represented as a Recent generative co-creation systems have enabled sketch rendering), TaleBrush will generate a story (Fig- indirect natural language prompts as input. For exam- ure 1B1). Because the underlying algorithm is ambiguous ple, natural language prompts can steer vision-language and may not precisely match the desired fortune sketch, models to generate visual images [22, 23]. As these ap- the best matching generation is also displayed in the vi- proaches can be used with imprecision or ambiguity, they sualization (Figure 1B2). Technically, this sketch-based are useful for giving high-level specifications on gener- control is powered by steering a big pretrained language ations. However, they would be difficult to iterate with model with a smaller language model that receives sketch due to the vast space of inputs. position as control code. More details can be found in Surprisingly, few human-AI co-creation tools have the Chung et al. [10] used gestural or sketch interactions for indirect control. With TaleBrush, the benefits of gestural inputs hold. We argue that this is a missed opportunity as there are First, iteration on the generation is easy. The user only several benefits to this approach. In the next section, we needs to redraw parts of the sketch. Additionally, a single expand on this possibility and why it may be appropri- drawn line expresses two dimensions simultaneously: (1) ate. where in the story and (2) at what fortune level should the sentence be. Notably, the first (position) is a direct input, whereas the fortune level represents an indirect one. In 4. Gestural Indirect Inputs for reality, we also use the speed at which the sketched line is Generative Co-Creation drawn to indicate how much ambiguity the user will tol- erate in the generated result. This is visually represented We propose that gestural or sketch-based interfaces for in the thickness of the line. A thinner line indicates the indirect specification satisfy our requirements for co- user wants a better match. Internally, this is implemented generation tools. At the very least, this approach may by regenerating the sentences multiple times and find- complement other input controls. First, simple gestural ing the one that best matches that desired fortune level. interactions (e.g., producing a rough sketch) are easy to This visualized boundary further emphasizes the ambi- iterate with. This characteristic can complement more guity and non-determinism of the algorithm. Note that effortful controls such as prompts or examples. More- a ‘sketch’ does not necessarily mean a ‘sketchy appear- over, the multi-dimensional characteristics of sketches ance.’ However, we have opted to use this rendering and gestures can reduce the effort to interact with multi- aesthetic to further lower the user’s expectations that the ple attributes simultaneously. For example, 2D sketching algorithm should be precise [7, 6]. coupled with pressure and speed recognition can be used to encode multiple parameters simultaneously. This flex- ibility also means that we can work in abstract visual 7 Joint Proceedings of the ACM IUI Workshops 2022, March 2022, Helsinki, Finland 1–10 4.1. Design Approaches for Gestural Controlling the precision When using gestural in- Indirect Inputs puts, the user’s intention regarding the precision might vary. For example, in TaleBrush, the user might have Implementing TaleBrush has given us some insight about wanted the algorithm to follow the generation more what may work well (and poorly) for indirect gestural tightly when they draw the line with more care. The inputs. system can be designed to leverage other input dimen- sions to control the precision to reflect these intentions 4.1.1. Ease of Iteration on Input better. In TaleBrush, the sketching speed was used to Combine direct and indirect input if possible For decide how tightly generation should be done. That is, some generation tools, indirect inputs may be sufficient. if the user drew slowly, we assumed this indicated that For example, if the tool generates any character biogra- they wanted a better fit (represented as a thinner error phies based on the good-evil nature of the character, then envelope). In this way, gestural interactions and rep- it might not require direct inputs. However, as with Tale- resentations can be used to align input precision with Brush, certain tasks require control with direct input. system capabilities. The user needs to be able to indicate, “where the genera- tion should be done’’ (e.g., where in the story a certain 5. Conclusion fortune level should exist or what does the start of the story look like?). In some cases, as we did with TaleBrush, In this position paper, we have explored where generative the indirect and direct controls can be combined into a algorithms sit in the overall design space of co-creative single gestural sketch. That is, with a single brushstroke, tools. We have further isolated those properties that are the sequential position (𝑥 position–the ‘where’) and the desirable and potentially required for supporting human- level of the protagonist’s fortune (𝑦 position–the ‘how’) AI co-creation. Our focus was on how inputs (both the are both specified. ‘what’ and the ‘how’) can interact with underlying algo- rithms. Our focus on enabling iteration and managing Complement hard-to-iterate control inputs (lan- expectations allowed us to consider the pros and cons of guage, examples) Spatial positions by themselves do different input types. Ultimately, we argued that gestu- not necessarily convey meaning. They are meaningful ral and sketch-based interactions would work well for when combined with semantic structures that can be put the control of generative algorithms. We showcased the on a continuous scale. For example, TaleBrush takes a benefits of this approach with TaleBrush. We believe restricted set of numerical semantics: whether the char- that there are significant possibilities opened up by using acter’s fortune is good or ill. However, this design can be abstract visual representations when coupled with novel extended to receive qualitative inputs as the endpoints interaction types. of the axes. For example, the user can give natural lan- guage prompts or examples on each end and explore the confined space with gestural inputs. This complements Acknowledgments the limitations and features of different input approaches. We thank our reviewers for providing helpful feedback Language prompts and examples lack the ease of iter- on this work. ation, which is the strength of gestural inputs. On the other hand, gestural inputs lack semantics, which lan- guage prompts and examples can convey. References 4.1.2. Matching Input Precision with Algorithmic [1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, Uncertainty B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in: Match the algorithmic precision with the input pre- Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, cision To have better expectations of how the algo- K. Q. Weinberger (Eds.), Advances in Neu- rithm will behave, the user should ideally be aware of ral Information Processing Systems, vol- the precision of the algorithm. Gestural inputs can be ume 27, Curran Associates, Inc., 2014. URL: designed to convey this information. For example, in https://proceedings.neurips.cc/paper/2014/file/ TaleBrush, this level of precision is reflected in the width 5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf. of the sketched line. This was designed to match the [2] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. median error from the test dataset we used during devel- Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, opment. Thus, the interaction and representation can be G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, used in ways that reduce ambiguity and help to match G. Krueger, T. Henighan, R. Child, A. Ramesh, (and manage) expectations. 8 Joint Proceedings of the ACM IUI Workshops 2022, March 2022, Helsinki, Finland 1–10 D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, [13] C. J. Steinmetz, J. D. Reiss, Steerable discovery of E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, neural audio effects, 2021. arXiv:2112.02926 . C. Berner, S. McCandlish, A. Radford, I. Sutskever, [14] R. Louie, A. Coenen, C. Z. Huang, M. Terry, C. J. D. Amodei, Language models are few-shot learn- Cai, Novice-ai music co-creation via ai-steering ers, in: H. Larochelle, M. Ranzato, R. Hadsell, tools for deep generative models, in: Proceedings M. F. Balcan, H. Lin (Eds.), Advances in Neu- of the 2020 CHI Conference on Human Factors in ral Information Processing Systems, volume 33, Computing Systems, CHI ’20, Association for Com- Curran Associates, Inc., 2020, pp. 1877–1901. puting Machinery, New York, NY, USA, 2020, p. URL: https://proceedings.neurips.cc/paper/2020/ 1–13. URL: https://doi.org/10.1145/3313831.3376739. file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf. doi:10.1145/3313831.3376739 . [3] Y. Wu, E. Manilow, Y. Deng, R. Swavely, K. Kast- [15] C.-J. Chang, C.-Y. Lee, Y.-H. Yang, Variable-length ner, T. Cooijmans, A. Courville, C.-Z. A. Huang, music score infilling via xlnet and musically special- J. Engel, Midi-ddsp: Detailed control of musi- ized positional encoding, 2021. arXiv:2108.05064 . cal performance via hierarchical modeling, 2021. [16] P. Ammanabrolu, W. Cheung, W. Broniec, M. O. arXiv:2112.09312 . Riedl, Automated storytelling via causal, common- [4] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, sense plot ordering, CoRR abs/2009.00829 (2020). A. Radford, M. Chen, I. Sutskever, Zero-shot text- URL: https://arxiv.org/abs/2009.00829. to-image generation, in: M. Meila, T. Zhang [17] A. Fan, M. Lewis, Y. Dauphin, Strategies for (Eds.), Proceedings of the 38th International Con- structuring story generation, in: Proceedings of ference on Machine Learning, volume 139 of Pro- the 57th Annual Meeting of the Association for ceedings of Machine Learning Research, PMLR, 2021, Computational Linguistics, Association for Com- pp. 8821–8831. URL: https://proceedings.mlr.press/ putational Linguistics, Florence, Italy, 2019, pp. v139/ramesh21a.html. 2650–2660. URL: https://aclanthology.org/P19-1254. [5] L. A. Gatys, A. S. Ecker, M. Bethge, Image style doi:10.18653/v1/P19- 1254 . transfer using convolutional neural networks, in: [18] J. E. Fan, M. Dinculescu, D. Ha, Collabdraw: An 2016 IEEE Conference on Computer Vision and environment for collaborative sketching with an Pattern Recognition (CVPR), IEEE, USA, 2016, pp. artificial agent, in: Proceedings of the 2019 on 2414–2423. doi:10.1109/CVPR.2016.265 . Creativity and Cognition, C&C ’19, Association for [6] M. D. Gross, E. Y. Do, Ambiguous intentions: A Computing Machinery, New York, NY, USA, 2019, paper-like interface for creative design, in: ACM p. 556–561. URL: https://doi.org/10.1145/3325480. Symposium on User Interface Software and Tech- 3326578. doi:10.1145/3325480.3326578 . nology, ACM, 1996, pp. 183–192. [19] Y. Lin, J. Guo, Y. Chen, C. Yao, F. Ying, It is your [7] J. Landay, B. Myers, Sketching interfaces: toward turn: Collaborative ideation with a co-creative more human interface design, Computer 34 (2001) robot through sketch, in: Proceedings of the 56–64. doi:10.1109/2.910894 . 2020 CHI Conference on Human Factors in Com- [8] M. Eitz, J. Hays, M. Alexa, How do humans sketch puting Systems, CHI ’20, Association for Com- objects?, ACM Trans. Graph. (Proc. SIGGRAPH) 31 puting Machinery, New York, NY, USA, 2020, p. (2012) 44:1–44:10. 1–14. URL: https://doi.org/10.1145/3313831.3376258. [9] S.-Y. Chen, W. Su, L. Gao, S. Xia, H. Fu, Deep- doi:10.1145/3313831.3376258 . FaceDrawing: Deep generation of face images from [20] C. Yan, J. J. Y. Chung, K. Yoon, Y. Gingold, E. Adar, sketches, ACM Transactions on Graphics (Proceed- S. R. Hong, FlatMagic: Improving Flat Colorization ings of ACM SIGGRAPH 2020) 39 (2020) 72:1–72:16. through AI-driven Design for DigitalComic Pro- [10] J. J. Y. Chung, W. Kim, K. M. Yoo, H. Lee, E. Adar, fessionals, Association for Computing Machinery, M. Chang, TaleBrush: Sketching Stories with Gen- New York, NY, USA, 2022. erative Pretrained Language Models, Association [21] L. Sheng, Z. Lin, J. Shao, X. Wang, Avatar-net: for Computing Machinery, New York, NY, USA, Multi-scale zero-shot style transfer by feature dec- 2022. oration, in: Computer Vision and Pattern Recog- [11] J. J. Y. Chung, S. He, E. Adar, The intersection of nition (CVPR), 2018 IEEE Conference on, 2018, pp. users, roles, interactions, and technologies in cre- 1–9. ativity support tools, in: Conference on Designing [22] A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, Interactive Systems, ACM, 2021, pp. 1817–1833. P. Mishkin, B. McGrew, I. Sutskever, M. Chen, [12] B. Syed, G. Verma, B. V. Srinivasan, A. Natara- Glide: Towards photorealistic image generation jan, V. Varma, Adapting language models for and editing with text-guided diffusion models, 2021. non-parallel author-stylized rewriting, 2020. arXiv:2112.10741 . arXiv:1909.09962 . [23] F. Huang, J. F. Canny, Sketchforme: Composing 9 Joint Proceedings of the ACM IUI Workshops 2022, March 2022, Helsinki, Finland 1–10 sketched scenes from text descriptions for inter- active applications, in: Proceedings of the 32nd Annual ACM Symposium on User Interface Soft- ware and Technology, UIST ’19, Association for Computing Machinery, New York, NY, USA, 2019, p. 209–220. URL: https://doi.org/10.1145/3332165. 3347878. doi:10.1145/3332165.3347878 . [24] E. Frid, C. Gomes, Z. Jin, Music creation by exam- ple, in: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, CHI ’20, As- sociation for Computing Machinery, New York, NY, USA, 2020, p. 1–13. URL: https://doi.org/10.1145/ 3313831.3376514. doi:10.1145/3313831.3376514 . [25] Y. Kim, S.-G. An, J. H. Lee, S.-H. Bae, Agile 3D Sketching with Air Scaffolding, Association for Computing Machinery, New York, NY, USA, 2018, p. 1–12. URL: https://doi.org/10.1145/3173574. 3173812. [26] I. S. MacKenzie, Fitts’ law as a research and de- sign tool in human-computer interaction, Hum.- Comput. Interact. 7 (1992) 91–139. URL: https:// doi.org/10.1207/s15327051hci0701_3. doi:10.1207/ s15327051hci0701_3 . [27] T. M. Amabile, The social psychology of creativ- ity: A componential conceptualization., Journal of personality and social psychology 45 (1983) 357. [28] T. M. Amabile, Componential theory of creativity (2012). 10