Gestural Inputs as Control Interaction for Generative
Human-AI Co-Creation
John Joon Young Chung1 , Minsuk Chang2 and Eytan Adar1
1
    University of Michigan, Ann Arbor, MI
2
    Naver AI Lab, Seongnam, Republic of Korea


                                             Abstract
                                             While AI-powered generative systems offer new avenues for art-making, directing these algorithms remains a central
                                             challenge. Current methods for steering have focused on conventional interaction techniques (widgets, examples, etc.). This
                                             position paper argues that the intersection of user needs in creative contexts and algorithmic capabilities requires re-thinking
                                             our interactions with generative AI. We propose that rough gestural inputs, such as hand gestures or sketching, can enhance
                                             the experience of human-AI co-creation–even for text. First, the undetermined and ambiguous nature of gestural inputs
                                             corresponds to the purpose and the capabilities of generative systems. Second, rough gestural can be intuitive and expressive,
                                             facilitating iterative co-creation. We discuss design dimensions for inputs of artifact-creating systems, then characterize
                                             existing and proposed input interactions with those dimensions. We highlight how gestural inputs can expand the control
                                             interaction for generative systems by analyzing existing tools and describing speculative input designs. Our hope is that
                                             gestural inputs become actively studied and adopted to support user intentions and maximize the perceived efficacy of
                                             generative algorithms.

                                             Keywords
                                             generation, controllability, gestural input


1. Introduction                                                                                                         existing control interactions for generative algorithms.
                                                                                                                        Current control interactions range from inputting a a
Technologies such as Generative adversarial network                                                                     simple number (e.g., have a violin play with the maxi-
(GAN) [1], and pretrained language models (PLM) [2]                                                                     mum amount of vibrato, by setting the parameter value of
have the potential to enable human-AI co-creation. These                                                                1.0 [3]) to using natural language prompts (e.g., produce
algorithms are attractive in creative contexts as the AI can                                                            an image of a dragon sitting on a castle [4]) to providing
generate novel creations–something the human hadn’t                                                                     examples (e.g., make this photograph look like this exam-
considered. However, this process can backfire when                                                                     ple from Picasso [5]). We argue that these approaches
novelty and surprise misalign with the user’s intention                                                                 are limited in different ways, particularly in co-creative
and preference. Most commonly, produced text can sud-                                                                   tasks. For example, numerical inputs imply an ‘exact’
denly turn from what the author wants. Users often have                                                                 level of control. This over-promises and sets a very high
to re-run the algorithm and iterate until they get the de-                                                              expectation for the user: the system will produce ex-
sired results. Without control, users can only hope that                                                                actly what was specified. Unfortunately, this does not
the next generation will be better than the last. Thus,                                                                 often match algorithmic capabilities. Natural language
controllability becomes key to effective iteration. The                                                                 prompts and example-driven interfaces are problematic
best controls go beyond steering the behavior of the al-                                                                for other reasons. Creative work often requires itera-
gorithm. They also manage the user’s expectations of                                                                    tion and experimentation with alternatives. Prompts and
what the algorithm will produce. There are many conven-                                                                 examples do not readily support this iteration. Users
tional ways to provide interactive control but without ad-                                                              may not understand why the algorithm did what it did,
dressing these goals effectively. This paper proposes that                                                              how the results can be corrected, or may simply be chal-
rough gestural ‘sketches’ coupled with abstract represen-                                                               lenged to find or create new examples or prompts. The
tations of content (i.e., information visualizations) can                                                               cost of iterative practice may make generative algorithms
facilitate control interaction for generative algorithms.                                                               unappealing in practice.
   Our proposal is strongly motivated by limitations in                                                                    In answer to many challenges for generative tools, we
                                                                                                                        propose that rough gestural inputs, such as sketching, can
Joint Proceedings of the ACM IUI Workshops 2022, March 2022,                                                            be a sweet spot for human-AI co-creation. First, gestural
Helsinki, Finland
                                                                                                                        input conveys imprecise and ambiguous intentions [6,
Envelope-Open jjyc@umich.edu (J. J. Y. Chung); minsuk.chang@navercorp.com
(M. Chang); eadar@umich.edu (E. Adar)                                                                                   7], which corresponds to the nondeterministic nature of
GLOBE https://johnr0.github.io/ (J. J. Y. Chung);                                                                       generative algorithms. Second, because impreciseness is
https://minsukchang.com/ (M. Chang); http://cond.org/ (E. Adar)                                                         allowed (e.g., simple brush strokes [8]), the interaction of
                                       © 2021 Copyright © 2022 for this paper by its authors. Use permitted under
                                       Creative Commons License Attribution 4.0 International (CC BY 4.0).              specification would be easier. With easier interactions,
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                                    1
Joint Proceedings of the ACM IUI Workshops 2022, March 2022, Helsinki, Finland                                             1–10


Figure 1: TaleBrush facilitate human-AI story co-creation by allowing users to intuitively control and sensemake story
generation with a line sketching interaction. Specifically, users can control the protagonist’s fortune with line sketching.
In TaleBrush, the writer can write a portion of a story (A1) and sketch the change of the protagonist’s fortune as a control
input (A2, green shaded area). In the sketched line, the 𝑥 and 𝑦 positions stand for the chronological story position and
the protagonist’s fortune, respectively. For the protagonist’s fortune, the higher the position, the better the fortune. In the
sketch, the width shows the possible variance in the fortune of the generated sentences. With the given line sketch, TaleBrush
generates story sentences (B1, indicated with blue). These sentences are then visualized upon the original sketch (B2, the blue
line and dots).


the end-user may not need to think carefully about the               and imprecise nature of sketching corresponds to the
examples or prompts they generate, thus allowing for                 user’s ambiguous intentions and the algorithm’s uncer-
more rapid iteration.                                                tainty. This example also demonstrates how one input
   The idea of gestural or sketching inputs in the context           modality and representation (i.e., visual) can be used to
of generation has some history. For example, low fidelity            guide a different output modality (i.e., textual).
sketches created by the end-users can guide the genera-                 In this paper, we expand on this idea. We first intro-
tion of photorealistic images [9]. Here the sketch is the            duce the design dimensions for inputs of artifact-creation
input and is in the same modality as the output (e.g., take          systems. These include the types of support one wants
the visual dragon I scribbled and make a visual photo-               with an AI tool, considerations of algorithmic uncertainty,
realistic version). The input indicates “what generation             the precision of input, and ease of iteration on algorithms
should be done.” Control is implemented through more                 and inputs. Using design dimensions, we characterize
standard interactive approaches (e.g., by adjusting this             different existing input types for generative human-AI
slider, I am indicating how to bias color selection for the          co-creation. We specifically discuss how our example sys-
dragon). This is not to imply a clear separation between             tem, TaleBrush [10], adopts sketched inputs to facilitate
input and control, as they are often inexorably connected            iterative human-AI co-creation in story writing. Con-
as mechanisms to have a system produce the desired out-              sidering sketching and gestural inputs for control will
put. However, our specific suggestion is that the input              enable new ways to support human-AI co-creation.
interactions–and sketching and gesture, in particular–
can be expanded to also control “how the generation
should be done.’’                                                    2. Design Dimensions of
   One example of a‘sketch-as-control’ interaction is our               Artifact-Creation Support
controllable story generation system, TaleBrush (Fig-
ure 1). Here, TaleBrush leverages abstract visual rep-               We first consider possible dimensions for designing con-
resentation of the character’s fortune to control the story          trol interactions for creation support. We scope “creation
generation. The canvas is a 2D plane that allows for the             support” to systems that help creatives directly imple-
specification of the protagonist’s fortune (𝑦-axis) and the          ment artifacts. We exclude those tools that serve a more
story’s progression (𝑥-axis). In this interface, the control         indirect role, such as critiquing the created artifact. This
interaction is as simple as a single stroke of a line. This          boundary is something we have previously considered
approach has several benefits. First, it is easier to interact       in surveying the range of tools in the creative space [11].
with than alternatives (e.g., having multiple sliders for            We propose a focus on three aspects: 1) type of support,
different story parts). Most importantly, the ambiguous              2) algorithm, and 3) input (summarized in Figure 2).


                                                                 2
Joint Proceedings of the ACM IUI Workshops 2022, March 2022, Helsinki, Finland                                                   1–10


                                                                       generation, use variants of generative algorithms. In con-
                                                                       trast to the augmentation category, the end-user is ceding
                                                                       some creative control to the tool. Though, of course, the
                                                                       human maintains ultimate control over what makes it
                                                                       into the final artifact.
                                                                           Transfer tools turn one artifact into another. A com-
                                                                       mon feature is that they receive some ‘original’ artifact
                                                                       (e.g., a picture, a piece of text, a sketch, etc.) as input. The
                                                                       tool will then act on this input to generate a variant–often
                                                                       some alteration of the original input. A wide range of
                                                                       tools fall into this category, and they are often modality-
                                                                       specific. For example, in the visual design/art space, we
                                                                       see systems that transfer one visual art piece’s style to an-
                                                                       other image [5]. Other tools in the space will transform
                                                                       rough sketches into photorealistic images [9]. As with
                                                                       image-based style transfer, we find similar approaches
                                                                       for text where the software can transform the written
                                                                       input to the style of a particular author [12]. Though tar-
Figure 2: Summary of the design dimensions of artifact-                get styles are commonly required, not all transfer tools
creation support in relation to designing generative co-               need them. In music, for example, there are tools that
creation tools. Orange items (transfer and generation) are two         transform some input piece of music by adding effects
supports provided with generative co-creation tools. Items in
                                                                       like delay or compression [13].
green (non-deterministic, direct, and indirect) are design ele-
ments for generative co-creation tools (our focus is on indirect
                                                                           Finally, we observe tools focused on generation. With
inputs). In indirect input design, the requirements for genera-        these, the algorithm generates content from incomplete
tive co-creation tools include: 1) easy and fast iteration and         inputs or those of a different modality. For example, an
2) algorithmic uncertainty matching the user’s expectation.            input might be some previous part of the music, story,
These align with gestural inputs, in considering the ease of           or some portion of drawings. The algorithm’s purpose
interaction and their ability to express the user’s ambiguous          is not to change this initial input, but rather to add to
intentions. Note that the two-dimensional diagram in charac-           them. In music and text, these algorithms continue the
teristics of input is drawn based on the qualitative analysis of       from the user-provided ‘start’ or ‘infill’ when given some
different input approaches.                                            start and end states [14, 15, 16, 17]. In visual arts, we
                                                                       most often find this type of algorithm in systems that can
                                                                       fill empty spaces in an image [18, 19]. Note that many
2.1. Type of Support                                                   tools sit somewhere between transfer and generation
                                                                       and may depend on how the underlying task is defined.
CSTs are an extremely diverse and broad category, even                 For example, we might have a tool that automatically
when restricted to direct influence [11]. Within this cate-            colors a part of an image. From the perspective of the
gory we see tools that can augment, transfer, or generate.             whole image, this may be transfer (especially if the input
Though the last two categories are most relevant to our                is some color palette or color model). However, because
proposal, augmentation is also worth considering.                      we are also generating new colors, we might treat the
   CSTs that provide augmentation support often enhance                colorization task as generative.
a task the creative is already doing through computa-                      Different types of tools will require different types of
tional means. Many direct manipulation tools fall into                 controls. However, there are similarities in user needs
this category. The least ‘intelligent’ of these replicate              and expectations (e.g., surprise and novelty but also a
existing tools in a digital format. For example, a digi-               willingness to cede some creative control to the software).
tal painting canvas has various types of digital brushes.              This is in contrast to non-creative applications (e.g., pre-
Other augmentation tools provide some limited automa-                  dictive form filling) where ambiguity and surprise and
tion. For example, a bucket tool will flood-fill a closed              undesirable. As we argue below, gestural and sketched
area in a sketch. Most augmentation tools are highly                   inputs hold promise here.
deterministic. They are “predictable” and more naturally
correspond to the user’s mental model of what the sys-
tem will do. When using augmentation-focused tools,
the user is firmly in control over both the idea and style
of the final artifact.
   The second and third types of support, transfer and


                                                                   3
Joint Proceedings of the ACM IUI Workshops 2022, March 2022, Helsinki, Finland                                      1–10


2.2. Algorithm                                               the artifact are iterated on. For example, with the direct
                                                             manipulation of the box, only the position is changing,
2.2.1. Type: Algorithmic Uncertainty
                                                             but not the color or size. Style transfer algorithms [21] are
A tool’s algorithmic pipeline can differ depending on at the other extreme. With every run of these algorithms,
how certain we are of the pipeline’s output. Deterministic the entire image (or many parts of it) will change.
algorithms are one extreme in that users can predict the        Iteration naturally connects back to algorithmic un-
result when using these algorithms. Direct manipulation certainty. If the user better understands the algorithm’s
implementations are, naturally, one example. When a behavior (e.g., what the transfer algorithm changes and
box is dragged with a mouse cursor, the end-user knows how), iteration may become easier. With high uncer-
where it will end up. Automated algorithms with clear tainty, the user may need to iterate many times to get
rules are also deterministic. For example, with flood- the effect or artifact they want.
fill (e.g., a bucket tool), the user knows the system will
fill closed areas. If something goes wrong, the user can 2.3. Input
quickly isolate the problem.
    On the other extreme are non-deterministic algorithms 2.3.1. Type: Input Directness
which represent many machine learning (ML) algorithms. An input method targets the artifact directly or indi-
Though powerful, the inferences made by these algo- rectly [11]. With direct input, the end-user indicates
rithms lead to increased uncertainty and failures. For the artifact or subject ‘target.’ Because of this directness,
example, in comic colorization, ‘flatting’ is the process of inputs are usually in the same medium as the target arti-
automatically creating colored polygons under different fact. In some situations, a portion of the artifact can also
parts of the linear art (e.g., one for the face, one for the be used as a direct input. For example, we can select a
shirt, etc.). The algorithm for automated flatting makes portion of the image or the story. At the other extreme
inferences about shapes even when they are not ‘closed’ are those inputs that do not directly impact the artifact
in an expected way. For example, creases in a drawing but may give broad instructions on how the tool should
for a shirt may lead a poorly designed algorithm to make implement something. The simplest example might be a
too many polygons or not connect them appropriately. slider control for some parameters. The user isn’t touch-
Ideally, the system will produce one polygon that en- ing the artifact directly (i.e., the story or image) but the
capsulates the entire shirt. However, current flatting change in the slider guides the tool. The modality of
software is imperfect and can make the wrong inference. indirect input can be far from the medium (e.g., visual
The algorithm’s uncertainty in what makes up the object arts as artifacts and numbers as inputs). As with our
can lead to unexpected bleeding [20]–a failure case.         introductory example, abstract visual encodings can also
    However, in the creative setting, and specifically for be used for indirect inputs. In that example, the end-user
generative algorithms, uncertainty can be a feature (rather drew the character’s fortune to produce text.
than a failure). Or, more precisely, the line between a
novel, desirable result, and an error are not necessarily
                                                             2.3.2. Characteristics: Input Precision
clear cut. There is rarely a single gold standard for what
should be generated, and the user might subjectively While there are numerous input approaches for artifact-
decide whether the output fits their goal.                   creation systems, they vary on the spectrum of precision.
                                                             These varying levels are helpful in different contexts. The
2.2.2. Characteristics: Ease of Algorithmic                  most traditional type of widgets receives one specific
         Iteration                                           value. Examples include a number in the slider or a
                                                             category in a dropdown box. With this precise control,
Iterative design is important in creating artifacts. This users will expect the output to react precisely.
is mainly due to the explorative nature of the task. How        Not all inputs need to be precise. Natural language
easy it is to iterate depends on the algorithm’s properties. prompt is one example and can handle a wider range of in-
First among these is latency–the time taken by the algo- put precision [22, 23]. Roughly specified language would
rithm for each cycle. The lower the latency, the easier be imprecise, but at the same time, allow a high degree
the iteration. For example, when moving a box with a of freedom in how it can be interpreted. For example,
mouse cursor, the iteration is real-time as the box’s posi- asking for a “rough texture” can mean many things—
tion instantly updates with the user’s movement. On the anything from Jackson Pollock’s chaotic style to Van
other hand, many generative algorithms take significant Gogh’s impressionism. However, language can support
time to generate artifacts, significantly slowing down finer control. For example, if we say “move the selected
iteration.                                                   square 3 pixels left,” this does not leave much room for
    A second algorithmic aspect that impacts iteration is misinterpretation.
scope. Here, we define scope as relating to what parts of


                                                             4
Joint Proceedings of the ACM IUI Workshops 2022, March 2022, Helsinki, Finland                                             1–10


   At the imprecise end, we often find Examples as in-               or examples, can increase iterative costs. This is mainly
puts [24, 21]. While they are often used as direct material          due to the vast space of options for these modalities. For
for transfer (e.g., source of style in visual style transfer),       natural language prompts, the user needs to come up with
it is up to the algorithm to determine, if it can, which             better wordings or more specific details on the prompts.
aspects of the input should be followed closely and which            This can be tricky if the user is to express differences in
are only suggestions. For example, when transferring                 degree (e.g., how would one use language to express the
the style of Van Gogh’s The Starry Night, it may not be              level of roughness of the texture in a painting?). Simi-
clear whether the user wants the colors or textures to               larly, iterating with examples is difficult because the user
be transferred. Adding more examples might make the                  needs to search for more or better examples. If such an
target clearer. However, it may be hard for the user to de-          example can’t be easily found or created, the user will
termine which attributes overlap between the examples                struggle to iterate.
and which are ambiguous. The interaction with uncer-                    Gestural or sketch inputs can help with iteration. While
tain algorithms makes this problem even more complex                 these input modalities come at the cost of precision, ges-
as it is not obvious if the issue is with the input or the           tural input is flexible, intuitive with lowered cognitive
inherent ambiguity of the system.                                    demands. These properties can reduce iteration time. For
   As with language prompts, Gestural inputs, such as                example, the user can erase and redraw a portion of the
sketches or hand gestures, can also have a wide range of             sketches to quickly change the specifications.
precision. For example, gestural inputs for direct manip-
ulation require outputs to follow the given input exactly.
When resizing a box in graphics editors, users expect                3. Designing Generative
the box to follow the cursor they are moving. However,                  Co-Creation Tools
sketches can be used for low-precision input. For exam-
ple, sketches can express flexible and lightweight ideas             The design dimensions above represent a large design
with their roughness, ambiguity, and uncertainty [7, 6].             space. However, we can begin to consider points in the
Similarly, hand gestures have been used to provide im-               space that are either required, or are more suitable, for
precise but intuitive and flexible inputs, such as serving           co-creation tools.
as rough scaffolds in 3D modeling [25]. As we see in
these examples, gestural inputs can be designed to pro-              3.1. Requirements for Generative
vide high intuitiveness and flexibility and be traded off
against precision.
                                                                          Co-creation Tools
   We note that input precision is often related to input            As we argued above, generative algorithms are usually
difficulty. As we know from psychophysical properties                used to support transfer or generation. Additionally,
such as Fitts’s Law [26], certain input precision comes at           these systems have increasingly leaned towards machine-
the cost of time or difficulty. Lower precision interactions,        learning-based approaches. Thus, we are largely working
such as gestures, can often lower interaction difficulty.            in the non-deterministic algorithmic space and this im-
                                                                     plies a couple of key requirements for tools.
2.3.3. Characteristics: Ease of Input Iteration                         First, iteration should be easy and fast. In cre-
                                                                     ative tasks, iteration and exploration are necessary as
Just as we consider the iterative cost at the algorithmic            they expose the artist to more options and, eventually,
level, it is worth considering it at the input level as well.        a concretization of ‘direction’ [27, 28]. Thus, users of
Though these two might be tied, a tool might have rela-              creative tools often want to be able to iterate, which is
tively small back-end iterative costs but widely diverging           well-aligned with the reality that to use non-deterministic
front-end costs. For example, the algorithm itself might             tools, they need to iterate. Unfortunately, sometimes the
run quickly but generating good example inputs may                   cost of iteration becomes high. Thus, tools should either
take a long time. Thus, different input approaches vary              act to speed up the number of iterations and, if that is
in how well they support iteration.                                  not possible, to reduce them. In both situations, reducing
   Traditional input widgets, such as numerical values               the iteration cost is critical.
on sliders, are relatively easy. With a single slider, the              Second, algorithmic uncertainty should match the
control options given to the user are tightly restricted,            user’s expectations. With standard algorithms, we
and a change in value does not require much effort. How-             would only need to worry about the user’s expectations
ever, even with simple slider widgets, the user needs to             in how their input and deterministic output relate. For
decide if a control should be changed and then make the              example, dragging an icon into the trash would lead to
actual change to the correct value. As the number of                 it being deleted. However, with non-deterministic algo-
input controls grows, so does iteration cost.                        rithms, instead of a specific output, they would need to
   Other types of inputs, such as natural language prompts


                                                                 5
Joint Proceedings of the ACM IUI Workshops 2022, March 2022, Helsinki, Finland                                          1–10


model a range of possible outputs. Without this under-              3.3. Input Design
standing, end-users are likely to be dissatisfied with the
                                                                    To achieve our requirements, we argue that interaction is
results. They will also find it difficult to model how their
                                                                    a critical factor. We focus on possible input approaches.
input choices will lead to a better, or more certain, output.
                                                                    These will naturally range based on the type of input
Our advantage in creative tools is that some degree of
                                                                    directness.
uncertainty is actually a desired property. Our goal is
not necessarily to make the tool appear deterministic as
creativity often requires ‘surprise.’ Thus, users both want         3.3.1. Direct Inputs: A Small Space of Design
and expect some level of (controlled) uncertainty. A user   Direct inputs are usually made in the same medium as the
may be willing to make a rough specification. At the        artifact. For example, we might use low-fidelity sketches
same time, they would understand and expect that the        on the drawing surface when the target artifact is visual.
tool will have some degrees of freedom within that space.   These sketches will then be transferred to high-fidelity
Note that none of this is to say that we need to force      images on (essentially) the same surface/encoding [9]. In
the algorithms to match the user’s expectations. In some    other cases, the algorithm may simply append elements
cases, users might not have well-defined expectations. In   to the sketch [18]. This type of input serves as the ‘mate-
other cases, we may change their expectations.              rial’ for the generation—where the transfer is applied or
                                                            what the generative algorithms build upon.
3.2. Algorithmic Design                                        While direct inputs may depend on the application
                                                            domain, their specific type may be largely constrained
There are various ways to approach the iteration and to a small design space. This is largely because the repre-
uncertainty problems on the algorithmic side. For exam- sentation depends on the target artifact’s medium (e.g.,
ple, by adding extensive controllability features, we can a drawing canvas). Additionally, the interactions are
provide the user with fine-grained controls for steering constrained by the underlying algorithm. For example,
the behavior of the generative algorithms. However, this we may train an algorithm to produce a photo-realistic
requires building algorithms that can actually accept all image given a low-resolution sketch. Such datasets are
these controls.                                             more readily available and easier to produce. The user-
   On the positive side, detailed control may reduce the facing input modality and form are thus constrained to
number of iterations, at least from the algorithmic per- something that looks like the training data. Finally, di-
spective. That is, fine-grained controls would reduce the rect inputs create a set of expectations for the end-user
ambiguity of the input and enable the generative system that need to be maintained in the interactive controls.
to produce a more targeted response. Detailed controls Because of these constraints, which may limit our design
also work to ‘teach’ the end-user how to model and direct space options, we move to consider indirect inputs.
the underlying algorithm. Their expectations of system
capabilities would come to be more in alignment with
                                                            3.3.2. Indirect Inputs: Approaches and Their
reality with fewer iterations. Of course, reducing the
                                                                    Limitations
latency of the algorithm would also facilitate the ease of
iteration. Clever designs, such as using smaller models Indirect inputs serve as instructions for both what and
before executing more costly larger ones may help here. how to generate. Unlike direct inputs, they are not depen-
   However, shifting the responsibility of satisfying our dent on the artifact’s medium. One can work in abstract
requirements to the algorithmic side exclusively is not spaces or through abstract representations. Thus, there
realistic. Regardless of the algorithm, many bottlenecks is often more freedom to design with indirect inputs. The
for iteration are from the interaction side. Increasing consequence of freeing ourselves from the constraints of
the number of controls may be cognitively costly for the the domain also enables us to consider additional algo-
end-user. This is not to say that fewer controls or simpler rithm types. This flexibility further affords a better abil-
inputs, such as examples or prompts, reduce cognitive ity to match the end user’s high-level intentions rather
cost. The cognitive cost of figuring out how to change than forcing them to work within their algorithm and
or create an example can be equally bad. When coupled interface constraints. However, this is not to say that all
with the specific demands of creative applications–that indirect inputs are good ones. A novel indirect interac-
we want some iteration and some surprise–achieving tion might be further from the target artifact’s modality
a ‘sweet-spot’ through algorithmic means alone seems and thus might be harder to master when the mapping
implausible. Put another way, simply changing the algo- is complex. A poorly designed indirect interaction can
rithm can’t solve our problem if the interface costs are also increase cognitive costs and reduce the ability to
high or the user’s requirements for a creative tool are iterate. All together, indirect inputs open up a vast space
unmet.                                                      of possibilities but introduce various pitfalls.


                                                                6
Joint Proceedings of the ACM IUI Workshops 2022, March 2022, Helsinki, Finland                                     1–10


   To better understand which aspects may help or hin-     encodings. Finally, as we have argued before, gestural
der, we focus on three types of inputs: traditional input  inputs convey the sense of being rough and flexible. This
widgets, natural language prompts, and gestural inputs.    strongly aligns with the non-determinism of the algo-
Traditional input widgets, such as sliders for numerical   rithm and the ambiguity of the user’s intent. Moreover,
inputs, represent the simplest option. If we are able to use
                                                           it can convey the ‘unfinished’ nature of the generative
these in the interface, it often means that we can directlyprocess.
map the user’s expectation to the actual behavior of the      As a demonstration of the feasibility of this approach
system. In reality, this depends on how the end-user un-   we describe our system TaleBrush [10] (Figure 1). Tale-
derstands the construct represented with the widget. If    Brush is a human-AI story co-creation tool that gener-
the label on the slider is ambiguous (e.g., this will control
                                                           ates story sentences according to the specifications of the
the ‘brightness’ of the text) or novel (e.g., this will control
                                                           protagonist’s fortune. For example, if we were describ-
the ‘certitudeness’ of the text), the user may struggle with
                                                           ing Cinderella’s fortune, we might say that: her fortune
the control. Clearly, increased experience with the tool   started low (with he stepmother and sisters), improved
will improve as the user calibrates to the system. More    greatly as she went to the ball, collapsed as she was forced
critical, however, are situations where a user only has a  to flee, and then improved again when the prince found
rough idea of what they want to generate. Here, a stan-    her.
dard input widget may be insufficient. As critically, the     TaleBrush allows the user first to input a portion of the
algorithms themselves may not deliver on the precision     story (direct input) in the text box (Figure 1A1). Then,
of the input. Thus, the interface is over-promising. The   they can sketch out the protagonist’s fortune in a 2-
ease of iteration with input widgets largely depends on    dimensional line sketch as in Figure 1A2. This is roughly
the complexity of the interface. A single slider is simple,a standard time series with the 𝑥 and 𝑦 axes standing for
but many controls and interactions will naturally become   sequence position and fortune levels, respectively. Us-
more challenging.                                          ing this sketched line (which is actually represented as a
   Recent generative co-creation systems have enabled      sketch rendering), TaleBrush will generate a story (Fig-
indirect natural language prompts as input. For exam-      ure 1B1). Because the underlying algorithm is ambiguous
ple, natural language prompts can steer vision-language    and may not precisely match the desired fortune sketch,
models to generate visual images [22, 23]. As these ap-    the best matching generation is also displayed in the vi-
proaches can be used with imprecision or ambiguity, they   sualization (Figure 1B2). Technically, this sketch-based
are useful for giving high-level specifications on gener-  control is powered by steering a big pretrained language
ations. However, they would be difficult to iterate with   model with a smaller language model that receives sketch
due to the vast space of inputs.                           position as control code. More details can be found in
   Surprisingly, few human-AI co-creation tools have       the Chung et al. [10]
used gestural or sketch interactions for indirect control.    With TaleBrush, the benefits of gestural inputs hold.
We argue that this is a missed opportunity as there are    First, iteration on the generation is easy. The user only
several benefits to this approach. In the next section, we needs to redraw parts of the sketch. Additionally, a single
expand on this possibility and why it may be appropri-     drawn line expresses two dimensions simultaneously: (1)
ate.                                                       where in the story and (2) at what fortune level should the
                                                           sentence be. Notably, the first (position) is a direct input,
                                                           whereas the fortune level represents an indirect one. In
4. Gestural Indirect Inputs for                            reality, we also use the speed at which the sketched line is
      Generative Co-Creation                               drawn to indicate how much ambiguity the user will tol-
                                                           erate in the generated result. This is visually represented
We propose that gestural or sketch-based interfaces for in the thickness of the line. A thinner line indicates the
indirect specification satisfy our requirements for co- user wants a better match. Internally, this is implemented
generation tools. At the very least, this approach may by regenerating the sentences multiple times and find-
complement other input controls. First, simple gestural ing the one that best matches that desired fortune level.
interactions (e.g., producing a rough sketch) are easy to This visualized boundary further emphasizes the ambi-
iterate with. This characteristic can complement more guity and non-determinism of the algorithm. Note that
effortful controls such as prompts or examples. More- a ‘sketch’ does not necessarily mean a ‘sketchy appear-
over, the multi-dimensional characteristics of sketches ance.’ However, we have opted to use this rendering
and gestures can reduce the effort to interact with multi- aesthetic to further lower the user’s expectations that the
ple attributes simultaneously. For example, 2D sketching algorithm should be precise [7, 6].
coupled with pressure and speed recognition can be used
to encode multiple parameters simultaneously. This flex-
ibility also means that we can work in abstract visual


                                                                  7
Joint Proceedings of the ACM IUI Workshops 2022, March 2022, Helsinki, Finland                                          1–10


4.1. Design Approaches for Gestural                                Controlling the precision When using gestural in-
     Indirect Inputs                                               puts, the user’s intention regarding the precision might
                                                                   vary. For example, in TaleBrush, the user might have
Implementing TaleBrush has given us some insight about             wanted the algorithm to follow the generation more
what may work well (and poorly) for indirect gestural              tightly when they draw the line with more care. The
inputs.                                                            system can be designed to leverage other input dimen-
                                                                   sions to control the precision to reflect these intentions
4.1.1. Ease of Iteration on Input                                  better. In TaleBrush, the sketching speed was used to
Combine direct and indirect input if possible For                  decide how tightly generation should be done. That is,
some generation tools, indirect inputs may be sufficient.          if the user drew slowly, we assumed this indicated that
For example, if the tool generates any character biogra-           they wanted a better fit (represented as a thinner error
phies based on the good-evil nature of the character, then         envelope). In this way, gestural interactions and rep-
it might not require direct inputs. However, as with Tale-         resentations can be used to align input precision with
Brush, certain tasks require control with direct input.            system capabilities.
The user needs to be able to indicate, “where the genera-
tion should be done’’ (e.g., where in the story a certain          5. Conclusion
fortune level should exist or what does the start of the
story look like?). In some cases, as we did with TaleBrush, In this position paper, we have explored where generative
the indirect and direct controls can be combined into a     algorithms sit in the overall design space of co-creative
single gestural sketch. That is, with a single brushstroke, tools. We have further isolated those properties that are
the sequential position (𝑥 position–the ‘where’) and the    desirable and potentially required for supporting human-
level of the protagonist’s fortune (𝑦 position–the ‘how’)   AI co-creation. Our focus was on how inputs (both the
are both specified.                                         ‘what’ and the ‘how’) can interact with underlying algo-
                                                            rithms. Our focus on enabling iteration and managing
Complement hard-to-iterate control inputs (lan- expectations allowed us to consider the pros and cons of
guage, examples) Spatial positions by themselves do different input types. Ultimately, we argued that gestu-
not necessarily convey meaning. They are meaningful ral and sketch-based interactions would work well for
when combined with semantic structures that can be put the control of generative algorithms. We showcased the
on a continuous scale. For example, TaleBrush takes a benefits of this approach with TaleBrush. We believe
restricted set of numerical semantics: whether the char- that there are significant possibilities opened up by using
acter’s fortune is good or ill. However, this design can be abstract visual representations when coupled with novel
extended to receive qualitative inputs as the endpoints interaction types.
of the axes. For example, the user can give natural lan-
guage prompts or examples on each end and explore the
confined space with gestural inputs. This complements Acknowledgments
the limitations and features of different input approaches.
                                                            We thank our reviewers for providing helpful feedback
Language prompts and examples lack the ease of iter-
                                                            on this work.
ation, which is the strength of gestural inputs. On the
other hand, gestural inputs lack semantics, which lan-
guage prompts and examples can convey.                      References
4.1.2. Matching Input Precision with Algorithmic                    [1] I. Goodfellow, J. Pouget-Abadie, M. Mirza,
       Uncertainty                                                      B. Xu, D. Warde-Farley, S. Ozair, A. Courville,
                                                                        Y. Bengio, Generative adversarial nets, in:
Match the algorithmic precision with the input pre-                     Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence,
cision To have better expectations of how the algo-                     K. Q. Weinberger (Eds.), Advances in Neu-
rithm will behave, the user should ideally be aware of                  ral Information Processing Systems, vol-
the precision of the algorithm. Gestural inputs can be                  ume 27, Curran Associates, Inc., 2014. URL:
designed to convey this information. For example, in                    https://proceedings.neurips.cc/paper/2014/file/
TaleBrush, this level of precision is reflected in the width            5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf.
of the sketched line. This was designed to match the                [2] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D.
median error from the test dataset we used during devel-                Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,
opment. Thus, the interaction and representation can be                 G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss,
used in ways that reduce ambiguity and help to match                    G. Krueger, T. Henighan, R. Child, A. Ramesh,
(and manage) expectations.


                                                               8
Joint Proceedings of the ACM IUI Workshops 2022, March 2022, Helsinki, Finland                                         1–10


     D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen,            [13] C. J. Steinmetz, J. D. Reiss, Steerable discovery of
     E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark,               neural audio effects, 2021. arXiv:2112.02926 .
     C. Berner, S. McCandlish, A. Radford, I. Sutskever,         [14] R. Louie, A. Coenen, C. Z. Huang, M. Terry, C. J.
     D. Amodei, Language models are few-shot learn-                   Cai, Novice-ai music co-creation via ai-steering
     ers, in: H. Larochelle, M. Ranzato, R. Hadsell,                  tools for deep generative models, in: Proceedings
     M. F. Balcan, H. Lin (Eds.), Advances in Neu-                    of the 2020 CHI Conference on Human Factors in
     ral Information Processing Systems, volume 33,                   Computing Systems, CHI ’20, Association for Com-
     Curran Associates, Inc., 2020, pp. 1877–1901.                    puting Machinery, New York, NY, USA, 2020, p.
     URL: https://proceedings.neurips.cc/paper/2020/                  1–13. URL: https://doi.org/10.1145/3313831.3376739.
     file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.                 doi:10.1145/3313831.3376739 .
 [3] Y. Wu, E. Manilow, Y. Deng, R. Swavely, K. Kast-            [15] C.-J. Chang, C.-Y. Lee, Y.-H. Yang, Variable-length
     ner, T. Cooijmans, A. Courville, C.-Z. A. Huang,                 music score infilling via xlnet and musically special-
     J. Engel, Midi-ddsp: Detailed control of musi-                   ized positional encoding, 2021. arXiv:2108.05064 .
     cal performance via hierarchical modeling, 2021.            [16] P. Ammanabrolu, W. Cheung, W. Broniec, M. O.
     arXiv:2112.09312 .                                               Riedl, Automated storytelling via causal, common-
 [4] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss,                  sense plot ordering, CoRR abs/2009.00829 (2020).
     A. Radford, M. Chen, I. Sutskever, Zero-shot text-               URL: https://arxiv.org/abs/2009.00829.
     to-image generation, in: M. Meila, T. Zhang                 [17] A. Fan, M. Lewis, Y. Dauphin, Strategies for
     (Eds.), Proceedings of the 38th International Con-               structuring story generation, in: Proceedings of
     ference on Machine Learning, volume 139 of Pro-                  the 57th Annual Meeting of the Association for
     ceedings of Machine Learning Research, PMLR, 2021,               Computational Linguistics, Association for Com-
     pp. 8821–8831. URL: https://proceedings.mlr.press/               putational Linguistics, Florence, Italy, 2019, pp.
     v139/ramesh21a.html.                                             2650–2660. URL: https://aclanthology.org/P19-1254.
 [5] L. A. Gatys, A. S. Ecker, M. Bethge, Image style                 doi:10.18653/v1/P19- 1254 .
     transfer using convolutional neural networks, in:           [18] J. E. Fan, M. Dinculescu, D. Ha, Collabdraw: An
     2016 IEEE Conference on Computer Vision and                      environment for collaborative sketching with an
     Pattern Recognition (CVPR), IEEE, USA, 2016, pp.                 artificial agent, in: Proceedings of the 2019 on
     2414–2423. doi:10.1109/CVPR.2016.265 .                           Creativity and Cognition, C&C ’19, Association for
 [6] M. D. Gross, E. Y. Do, Ambiguous intentions: A                   Computing Machinery, New York, NY, USA, 2019,
     paper-like interface for creative design, in: ACM                p. 556–561. URL: https://doi.org/10.1145/3325480.
     Symposium on User Interface Software and Tech-                   3326578. doi:10.1145/3325480.3326578 .
     nology, ACM, 1996, pp. 183–192.                             [19] Y. Lin, J. Guo, Y. Chen, C. Yao, F. Ying, It is your
 [7] J. Landay, B. Myers, Sketching interfaces: toward                turn: Collaborative ideation with a co-creative
     more human interface design, Computer 34 (2001)                  robot through sketch, in: Proceedings of the
     56–64. doi:10.1109/2.910894 .                                    2020 CHI Conference on Human Factors in Com-
 [8] M. Eitz, J. Hays, M. Alexa, How do humans sketch                 puting Systems, CHI ’20, Association for Com-
     objects?, ACM Trans. Graph. (Proc. SIGGRAPH) 31                  puting Machinery, New York, NY, USA, 2020, p.
     (2012) 44:1–44:10.                                               1–14. URL: https://doi.org/10.1145/3313831.3376258.
 [9] S.-Y. Chen, W. Su, L. Gao, S. Xia, H. Fu, Deep-                  doi:10.1145/3313831.3376258 .
     FaceDrawing: Deep generation of face images from            [20] C. Yan, J. J. Y. Chung, K. Yoon, Y. Gingold, E. Adar,
     sketches, ACM Transactions on Graphics (Proceed-                 S. R. Hong, FlatMagic: Improving Flat Colorization
     ings of ACM SIGGRAPH 2020) 39 (2020) 72:1–72:16.                 through AI-driven Design for DigitalComic Pro-
[10] J. J. Y. Chung, W. Kim, K. M. Yoo, H. Lee, E. Adar,              fessionals, Association for Computing Machinery,
     M. Chang, TaleBrush: Sketching Stories with Gen-                 New York, NY, USA, 2022.
     erative Pretrained Language Models, Association             [21] L. Sheng, Z. Lin, J. Shao, X. Wang, Avatar-net:
     for Computing Machinery, New York, NY, USA,                      Multi-scale zero-shot style transfer by feature dec-
     2022.                                                            oration, in: Computer Vision and Pattern Recog-
[11] J. J. Y. Chung, S. He, E. Adar, The intersection of              nition (CVPR), 2018 IEEE Conference on, 2018, pp.
     users, roles, interactions, and technologies in cre-             1–9.
     ativity support tools, in: Conference on Designing          [22] A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam,
     Interactive Systems, ACM, 2021, pp. 1817–1833.                   P. Mishkin, B. McGrew, I. Sutskever, M. Chen,
[12] B. Syed, G. Verma, B. V. Srinivasan, A. Natara-                  Glide: Towards photorealistic image generation
     jan, V. Varma, Adapting language models for                      and editing with text-guided diffusion models, 2021.
     non-parallel author-stylized rewriting, 2020.                    arXiv:2112.10741 .
     arXiv:1909.09962 .                                          [23] F. Huang, J. F. Canny, Sketchforme: Composing


                                                             9
Joint Proceedings of the ACM IUI Workshops 2022, March 2022, Helsinki, Finland   1–10


     sketched scenes from text descriptions for inter-
     active applications, in: Proceedings of the 32nd
     Annual ACM Symposium on User Interface Soft-
     ware and Technology, UIST ’19, Association for
     Computing Machinery, New York, NY, USA, 2019,
     p. 209–220. URL: https://doi.org/10.1145/3332165.
     3347878. doi:10.1145/3332165.3347878 .
[24] E. Frid, C. Gomes, Z. Jin, Music creation by exam-
     ple, in: Proceedings of the 2020 CHI Conference on
     Human Factors in Computing Systems, CHI ’20, As-
     sociation for Computing Machinery, New York, NY,
     USA, 2020, p. 1–13. URL: https://doi.org/10.1145/
     3313831.3376514. doi:10.1145/3313831.3376514 .
[25] Y. Kim, S.-G. An, J. H. Lee, S.-H. Bae, Agile
     3D Sketching with Air Scaffolding, Association
     for Computing Machinery, New York, NY, USA,
     2018, p. 1–12. URL: https://doi.org/10.1145/3173574.
     3173812.
[26] I. S. MacKenzie, Fitts’ law as a research and de-
     sign tool in human-computer interaction, Hum.-
     Comput. Interact. 7 (1992) 91–139. URL: https://
     doi.org/10.1207/s15327051hci0701_3. doi:10.1207/
     s15327051hci0701_3 .
[27] T. M. Amabile, The social psychology of creativ-
     ity: A componential conceptualization., Journal of
     personality and social psychology 45 (1983) 357.
[28] T. M. Amabile, Componential theory of creativity
     (2012).


                                                             10