=Paper= {{Paper |id=Vol-2848/HAI-GEN-Paper-1 |storemode=property |title=Cococo: AI-Steering Tools for Music Novices Co-Creating with Generative Models |pdfUrl=https://ceur-ws.org/Vol-2848/HAI-GEN-Paper-1.pdf |volume=Vol-2848 |authors=Ryan Louie,Any Cohen,Cheng-Zhi Anna Huang,Michael Terry,Carrie Cai |dblpUrl=https://dblp.org/rec/conf/iui/LouieCHTC20 }} ==Cococo: AI-Steering Tools for Music Novices Co-Creating with Generative Models== https://ceur-ws.org/Vol-2848/HAI-GEN-Paper-1.pdf
    Cococo: AI-Steering Tools for Music Novices Co-Creating with
                         Generative Models1
                    Ryan Louie                                               Andy Coenen                             Cheng Zhi Huang
           Northwestern University                                      Google Research                            Mountain View, CA
                 Evanston, IL                                         Mountain View, CA                       chengzhiannahuang@gmail.com
        ryanlouie@u.northwestern.edu                                andycoenen@google.com

                                              Michael Terry                                     Carrie J. Cai
                                           Google Research                                    Google Research
                                           Cambridge, MA                                     Mountain View, CA
                                       michaelterry@google.com                               cjcai@google.com

ABSTRACT                                                                            music, it is not clear how people perceive and engage in co-creation
In this work1 , we investigate how novices co-create music with                     activities like these, or what types of capabilities they might find
a deep generative model, and what types of interactive controls                     useful.
are important for an effective co-creation experience. Through a                       In a need-finding study we conducted to understand the novice-
needfinding study, we found that generative AI can overwhelm                        AI co-creation process, we found that generative music models
novices when the AI generates too much content, and can make                        can sometimes be quite challenging to co-create with. Novices
it hard to express creative goals when outputs appear to be ran-                    experienced information overload, in which they struggled to evalu-
dom. To better match co-creation needs, we built Cococo, a music                    ate and edit the generated music because the system created too
editor web interface that adds interactive capabilities via a set of AI-            much content at once. They also struggled with the system’s non-
steering tools. These tools restrict content generation to particular               deterministic output. While the output would typically be coherent,
voices and time measures, and help to constrain non-deterministic                   it would not always align with users’ musical goals at the moment.
output to specific high-level directions. We found that the tools                   Having surfaced these challenges, this paper seeks to understand
helped users increase their control, self-efficacy, and creative own-               what interfaces and interactive controls for generative models are
ership, and we describe how the tools affected novices’ strategies                  important in order to promote an effective co-creation experience.
for composing and managing their interaction with AI.                                  As a step towards explicitly designing for music novices co-
                                                                                    creating with generative models, we present Cococo (collaborative
CCS CONCEPTS                                                                        co-creation), a music editor web-interface for novice-AI co-creation
                                                                                    that augments standard generative music interfaces with a set of
• Human-centered computing → Human computer interac-
                                                                                    AI-steering tools: 1) Voice Lanes that allow users to define for which
tion (HCI); User studies; Collaborative interaction.
                                                                                    time-steps (e.g. measure 1) and for which voices (e.g. soprano, alto,
ACM Reference Format:                                                               tenor, bass) the AI generates music, before any music is created,
Ryan Louie, Andy Coenen, Cheng Zhi Huang, Michael Terry, and Carrie J.              2) an Example-based Slider for expressing that the AI-generated
Cai. 2020. Cococo: AI-Steering Tools for Music Novices Co-Creating with
                                                                                    music should be more or less like an existing example of music, 3)
Generative Models1 . In IUI ’20 Workshops, March 17, 2020, Cagliari, Italy.
ACM, New York, NY, USA, 6 pages.
                                                                                    Semantic Sliders that users can adjust to direct the music toward
                                                                                    high-level directions (e.g. happier / sadder, or more conventional
1    INTRODUCTION                                                                   / more surprising), and 4) Multiple Alternatives for the user to se-
                                                                                    lect between a variety of AI-generated options. To implement the
Recent generative music models have made it conceivable for novices
                                                                                    sliders, we developed a soft priors approach that encodes desired
to create an entire musical composition from scratch, in partner-
                                                                                    qualities specified by a slider into a prior distribution; this soft prior
ship with a generative model. For example, the widely available
                                                                                    is then used to alter a model’s original sampling distribution, in
Bach Doodle [9] sought to enable anyone on the web to create a
                                                                                    turn influencing the AI’s generated output.
four-part chorale in the style of J.S. Bach by writing only a few
                                                                                       In a summative evaluation with 21 music novices, we found
notes, allowing an AI to fill in the rest. While this app makes it
                                                                                    that AI-steering tools not only increased users’ trust, control, com-
conceivable for even novices with no composition training to create
                                                                                    prehension, and sense of collaboration with the AI, but also con-
1 This workshop paper is a shortened summary of the full CHI’20 paper [10]          tributed to a greater sense of self-efficacy and ownership of the
2 This work was completed during the first author’s summer internship at Google.
                                                                                    composition relative to the AI. We also reveal how AI-Steering
                                                                                    tools affected novices co-creation process, such as by working with
Copyright © 2020 for this paper by its authors. Use permitted under Creative        smaller, semantically-meaningful components and reducing the
Commons License Attribution 4.0 International (CC BY 4.0).                          non-determinism in AI-generated output. Together, these findings
                                                                                    inform the design of future human-AI interfaces for co-creation.
IUI ’20 Workshops, March 17, 2020, Cagliari, Italy                                                                                    Louie, et al.




Figure 1: Users of Cococo can manually write some notes (A), specify which voices and in which time range to request AI-
generated music using Voice Lanes (B), click Generate (C) infill the music given the existing notes, constrain generation along
specific dimensions of interest using the Semantic Sliders (D) and Example-Based Slider (E), or audition Multiple Alternatives
(F) of generated output by selecting a sample thumbnail to temporarily substitute it into the music score (shown as glowing
notes in this figure (G)). Users can also use the Infill Mask (H) to crop a section of notes to be infilled again using AI.


2    NOVICE’S NEEDS FOR CO-CREATION                                     with a set of AI steering tools (Figure 1). Cococo builds on top of
To understand challenges when composing music with generative           Coconet [7], a state-of-the-art deep generative model trained on 4
models, we conducted a 25 minute needfinding study with 11 music        part harmony that accepts incomplete music as input and outputs
composition novices. We observed novices use a tool that mirrored       complete music. Coconet works with music that can have 4 parts or
conventional interfaces for composing music with deep generative        voices playing at the same time (represented by Soprano Alto Tenor
models [9].                                                             Bass), are 2-measures long or 32 timesteps of sixteenth-note beats,
   Participants experienced information overload: they strug-           and where each voice can take on any one of 46 pitches. Coconet is
gled to evaluate the generated music due to the amount of AI-           able to infill any section of music, including gaps in the middle or
generated content. Participants struggled to identify which note        start of the piece. To mirror the most recent interfaces backed by
was causing a discordant sound after multiple generated voices          these infill capabilities [3, 5], Cococo contains an infill mask feature,
were added to their original. Participant were naturally inclined       with which users can crop a passage of notes to be erased using
to work on the composition “bar-by-bar or part-by-part”; however        a rectangular mask, and automatically infill that section using AI.
in contrast to these expectations, the generated output felt like it    Users can also manually draw and edit notes.
“skipped a couple steps” and made it difficult to follow all at once.      Beyond the infill mask, Cococo distinguishes itself with its AI
   Participants struggled to express desired musical objectives due     steering tools. In the following subsections, we describe in detail
to the AI’s non-deterministic output. Even though what was              each of the four tools. Additionally, we illustrate the co-creation
produced sounded harmonious to the user, they felt incapable of         workflow enabled by these tools in Figure 1.
giving feedback about their goal in order to constrain the kinds
of notes the model generated. Participants likened this frustrated      3.0.1 Voice Lanes. Voice Lanes allow specifying for which voice(s)
feeling to “rolling dice” to generate a desired sound, and instead      and for which time steps to generate music. With this capability,
wished to control generation based on relevant musical objectives.      users can control the amount of generated content they would like
                                                                        to work with. This was designed to address information overload
3    COCOCO                                                             caused by Coconet’s default capabilities to infill all remaining voices
Based on identified user needs, we developed Cococo (collabo-           and sections at a time. For example, a user can request the AI to
rative co-creation), a music editor web-interface 3 for novice-AI       add a single accompanying bass line to their melody by highlight-
co-creation that augments standard generative music interfaces          ing the bass (bottom) voice lane for the duration of the melody,
                                                                        prior to clicking the generate button (Figure 1B). To support this
3 https://github.com/pair-code/cococo                                   type of request, we pass a custom generation mask to the Coconet
Cococo: AI-Steering Tools for Music Novices Co-Creating with Generative Models1                                     IUI ’20 Workshops, March 17, 2020, Cagliari, Italy


model including only the user-selected voices and time-slices to be               surrounding context (encoded in the model’s sampling distribution)
generated.                                                                        and additional desired qualities (encoded in a prior distribution).
                                                                                  We provide visual intuition for how these distributions interact in
3.0.2 Multiple Alternatives. Cococo provides affordances for audi-
                                                                                  Figure 2. More formally, we use the equation below to alter the
tioning multiple alternatives generated by the AI. This capability
                                                                                  distribution used to generate outputs:
was designed based on formative feedback, in which users wanted
a way to cycle through several generated suggestions to decide                           𝑝 adjusted (𝑥 𝑣, 𝑡 |𝑥𝐶 ) ∝ 𝑝 coconet (𝑥 𝑣, 𝑡 |𝑥𝐶 ) 𝑝 softprior (𝑥 𝑣, 𝑡 )
which was the most desirable. Users first choose the number of
alternatives to be generated (Figure 1C), audition each alternative               where 𝑝 coconet (𝑥 𝑣,𝑡 |𝑥𝐶 ) gives the sampling distribution over pitches
by clicking on the different preview thumbnails (Figure 1F), and lis-             for voice 𝑣 at time 𝑡 from Coconet given musical context 𝑥𝐶 (𝐶 gives
ten to an alternative which is substituted within the larger musical              the set of 𝑣, 𝑡 positions constituting the context), 𝑝 softprior (𝑥 𝑣,𝑡 )
context (Figure 1G).                                                              encodes the distribution over pitches specified by the user or AI-
                                                                                  steering tool designer (serving as soft priors), and 𝑝 adjusted (𝑥 𝑣,𝑡 |𝑥𝐶 )
3.0.3 Example-based Slider. While prototyping the Multiple Al-                    gives the resulting adjusted posterior sampling distribution over
ternatives feature, we found that the non-determinism inherent in                 pitches. The soft priors 𝑝 softprior (𝑥 𝑣,𝑡 ) are defined so that notes that
Coconet could cause generated samples to be both (1) random and                   should be encouraged are given a higher probability, and those
unfocused, or (2) too similar to each other and lack diversity. As                discouraged are given a lower, but non-zero probability. Since none
a solution, we developed the example-based slider for expressing                  of the note probabilities are forced to zero, very probable notes in
that the AI-generated music should be more or less like an existing               the model’s original sampling distribution can still be likely after
example of music. Before this slider is enabled, the user must select             incorporating the priors, thus making it possible for the model’s
a reference example chunk of notes. Example-based sliders use soft                output to adhere to both the original context and the additional
priors to guide music generation.                                                 user-desired qualities.
3.0.4 Semantic Sliders. We implemented two semantic sliders in                        The example-based and semantic sliders define a soft prior to
Cococo (Figure 1D) to constrain generated output along meaning-                   modulate the model’s generated output. When the user sets the
ful dimensions: a conventional vs. surprising slider, and a major                 example-based slider to more “similar,” Cococo defines a soft prior
(happy) vs. minor (sad) slider. Users can adjust how predictable vs.              with higher probabilities for notes in the example. Conversely, for
unusual notes should be using the “conventional“ and “surprising”                 a slider setting of more “different,” Cococo defines a soft prior with
dimensions of the slider. The conventional/surprising slider adjusts              lower probabilities for notes in the example.
the temperature (𝑇 ) of the sampling distribution [4]. A lower tem-                   The minor/major slider uses a slightly more complicated ap-
perature makes the distribution more “peaky” and even more likely                 proach to define the soft prior distribution. When the user sets the
for notes to be sampled that had higher probabilities in the original             slider to happy (major), for example, Cococo defines the soft prior
distribution (conventional), while higher temperatures makes the                  by asking what is the most likely major triad at each time slice
distribution less “peaky” and sampling more random (surprising).                  within the model’s sampling distribution. The log likelihood of a
The major vs. minor slider constrains generated notes to a happier                triad is computed by summing the log probability of all the notes
(major) quality or a sadder (minor) quality. This slider defines a soft           that could be part of the triad (e.g., for C major triad, this includes
prior that adjusts the sampling distribution to have higher prob-                 all the Cs, Es, and Gs in all octaves). We repeat this procedure for
abilities for the most-likely major triad (for happy) or non-major                all possible major triads to determine which is the most likely for a
triad (for sad) at each time-step.                                                time slice. We then repeat this procedure for all time slices to be
                                                                                  generated, in order to create our soft prior for most likely major
                                                                                  triads.

                                                                                  4    USER STUDY
                                                                                  We conducted a within-subjects study to compare the user expe-
                                                                                  rience of Cococo to that of the conventional interface. The con-
                                                                                  ventional interface is aesthetically similar to Cococo, but does not
                                                                                  contain the AI-steering tools. To mirror the most recent deep gen-
                                                                                  erative music interfaces, the conventional interface does include
                                                                                  the infill-mask feature, which enables users to crop any region of
                                                                                  the music and request that it be filled in by the AI [3, 5]. Through
                                                                                  a quantitative survey study, we seek to answer RQ1 How the AI-
Figure 2: Visualization of using soft priors to adjust a                          steering tools in Cococo affects user perceptions of the creative
model’s sampling distribution. The shape of the distribu-                         process and the creative artifacts made with the AI. Through qual-
tions are simplified to 1 voice, 7 pitches, and 4 timesteps.                      itative interviews and observations, we seek to understand RQ2
In CoCoCo, the actual shape is 4 voices, 46 pitches, and 32                       How music novices apply the AI-steering tools within Cococo in
timesteps                                                                         their creative process? What patterns of use and strategies arise?
3.0.5 Soft Priors: a Technique for AI-Steering. The soft prior ap-                4.0.1 Method. 21 music composition novices participated in the
proach enables the generation of output that adheres to both the                  study. Each participant first completed an online tutorial of the two
IUI ’20 Workshops, March 17, 2020, Cagliari, Italy                                                                                     Louie, et al.




     Figure 3: Results from post-study survey comparing the conventional interface and Cococo, with standard error bars.


interfaces on their own (30 minutes). Then, they composed two               5     QUANTITATIVE FINDINGS
pieces, one with Cococo and one with the conventional interface,            Results from the post-study questionnaire are shown in Figure 3.
with the order counterbalanced (15 minutes each). As a prompt,              We conducted paired t-tests using Benjamani-Hochberg correction
users were provided a set of images from the card game Dixit [14]           to account for the 15 planned-comparisons, using a false discovery
and were asked to compose music that reflected the character and            rate 𝑄 = 0.05.
mood of one image of their choosing. This task is similar to image-            In regards to users perceptions of the creative process, we found
based tasks used in prior music studies [8]. Finally, they answered         Cococo significantly improved participants ability to express their
a post-study questionnaire and completed a semi-structured in-              creative goals, self-efficacy, perception of learning more about
terview (20 minutes). So that we could understand their thought             music, and engagement compared to the conventional interface.
process, users were encouraged to think aloud while composing.              No significant difference was found in effort; participants described
                                                                            the two systems as requiring different kinds of effort: While Co-
                                                                            coco required users to think and interact with the controls, the
                                                                            conventional interface’s lack of controls made it effortful to ex-
                                                                            press creative goals. Users perceptions of the completeness of
4.0.2 Quantitative Measures. For our quantitative questionnaire,            their composition made with Cococo was significantly higher than
we evaluated the following outcome metrics. All items below were            the conventional interface; however, no significant difference was
rated on a 7-point Likert scale (1=Strongly disagree, 7=Strongly            found for uniqueness.
agree) except where noted below.                                               The comparisons for users’ attitudes towards the AI were all
   The following set of metrics sought to measure users’ compo-             found to be statistically significant: Cococo was more controllable,
sitional experience. Creative expression: Users rated “I was able           comprehensible, and collaborative than the conventional inter-
to express my creative goals in the composition made using [System          face; participants using Cococo expressed higher trust in the AI,
X].” Self-efficacy: Users answered two items from the Generalized           felt more ownership over the composition, and attributed the
Self-Efficacy scale [13] that were rephrased for music composition.         music to more of their own contributions relative to the AI.
Effort: Users answered the effort question of the NASA-TLX [6],
where 1=very low and 7=very high. Engaging: Users rated “Us-                6     QUALITATIVE FINDINGS
ing [System X] felt engaging.” Learning: Users rated “After using           In this section, we first report how AI-Steering tools supported
[System X], I learned more about music composition than I knew              novices’ composing strategies and experience, including 1) working
previously.” Completeness of the composition: Users rated “The              with smaller, semantically meaningful components and 2) reducing
composition I created using [System X] feels complete (e.g., there’s        non-determinism through testing a variety of constrained settings
nothing to be further worked on).” Uniqueness of the composition:           for generation. We then describe 3) how novices’ prior mental
Users rated “The composition I created using System X feels unique.”        models shaped their interaction with AI.
   In addition, we evaluated users’ attitudes towards the AI. AI
interaction issues: Users rated the extent to which the system felt
comprehensible and controllable, two key challenges of human-AI
                                                                            6.1    Effects of Partitioning AI Capabilities into
interaction raised in prior work on DNNs [12]. Trust: Participants                 Semantically-Meaningful Components
rated the system along Mayer’s dimensions of trust [11]: capability,        AI-Steering tools allowed participants to build up the composition
benevolence, and integrity. Ownership: Users rated two questions,           from smaller components, bit-by-bit. For example, one participant
one on ownership (“I felt the composition created was mine.” ), and         who used the Voice Lanes said, “I’m trying to get the bass right,
one on attribution (“The music created using [System X] was 1=totally       then the tenor right, then soprano and alto right, and build bit-by-
due to the system’s contributions, 7=totally due to my contributions.” ).   bit” (P2). Participants who worked bit-by-bit thought about their
Collaboration: Users rated “I felt like I was collaborating with the        compositions in semantically-meaningful chunks, such as melody
system.”                                                                    vs. background or separate musical personas. For example, one
Cococo: AI-Steering Tools for Music Novices Co-Creating with Generative Models1                                IUI ’20 Workshops, March 17, 2020, Cagliari, Italy


participant gave the tenor voice an “alternating [pitch] pattern"                 (P15). The Multiple Alternatives capability further enabled this
to express indecision in the main melody, then gave other voices                  participant to systematically infer that the specific configuration of
“mysterious... dinging sounds" as a harmonic backdrop (P4).                       existing voice components was unlikely to produce better results
    Working bit-by-bit helped participants feel less overwhelmed                  through the observation of multiple poor results generated for the
and better understand their compositions. For example, those work-                single-voice.
ing voice-by-voice could better handle the combination of multiple
voices: “As someone who cannot be thinking about all 4 voices at the
same time, it’s so helpful to generate one at a time” (P2). Participants
then became familiar with their own composition during the cre-
ation process, which enabled them to more quickly identify the                    6.3    Effects of Users’ Prior Mental Models
“cause” of problematic areas later on. For example, one participant               Participants brought with them prior mental models that impacted
indicated that “[because] I had built [each voice] independently and              how they interacted with the generative model. First, many par-
listened to them individually,” this helped them “understand what is              ticipants already had a set of primitives for expressing high-level
coming from where” (P7).                                                          musical goals. For example, higher pitches were used to commu-
    Through this bit-by-bit process, participants learned how sub-                nicate a light mood, long notes to convey calmness or drawn-out
components can combine to achieve desired musical outcomes. For                   emotions, and a shape of ascending pitches to communicate tri-
instance, one participant learned that “a piece can become more                   umph and escalation. When participants who could not find an
vivid by adding both a minor and major chord” after they applied                  explicit tool that mapped to their envisioned primitive, they re-
the major/minor slider to generate two contrasting, side-by-side                  purposed the tools as “proxy controls” to enact their strategy. For
chunks (P12).                                                                     example, a common pattern was to set the slider to “conventional”
                                                                                  to generate music that was “not super fast... not a strong musical
                                                                                  intensity” (P9), and to “surprising” for generating “shorter notes... to
6.2     Effects of Constraining Non-Determinism                                   add more interest” (P15).
        in Generated Output                                                           In some cases, even use of the AI-steering tools did not succeed
AI-Steering tools helped to constrain the non-deterministic output                in generating the desired quality. For example, the music produced
inherent in the generative model. As a result, the tools allowed                  using the “similar” setting was not always similar along the user-
users to steer generation in desired directions when composing                    envisioned dimension. To overcome these challenges, participants
with AI. Multiple Alternatives reduced the uncertainty that AI-                   developed a strategy of “leading by example” by populating the
generated output would be misaligned with a user’s musical goals.                 input context with the type of content they desired from the AI.
Participants could simply generate a range of possibilities, audition             For instance, one participant manually drew an ascending pattern
them, and choose the one closest to their goal before continuing.                 in the first half of the alto voice, in the hopes that the AI would
    During different phases of the composing process, participants                continue the ascending pattern in the second half.
used the sliders to constrain the large space of possibilities that                   Second, several participants believed that the AI model was su-
could be generated. The Semantic Sliders were sometimes used to                   perior to their skills as novice composers. As such, when specific
set an initial trajectory for generated music: “Because I was able to             errors arose during the composing process, they often blamed their
give more inputs to [Cococo] about what my goals were, it was able to             own efforts for these mistakes and hesitated to play an active role
create some things that gave me a starting point” (P8). Sliders were              in the process. While we found evidence that the tools helped im-
also used to refine what the AI had already generated: “It was...                 prove feelings of self-efficacy (See Quantitative Findings), there
not dramatic enough. Moving the slider to more surprising, and more               were also times when participants doubted their own musical abil-
minor added more drama at the end” (P5).                                          ities. Novices experienced self-doubt when poor sounding music
    Participants constrained generation by setting the sliders to their           was generated based off of their user-composed notes as the input
outer limits. This enabled them to test the boundaries of AI output.              context. For example, one user said, “All the things it’s generating
For example, one participant moved a slider to the “similar” extreme,             sound sad, so it’s probably me because of what I generated” (P11).
then incrementally backed it off to understand what to expect at                  In cases such as this, participants seemed unable to disambiguate
various levels of the slider: “On the far end of similar, I got four              between AI failures and their own composing flaws, and placed the
identical generations, and now I’m almost at the middle now, and                  blame on themselves.
it’s making such subtle adjustments” (P18). In contrast, when using                   In other scenarios, novices were hesitant to interfere with the
the conventional interface, participants could not as easily discern              AI music generation process. For instance, some assumed that the
whether undesirable model outputs were due to AI limits, or a                     AI’s global optimization would create better output than had they
simple luck of the draw.                                                          worked bit-by-bit: “Instead of doing [the voice lanes] one by one,
    Participants also used the tools to consider how a specific input             I thought that the AI would know how to combine all these three
configuration affects the limits of AI output. For example, one                   [voices] in a way that would sound good” (P1). While editing content,
participant used the Voice Lanes to generate multiple alternatives                others were worried that making local changes could interfere with
for a single-voice harmony. This enabled them to consider the limits              the AI’s global optimization and possibly “mess the whole thing up”
imposed by specific voice components: “Maybe the dissonance [in                   (P3). In these cases, an incomplete mental model of how the system
the single-voice] is happening because of how I had the soprano and               functions seemed to discourage experimentation and their sense of
bass... which are limiting it... so it’s hard to find something that works”       self-efficacy.
IUI ’20 Workshops, March 17, 2020, Cagliari, Italy                                                                                                  Louie, et al.


7    DISCUSSION                                                          REFERENCES
7.0.1 Partition AI Capabilities into Semantically-Meaningful Tools.       [1] Saleema Amershi, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira
                                                                              Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul N. Bennett, Kori Inkpen,
Our results suggest that AI-steering tools played a key role in               Jaime Teevan, Ruth Kikin-Gil, and Eric Horvitz. 2019. Guidelines for Human-
breaking the co-creation task down into understandable chunks                 AI Interaction. In Proceedings of the 2019 CHI Conference on Human Factors in
                                                                              Computing Systems (Glasgow, Scotland Uk) (CHI ’19). ACM, New York, NY, USA,
and generating, auditioning, and editing these smaller pieces until           Article 3, 13 pages. https://doi.org/10.1145/3290605.3300233
users arrived at a satisfactory result. Unexpectedly, novices quickly     [2] Carrie J. Cai, Samantha Winter, David Steiner, Lauren Wilcox, and Michael Terry.
became familiar with their own creations through composing bit-by-            2019. "Hello AI": Uncovering the Onboarding Needs of Medical Practitioners for
                                                                              Human-AI Collaborative Decision-Making. Proc. ACM Hum.-Comput. Interact. 3,
bit, which later helped them debug problematic areas. Interacting             CSCW, Article 104 (Nov. 2019), 24 pages. https://doi.org/10.1145/3359206
through semantically meaningful tools also helped them learn more         [3] Monica Dinculescu and Cheng-Zhi Anna Huang. 2019. Coucou: An expanded
about music composition and effective strategies for achieving par-           interface for interactive composition with Coconet, through flexible inpainting.
                                                                              https://coconet.glitch.me/
ticular outcomes (e.g., the effect of a minor key in the composition).    [4] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep learning. MIT
Ultimately, AI-steering tools affected participants’ sense of artis-          press.
                                                                          [5] Gaëtan Hadjeres, François Pachet, and Frank Nielsen. 2017. DeepBach: a Steerable
tic ownership and competence as amateur composers, through an                 Model for Bach Chorales Generation. In International Conference on Machine
improved ability to express creative intent. In sum, beyond reduc-            Learning. 1362–1371.
ing information overload, tools that partition AI capabilities into       [6] Sandra G Hart and Lowell E Staveland. 1988. Development of NASA-TLX (Task
                                                                              Load Index): Results of empirical and theoretical research. In Advances in psy-
semantically-meaningful components may be fundamental to one’s                chology. Vol. 52. Elsevier, 139–183.
notion of being a creator, while opening the door for users to learn      [7] Cheng-Zhi Anna Huang, Tim Cooijmnas, Adam Roberts, Aaron Courville, and
effective strategies for creating in that domain.                             Douglas Eck. 2017. Counterpoint by Convolution. ISMIR (2017).
                                                                          [8] Cheng-Zhi Anna Huang, David Duvenaud, and Krzysztof Z Gajos. 2016. Chor-
                                                                              dripple: Recommending chords to help novice composers go beyond the ordinary.
7.0.2 Onboard Users and Divulge AI Limitations. While partici-                In Proceedings of the 21st International Conference on Intelligent User Interfaces.
pants were able to develop productive strategies using AI-steering            ACM, 241–250.
                                                                          [9] Cheng-Zhi Anna Huang, Curtis Hawthorne, Adam Roberts, Monica Dinculescu,
tools, they were sometimes hesitant to make local edits for fear of           James Wexler, Leon Hong, and Jacob Howcroft. 2019. The Bach Doodle: Ap-
adversely affecting the AI’s global optimization. These reactions             proachable music composition with machine learning at scale. ISMIR (2019).
                                                                         [10] Ryan Louie, Andy Coenen, Cheng Zhi Huang, Michael Terry, and Carrie J. Cai.
suggest that participants could benefit from a more accurate mental           2020. Novice-AI Music Co-Creation via AI-Steering Tools for Deep Generative
model of the AI. Previous research suggests benefits of educating             Models. In Proceedings of the 2020 CHI Conference on Human Factors in Computing
users about the AI and its capabilities [1], or providing onboarding          Systems (Honolulu, HI USA) (CHI ’20). ACM, New York, NY, USA, 13. https:
                                                                              //doi.org/10.1145/3313831.3376739
materials and exercises [2]. For example, an onboarding tutorial         [11] Roger C Mayer, James H Davis, and F David Schoorman. 1995. An integrative
could demonstrate contexts in which the AI can easily generate                model of organizational trust. Academy of management review 20, 3 (1995),
content, and situations where it is unable to function well. For in-          709–734.
                                                                         [12] Changhoon Oh, Jungwoo Song, Jinhan Choi, Seonghyeon Kim, Sungwoo Lee,
stance, the system could automatically detect if the AI is overly             and Bongwon Suh. 2018. I Lead, You Help but Only with Enough Details: Un-
constrained and unable produce a wide variety content, and display            derstanding User Experience of Co-Creation with Artificial Intelligence. In Pro-
a warning sign on the tool icon. Or, semantic sliders could divulge           ceedings of the 2018 CHI Conference on Human Factors in Computing Systems
                                                                              (Montreal QC, Canada) (CHI ’18). ACM, New York, NY, USA, Article 649, 13 pages.
certain variables they are correlated with but not systematically             https://doi.org/10.1145/3173574.3174223
mapped to, to set proper expectations when users leverage them as        [13] Ralf Schwarzer and Matthias Jerusalem. 1995. Generalized self-efficacy scale.
                                                                              Measures in health psychology: A user’s portfolio. Causal and control beliefs 1, 1
proxies. This could help users better debug the AI when it produces           (1995), 35–37.
undesirable results. It could also prevent them from incorrectly         [14] Wikipedia contributors. 2019. Dixit (card game) — Wikipedia, The Free Ency-
attributing themselves and their lack of experience in composing as           clopedia. https://en.wikipedia.org/w/index.php?title=Dixit_(card_game)&oldid=
                                                                              908027531. [Online; accessed 19-September-2019].
the source of the error, rather than the AI being overly constrained.

7.0.3 Bridge Novice Primitives with Desired Creative Goals. Though
we created an initial set of dimensions for AI-steering, we were
surprised that participants already had a set of go-to primitives
to express high-level creative goals, such a long notes to convey
calmness or ascending notes to express triumph and escalation.
When the interactive dimensions did not explicitly map to these
primitives, they re-purposed the existing tools as proxy controls to
achieve the desired effect. Given this, one could imagine directly
supporting these common go-to strategies. Given a wide range of
possible semantic levers, and the technical challenges of exposing
these dimensions in DNNs, model creators should at minimum
prioritize exposing dimensions that are the most commonly relied
upon. For music novices, we found that these included pitch, note
density, shape, voice and temporal separation. Future systems could
help boost the effectiveness of novice strategies by helping them
bridge between their primitives to high-level creative goals, such
as automatically “upgrading” a series of plodding bass line notes to
create a foreboding melody.