=Paper=
{{Paper
|id=Vol-3926/paper6
|storemode=property
|title=Beyond Following: Mixing Active Initiative into Computational Creativity
|pdfUrl=https://ceur-ws.org/Vol-3926/paper6.pdf
|volume=Vol-3926
|authors=Zhiyu Lin,Upol Ehsan,Rohan Agarwal,Samihan Dani,Vidushi Vashishth,Mark Riedl
|dblpUrl=https://dblp.org/rec/conf/exag/LinEADVR24
}}
==Beyond Following: Mixing Active Initiative into Computational Creativity==
Beyond Following: Mixing Active Initiative into Computational
Creativity
Zhiyu Lin1 , Upol Ehsan1 , Rohan Agarwal1 , Samihan Dani1 , Vidushi Vashishth1 and
Mark Riedl1
1
Georgia Institute of Technology, Atlanta, Georgia, USA
Abstract
Generative Artificial Intelligence (AI) encounters limitations in efficiency and fairness within the realm of Procedural
Content Generation (PCG) when human creators solely drive and bear responsibility for the generative process.
Alternative setups, such as Mixed-Initiative Co-Creative (MI-CC) systems, exhibited their promise. Still, the potential
of an active mixed initiative, where AI takes a role beyond following, is understudied. This work investigates the
influence of the adaptive ability of an active and learning AI agent on creators’ expectancy of creative responsibilities
in an MI-CC setting. We built and studied a system that employs reinforcement learning (RL) methods to learn the
creative responsibility preferences of a human user during online interactions. Situated in story co-creation, we develop
a Multi-armed-bandit agent that learns from the human creator, updates its collaborative decision-making belief, and
switches between its capabilities during an MI-CC experience. With 39 participants joining a human subject study,
Our developed system’s learning capabilities are well recognized compared to the non-learning ablation, corresponding
to a significant increase in overall satisfaction with the MI-CC experience. These findings indicate a robust association
between effective MI-CC collaborative interactions, particularly the implementation of proactive AI initiatives, and
deepened understanding among all participants.
Keywords
Mixed-Initiative, Co-Creativity, Human-AI Collaboration, Procedural Content Generation
1. Introduction (CC) systems. Mixed-Initiative systems are those in
which both human and AI systems can initiate con-
Recent advancements in Machine Learning (ML)– tent changes. Co-Creative systems are those in which
powered Artificial Intelligence (AI), such as large lan- both human and AI systems can contribute to content
guage models (LMs) [1] and diffusion models [2], have creation. In particular, MI-CC systems have been
made a new class of tools for Procedural Content demonstrated in game design [8], drawing [9], and sto-
Generation (PCG) available to game creators. The rytelling [10], that benefits from both human and AI
dominant contemporary way for the creators to con- possessing the ability to take creative initiative. While
trol such generative AI models is via prompting—the the broadest definition of co-creative systems might
issuing of textual instructions for the model to inter- include any human creators working with a generative
pret and respond to [3]. That is, the user is tasked AI, the vast majority of them have not investigated
with the responsibility of issuing clear “prompts” to the role of mixed-initiative, especially a more active
contextualize the AI system and make them aware of AI initiative.
their intents. The AI is tasked to follow and fulfill At the heart of MI-CC systems is the question of
the request strictly based on it. If the system does whether and how the AI creative agent knows and
not respond with an output that satisfies the creators’ understands (a) the intentions and goals of the human
wants or needs, it is incumbent upon the creators to creator and (b) how the user wants to work with the
modify the prompt and try again. AI system. These questions pose significant challenges,
The paradigm of human creators working with gen- especially within domains critical to game designers
erative AI via prompting is just one of many theo- utilizing AI, such as Computational Creativity and
retical ways for a human creator and an AI system PCG. In other domains, the goal may be provided to AI
to interact [4]. There is evidence that prompting is in advance, making it easier to identify opportunities
not necessarily the best interaction paradigm; users to take the initiative with respect to contributing to a
indicate an appreciation for more varied ways of in- solution—the extreme of which is the AI system know-
teracting with AI creative systems [5]. Other config- ing the goal and solving the goal completely on its own.
urations of human-AI collaboration creative systems When it comes to creating games, however, the human
are possible that promise to reduce cognitive load, creators’ intent is harder to articulate completely[11].
frustration, and system abandonment [6], and make The human creator’s goals are also non-stationary and
these systems more casual and enjoyable [7]. These may evolve during the creative process[12, 13]. The
include Mixed-Initiative (MI) systems and Co-Creative human creator might also have a preferred working
style that the agent should conform to in order to
11th Experimental Artificial Intelligence in Games Workshop, take the initiative while minimizing disruption. Once
November 19, 2024, Lexington, Kentucky, USA. we overcome these challenges, researchers have shown
$ zhiyulin@gatech.edu (Z. Lin); ehsanu@gatech.edu
(U. Ehsan); rohanagarwal@gatech.edu (R. Agarwal); that such ambiguity and instability link to improved
sdani30@gatech.edu (S. Dani); vvashishth3@gatech.edu outcomes of the creative activity[14], thus benefiting
(V. Vashishth); riedl@cc.gatech.edu (M. Riedl) the MI-CC interaction.
https://zhiyulin.info/ (Z. Lin); In this paper, we examine Co-Creative systems in
https://www.upolehsan.com/ (U. Ehsan); a mixed-initiative setting and study the dynamics of
https://eilab.gatech.edu/mark-riedl.html (M. Riedl)
© 2024 Copyright for this paper by its authors. Use permitted under managing creative responsibility between human and
Creative Commons License Attribution 4.0 International (CC BY
4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
Figure 1: Screenshot of our system in action.
AI initiatives. We ask: What influence does an AI 2. Background and Related Work
agent’s ability to actively adapt to creators’ expectancy
of creative responsibility in an MI-CC system have on The procedure of an MI-CC system learning its cre-
creator experience and perception? ative responsibilities can be described as a decision-
In particular, we make the assumption that the AI making process, where the agent communicates with
agent is capable of working in the creative domain if the human creator, gathers information, and chooses
given explicit prompts but is unaware of the human among its capabilities. This is not as straightforward
creator’s preferences for distributing creative respon- as asking human creators to prompt AI agents because:
sibility between humans and the AI. We explore the
• Just like the Cold Start problem experienced by
usage of Reinforcement Learning (RL) methods in
AI agents lacking prior preferential knowledge
this setting and demonstrate that the creative respon-
from their creators [17], human creators, even
sibility learning challenge in MI-CC systems can be
experts, may struggle to make inferences about
addressed by a multi-armed bandit (MAB) algorithm
the behavior of AI systems they initially face;
that observes feedback from users iteratively, updates
its beliefs, and carries out its capabilities to facilitate • The ability of human creators to effectively con-
the MI-CC collaboration. The learning is done online vey information to AI depends on their com-
in real-time during the MI-CC process, and the human munication skills, which can be a significant
creator is not expected to have previous knowledge obstacle even in human-to-human interactions
of the AI agent or time to pre-train it with regard to [18].
their collaboration style. • Enforcing this AI-centric method of input re-
Working in the domain of structured story co- quires a profound mechanical understanding of
creation, we invite 39 participants to a human subject the AI system from the human creators, where
study. We quantitatively measure the human creator’s this knowledge does not necessarily intersect
perceived learning performance of the agent and the with their expertise. This marginalizes creators
overall level of satisfaction with the collaboration. We who do not possess the requisite expertise in
use the Creative Support Index (CSI) [15] to study utilizing AI.
the implications of a learning and evolving AI agent. For these reasons, relying solely on human creators
We also report on qualitative data collected from par- for direct collaborative prompting, regardless of the
ticipants, using a grounded theory [16] approach in capability of the AI models, has its limitations, leading
which we identify thematic patterns in users’ subjec- to efficiency, cognitive load, fairness, and equity issues.
tive reports of their experiences. This study reveals Alternatively, a model can be built on human feed-
a higher degree of participant recognition regarding back without users directly communicating their goals.
the learning capabilities of our agent, compared to the Researchers demonstrated their potential in transfer-
ablation, which in turn corresponded to a significant ring human knowledge to AI [19, 20] and making AI
increase in overall satisfaction with our agent. 1 learn more efficiently [21, 22]. When it comes to gener-
ating contents, this is the foundation of methods such
1
https://github.com/xxbidiao/beyond-following-experiments as RL from human feedback [23], that has proven to
drastically improve the quality of generated text in 3.1. Task Setup
state-of-the-art models such as GPT-4[1]. Yet, they
The Delegation Setup. For the experiments, we spot-
are designed to exclusively optimize for a static, known-
light a specific but generalizable collaborative setup:
from-data objective. They are not designed for online
Learning a delegation. In this setup, both parties take
implementation where pre-training is not feasible, and
a subset (or entirety, if preferred) of responsibilities
the system lacks prior knowledge of new creators and
in an MI-CC activity towards the common goal. The
needs to actively probe them.
human creator concentrates on specific parts of the
To focus on the active probing challenge, we formal-
creative task while not losing control of the other parts;
ize it as a Multi-Armed Bandit (MAB) problem [24]
the AI agent needs to strategically shift its focus to-
above generative abilities, where an AI agent needs to
wards the parts that the human creator is not focusing
actively choose under uncertainty from their library
on and actively determine how to make improvements.
of capabilities based on their understanding of their
Furthermore, as these interactions are not without
human creator teammate, to minimize total regret and
cost, such as creators’ cognitive load, it is also im-
maximize rewards from their teammate. Multi-Armed
portant to minimize such costs towards learning these
Bandit systems have been employed in the context
responsibilities. We denote the expected and delegated
of resolving how to make progress in an interactive
responsibility that the AI agent needs to learn during
creative experience. Koch et al. [25] discussed a de-
the interaction preferred work style for a particular
sign ideation framework that suggests images that a
human creator.
designer may like by exploring and exploiting in the im-
age embedding space with a variant of MAB; Gallotta
et al. [26] applied MAB in the context of generating Domain: Storytelling. Given the mounting interest
“in-game spaceships” by enabling creator-guided latent in co-creative storytelling [32] and established research
space walk in the feature embedding space represent- foundation within story generation, its high relevance
ing such spaceships. These works focused on a single to game development, and its inherent complexity with
type of action in the content space, and concentrated regard to PCG, we select story generation as a proving
on expanding the generative space of such content; ground for our proposed method. The expertise of
Lin et al. [27, 5] explored instead the action space, the team and advancements in open-source Large LMs
characterized as types of Communications represent- readily available to us facilitated implementation; This
ing information exchange between human and AI used allows us to focus on the human factors of the MI-CC
in the co-creative process; As to the idea of switching experience and the AI agent itself.
between different high-level actions beyond the content For our experimental system, We use Llama2-13b-
level, Building a model of the user has been proven chat [33] as the LM, readily available at the time of
to help in a CC setting, specifically in the domain of the study while very responsive for the interactive
storytelling. Yu et al. [28] demonstrated its potential experience.
to generate stories that bring “an enjoyable experience
for the players”; Gray et al. [29, 30] further demon- 3.2. Experimental AI System overview
strated how MAB agents help to capture this player
model. Vinogradov et al. [31] showcased a framework We now describe the AI system we built for the purpose
where the agent explores the creators’ “player” model of the study. The experimental system is based on
vigorously by directly generating “distractions”, ob- the Creative Wand framework [27], containing the
jects designed to probe into players’ preference instead following four components:
of providing utilities in finishing a certain task; They
proposed using MAB for this task for its promises in 3.2.1. Creative Context
“balancing the act of gathering information about the The Creative Context is the abstraction of generative
payout associated with each arm (exploration) and models for this system.
maximizing reward given the current known informa- In this paper, we study stories containing four com-
tion (exploitation)”, dynamically updating the model ponents inspired by the Narrative Arc theory: the
in the process towards assigning tasks that the play- beginning, development or rising action, climax, and
ers feel more interested in tackling. They inspire our conclusion. We design an AI framework that writes
method, as its approach of adding distractions is well each component of the story using language models
comparable to the agent carrying out its initiative and prompt engineering (See Appendix C for more
while directly changing the creative content. details). Both the human participant and the AI are
instructed to write about 20 to 30 words per com-
3. Study Design ponent, and the target length of the whole story is
around 100 words.
In this section, we present the study we designed Once we set up the model, it will take requests from
to examine the AI agent we created that adapts to Communications.
creators’ expectancy of creative responsibility. We
seek to determine how this changes the perception of 3.2.2. Communications
the creators toward the AI and the creative experience
Communications describes the interactions between
the system supplies to the human creators.
the human creators and the AI; they also double as
the capabilities the AI agent possess. To focus on how
the agents would choose their creative responsibili-
ties, we implement a minimalistic yet complete set of
Figure 2: One round of interaction of our experimental system. Each participant will experience multiple turns per session.
capabilities for the creative experience. This allows heuristic is computationally fast and enables respon-
us to focus on research questions about the creative sive interactions; Second, it additionally provides vi-
experience while minimizing the cognitive load of the sualization for the users. As shown in Figure 1, we
participants. Our agent possesses the following capa- present this right above the text boxes for the stories,
bilities, implemented as prompts to the LM describing with a text hint and a progress bar representing the
the responsibilities (See Appendix C for details): ideation process of the agent. We additionally provide
a “skip” function that forces agent initiative.
• (Re)write the beginning and development;
• (Re)write the climax and conclusion; Agent Initiative. In this phase, the agent decides
• Write a review of the story, one sentence posi- which capability best fosters the collaborative experi-
tive, one negative, and one suggestion for im- ence and carries out the corresponding Communication.
provements. We build a Multi-Armed Bandit-based agent in our
system that is responsible for choosing which Commu-
3.2.3. Experience Manager and Frontend nication to invoke, with Thompson Sampling as the
These two modules manage the interactive experience chosen algorithm for the experimental system within
and workflow. the AI agent. Formally, an agent 𝐴 interacts with a
We implement a Finite State Machine to manage set of 𝐾 arms 𝑎1 · · · 𝑎𝑘 , each of which is associated
the experience. Figure 2 shows the states with the with Communication and underlying capabilities and
overall flow of interaction each participant experiences an unknown reward distribution. Whenever an arm
in one experiment session. One session of the MI-CC is pulled, the agent seeks feedback from the human
experience is separated into multiple “turns”, where creator on the initiative, which is treated as a reward
both parties iteratively improve the story, sharing the signal. (See next paragraph.) The goal of the agent
same text fields in the editing process. The partici- is to maximize the total reward obtained by repeat-
pants are not directly notified of the internal states of edly pulling arms during the session. See subsection
the system. A for more details on the design choices of the MAB
agent. Once an arm is pulled, the agent executes a
Communication, interacts with the user, and updates
Human Initiative. During this phase, human creators
the story as needed.
contribute to the story by making edits in any of
the four text fields. This phase ends when the agent
decides to take the initiative. We implement a point- Learning from human. The system will ask about
based heuristic based on pilot studies: the agent would (Action Feedback) the way they just worked and (Con-
assign points for changes it observes, and will take tent Feedback) the updates and content changes. The
initiative whenever enough points are accumulated, participants choose between “Good” (Reward of 1)
signifying substantial edits from the human creators, and “Bad” (Reward of 0). “Bad” feedback on gener-
in the following criteria: ated text leads to a reversion to the original content,
though it is not used to improve the LM in any way.
• Each new character would add 5 points; A weighted mean is employed to integrate both types
• Each time the human creator switches between of feedback into a singular reward signal. For the study,
fields after any changes, 100 points are added; a weight hyperparameter of 80% is applied to the Ac-
• Whenever the human creator leaves a text field tion Feedback and 20% to the Content Feedback. This
with 200 points accumulated (roughly one full prioritizes learning action-level responsibilities rather
sentence or two minor changes), the agent will than the preference for LM-generated text, in which
take the initiative by locking the editing inter- the full system and the baseline share implementation.
face and resetting the counter. This reward signal is then used to train the agent.
For this experiment, an MAB agent with Thompson
This heuristic provides two advantages compared to Sampling is used in the experimental system. See
other ways this decision can be made: First, this Appendix A for a discussion and experiments related
Figure 3: Participants’ experience during the study.
to this choice. Post-study. After participants finished two sessions
Once the learning process is complete, “human ini- using our system, they were asked about the process
tiative” starts again. To maintain user engagement, they had just experienced. inspired by Creative Sup-
text responses are morphed each time to avoid repeti- port Index (CSI) [15] used in the previous studies,
tiveness, while contextual hints are also strategically We ask questions based on dimensions related to the
provided throughout the experience. Figure 1 shows creative support perception and overall collaborative
the user interface. experience, grouped to facilitate richer responses from
the participants while maintaining their engagement
3.3. Study Methodology in the survey.
Specifically, we ask which system(s), are (Q5, Learn-
To study the perception of human creators towards MI- ing, Collaboration) learning to collaborate, (Q6, En-
CC systems equipped with these learning capabilities, joyment, Immersion) more capable and easy to work
we conduct a study summarized in figure 3 on the AI with, (Q7, Expressiveness, Exploration, Results worth
system. effort) enabling better stories; For Q5 through Q7,
We compare our system, the “Full” system, with an participants can choose either system, both systems,
ablation named “baseline”. The “baseline” ablation or neither to be chosen, leading to a potential total
does not learn. It chooses each of the 3 Communica- exceeding 100%. We ask one final question (Q8) on
tions with a 1/3 probability at all times and provides which system will they recommend more, framed in a
only a reverting option when “asking for feedback”. win-draw-lose format.
These systems are codenamed “Echo Wand” and “Har- Although these questions are presented in the same
mony Wand” respectively, not to reveal the details of order for all participants, the order of the options is
the systems to the participants during the study. randomized to reduce bias towards any system. All
We recruited 39 United States participants 2 on questions are followed by an open-text question pre-
Prolific3 with adequate English proficiency. Each ex- pared to collect justifications from the participants.
periment session lasted for approximately 40 minutes,
and we paid the participants $15 per hour for perfect
completion of the study. 4. Quantitative Results
Pre-study. Before the experience, participants answer 4.1. Creative background
four 5-point Likert-scale questions on (Q1) Expertise in Table 1 shows a summary of the creative backgrounds
Computer-Assisted Designing (CAD), (Q2) Expertise of the participants. Although a median of 4 on all
in writing stories, (Q3) Frequency using AI, and (Q4) questions implies that participants are familiar with
Understanding of AI. 4 the recent advancement of AI, when specifically ask-
We then present instructions to familiarize the par- ing whether they can build one, only 1 participant
ticipants with our systems by providing annotated answered “yes” (5 in Q4), meaning that most of the
screenshots of the interface, which is a copy of Figure participants do not have a technical background.
1, but with additional numeric overlays, descriptions of However, comparing to 26% reported in [5], we
components, and a brief introduction to the workflow observed 87% of the participants at least being “some-
of co-creating a story. what familiar” (3+) with recent AI technologies, and
They are then assigned the delegation task to focus 51% being “familiar” (4+); The experience of using
on writing the beginning and the development of the commercially available Large LM-based agents may
story while leaving the other parts of the story to AI have a profound effect on how participants, in general,
as much as possible. They are also made aware that would collaborate with AI systems.
the AI does not know this setup in advance.
4.2. Quantitative Results
Experience. Participants are assigned to interact
with the full system and the baseline ablation, pre- We commence by presenting the quantitative results of
sented in random ordering, counter-balanced. They the study through the choices made by the participants
are given 10 turns per each of the 2 sessions. in the multiple-choice questions.
2
When asked which system(s) learned to collaborate
Only counting participants who finished the whole study with
with them under the delegation arrangement (Q5),
valid sessions and responses.
3
prolific.co the “Full” system is chosen 69% (𝑛 = 39) of the times,
4
See Appendix B for the full question text. compared to 51% for the baseline (𝑝 < 0.018, under a
Q (See Appendix B for full questions) 1 2 3 4 5 Average Median
Q1: CAD skills 1 1 2 19 16 4.23 4
Q2: Writing skills 1 0 7 20 11 4.03 4
Q3: Frequency of using AI 0 0 16 11 12 3.90 4
Q4: Understanding of AI Tech. 0 5 14 19 1 3.41 4
Table 1
Creative background of the participants. 1 = Most Negative, 5 = Most Positive.
binomial test where 𝐻0 := no observable difference in these creative support dimensions may be too minor
distribution; The same for all p-values in this section). when it comes to how they are presented visually; The
We clearly see the “Full” system with learning capa- effect of user interface used to present the results in an
bilities enabled being perceived significantly better at MI-CC system is out of the scope of this work, though
learning the delegation than the baseline, demonstrat- these findings illuminated a potential path for future
ing the effectiveness of the MAB-based model From research.
the human creator perspective learning from their feed-
back.
When asked which system to recommend, this trend 5. Qualitative Results
also persists: Our system is preferred (wins) 43.6%
We now show the results from the open-ended ques-
of the time, versus 20.5% (loses) for the baseline
tions following each multiple-choice question. Open-
(𝑝 < 0.001); 35.9% of the participants do not have a
ended justifications participants provided for each of
preference (draw). The “Full” system is only different
the four questions are evaluated with thematic anal-
from the baseline system with the learning capabilities
ysis [36], based on grounded theory [16]. Taking an
and corresponding frontend elements, yet we see a sta-
inductive approach, we started the process with an
tistically significant improvement in preference towards
open-coding scheme and iteratively produced in-vivo
our “Full” system, illustrating the potential of our
codes (generating codes directly from the data). Next,
method in enhancing MI-CC experience and making
we analyzed the data using axial codes, which in-
such system better for human creators.
volves finding relationships between the open codes
When it comes to which system(s) gave a good story
and clustering them into different emergent themes.
(Q7), 72% of the participants agree that the “Full” sys-
Through an iterative process performed until consen-
tem made a good story, while 69% selected the baseline
sus was reached, we share the most salient themes that
system (𝑝 > 0.05). We were unable to statistically
emerged from axial codes.
determine whether an agent learning the delegation
would produce a better story; This is expected, We
focused on studying the sharing of responsibilities and A MI-CC system that understands the intents of
enforced a delegation setting. In an actual MI-CC the human creators and follows them by learning is
experience, without such a prior, A human creator overall favored and collaborates well with the cre-
would utilize the agent’s learning capability to pro- ators. Participants demonstrated their observation
mote their strengths and discourage their weaknesses, of the learning capabilities of the “full” system, identi-
and an improvement in perceived performance is more fying them as “better about learning that I specifically
likely to be observed in that setting. wanted help with” (P34) and “listened to my feed-
Finally, when queried about the collaboration itself back.”(P39). In comparison, the baseline system is
(Q6), 62% of the participants think the “Full” system identified as “did less of the work ... did not necessarily
is capable and made the collaboration easy, while 56% learn what its role was expected to be” (P19). this
voted for the baseline system (p>0.05). We also were resulted in a preference for the Full system for P32,
unable to statistically determine whether the “Full” as the Full system is quoted as a “more useful helper".
system is more enjoyable and immersive. Although the This aligns with the quantitative observations.
difference between the “Full” system and the baseline
is substantial enough both implementation-wise and Good content suggestions may give people the feel-
towards the perception of learning, from the angle of ing that the system is learning how to collaborate
the user interface, the only difference is 10 additional with them, regardless of how AI is actually doing
questions from the “Full” system per session. Previ- so. Despite specifically asking participants to discuss
ously, Larsson et al. [34] reported that “there was a whether the agent has “learned to collaborate with
clear trend that the visual ... was rather important to you under that arrangement” (Q5), Participants are
the subject’s relationship towards the MI-CC.” while also rating the system based on the generated content:
these “relationships” are directly linked to creators’
(P25, emphasis asked) This one learned
perception of immersion of the experience; Ehsan et
from me because it was able to build off
al. [35] additionally pointed out that even when an AI
of my original foundation of my story that
system presents the same underlying information, how
I typed.
it is presented influences the perceptions of human
users. We may have observed this effect from a differ- P18, who rated their familiarity with AI as Familiar
ent angle, where a lack of differences in presentation (4 out of 5) and AI usage as “Always / as much as
may have caused the indifference of the participants. possible” (5 out of 5), wrote that the “Full” system is
To that end, the difference between the two systems on learning from them:
I could see Echo Wand adding more de- ... I was in control of the final text to
tail and building out more creatively accept changes or not, or to make my
than with Harmony Wand. own.
This participant is familiar with recent generative AI In a system involving a creator who wishes to create
and mentions “adding details” and “building,” which content to their liking, it is expected that the creator
are traits that these AI are optimized for. As both wishes to solicit as much control as possible. However,
the “Full” and the baseline use the same underlying if the AI agent does not have any final say on the
generative AI capabilities, P18 could not distinguish contents, should we expect it to take any creative
between the “improvements” on generated contents responsibilities? Although we acknowledge that this
and the performance of the MAB-based agent. The is more of a philosophical question, way out of the
apparent improvements of generated stories may result scope of our work, what if the agent would understand
from a wide range of reasons, such as participants what their counterpart is actually seeking and use
providing different input and LM sampled differently, this information to determine what contribution they
unrelated to both the underlying LM and the learner, should stick to by understanding what human creators
creating noises in the perception of participants. are thinking?
Diversity is also important, it may not be the best 6. Discussions
strategy for a learning agent to pick the “best op-
tions”, and sometimes the agent may want to inten- Distilling from these findings ranging from the per-
tionally surprise their teammates. P23 was impressed ception of collaboration, good writing skills, diversity
by the range of capabilities both agents possess, seeing in capabilities, and creators’ need for control, a com-
“They were both impressive, being able to take my mon implication surfaces: Getting the mental model
story and to word it better, or even add things to of the creators right, the system will succeed; Getting
change it to make it better”. When asked about the it wrong, failure cases would surface. A mental model
generated story, P39 mentioned that “ Both of them is described by Kieras et al. [37] as “ understanding
gave bad stories.” and “I need much more control and ... that describes the internal mechanism“ of the sys-
options”. Curiously, this is the same participant that tem a human is operating; Leslie et al. [38] further
enjoyed the agent that “listened to my feedback.”. P36 point out that a theory of mind is a mechanism that
preferred the baseline system that executes random human expresses naturally, towards an understanding
actions: of thinking, in our context, their teammate AI. The
success of our “Full” system of learning rises from its
I did all of the work with Echo, despite
ability to learn a model of how the creators wish to
my best efforts to get it to collaborate
collaborate with them, and the reward given from a
with me. Harmony had much more inter-
teammate can be otherwise treated as a reward for
esting suggestions and rightfully pointed
correctly understanding their model. The need for di-
out when a section became too dense.
versified responses and more respect to control signals
It balanced the second two sections to
users imposed also fall into this paradigm, but be-
match my intro and build up, unlike
yond; Understanding how these reward signals should
Echo who almost refused to work on
be used beyond “picking the best”, and how to cap-
them.
ture hints for new actions or capabilities needed can
For this study, we assigned delegation tasks to the greatly improve collaborations with MI-CC systems.
participants. This is only a subset of possible respon- This falls into the subfield of “novelty detection and
sibilities that the AI agent can take and the human adaptation” [39] situated in RL, which is known to be
creators may expect. Lin et al. [5] have shown that a challenging, if solvable at all with ML methods, as ML
system with more coverage of the design space, provid- models can only rely on their extrapolation capabilities
ing more diversified options, is preferred. Our study towards the “unknowns”, that may not hold for all
design, which is more focused on studying the learning novelties; This will be a rewarding pathway towards
process, limited the variety of capabilities the agent better MI-CC systems if not agentic AI overall.
may perform. To that end, once such an MI-CC sys- We start to see a consistent narrative: creators are
tem is put into use beyond research, it is necessary interpreting the capabilities of our AI agent learning
to diversify both the capability pool and the process as an attempt the AI agent made to learn a mental
of the AI agent choosing them, potentially providing model of themselves; Because our agent determines
surprise and unpredictability to further inspire the which Communication to use and the effect of it on the
users. contents being collaborated on, We observe the par-
ticipants treating proper learning of Communication
Creator control is important, and creators may want choices (expected) and the content generated (emerg-
their ideas to be included even when AI can pro- ing) as both evidence that the agent is learning from
vide better candidates. Beyond the need for control them and traits leading to their preferences towards
mentioned by P39, P28 mentioned that they were im- these systems. This also, to some extent, explains
pressed by the capabilities of both systems in “finish the placebo effect we observe on the baseline system:
the story that I started with.” (Emphasis added). P27 around half of the participants believe that the base-
mentioned further on their justification: line system is learning from them, significantly more
than 0, despite the baseline system only making deci-
sions randomly. In this controlled comparative study,
to avoid a bias towards either of the systems, we in- Creativity (2023) 64–73. URL: http://arxiv.
tentionally did not disclose any difference between the org/abs/2305.07465. doi:10.48550/arXiv.2305.
“full” system and the baseline. This perception may 07465, arXiv:2305.07465 [cs].
have arisen from the capability of our agent to gen- [6] J. Sweller, Cognitive load theory, in: Psychology
erate part of stories that follow the context that the of learning and motivation, volume 55, Elsevier,
participants provided. Although we acknowledge that 2011, pp. 37–76.
these factors are hard to decouple, this finding also [7] K. Compton, M. Mateas, Casual creators, in:
hints at the potential of our methods in understanding Proceedings of the sixth international conference
the human creator holisticly. Upol et al. [35] pointed on computational creativity, 2015, p. 228.
out that the background of human users determines [8] A. Liapis, G. N. Yannakakis, C. Alexopoulos,
their cognitive heuristics, which plays a role in their P. Lopes, Can computers foster human users’ cre-
expectations beyond what the designer of the systems ativity? Theory and praxis of mixed-initiative co-
expected in the first place. They also realized that if not creativity, DCE (2016). URL: https://www.um.
treated carefully, AI systems can actually introduce edu.mt/library/oar/handle/123456789/29476, ac-
such placebo effects, as a pitfall [40], by misleading the cepted: 2018-04-23T12:31:38Z Publisher: DCE.
human users into appreciating their trustworthiness [9] N. Davis, C.-P. Hsiao, K. Y. Singh, L. Li,
and power, without the development of underlying AI S. Moningi, B. Magerko, Drawing Apprentice:
capabilities. Standing on these findings, A promising An Enactive Co-Creative Agent for Artistic Col-
direction of research is to carefully identify the effect laboration, in: Proceedings of the 2015 ACM
of expectations of both parties involved in the MI-CC SIGCHI Conference on Creativity and Cognition,
process, and how they dynamically change during the C&C ’15, Association for Computing Machin-
collaboration. ery, New York, NY, USA, 2015, pp. 185–186.
URL: https://doi.org/10.1145/2757226.2764555.
doi:10.1145/2757226.2764555.
7. Conclusions [10] A. Alvarez, J. Font, J. Togelius, Story Designer:
Towards a Mixed-Initiative Tool to Create Narra-
In this paper, we showcased how an MI-CC system is
tive Structures, Proceedings of the 17th Interna-
capable of listening to human feedback and improving
tional Conference on the Foundations of Digital
itself towards a better understanding of how it should
Games (2022) 1–9. URL: http://arxiv.org/abs/
collaborate with human creators in a storytelling do-
2210.09294, arXiv:2210.09294 [cs].
main. Inviting 39 participants and comparing two such
[11] M. O. Riedl, Human-centered artificial intelli-
systems with and without these learning capabilities,
gence and machine learning, Human behavior
we found that this capability was well recognized by
and emerging technologies 1 (2019) 33–36. Pub-
the participants and led to better satisfaction over-
lisher: Wiley Online Library.
all. To this end, we further encourage the designers
[12] N. Davis, C.-P. Hsiao, Y. Popova, B. Magerko,
of MI-CC systems to pay attention to both the hu-
An enactive model of creativity for computational
man creators and the AI agent, study how each party
collaboration and co-creation, Creativity in the
should, or is already, adapting to and creating mental
digital age (2015) 109–133. Publisher: Springer.
models of their counterpart, based on their creative
[13] M. Guzdial, N. Liao, M. Riedl, Co-Creative Level
roles taken, their previous experience, and capabili-
Design via Machine Learning, Fifth Experimen-
ties, and most importantly, the wishes of the human
tal AI in Games Workshop (2018). URL: http:
creators.
//arxiv.org/abs/1809.09420, arXiv: 1809.09420.
[14] F. Zenasni, M. Besançon, T. Lubart, Creativity
References and tolerance of ambiguity: An empirical study,
The Journal of Creative Behavior 42 (2008) 61–73.
[1] OpenAI, GPT-4 Technical Report, 2023. URL: Publisher: Wiley Online Library.
http://arxiv.org/abs/2303.08774. doi:10.48550/ [15] E. Cherry, C. Latulipe, Quantifying the creativ-
arXiv.2303.08774, arXiv:2303.08774 [cs]. ity support of digital tools through the creativity
[2] P. Dhariwal, A. Nichol, Diffusion Models Beat support index, ACM Transactions on Computer-
GANs on Image Synthesis, Advances in neural Human Interaction (TOCHI) 21 (2014) 1–25. Pub-
information processing systems 34 (2021) 8780– lisher: ACM New York, NY, USA.
8794. ArXiv: 2105.05233. [16] B. Glaser, A. Strauss, Discovery of grounded
[3] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, theory: Strategies for qualitative research, Rout-
G. Neubig, Pre-train, prompt, and predict: A ledge, 2017.
systematic survey of prompting methods in natu- [17] J. Bobadilla, F. Ortega, A. Hernando, J. Bernal,
ral language processing, ACM Computing Sur- A collaborative filtering approach to mitigate the
veys 55 (2023) 1–35. Bibtex.eprint: 2107.13586 new user cold start problem, Knowledge-based
bibtex.archivePrefix: arXiv. systems 26 (2012) 225–238. Publisher: Elsevier.
[4] Z. Lin, M. Riedl, An Ontology of Co-Creative AI [18] S. M. Grover, Shaping effective communication
Systems, arXiv preprint arXiv:2310.07472 (2023). skills and therapeutic relationships at work: The
[5] Z. Lin, U. Ehsan, R. Agarwal, S. Dani, foundation of collaboration, Aaohn journal 53
V. Vashishth, M. Riedl, Beyond Prompts: Ex- (2005) 177–182. Publisher: SAGE Publications
ploring the Design Space of Mixed-Initiative Sage CA: Los Angeles, CA.
Co-Creativity Systems, Proceedings of the [19] W. Bradley Knox, P. Stone, TAMER: Train-
14th International Conference on Computational ing an Agent Manually via Evaluative Reinforce-
ment, in: 2008 7th IEEE International Con- an Experience Managed Environment, Proceed-
ference on Development and Learning, IEEE, ings of the AAAI Conference on Artificial Intelli-
Monterey, CA, 2008, pp. 292–297. URL: http: gence and Interactive Digital Entertainment 18
//ieeexplore.ieee.org/document/4640845/. doi:10. (2022) 207–214. URL: https://ojs.aaai.org/index.
1109/DEVLRN.2008.4640845. php/AIIDE/article/view/21965. doi:10.1609/
[20] G. Warnell, N. Waytowich, V. Lawhern, P. Stone, aiide.v18i1.21965, number: 1.
Deep TAMER: Interactive agent shaping in high- [32] M. Behrooz, Y. Tian, W. Ngan, Y. Yung-
dimensional state spaces, in: Proceedings of ster, J. Wong, D. Zax, Holding the Line: A
the AAAI conference on artificial intelligence, Study of Writers’ Attitudes on Co-creativity with
volume 32, 2018. Issue: 1. AI, 2024. URL: http://arxiv.org/abs/2404.13165,
[21] Z. Lin, B. Harrison, A. Keech, M. O. Riedl, arXiv:2404.13165 [cs].
Explore, Exploit or Listen: Combining Hu- [33] H. Touvron, L. Martin, K. Stone, P. Albert,
man Feedback and Policy Model to Speed up A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra,
Deep Reinforcement Learning in 3D Worlds, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher,
arXiv:1709.03969 [cs] (2017). URL: http://arxiv. C. C. Ferrer, M. Chen, G. Cucurull, D. Es-
org/abs/1709.03969, arXiv: 1709.03969. iobu, J. Fernandes, J. Fu, W. Fu, B. Fuller,
[22] D. Arumugam, J. K. Lee, S. Saskin, M. L. C. Gao, V. Goswami, N. Goyal, A. Hartshorn,
Littman, Deep Reinforcement Learning S. Hosseini, R. Hou, H. Inan, M. Kardas,
from Policy-Dependent Human Feedback, V. Kerkez, M. Khabsa, I. Kloumann, A. Ko-
2019. URL: http://arxiv.org/abs/1902.04257, renev, P. S. Koura, M.-A. Lachaux, T. Lavril,
arXiv:1902.04257 [cs]. J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Mar-
[23] D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, tinet, T. Mihaylov, P. Mishra, I. Molybog, Y. Nie,
A. Radford, D. Amodei, P. Christiano, G. Irving, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi,
Fine-tuning language models from human prefer- A. Schelten, R. Silva, E. M. Smith, R. Subrama-
ences, arXiv preprint arXiv:1909.08593 (2019). nian, X. E. Tan, B. Tang, R. Taylor, A. Williams,
[24] J. Vermorel, M. Mohri, Multi-armed bandit al- J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang,
gorithms and empirical evaluation, in: European A. Fan, M. Kambadur, S. Narang, A. Rodriguez,
conference on machine learning, Springer, 2005, R. Stojnic, S. Edunov, T. Scialom, Llama 2: Open
pp. 437–448. Foundation and Fine-Tuned Chat Models, 2023.
[25] J. Koch, A. Lucero, L. Hegemann, A. Oulasvirta, _eprint: 2307.09288.
May AI? Design Ideation with Cooperative Con- [34] G. Larsson, V. Lindecrantz, How an AI colleague
textual Bandits, in: Proceedings of the 2019 affect the experiance of content creation, 2023.
CHI Conference on Human Factors in Comput- URL: https://www.diva-portal.org/smash/get/
ing Systems, Association for Computing Machin- diva2:1780852/FULLTEXT02.
ery, New York, NY, USA, 2019, pp. 1–12. URL: [35] U. Ehsan, S. Passi, Q. V. Liao, L. Chan, I.-H. Lee,
https://doi.org/10.1145/3290605.3300863. M. Muller, M. O. Riedl, The who in explainable
[26] R. Gallotta, K. Arulkumaran, L. B. Soros, AI: How AI background shapes perceptions of AI
Preference-Learning Emitters for Mixed-Initiative explanations, in: Proceedings of the CHI Confer-
Quality-Diversity Algorithms, IEEE Transac- ence on Human Factors in Computing Systems,
tions on Games (2023) 1–14. doi:10.1109/TG. 2024, pp. 1–32. ArXiv: 2107.13509 [cs.HC].
2023.3264457. [36] J. Aronson, A pragmatic view of thematic analy-
[27] Z. Lin, R. Agarwal, M. Riedl, Creative Wand: sis, The qualitative report 2 (1994) 1–3.
A System to Study Effects of Communications [37] D. E. Kieras, S. Bovair, The role of a
in Co-creative Settings, Proceedings of the mental model in learning to operate a de-
AAAI Conference on Artificial Intelligence and vice, Cognitive Science 8 (1984) 255–273.
Interactive Digital Entertainment 18 (2022) 45– URL: https://www.sciencedirect.com/science/
52. URL: https://ojs.aaai.org/index.php/AIIDE/ article/pii/S0364021384800038. doi:https://doi.
article/view/21946. doi:10.1609/aiide.v18i1. org/10.1016/S0364-0213(84)80003-8.
21946, number: 1. [38] A. M. Leslie, O. Friedman, T. P. German, Core
[28] H. Yu, M. Riedl, Data-driven personalized drama mechanisms in ‘theory of mind’, Trends in cogni-
management, in: Proceedings of the AAAI Con- tive sciences 8 (2004) 528–533. Publisher: Else-
ference on Artificial Intelligence and Interactive vier.
Digital Entertainment, volume 9, 2013, pp. 191– [39] J. Balloch, Z. Lin, M. Hussain, A. Srinivas,
197. Issue: 1. R. Wright, X. Peng, J. Kim, M. Riedl, Novgrid: A
[29] R. C. Gray, J. Zhu, D. Arigo, E. Forman, S. On- flexible grid world for evaluating agent response to
tañón, Player modeling via multi-armed bandits, novelty, arXiv preprint arXiv:2203.12117 (2022).
in: Proceedings of the 15th International Confer- [40] U. Ehsan, M. O. Riedl, Explainability pitfalls:
ence on the Foundations of Digital Games, 2020, Beyond dark patterns in explainable AI, Patterns
pp. 1–8. 5 (2024). Publisher: Elsevier.
[30] R. C. Gray, J. Zhu, S. Ontañón, Multiplayer [41] R. S. Sutton, A. G. Barto, Reinforcement learn-
Modeling via Multi-Armed Bandits, in: 2021 ing: An introduction, MIT press, 2018.
IEEE Conference on Games (CoG), IEEE, 2021, [42] R. Agrawal, Sample mean based index policies
pp. 01–08. by o (log n) regret for the multi-armed bandit
[31] A. Vinogradov, B. Harrison, Using Multi-Armed problem, Advances in applied probability 27
Bandits to Dynamically Update Player Models in (1995) 1054–1078. Publisher: Cambridge Univer-
sample with the maximum probability, while seeking
a reward between 0 and 1:
𝑎 = 𝑎𝑟𝑔𝑚𝑎𝑥(ℬ(𝛼𝑎 , 𝛽𝑎 )) (2)
𝛼𝑎 increases by the reward received, and 𝛽𝑎 increased
by 1 minus the reward received. Initially, both 𝛼 and
𝛽 for each arm are set to 1 to establish a uniform
prior distribution. Thompson sampling is designed to
effortlessly transition from primarily exploring in the
initial stages to a more exploitation-oriented strategy
as it acquires more information.
We carried out an Oracle-based experiment to de-
termine the MAB algorithm of choice for the study.
Using an oracle, which simulates a human creator in-
teracting with the system, gives us total control of
Figure 4: Oracle experiment results on MAB algorithms of their behaviour. We measure the performance of the
the agents performing on various feedback accuracy levels. agents at various levels of human feedback accuracy,
Upper Bound performance, where the liked arm is always to seek an agent that generally performs well on all
pulled, and the Lower Bound, where one not-liked arm is accuracy levels so that it serves a wider variety of
always pulled, is also presented for reference. human creators well.
We study four different agents and baselines: 𝜖-
greedy, UCB1, Thompson Sampling, and Random
sity Press. Baseline, where a universally random arm is chosen
[43] W. R. Thompson, On the likelihood that one each time. We give the agents 3 arms to pull, where
unknown probability exceeds another in view of one is “liked” and two others are “unliked”. Each arm
the evidence of two samples, Biometrika 25 (1933) would give either a reward of 1 if liked or 0 otherwise
285–294. Publisher: Oxford University Press. when pulled, by the oracle; We define human feed-
back accuracy as the probability of the oracle giving
a reward of 1 on pulling the “liked” arm and a 0 on
pulling the “not liked” arm. As this value gets lower,
A. Choosing a MAB algorithm closer to 50%, the simulated oracle becomes less clear
on which arm it liked and becomes a less efficient feed-
In this section, we provide more information on the back provider. We simulated 5 levels of this accuracy,
design choice of the MAB agent. Following results from 60% to 100% with equal intervals.
from Vinogradov et al. [31], we looked into three 𝜖-greedy is highly sensitive to its 𝜖 parameter cho-
representative MAB algorithms: 𝜖-greedy, UCB1 and sen, and we report with the best performing 𝜖-greedy
Thompson Sampling. agent in the with 𝜖 = 0.2. We report the “normalized
𝜖-greedy [41], widely used in RL, works on a simple rewards”, which is the agent’s reward relative to the
principle: The agent has probability 𝜖 (a hyperparam- theoretical maximum of always choosing the “liked”
eter) to choose a random action (explore) instead of arm. We repeat each experiment condition 100 times
performing the best action from its policy (exploit). and report the mean normalized rewards after 10 steps
UCB1, or Upper Confidence Bound 1 [42] instead to simulate a scenario where the MI-CC agent has to
takes a more deterministic approach: This algorithm quickly learn from their human counterparts, similar
calculates an “Upper Confidence Bound” for each arm, to our actual study.
considering both the current running average of the Figure 4 summarizes the results from the Oracle
rewards and the uncertainty due to lack of sampling: experiments. As we only gave these agents 10 steps to
√︀ learn the arms, the agent may not have yet converged.
𝑎 = 𝑎𝑟𝑔𝑚𝑎𝑥(𝑥
¯𝑎 + 2 log 𝑡/𝑛𝑎 ) (1)
This is expected in a quick-learning scenario. 𝜖-greedy
where 𝑥¯ 𝑎 represents the average reward received from performed poorly, even worse than the random base-
arm 𝑎, 𝑛𝑎 represents number of times arm 𝑎 was pulled, line, likely due to its inability to quickly change fo-
and 𝑡 the total number of times all arms are pulled. cus between exploration and exploitation; UCB1 and
This makes UCB1 aware of the uncertainty of the Thompson perform at similar levels, demonstrating
rewards from each arm when the agent makes its deci- their capabilities to calculate an upper-bound reward
sions. Although probability distributions are used to and use it in their decision-making process.
calculate these bounds, this algorithm does not sample Although UCB1 and Thompson performed similarly,
at all and provides a deterministic choice for a given Thompson Sampling is preferred because of its sam-
system state. pling behavior. UCB1 schedules its exploration over a
Finally, Thompson Sampling is a robust Bayesian very long session in a deterministic way (exploring once
approach first introduced by Thompson [43]. It main- after exploiting 𝑛 times). As we aim for quick learn-
tains a probability distribution over the possible values ing and adaptation, without sampling, UCB1 risks
of each arm’s reward and uses this distribution to make showing “stubbornness” to a suboptimal arm without
decisions. To determine which arm to pull, it draws any probability to unstuck itself, a behavior that is
samples from a Beta (ℬ) distribution of the number less preferred from an MI-CC perspective. Thompson
of successes and failures for each arm, choosing the Sampling, on the other side, exhibits its capability
to dynamically change its exploration aggressiveness are followed by an open-text question prepared to
based on previous observations, while using a Bayesian collect justifications from the participants.
prior instead of greedy sampling, both benefiting its ap-
plication in our experiment MI-CC setup. This results
in both an effectively dynamic “epsilon” compared to C. Prompting details
epsilon-greedy and some randomness instead of being
Prompts for Communications start with
fully greedy per each step, compared to UCB1.
We chose Thompson Sampling as the MAB algo- “You are an AI writing assistant, col-
rithm used in the experimental system. laborating with a human on the task of
writing a story.You are very concise, and
answer only what is absolutely necessary,
B. Questionnaires used in the study without any explanations or introduc-
Pre-study. Four 5-point Likert scale questions are tions.You make sure that all your an-
asked: swers are surrounded by an underscore,
such as _My answer_ .”
• Q1: Do you agree that you are familiar with the
process of creating content, such as writing arti- and are followed by a few examples of the tasks, along
cles, drawing pictures or creating a video game with the constraints, formed in a question-answering
stage, using a computer? (Strongly Disagree format; The final question does not come with an an-
→ Strongly Agree) swer, and the continuation is treated as the response.
• Q2: Do you agree that you are good at writ-
ing or telling a story, either real or fictional?
(Strongly Disagree / Never attempted in the
past 5 years → Strongly Agree)
• Q3: How frequently do you use or interface
with artificial intelligence? For example, using
map services to find a route to your destina-
tion, playing a game with a computer-controlled
character, or using a chatbot. (Never used →
Always / For as many things as possible)
• Q4: How much understanding do you have
of the recent developments in Artificial Intelli-
gence technologies? (Very unfamiliar → Very
familiar / I can build one)
Post-study. Four questions are asked regarding the
systems they used during the study.
• Q5-(Learning, Collaboration) You were as-
signed a specific way to collaborate with the
assistant Wands, and the assistant is not in-
formed of this arrangement in advance. Which
assistant wand learned to collaborate with you
under that arrangement? If you have chosen at
least one of the assistant wands, how did you
know they learned from you?
• Q6-(Enjoyment, Immersion) Which assistant
wand is more capable and made the collabo-
ration easy for you? If you have chosen at
least one of the assistant wands, how did the
assistant(s) impress you with their capabilities?
• Q7-(Expressiveness, Exploration, Results worth
effort) With these assistant wands, which col-
laborative experience ended up in a good story?
If you have chosen at least one of the assistant
wands, What do you think helped? If you chose
neither, what went wrong?
• Q8-Lastly, which assistant wand would you rec-
ommend more to a friend or a colleague story
writer? Please let us know if you have any other
message or comment to share.
For Q5 to Q7, Participants may select one, both, or
neither system; For Q8, as it is a comparative question,
the option of "neither" is not available. All questions