Beyond Following: Mixing Active Initiative into Computational Creativity Zhiyu Lin1 , Upol Ehsan1 , Rohan Agarwal1 , Samihan Dani1 , Vidushi Vashishth1 and Mark Riedl1 1 Georgia Institute of Technology, Atlanta, Georgia, USA Abstract Generative Artificial Intelligence (AI) encounters limitations in efficiency and fairness within the realm of Procedural Content Generation (PCG) when human creators solely drive and bear responsibility for the generative process. Alternative setups, such as Mixed-Initiative Co-Creative (MI-CC) systems, exhibited their promise. Still, the potential of an active mixed initiative, where AI takes a role beyond following, is understudied. This work investigates the influence of the adaptive ability of an active and learning AI agent on creators’ expectancy of creative responsibilities in an MI-CC setting. We built and studied a system that employs reinforcement learning (RL) methods to learn the creative responsibility preferences of a human user during online interactions. Situated in story co-creation, we develop a Multi-armed-bandit agent that learns from the human creator, updates its collaborative decision-making belief, and switches between its capabilities during an MI-CC experience. With 39 participants joining a human subject study, Our developed system’s learning capabilities are well recognized compared to the non-learning ablation, corresponding to a significant increase in overall satisfaction with the MI-CC experience. These findings indicate a robust association between effective MI-CC collaborative interactions, particularly the implementation of proactive AI initiatives, and deepened understanding among all participants. Keywords Mixed-Initiative, Co-Creativity, Human-AI Collaboration, Procedural Content Generation 1. Introduction (CC) systems. Mixed-Initiative systems are those in which both human and AI systems can initiate con- Recent advancements in Machine Learning (ML)– tent changes. Co-Creative systems are those in which powered Artificial Intelligence (AI), such as large lan- both human and AI systems can contribute to content guage models (LMs) [1] and diffusion models [2], have creation. In particular, MI-CC systems have been made a new class of tools for Procedural Content demonstrated in game design [8], drawing [9], and sto- Generation (PCG) available to game creators. The rytelling [10], that benefits from both human and AI dominant contemporary way for the creators to con- possessing the ability to take creative initiative. While trol such generative AI models is via prompting—the the broadest definition of co-creative systems might issuing of textual instructions for the model to inter- include any human creators working with a generative pret and respond to [3]. That is, the user is tasked AI, the vast majority of them have not investigated with the responsibility of issuing clear “prompts” to the role of mixed-initiative, especially a more active contextualize the AI system and make them aware of AI initiative. their intents. The AI is tasked to follow and fulfill At the heart of MI-CC systems is the question of the request strictly based on it. If the system does whether and how the AI creative agent knows and not respond with an output that satisfies the creators’ understands (a) the intentions and goals of the human wants or needs, it is incumbent upon the creators to creator and (b) how the user wants to work with the modify the prompt and try again. AI system. These questions pose significant challenges, The paradigm of human creators working with gen- especially within domains critical to game designers erative AI via prompting is just one of many theo- utilizing AI, such as Computational Creativity and retical ways for a human creator and an AI system PCG. In other domains, the goal may be provided to AI to interact [4]. There is evidence that prompting is in advance, making it easier to identify opportunities not necessarily the best interaction paradigm; users to take the initiative with respect to contributing to a indicate an appreciation for more varied ways of in- solution—the extreme of which is the AI system know- teracting with AI creative systems [5]. Other config- ing the goal and solving the goal completely on its own. urations of human-AI collaboration creative systems When it comes to creating games, however, the human are possible that promise to reduce cognitive load, creators’ intent is harder to articulate completely[11]. frustration, and system abandonment [6], and make The human creator’s goals are also non-stationary and these systems more casual and enjoyable [7]. These may evolve during the creative process[12, 13]. The include Mixed-Initiative (MI) systems and Co-Creative human creator might also have a preferred working style that the agent should conform to in order to 11th Experimental Artificial Intelligence in Games Workshop, take the initiative while minimizing disruption. Once November 19, 2024, Lexington, Kentucky, USA. we overcome these challenges, researchers have shown $ zhiyulin@gatech.edu (Z. Lin); ehsanu@gatech.edu (U. Ehsan); rohanagarwal@gatech.edu (R. Agarwal); that such ambiguity and instability link to improved sdani30@gatech.edu (S. Dani); vvashishth3@gatech.edu outcomes of the creative activity[14], thus benefiting (V. Vashishth); riedl@cc.gatech.edu (M. Riedl) the MI-CC interaction. € https://zhiyulin.info/ (Z. Lin); In this paper, we examine Co-Creative systems in https://www.upolehsan.com/ (U. Ehsan); a mixed-initiative setting and study the dynamics of https://eilab.gatech.edu/mark-riedl.html (M. Riedl) © 2024 Copyright for this paper by its authors. Use permitted under managing creative responsibility between human and Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Figure 1: Screenshot of our system in action. AI initiatives. We ask: What influence does an AI 2. Background and Related Work agent’s ability to actively adapt to creators’ expectancy of creative responsibility in an MI-CC system have on The procedure of an MI-CC system learning its cre- creator experience and perception? ative responsibilities can be described as a decision- In particular, we make the assumption that the AI making process, where the agent communicates with agent is capable of working in the creative domain if the human creator, gathers information, and chooses given explicit prompts but is unaware of the human among its capabilities. This is not as straightforward creator’s preferences for distributing creative respon- as asking human creators to prompt AI agents because: sibility between humans and the AI. We explore the • Just like the Cold Start problem experienced by usage of Reinforcement Learning (RL) methods in AI agents lacking prior preferential knowledge this setting and demonstrate that the creative respon- from their creators [17], human creators, even sibility learning challenge in MI-CC systems can be experts, may struggle to make inferences about addressed by a multi-armed bandit (MAB) algorithm the behavior of AI systems they initially face; that observes feedback from users iteratively, updates its beliefs, and carries out its capabilities to facilitate • The ability of human creators to effectively con- the MI-CC collaboration. The learning is done online vey information to AI depends on their com- in real-time during the MI-CC process, and the human munication skills, which can be a significant creator is not expected to have previous knowledge obstacle even in human-to-human interactions of the AI agent or time to pre-train it with regard to [18]. their collaboration style. • Enforcing this AI-centric method of input re- Working in the domain of structured story co- quires a profound mechanical understanding of creation, we invite 39 participants to a human subject the AI system from the human creators, where study. We quantitatively measure the human creator’s this knowledge does not necessarily intersect perceived learning performance of the agent and the with their expertise. This marginalizes creators overall level of satisfaction with the collaboration. We who do not possess the requisite expertise in use the Creative Support Index (CSI) [15] to study utilizing AI. the implications of a learning and evolving AI agent. For these reasons, relying solely on human creators We also report on qualitative data collected from par- for direct collaborative prompting, regardless of the ticipants, using a grounded theory [16] approach in capability of the AI models, has its limitations, leading which we identify thematic patterns in users’ subjec- to efficiency, cognitive load, fairness, and equity issues. tive reports of their experiences. This study reveals Alternatively, a model can be built on human feed- a higher degree of participant recognition regarding back without users directly communicating their goals. the learning capabilities of our agent, compared to the Researchers demonstrated their potential in transfer- ablation, which in turn corresponded to a significant ring human knowledge to AI [19, 20] and making AI increase in overall satisfaction with our agent. 1 learn more efficiently [21, 22]. When it comes to gener- ating contents, this is the foundation of methods such 1 https://github.com/xxbidiao/beyond-following-experiments as RL from human feedback [23], that has proven to drastically improve the quality of generated text in 3.1. Task Setup state-of-the-art models such as GPT-4[1]. Yet, they The Delegation Setup. For the experiments, we spot- are designed to exclusively optimize for a static, known- light a specific but generalizable collaborative setup: from-data objective. They are not designed for online Learning a delegation. In this setup, both parties take implementation where pre-training is not feasible, and a subset (or entirety, if preferred) of responsibilities the system lacks prior knowledge of new creators and in an MI-CC activity towards the common goal. The needs to actively probe them. human creator concentrates on specific parts of the To focus on the active probing challenge, we formal- creative task while not losing control of the other parts; ize it as a Multi-Armed Bandit (MAB) problem [24] the AI agent needs to strategically shift its focus to- above generative abilities, where an AI agent needs to wards the parts that the human creator is not focusing actively choose under uncertainty from their library on and actively determine how to make improvements. of capabilities based on their understanding of their Furthermore, as these interactions are not without human creator teammate, to minimize total regret and cost, such as creators’ cognitive load, it is also im- maximize rewards from their teammate. Multi-Armed portant to minimize such costs towards learning these Bandit systems have been employed in the context responsibilities. We denote the expected and delegated of resolving how to make progress in an interactive responsibility that the AI agent needs to learn during creative experience. Koch et al. [25] discussed a de- the interaction preferred work style for a particular sign ideation framework that suggests images that a human creator. designer may like by exploring and exploiting in the im- age embedding space with a variant of MAB; Gallotta et al. [26] applied MAB in the context of generating Domain: Storytelling. Given the mounting interest “in-game spaceships” by enabling creator-guided latent in co-creative storytelling [32] and established research space walk in the feature embedding space represent- foundation within story generation, its high relevance ing such spaceships. These works focused on a single to game development, and its inherent complexity with type of action in the content space, and concentrated regard to PCG, we select story generation as a proving on expanding the generative space of such content; ground for our proposed method. The expertise of Lin et al. [27, 5] explored instead the action space, the team and advancements in open-source Large LMs characterized as types of Communications represent- readily available to us facilitated implementation; This ing information exchange between human and AI used allows us to focus on the human factors of the MI-CC in the co-creative process; As to the idea of switching experience and the AI agent itself. between different high-level actions beyond the content For our experimental system, We use Llama2-13b- level, Building a model of the user has been proven chat [33] as the LM, readily available at the time of to help in a CC setting, specifically in the domain of the study while very responsive for the interactive storytelling. Yu et al. [28] demonstrated its potential experience. to generate stories that bring “an enjoyable experience for the players”; Gray et al. [29, 30] further demon- 3.2. Experimental AI System overview strated how MAB agents help to capture this player model. Vinogradov et al. [31] showcased a framework We now describe the AI system we built for the purpose where the agent explores the creators’ “player” model of the study. The experimental system is based on vigorously by directly generating “distractions”, ob- the Creative Wand framework [27], containing the jects designed to probe into players’ preference instead following four components: of providing utilities in finishing a certain task; They proposed using MAB for this task for its promises in 3.2.1. Creative Context “balancing the act of gathering information about the The Creative Context is the abstraction of generative payout associated with each arm (exploration) and models for this system. maximizing reward given the current known informa- In this paper, we study stories containing four com- tion (exploitation)”, dynamically updating the model ponents inspired by the Narrative Arc theory: the in the process towards assigning tasks that the play- beginning, development or rising action, climax, and ers feel more interested in tackling. They inspire our conclusion. We design an AI framework that writes method, as its approach of adding distractions is well each component of the story using language models comparable to the agent carrying out its initiative and prompt engineering (See Appendix C for more while directly changing the creative content. details). Both the human participant and the AI are instructed to write about 20 to 30 words per com- 3. Study Design ponent, and the target length of the whole story is around 100 words. In this section, we present the study we designed Once we set up the model, it will take requests from to examine the AI agent we created that adapts to Communications. creators’ expectancy of creative responsibility. We seek to determine how this changes the perception of 3.2.2. Communications the creators toward the AI and the creative experience Communications describes the interactions between the system supplies to the human creators. the human creators and the AI; they also double as the capabilities the AI agent possess. To focus on how the agents would choose their creative responsibili- ties, we implement a minimalistic yet complete set of Figure 2: One round of interaction of our experimental system. Each participant will experience multiple turns per session. capabilities for the creative experience. This allows heuristic is computationally fast and enables respon- us to focus on research questions about the creative sive interactions; Second, it additionally provides vi- experience while minimizing the cognitive load of the sualization for the users. As shown in Figure 1, we participants. Our agent possesses the following capa- present this right above the text boxes for the stories, bilities, implemented as prompts to the LM describing with a text hint and a progress bar representing the the responsibilities (See Appendix C for details): ideation process of the agent. We additionally provide a “skip” function that forces agent initiative. • (Re)write the beginning and development; • (Re)write the climax and conclusion; Agent Initiative. In this phase, the agent decides • Write a review of the story, one sentence posi- which capability best fosters the collaborative experi- tive, one negative, and one suggestion for im- ence and carries out the corresponding Communication. provements. We build a Multi-Armed Bandit-based agent in our system that is responsible for choosing which Commu- 3.2.3. Experience Manager and Frontend nication to invoke, with Thompson Sampling as the These two modules manage the interactive experience chosen algorithm for the experimental system within and workflow. the AI agent. Formally, an agent 𝐴 interacts with a We implement a Finite State Machine to manage set of 𝐾 arms 𝑎1 · · · 𝑎𝑘 , each of which is associated the experience. Figure 2 shows the states with the with Communication and underlying capabilities and overall flow of interaction each participant experiences an unknown reward distribution. Whenever an arm in one experiment session. One session of the MI-CC is pulled, the agent seeks feedback from the human experience is separated into multiple “turns”, where creator on the initiative, which is treated as a reward both parties iteratively improve the story, sharing the signal. (See next paragraph.) The goal of the agent same text fields in the editing process. The partici- is to maximize the total reward obtained by repeat- pants are not directly notified of the internal states of edly pulling arms during the session. See subsection the system. A for more details on the design choices of the MAB agent. Once an arm is pulled, the agent executes a Communication, interacts with the user, and updates Human Initiative. During this phase, human creators the story as needed. contribute to the story by making edits in any of the four text fields. This phase ends when the agent decides to take the initiative. We implement a point- Learning from human. The system will ask about based heuristic based on pilot studies: the agent would (Action Feedback) the way they just worked and (Con- assign points for changes it observes, and will take tent Feedback) the updates and content changes. The initiative whenever enough points are accumulated, participants choose between “Good” (Reward of 1) signifying substantial edits from the human creators, and “Bad” (Reward of 0). “Bad” feedback on gener- in the following criteria: ated text leads to a reversion to the original content, though it is not used to improve the LM in any way. • Each new character would add 5 points; A weighted mean is employed to integrate both types • Each time the human creator switches between of feedback into a singular reward signal. For the study, fields after any changes, 100 points are added; a weight hyperparameter of 80% is applied to the Ac- • Whenever the human creator leaves a text field tion Feedback and 20% to the Content Feedback. This with 200 points accumulated (roughly one full prioritizes learning action-level responsibilities rather sentence or two minor changes), the agent will than the preference for LM-generated text, in which take the initiative by locking the editing inter- the full system and the baseline share implementation. face and resetting the counter. This reward signal is then used to train the agent. For this experiment, an MAB agent with Thompson This heuristic provides two advantages compared to Sampling is used in the experimental system. See other ways this decision can be made: First, this Appendix A for a discussion and experiments related Figure 3: Participants’ experience during the study. to this choice. Post-study. After participants finished two sessions Once the learning process is complete, “human ini- using our system, they were asked about the process tiative” starts again. To maintain user engagement, they had just experienced. inspired by Creative Sup- text responses are morphed each time to avoid repeti- port Index (CSI) [15] used in the previous studies, tiveness, while contextual hints are also strategically We ask questions based on dimensions related to the provided throughout the experience. Figure 1 shows creative support perception and overall collaborative the user interface. experience, grouped to facilitate richer responses from the participants while maintaining their engagement 3.3. Study Methodology in the survey. Specifically, we ask which system(s), are (Q5, Learn- To study the perception of human creators towards MI- ing, Collaboration) learning to collaborate, (Q6, En- CC systems equipped with these learning capabilities, joyment, Immersion) more capable and easy to work we conduct a study summarized in figure 3 on the AI with, (Q7, Expressiveness, Exploration, Results worth system. effort) enabling better stories; For Q5 through Q7, We compare our system, the “Full” system, with an participants can choose either system, both systems, ablation named “baseline”. The “baseline” ablation or neither to be chosen, leading to a potential total does not learn. It chooses each of the 3 Communica- exceeding 100%. We ask one final question (Q8) on tions with a 1/3 probability at all times and provides which system will they recommend more, framed in a only a reverting option when “asking for feedback”. win-draw-lose format. These systems are codenamed “Echo Wand” and “Har- Although these questions are presented in the same mony Wand” respectively, not to reveal the details of order for all participants, the order of the options is the systems to the participants during the study. randomized to reduce bias towards any system. All We recruited 39 United States participants 2 on questions are followed by an open-text question pre- Prolific3 with adequate English proficiency. Each ex- pared to collect justifications from the participants. periment session lasted for approximately 40 minutes, and we paid the participants $15 per hour for perfect completion of the study. 4. Quantitative Results Pre-study. Before the experience, participants answer 4.1. Creative background four 5-point Likert-scale questions on (Q1) Expertise in Table 1 shows a summary of the creative backgrounds Computer-Assisted Designing (CAD), (Q2) Expertise of the participants. Although a median of 4 on all in writing stories, (Q3) Frequency using AI, and (Q4) questions implies that participants are familiar with Understanding of AI. 4 the recent advancement of AI, when specifically ask- We then present instructions to familiarize the par- ing whether they can build one, only 1 participant ticipants with our systems by providing annotated answered “yes” (5 in Q4), meaning that most of the screenshots of the interface, which is a copy of Figure participants do not have a technical background. 1, but with additional numeric overlays, descriptions of However, comparing to 26% reported in [5], we components, and a brief introduction to the workflow observed 87% of the participants at least being “some- of co-creating a story. what familiar” (3+) with recent AI technologies, and They are then assigned the delegation task to focus 51% being “familiar” (4+); The experience of using on writing the beginning and the development of the commercially available Large LM-based agents may story while leaving the other parts of the story to AI have a profound effect on how participants, in general, as much as possible. They are also made aware that would collaborate with AI systems. the AI does not know this setup in advance. 4.2. Quantitative Results Experience. Participants are assigned to interact with the full system and the baseline ablation, pre- We commence by presenting the quantitative results of sented in random ordering, counter-balanced. They the study through the choices made by the participants are given 10 turns per each of the 2 sessions. in the multiple-choice questions. 2 When asked which system(s) learned to collaborate Only counting participants who finished the whole study with with them under the delegation arrangement (Q5), valid sessions and responses. 3 prolific.co the “Full” system is chosen 69% (𝑛 = 39) of the times, 4 See Appendix B for the full question text. compared to 51% for the baseline (𝑝 < 0.018, under a Q (See Appendix B for full questions) 1 2 3 4 5 Average Median Q1: CAD skills 1 1 2 19 16 4.23 4 Q2: Writing skills 1 0 7 20 11 4.03 4 Q3: Frequency of using AI 0 0 16 11 12 3.90 4 Q4: Understanding of AI Tech. 0 5 14 19 1 3.41 4 Table 1 Creative background of the participants. 1 = Most Negative, 5 = Most Positive. binomial test where 𝐻0 := no observable difference in these creative support dimensions may be too minor distribution; The same for all p-values in this section). when it comes to how they are presented visually; The We clearly see the “Full” system with learning capa- effect of user interface used to present the results in an bilities enabled being perceived significantly better at MI-CC system is out of the scope of this work, though learning the delegation than the baseline, demonstrat- these findings illuminated a potential path for future ing the effectiveness of the MAB-based model From research. the human creator perspective learning from their feed- back. When asked which system to recommend, this trend 5. Qualitative Results also persists: Our system is preferred (wins) 43.6% We now show the results from the open-ended ques- of the time, versus 20.5% (loses) for the baseline tions following each multiple-choice question. Open- (𝑝 < 0.001); 35.9% of the participants do not have a ended justifications participants provided for each of preference (draw). The “Full” system is only different the four questions are evaluated with thematic anal- from the baseline system with the learning capabilities ysis [36], based on grounded theory [16]. Taking an and corresponding frontend elements, yet we see a sta- inductive approach, we started the process with an tistically significant improvement in preference towards open-coding scheme and iteratively produced in-vivo our “Full” system, illustrating the potential of our codes (generating codes directly from the data). Next, method in enhancing MI-CC experience and making we analyzed the data using axial codes, which in- such system better for human creators. volves finding relationships between the open codes When it comes to which system(s) gave a good story and clustering them into different emergent themes. (Q7), 72% of the participants agree that the “Full” sys- Through an iterative process performed until consen- tem made a good story, while 69% selected the baseline sus was reached, we share the most salient themes that system (𝑝 > 0.05). We were unable to statistically emerged from axial codes. determine whether an agent learning the delegation would produce a better story; This is expected, We focused on studying the sharing of responsibilities and A MI-CC system that understands the intents of enforced a delegation setting. In an actual MI-CC the human creators and follows them by learning is experience, without such a prior, A human creator overall favored and collaborates well with the cre- would utilize the agent’s learning capability to pro- ators. Participants demonstrated their observation mote their strengths and discourage their weaknesses, of the learning capabilities of the “full” system, identi- and an improvement in perceived performance is more fying them as “better about learning that I specifically likely to be observed in that setting. wanted help with” (P34) and “listened to my feed- Finally, when queried about the collaboration itself back.”(P39). In comparison, the baseline system is (Q6), 62% of the participants think the “Full” system identified as “did less of the work ... did not necessarily is capable and made the collaboration easy, while 56% learn what its role was expected to be” (P19). this voted for the baseline system (p>0.05). We also were resulted in a preference for the Full system for P32, unable to statistically determine whether the “Full” as the Full system is quoted as a “more useful helper". system is more enjoyable and immersive. Although the This aligns with the quantitative observations. difference between the “Full” system and the baseline is substantial enough both implementation-wise and Good content suggestions may give people the feel- towards the perception of learning, from the angle of ing that the system is learning how to collaborate the user interface, the only difference is 10 additional with them, regardless of how AI is actually doing questions from the “Full” system per session. Previ- so. Despite specifically asking participants to discuss ously, Larsson et al. [34] reported that “there was a whether the agent has “learned to collaborate with clear trend that the visual ... was rather important to you under that arrangement” (Q5), Participants are the subject’s relationship towards the MI-CC.” while also rating the system based on the generated content: these “relationships” are directly linked to creators’ (P25, emphasis asked) This one learned perception of immersion of the experience; Ehsan et from me because it was able to build off al. [35] additionally pointed out that even when an AI of my original foundation of my story that system presents the same underlying information, how I typed. it is presented influences the perceptions of human users. We may have observed this effect from a differ- P18, who rated their familiarity with AI as Familiar ent angle, where a lack of differences in presentation (4 out of 5) and AI usage as “Always / as much as may have caused the indifference of the participants. possible” (5 out of 5), wrote that the “Full” system is To that end, the difference between the two systems on learning from them: I could see Echo Wand adding more de- ... I was in control of the final text to tail and building out more creatively accept changes or not, or to make my than with Harmony Wand. own. This participant is familiar with recent generative AI In a system involving a creator who wishes to create and mentions “adding details” and “building,” which content to their liking, it is expected that the creator are traits that these AI are optimized for. As both wishes to solicit as much control as possible. However, the “Full” and the baseline use the same underlying if the AI agent does not have any final say on the generative AI capabilities, P18 could not distinguish contents, should we expect it to take any creative between the “improvements” on generated contents responsibilities? Although we acknowledge that this and the performance of the MAB-based agent. The is more of a philosophical question, way out of the apparent improvements of generated stories may result scope of our work, what if the agent would understand from a wide range of reasons, such as participants what their counterpart is actually seeking and use providing different input and LM sampled differently, this information to determine what contribution they unrelated to both the underlying LM and the learner, should stick to by understanding what human creators creating noises in the perception of participants. are thinking? Diversity is also important, it may not be the best 6. Discussions strategy for a learning agent to pick the “best op- tions”, and sometimes the agent may want to inten- Distilling from these findings ranging from the per- tionally surprise their teammates. P23 was impressed ception of collaboration, good writing skills, diversity by the range of capabilities both agents possess, seeing in capabilities, and creators’ need for control, a com- “They were both impressive, being able to take my mon implication surfaces: Getting the mental model story and to word it better, or even add things to of the creators right, the system will succeed; Getting change it to make it better”. When asked about the it wrong, failure cases would surface. A mental model generated story, P39 mentioned that “ Both of them is described by Kieras et al. [37] as “ understanding gave bad stories.” and “I need much more control and ... that describes the internal mechanism“ of the sys- options”. Curiously, this is the same participant that tem a human is operating; Leslie et al. [38] further enjoyed the agent that “listened to my feedback.”. P36 point out that a theory of mind is a mechanism that preferred the baseline system that executes random human expresses naturally, towards an understanding actions: of thinking, in our context, their teammate AI. The success of our “Full” system of learning rises from its I did all of the work with Echo, despite ability to learn a model of how the creators wish to my best efforts to get it to collaborate collaborate with them, and the reward given from a with me. Harmony had much more inter- teammate can be otherwise treated as a reward for esting suggestions and rightfully pointed correctly understanding their model. The need for di- out when a section became too dense. versified responses and more respect to control signals It balanced the second two sections to users imposed also fall into this paradigm, but be- match my intro and build up, unlike yond; Understanding how these reward signals should Echo who almost refused to work on be used beyond “picking the best”, and how to cap- them. ture hints for new actions or capabilities needed can For this study, we assigned delegation tasks to the greatly improve collaborations with MI-CC systems. participants. This is only a subset of possible respon- This falls into the subfield of “novelty detection and sibilities that the AI agent can take and the human adaptation” [39] situated in RL, which is known to be creators may expect. Lin et al. [5] have shown that a challenging, if solvable at all with ML methods, as ML system with more coverage of the design space, provid- models can only rely on their extrapolation capabilities ing more diversified options, is preferred. Our study towards the “unknowns”, that may not hold for all design, which is more focused on studying the learning novelties; This will be a rewarding pathway towards process, limited the variety of capabilities the agent better MI-CC systems if not agentic AI overall. may perform. To that end, once such an MI-CC sys- We start to see a consistent narrative: creators are tem is put into use beyond research, it is necessary interpreting the capabilities of our AI agent learning to diversify both the capability pool and the process as an attempt the AI agent made to learn a mental of the AI agent choosing them, potentially providing model of themselves; Because our agent determines surprise and unpredictability to further inspire the which Communication to use and the effect of it on the users. contents being collaborated on, We observe the par- ticipants treating proper learning of Communication Creator control is important, and creators may want choices (expected) and the content generated (emerg- their ideas to be included even when AI can pro- ing) as both evidence that the agent is learning from vide better candidates. Beyond the need for control them and traits leading to their preferences towards mentioned by P39, P28 mentioned that they were im- these systems. This also, to some extent, explains pressed by the capabilities of both systems in “finish the placebo effect we observe on the baseline system: the story that I started with.” (Emphasis added). P27 around half of the participants believe that the base- mentioned further on their justification: line system is learning from them, significantly more than 0, despite the baseline system only making deci- sions randomly. In this controlled comparative study, to avoid a bias towards either of the systems, we in- Creativity (2023) 64–73. URL: http://arxiv. tentionally did not disclose any difference between the org/abs/2305.07465. doi:10.48550/arXiv.2305. “full” system and the baseline. This perception may 07465, arXiv:2305.07465 [cs]. have arisen from the capability of our agent to gen- [6] J. Sweller, Cognitive load theory, in: Psychology erate part of stories that follow the context that the of learning and motivation, volume 55, Elsevier, participants provided. Although we acknowledge that 2011, pp. 37–76. these factors are hard to decouple, this finding also [7] K. Compton, M. Mateas, Casual creators, in: hints at the potential of our methods in understanding Proceedings of the sixth international conference the human creator holisticly. Upol et al. [35] pointed on computational creativity, 2015, p. 228. out that the background of human users determines [8] A. Liapis, G. N. Yannakakis, C. Alexopoulos, their cognitive heuristics, which plays a role in their P. Lopes, Can computers foster human users’ cre- expectations beyond what the designer of the systems ativity? Theory and praxis of mixed-initiative co- expected in the first place. They also realized that if not creativity, DCE (2016). URL: https://www.um. treated carefully, AI systems can actually introduce edu.mt/library/oar/handle/123456789/29476, ac- such placebo effects, as a pitfall [40], by misleading the cepted: 2018-04-23T12:31:38Z Publisher: DCE. human users into appreciating their trustworthiness [9] N. Davis, C.-P. Hsiao, K. Y. Singh, L. Li, and power, without the development of underlying AI S. Moningi, B. Magerko, Drawing Apprentice: capabilities. Standing on these findings, A promising An Enactive Co-Creative Agent for Artistic Col- direction of research is to carefully identify the effect laboration, in: Proceedings of the 2015 ACM of expectations of both parties involved in the MI-CC SIGCHI Conference on Creativity and Cognition, process, and how they dynamically change during the C&C ’15, Association for Computing Machin- collaboration. ery, New York, NY, USA, 2015, pp. 185–186. URL: https://doi.org/10.1145/2757226.2764555. doi:10.1145/2757226.2764555. 7. Conclusions [10] A. Alvarez, J. Font, J. Togelius, Story Designer: Towards a Mixed-Initiative Tool to Create Narra- In this paper, we showcased how an MI-CC system is tive Structures, Proceedings of the 17th Interna- capable of listening to human feedback and improving tional Conference on the Foundations of Digital itself towards a better understanding of how it should Games (2022) 1–9. URL: http://arxiv.org/abs/ collaborate with human creators in a storytelling do- 2210.09294, arXiv:2210.09294 [cs]. main. Inviting 39 participants and comparing two such [11] M. O. Riedl, Human-centered artificial intelli- systems with and without these learning capabilities, gence and machine learning, Human behavior we found that this capability was well recognized by and emerging technologies 1 (2019) 33–36. Pub- the participants and led to better satisfaction over- lisher: Wiley Online Library. all. To this end, we further encourage the designers [12] N. Davis, C.-P. Hsiao, Y. Popova, B. Magerko, of MI-CC systems to pay attention to both the hu- An enactive model of creativity for computational man creators and the AI agent, study how each party collaboration and co-creation, Creativity in the should, or is already, adapting to and creating mental digital age (2015) 109–133. Publisher: Springer. models of their counterpart, based on their creative [13] M. Guzdial, N. Liao, M. Riedl, Co-Creative Level roles taken, their previous experience, and capabili- Design via Machine Learning, Fifth Experimen- ties, and most importantly, the wishes of the human tal AI in Games Workshop (2018). URL: http: creators. //arxiv.org/abs/1809.09420, arXiv: 1809.09420. [14] F. Zenasni, M. Besançon, T. Lubart, Creativity References and tolerance of ambiguity: An empirical study, The Journal of Creative Behavior 42 (2008) 61–73. [1] OpenAI, GPT-4 Technical Report, 2023. URL: Publisher: Wiley Online Library. http://arxiv.org/abs/2303.08774. doi:10.48550/ [15] E. Cherry, C. Latulipe, Quantifying the creativ- arXiv.2303.08774, arXiv:2303.08774 [cs]. ity support of digital tools through the creativity [2] P. Dhariwal, A. Nichol, Diffusion Models Beat support index, ACM Transactions on Computer- GANs on Image Synthesis, Advances in neural Human Interaction (TOCHI) 21 (2014) 1–25. Pub- information processing systems 34 (2021) 8780– lisher: ACM New York, NY, USA. 8794. ArXiv: 2105.05233. [16] B. Glaser, A. Strauss, Discovery of grounded [3] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, theory: Strategies for qualitative research, Rout- G. Neubig, Pre-train, prompt, and predict: A ledge, 2017. systematic survey of prompting methods in natu- [17] J. Bobadilla, F. Ortega, A. Hernando, J. Bernal, ral language processing, ACM Computing Sur- A collaborative filtering approach to mitigate the veys 55 (2023) 1–35. Bibtex.eprint: 2107.13586 new user cold start problem, Knowledge-based bibtex.archivePrefix: arXiv. systems 26 (2012) 225–238. Publisher: Elsevier. [4] Z. Lin, M. Riedl, An Ontology of Co-Creative AI [18] S. M. Grover, Shaping effective communication Systems, arXiv preprint arXiv:2310.07472 (2023). skills and therapeutic relationships at work: The [5] Z. Lin, U. Ehsan, R. Agarwal, S. Dani, foundation of collaboration, Aaohn journal 53 V. Vashishth, M. Riedl, Beyond Prompts: Ex- (2005) 177–182. Publisher: SAGE Publications ploring the Design Space of Mixed-Initiative Sage CA: Los Angeles, CA. Co-Creativity Systems, Proceedings of the [19] W. Bradley Knox, P. Stone, TAMER: Train- 14th International Conference on Computational ing an Agent Manually via Evaluative Reinforce- ment, in: 2008 7th IEEE International Con- an Experience Managed Environment, Proceed- ference on Development and Learning, IEEE, ings of the AAAI Conference on Artificial Intelli- Monterey, CA, 2008, pp. 292–297. URL: http: gence and Interactive Digital Entertainment 18 //ieeexplore.ieee.org/document/4640845/. doi:10. (2022) 207–214. URL: https://ojs.aaai.org/index. 1109/DEVLRN.2008.4640845. php/AIIDE/article/view/21965. doi:10.1609/ [20] G. Warnell, N. Waytowich, V. Lawhern, P. Stone, aiide.v18i1.21965, number: 1. Deep TAMER: Interactive agent shaping in high- [32] M. Behrooz, Y. Tian, W. Ngan, Y. Yung- dimensional state spaces, in: Proceedings of ster, J. Wong, D. Zax, Holding the Line: A the AAAI conference on artificial intelligence, Study of Writers’ Attitudes on Co-creativity with volume 32, 2018. Issue: 1. AI, 2024. URL: http://arxiv.org/abs/2404.13165, [21] Z. Lin, B. Harrison, A. Keech, M. O. Riedl, arXiv:2404.13165 [cs]. Explore, Exploit or Listen: Combining Hu- [33] H. Touvron, L. Martin, K. Stone, P. Albert, man Feedback and Policy Model to Speed up A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, Deep Reinforcement Learning in 3D Worlds, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, arXiv:1709.03969 [cs] (2017). URL: http://arxiv. C. C. Ferrer, M. Chen, G. Cucurull, D. Es- org/abs/1709.03969, arXiv: 1709.03969. iobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, [22] D. Arumugam, J. K. Lee, S. Saskin, M. L. C. Gao, V. Goswami, N. Goyal, A. Hartshorn, Littman, Deep Reinforcement Learning S. Hosseini, R. Hou, H. Inan, M. Kardas, from Policy-Dependent Human Feedback, V. Kerkez, M. Khabsa, I. Kloumann, A. Ko- 2019. URL: http://arxiv.org/abs/1902.04257, renev, P. S. Koura, M.-A. Lachaux, T. Lavril, arXiv:1902.04257 [cs]. J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Mar- [23] D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, tinet, T. Mihaylov, P. Mishra, I. Molybog, Y. Nie, A. Radford, D. Amodei, P. Christiano, G. Irving, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, Fine-tuning language models from human prefer- A. Schelten, R. Silva, E. M. Smith, R. Subrama- ences, arXiv preprint arXiv:1909.08593 (2019). nian, X. E. Tan, B. Tang, R. Taylor, A. Williams, [24] J. Vermorel, M. Mohri, Multi-armed bandit al- J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, gorithms and empirical evaluation, in: European A. Fan, M. Kambadur, S. Narang, A. Rodriguez, conference on machine learning, Springer, 2005, R. Stojnic, S. Edunov, T. Scialom, Llama 2: Open pp. 437–448. Foundation and Fine-Tuned Chat Models, 2023. [25] J. Koch, A. Lucero, L. Hegemann, A. Oulasvirta, _eprint: 2307.09288. May AI? Design Ideation with Cooperative Con- [34] G. Larsson, V. Lindecrantz, How an AI colleague textual Bandits, in: Proceedings of the 2019 affect the experiance of content creation, 2023. CHI Conference on Human Factors in Comput- URL: https://www.diva-portal.org/smash/get/ ing Systems, Association for Computing Machin- diva2:1780852/FULLTEXT02. ery, New York, NY, USA, 2019, pp. 1–12. URL: [35] U. Ehsan, S. Passi, Q. V. Liao, L. Chan, I.-H. Lee, https://doi.org/10.1145/3290605.3300863. M. Muller, M. O. Riedl, The who in explainable [26] R. Gallotta, K. Arulkumaran, L. B. Soros, AI: How AI background shapes perceptions of AI Preference-Learning Emitters for Mixed-Initiative explanations, in: Proceedings of the CHI Confer- Quality-Diversity Algorithms, IEEE Transac- ence on Human Factors in Computing Systems, tions on Games (2023) 1–14. doi:10.1109/TG. 2024, pp. 1–32. ArXiv: 2107.13509 [cs.HC]. 2023.3264457. [36] J. Aronson, A pragmatic view of thematic analy- [27] Z. Lin, R. Agarwal, M. Riedl, Creative Wand: sis, The qualitative report 2 (1994) 1–3. A System to Study Effects of Communications [37] D. E. Kieras, S. Bovair, The role of a in Co-creative Settings, Proceedings of the mental model in learning to operate a de- AAAI Conference on Artificial Intelligence and vice, Cognitive Science 8 (1984) 255–273. Interactive Digital Entertainment 18 (2022) 45– URL: https://www.sciencedirect.com/science/ 52. URL: https://ojs.aaai.org/index.php/AIIDE/ article/pii/S0364021384800038. doi:https://doi. article/view/21946. doi:10.1609/aiide.v18i1. org/10.1016/S0364-0213(84)80003-8. 21946, number: 1. [38] A. M. Leslie, O. Friedman, T. P. German, Core [28] H. Yu, M. Riedl, Data-driven personalized drama mechanisms in ‘theory of mind’, Trends in cogni- management, in: Proceedings of the AAAI Con- tive sciences 8 (2004) 528–533. Publisher: Else- ference on Artificial Intelligence and Interactive vier. Digital Entertainment, volume 9, 2013, pp. 191– [39] J. Balloch, Z. Lin, M. Hussain, A. Srinivas, 197. Issue: 1. R. Wright, X. Peng, J. Kim, M. Riedl, Novgrid: A [29] R. C. Gray, J. Zhu, D. Arigo, E. Forman, S. On- flexible grid world for evaluating agent response to tañón, Player modeling via multi-armed bandits, novelty, arXiv preprint arXiv:2203.12117 (2022). in: Proceedings of the 15th International Confer- [40] U. Ehsan, M. O. Riedl, Explainability pitfalls: ence on the Foundations of Digital Games, 2020, Beyond dark patterns in explainable AI, Patterns pp. 1–8. 5 (2024). Publisher: Elsevier. [30] R. C. Gray, J. Zhu, S. Ontañón, Multiplayer [41] R. S. Sutton, A. G. Barto, Reinforcement learn- Modeling via Multi-Armed Bandits, in: 2021 ing: An introduction, MIT press, 2018. IEEE Conference on Games (CoG), IEEE, 2021, [42] R. Agrawal, Sample mean based index policies pp. 01–08. by o (log n) regret for the multi-armed bandit [31] A. Vinogradov, B. Harrison, Using Multi-Armed problem, Advances in applied probability 27 Bandits to Dynamically Update Player Models in (1995) 1054–1078. Publisher: Cambridge Univer- sample with the maximum probability, while seeking a reward between 0 and 1: 𝑎 = 𝑎𝑟𝑔𝑚𝑎𝑥(ℬ(𝛼𝑎 , 𝛽𝑎 )) (2) 𝛼𝑎 increases by the reward received, and 𝛽𝑎 increased by 1 minus the reward received. Initially, both 𝛼 and 𝛽 for each arm are set to 1 to establish a uniform prior distribution. Thompson sampling is designed to effortlessly transition from primarily exploring in the initial stages to a more exploitation-oriented strategy as it acquires more information. We carried out an Oracle-based experiment to de- termine the MAB algorithm of choice for the study. Using an oracle, which simulates a human creator in- teracting with the system, gives us total control of Figure 4: Oracle experiment results on MAB algorithms of their behaviour. We measure the performance of the the agents performing on various feedback accuracy levels. agents at various levels of human feedback accuracy, Upper Bound performance, where the liked arm is always to seek an agent that generally performs well on all pulled, and the Lower Bound, where one not-liked arm is accuracy levels so that it serves a wider variety of always pulled, is also presented for reference. human creators well. We study four different agents and baselines: 𝜖- greedy, UCB1, Thompson Sampling, and Random sity Press. Baseline, where a universally random arm is chosen [43] W. R. Thompson, On the likelihood that one each time. We give the agents 3 arms to pull, where unknown probability exceeds another in view of one is “liked” and two others are “unliked”. Each arm the evidence of two samples, Biometrika 25 (1933) would give either a reward of 1 if liked or 0 otherwise 285–294. Publisher: Oxford University Press. when pulled, by the oracle; We define human feed- back accuracy as the probability of the oracle giving a reward of 1 on pulling the “liked” arm and a 0 on pulling the “not liked” arm. As this value gets lower, A. Choosing a MAB algorithm closer to 50%, the simulated oracle becomes less clear on which arm it liked and becomes a less efficient feed- In this section, we provide more information on the back provider. We simulated 5 levels of this accuracy, design choice of the MAB agent. Following results from 60% to 100% with equal intervals. from Vinogradov et al. [31], we looked into three 𝜖-greedy is highly sensitive to its 𝜖 parameter cho- representative MAB algorithms: 𝜖-greedy, UCB1 and sen, and we report with the best performing 𝜖-greedy Thompson Sampling. agent in the with 𝜖 = 0.2. We report the “normalized 𝜖-greedy [41], widely used in RL, works on a simple rewards”, which is the agent’s reward relative to the principle: The agent has probability 𝜖 (a hyperparam- theoretical maximum of always choosing the “liked” eter) to choose a random action (explore) instead of arm. We repeat each experiment condition 100 times performing the best action from its policy (exploit). and report the mean normalized rewards after 10 steps UCB1, or Upper Confidence Bound 1 [42] instead to simulate a scenario where the MI-CC agent has to takes a more deterministic approach: This algorithm quickly learn from their human counterparts, similar calculates an “Upper Confidence Bound” for each arm, to our actual study. considering both the current running average of the Figure 4 summarizes the results from the Oracle rewards and the uncertainty due to lack of sampling: experiments. As we only gave these agents 10 steps to √︀ learn the arms, the agent may not have yet converged. 𝑎 = 𝑎𝑟𝑔𝑚𝑎𝑥(𝑥 ¯𝑎 + 2 log 𝑡/𝑛𝑎 ) (1) This is expected in a quick-learning scenario. 𝜖-greedy where 𝑥¯ 𝑎 represents the average reward received from performed poorly, even worse than the random base- arm 𝑎, 𝑛𝑎 represents number of times arm 𝑎 was pulled, line, likely due to its inability to quickly change fo- and 𝑡 the total number of times all arms are pulled. cus between exploration and exploitation; UCB1 and This makes UCB1 aware of the uncertainty of the Thompson perform at similar levels, demonstrating rewards from each arm when the agent makes its deci- their capabilities to calculate an upper-bound reward sions. Although probability distributions are used to and use it in their decision-making process. calculate these bounds, this algorithm does not sample Although UCB1 and Thompson performed similarly, at all and provides a deterministic choice for a given Thompson Sampling is preferred because of its sam- system state. pling behavior. UCB1 schedules its exploration over a Finally, Thompson Sampling is a robust Bayesian very long session in a deterministic way (exploring once approach first introduced by Thompson [43]. It main- after exploiting 𝑛 times). As we aim for quick learn- tains a probability distribution over the possible values ing and adaptation, without sampling, UCB1 risks of each arm’s reward and uses this distribution to make showing “stubbornness” to a suboptimal arm without decisions. To determine which arm to pull, it draws any probability to unstuck itself, a behavior that is samples from a Beta (ℬ) distribution of the number less preferred from an MI-CC perspective. Thompson of successes and failures for each arm, choosing the Sampling, on the other side, exhibits its capability to dynamically change its exploration aggressiveness are followed by an open-text question prepared to based on previous observations, while using a Bayesian collect justifications from the participants. prior instead of greedy sampling, both benefiting its ap- plication in our experiment MI-CC setup. This results in both an effectively dynamic “epsilon” compared to C. Prompting details epsilon-greedy and some randomness instead of being Prompts for Communications start with fully greedy per each step, compared to UCB1. We chose Thompson Sampling as the MAB algo- “You are an AI writing assistant, col- rithm used in the experimental system. laborating with a human on the task of writing a story.You are very concise, and answer only what is absolutely necessary, B. Questionnaires used in the study without any explanations or introduc- Pre-study. Four 5-point Likert scale questions are tions.You make sure that all your an- asked: swers are surrounded by an underscore, such as _My answer_ .” • Q1: Do you agree that you are familiar with the process of creating content, such as writing arti- and are followed by a few examples of the tasks, along cles, drawing pictures or creating a video game with the constraints, formed in a question-answering stage, using a computer? (Strongly Disagree format; The final question does not come with an an- → Strongly Agree) swer, and the continuation is treated as the response. • Q2: Do you agree that you are good at writ- ing or telling a story, either real or fictional? (Strongly Disagree / Never attempted in the past 5 years → Strongly Agree) • Q3: How frequently do you use or interface with artificial intelligence? For example, using map services to find a route to your destina- tion, playing a game with a computer-controlled character, or using a chatbot. (Never used → Always / For as many things as possible) • Q4: How much understanding do you have of the recent developments in Artificial Intelli- gence technologies? (Very unfamiliar → Very familiar / I can build one) Post-study. Four questions are asked regarding the systems they used during the study. • Q5-(Learning, Collaboration) You were as- signed a specific way to collaborate with the assistant Wands, and the assistant is not in- formed of this arrangement in advance. Which assistant wand learned to collaborate with you under that arrangement? If you have chosen at least one of the assistant wands, how did you know they learned from you? • Q6-(Enjoyment, Immersion) Which assistant wand is more capable and made the collabo- ration easy for you? If you have chosen at least one of the assistant wands, how did the assistant(s) impress you with their capabilities? • Q7-(Expressiveness, Exploration, Results worth effort) With these assistant wands, which col- laborative experience ended up in a good story? If you have chosen at least one of the assistant wands, What do you think helped? If you chose neither, what went wrong? • Q8-Lastly, which assistant wand would you rec- ommend more to a friend or a colleague story writer? Please let us know if you have any other message or comment to share. For Q5 to Q7, Participants may select one, both, or neither system; For Q8, as it is a comparative question, the option of "neither" is not available. All questions