=Paper=
{{Paper
|id=Vol-3926/paper6
|storemode=property
|title=Beyond Following: Mixing Active Initiative into Computational Creativity
|pdfUrl=https://ceur-ws.org/Vol-3926/paper6.pdf
|volume=Vol-3926
|authors=Zhiyu Lin,Upol Ehsan,Rohan Agarwal,Samihan Dani,Vidushi Vashishth,Mark Riedl
|dblpUrl=https://dblp.org/rec/conf/exag/LinEADVR24
}}
==Beyond Following: Mixing Active Initiative into Computational Creativity==
<pdf width="1500px">https://ceur-ws.org/Vol-3926/paper6.pdf</pdf>
<pre>
                         Beyond Following: Mixing Active Initiative into Computational
                         Creativity
                         Zhiyu Lin1 , Upol Ehsan1 , Rohan Agarwal1 , Samihan Dani1 , Vidushi Vashishth1 and
                         Mark Riedl1
                         1
                             Georgia Institute of Technology, Atlanta, Georgia, USA


                                         Abstract
                                         Generative Artificial Intelligence (AI) encounters limitations in efficiency and fairness within the realm of Procedural
                                         Content Generation (PCG) when human creators solely drive and bear responsibility for the generative process.
                                         Alternative setups, such as Mixed-Initiative Co-Creative (MI-CC) systems, exhibited their promise. Still, the potential
                                         of an active mixed initiative, where AI takes a role beyond following, is understudied. This work investigates the
                                         influence of the adaptive ability of an active and learning AI agent on creators’ expectancy of creative responsibilities
                                         in an MI-CC setting. We built and studied a system that employs reinforcement learning (RL) methods to learn the
                                         creative responsibility preferences of a human user during online interactions. Situated in story co-creation, we develop
                                         a Multi-armed-bandit agent that learns from the human creator, updates its collaborative decision-making belief, and
                                         switches between its capabilities during an MI-CC experience. With 39 participants joining a human subject study,
                                         Our developed system’s learning capabilities are well recognized compared to the non-learning ablation, corresponding
                                         to a significant increase in overall satisfaction with the MI-CC experience. These findings indicate a robust association
                                         between effective MI-CC collaborative interactions, particularly the implementation of proactive AI initiatives, and
                                         deepened understanding among all participants.

                                         Keywords
                                         Mixed-Initiative, Co-Creativity, Human-AI Collaboration, Procedural Content Generation


                         1. Introduction                                                                    (CC) systems. Mixed-Initiative systems are those in
                                                                                                            which both human and AI systems can initiate con-
                         Recent advancements in Machine Learning (ML)–                                      tent changes. Co-Creative systems are those in which
                         powered Artificial Intelligence (AI), such as large lan-                           both human and AI systems can contribute to content
                         guage models (LMs) [1] and diffusion models [2], have                              creation. In particular, MI-CC systems have been
                         made a new class of tools for Procedural Content                                   demonstrated in game design [8], drawing [9], and sto-
                         Generation (PCG) available to game creators. The                                   rytelling [10], that benefits from both human and AI
                         dominant contemporary way for the creators to con-                                 possessing the ability to take creative initiative. While
                         trol such generative AI models is via prompting—the                                the broadest definition of co-creative systems might
                         issuing of textual instructions for the model to inter-                            include any human creators working with a generative
                         pret and respond to [3]. That is, the user is tasked                               AI, the vast majority of them have not investigated
                         with the responsibility of issuing clear “prompts” to                              the role of mixed-initiative, especially a more active
                         contextualize the AI system and make them aware of                                 AI initiative.
                         their intents. The AI is tasked to follow and fulfill                                 At the heart of MI-CC systems is the question of
                         the request strictly based on it. If the system does                               whether and how the AI creative agent knows and
                         not respond with an output that satisfies the creators’                            understands (a) the intentions and goals of the human
                         wants or needs, it is incumbent upon the creators to                               creator and (b) how the user wants to work with the
                         modify the prompt and try again.                                                   AI system. These questions pose significant challenges,
                            The paradigm of human creators working with gen-                                especially within domains critical to game designers
                         erative AI via prompting is just one of many theo-                                 utilizing AI, such as Computational Creativity and
                         retical ways for a human creator and an AI system                                  PCG. In other domains, the goal may be provided to AI
                         to interact [4]. There is evidence that prompting is                               in advance, making it easier to identify opportunities
                         not necessarily the best interaction paradigm; users                               to take the initiative with respect to contributing to a
                         indicate an appreciation for more varied ways of in-                               solution—the extreme of which is the AI system know-
                         teracting with AI creative systems [5]. Other config-                              ing the goal and solving the goal completely on its own.
                         urations of human-AI collaboration creative systems                                When it comes to creating games, however, the human
                         are possible that promise to reduce cognitive load,                                creators’ intent is harder to articulate completely[11].
                         frustration, and system abandonment [6], and make                                  The human creator’s goals are also non-stationary and
                         these systems more casual and enjoyable [7]. These                                 may evolve during the creative process[12, 13]. The
                         include Mixed-Initiative (MI) systems and Co-Creative                              human creator might also have a preferred working
                                                                                                            style that the agent should conform to in order to
                         11th Experimental Artificial Intelligence in Games Workshop,                       take the initiative while minimizing disruption. Once
                         November 19, 2024, Lexington, Kentucky, USA.                                       we overcome these challenges, researchers have shown
                         $ zhiyulin@gatech.edu (Z. Lin); ehsanu@gatech.edu
                         (U. Ehsan); rohanagarwal@gatech.edu (R. Agarwal);                                  that such ambiguity and instability link to improved
                         sdani30@gatech.edu (S. Dani); vvashishth3@gatech.edu                               outcomes of the creative activity[14], thus benefiting
                         (V. Vashishth); riedl@cc.gatech.edu (M. Riedl)                                     the MI-CC interaction.
                          https://zhiyulin.info/ (Z. Lin);                                                    In this paper, we examine Co-Creative systems in
                         https://www.upolehsan.com/ (U. Ehsan);                                             a mixed-initiative setting and study the dynamics of
                         https://eilab.gatech.edu/mark-riedl.html (M. Riedl)
                                      © 2024 Copyright for this paper by its authors. Use permitted under   managing creative responsibility between human and
                                      Creative Commons License Attribution 4.0 International (CC BY
                                      4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Figure 1: Screenshot of our system in action.


AI initiatives. We ask: What influence does an AI              2. Background and Related Work
agent’s ability to actively adapt to creators’ expectancy
of creative responsibility in an MI-CC system have on          The procedure of an MI-CC system learning its cre-
creator experience and perception?                             ative responsibilities can be described as a decision-
   In particular, we make the assumption that the AI           making process, where the agent communicates with
agent is capable of working in the creative domain if          the human creator, gathers information, and chooses
given explicit prompts but is unaware of the human             among its capabilities. This is not as straightforward
creator’s preferences for distributing creative respon-        as asking human creators to prompt AI agents because:
sibility between humans and the AI. We explore the
                                                                   • Just like the Cold Start problem experienced by
usage of Reinforcement Learning (RL) methods in
                                                                     AI agents lacking prior preferential knowledge
this setting and demonstrate that the creative respon-
                                                                     from their creators [17], human creators, even
sibility learning challenge in MI-CC systems can be
                                                                     experts, may struggle to make inferences about
addressed by a multi-armed bandit (MAB) algorithm
                                                                     the behavior of AI systems they initially face;
that observes feedback from users iteratively, updates
its beliefs, and carries out its capabilities to facilitate        • The ability of human creators to effectively con-
the MI-CC collaboration. The learning is done online                 vey information to AI depends on their com-
in real-time during the MI-CC process, and the human                 munication skills, which can be a significant
creator is not expected to have previous knowledge                   obstacle even in human-to-human interactions
of the AI agent or time to pre-train it with regard to               [18].
their collaboration style.                                         • Enforcing this AI-centric method of input re-
   Working in the domain of structured story co-                     quires a profound mechanical understanding of
creation, we invite 39 participants to a human subject               the AI system from the human creators, where
study. We quantitatively measure the human creator’s                 this knowledge does not necessarily intersect
perceived learning performance of the agent and the                  with their expertise. This marginalizes creators
overall level of satisfaction with the collaboration. We             who do not possess the requisite expertise in
use the Creative Support Index (CSI) [15] to study                   utilizing AI.
the implications of a learning and evolving AI agent.             For these reasons, relying solely on human creators
We also report on qualitative data collected from par-         for direct collaborative prompting, regardless of the
ticipants, using a grounded theory [16] approach in            capability of the AI models, has its limitations, leading
which we identify thematic patterns in users’ subjec-          to efficiency, cognitive load, fairness, and equity issues.
tive reports of their experiences. This study reveals             Alternatively, a model can be built on human feed-
a higher degree of participant recognition regarding           back without users directly communicating their goals.
the learning capabilities of our agent, compared to the        Researchers demonstrated their potential in transfer-
ablation, which in turn corresponded to a significant          ring human knowledge to AI [19, 20] and making AI
increase in overall satisfaction with our agent. 1             learn more efficiently [21, 22]. When it comes to gener-
                                                               ating contents, this is the foundation of methods such
1
    https://github.com/xxbidiao/beyond-following-experiments   as RL from human feedback [23], that has proven to
drastically improve the quality of generated text in        3.1. Task Setup
state-of-the-art models such as GPT-4[1]. Yet, they
                                                            The Delegation Setup. For the experiments, we spot-
are designed to exclusively optimize for a static, known-
                                                            light a specific but generalizable collaborative setup:
from-data objective. They are not designed for online
                                                            Learning a delegation. In this setup, both parties take
implementation where pre-training is not feasible, and
                                                            a subset (or entirety, if preferred) of responsibilities
the system lacks prior knowledge of new creators and
                                                            in an MI-CC activity towards the common goal. The
needs to actively probe them.
                                                            human creator concentrates on specific parts of the
   To focus on the active probing challenge, we formal-
                                                            creative task while not losing control of the other parts;
ize it as a Multi-Armed Bandit (MAB) problem [24]
                                                            the AI agent needs to strategically shift its focus to-
above generative abilities, where an AI agent needs to
                                                            wards the parts that the human creator is not focusing
actively choose under uncertainty from their library
                                                            on and actively determine how to make improvements.
of capabilities based on their understanding of their
                                                            Furthermore, as these interactions are not without
human creator teammate, to minimize total regret and
                                                            cost, such as creators’ cognitive load, it is also im-
maximize rewards from their teammate. Multi-Armed
                                                            portant to minimize such costs towards learning these
Bandit systems have been employed in the context
                                                            responsibilities. We denote the expected and delegated
of resolving how to make progress in an interactive
                                                            responsibility that the AI agent needs to learn during
creative experience. Koch et al. [25] discussed a de-
                                                            the interaction preferred work style for a particular
sign ideation framework that suggests images that a
                                                            human creator.
designer may like by exploring and exploiting in the im-
age embedding space with a variant of MAB; Gallotta
et al. [26] applied MAB in the context of generating        Domain: Storytelling. Given the mounting interest
“in-game spaceships” by enabling creator-guided latent      in co-creative storytelling [32] and established research
space walk in the feature embedding space represent-        foundation within story generation, its high relevance
ing such spaceships. These works focused on a single        to game development, and its inherent complexity with
type of action in the content space, and concentrated       regard to PCG, we select story generation as a proving
on expanding the generative space of such content;          ground for our proposed method. The expertise of
Lin et al. [27, 5] explored instead the action space,       the team and advancements in open-source Large LMs
characterized as types of Communications represent-         readily available to us facilitated implementation; This
ing information exchange between human and AI used          allows us to focus on the human factors of the MI-CC
in the co-creative process; As to the idea of switching     experience and the AI agent itself.
between different high-level actions beyond the content        For our experimental system, We use Llama2-13b-
level, Building a model of the user has been proven         chat [33] as the LM, readily available at the time of
to help in a CC setting, specifically in the domain of      the study while very responsive for the interactive
storytelling. Yu et al. [28] demonstrated its potential     experience.
to generate stories that bring “an enjoyable experience
for the players”; Gray et al. [29, 30] further demon-       3.2. Experimental AI System overview
strated how MAB agents help to capture this player
model. Vinogradov et al. [31] showcased a framework         We now describe the AI system we built for the purpose
where the agent explores the creators’ “player” model       of the study. The experimental system is based on
vigorously by directly generating “distractions”, ob-       the Creative Wand framework [27], containing the
jects designed to probe into players’ preference instead    following four components:
of providing utilities in finishing a certain task; They
proposed using MAB for this task for its promises in        3.2.1. Creative Context
“balancing the act of gathering information about the       The Creative Context is the abstraction of generative
payout associated with each arm (exploration) and           models for this system.
maximizing reward given the current known informa-            In this paper, we study stories containing four com-
tion (exploitation)”, dynamically updating the model        ponents inspired by the Narrative Arc theory: the
in the process towards assigning tasks that the play-       beginning, development or rising action, climax, and
ers feel more interested in tackling. They inspire our      conclusion. We design an AI framework that writes
method, as its approach of adding distractions is well      each component of the story using language models
comparable to the agent carrying out its initiative         and prompt engineering (See Appendix C for more
while directly changing the creative content.               details). Both the human participant and the AI are
                                                            instructed to write about 20 to 30 words per com-
3. Study Design                                             ponent, and the target length of the whole story is
                                                            around 100 words.
In this section, we present the study we designed             Once we set up the model, it will take requests from
to examine the AI agent we created that adapts to           Communications.
creators’ expectancy of creative responsibility. We
seek to determine how this changes the perception of        3.2.2. Communications
the creators toward the AI and the creative experience
                                                            Communications describes the interactions between
the system supplies to the human creators.
                                                            the human creators and the AI; they also double as
                                                            the capabilities the AI agent possess. To focus on how
                                                            the agents would choose their creative responsibili-
                                                            ties, we implement a minimalistic yet complete set of
Figure 2: One round of interaction of our experimental system. Each participant will experience multiple turns per session.


capabilities for the creative experience. This allows            heuristic is computationally fast and enables respon-
us to focus on research questions about the creative             sive interactions; Second, it additionally provides vi-
experience while minimizing the cognitive load of the            sualization for the users. As shown in Figure 1, we
participants. Our agent possesses the following capa-            present this right above the text boxes for the stories,
bilities, implemented as prompts to the LM describing            with a text hint and a progress bar representing the
the responsibilities (See Appendix C for details):               ideation process of the agent. We additionally provide
                                                                 a “skip” function that forces agent initiative.
    • (Re)write the beginning and development;
    • (Re)write the climax and conclusion;                       Agent Initiative. In this phase, the agent decides
    • Write a review of the story, one sentence posi-            which capability best fosters the collaborative experi-
      tive, one negative, and one suggestion for im-             ence and carries out the corresponding Communication.
      provements.                                                We build a Multi-Armed Bandit-based agent in our
                                                                 system that is responsible for choosing which Commu-
3.2.3. Experience Manager and Frontend                           nication to invoke, with Thompson Sampling as the
These two modules manage the interactive experience              chosen algorithm for the experimental system within
and workflow.                                                    the AI agent. Formally, an agent 𝐴 interacts with a
   We implement a Finite State Machine to manage                 set of 𝐾 arms 𝑎1 · · · 𝑎𝑘 , each of which is associated
the experience. Figure 2 shows the states with the               with Communication and underlying capabilities and
overall flow of interaction each participant experiences         an unknown reward distribution. Whenever an arm
in one experiment session. One session of the MI-CC              is pulled, the agent seeks feedback from the human
experience is separated into multiple “turns”, where             creator on the initiative, which is treated as a reward
both parties iteratively improve the story, sharing the          signal. (See next paragraph.) The goal of the agent
same text fields in the editing process. The partici-            is to maximize the total reward obtained by repeat-
pants are not directly notified of the internal states of        edly pulling arms during the session. See subsection
the system.                                                      A for more details on the design choices of the MAB
                                                                 agent. Once an arm is pulled, the agent executes a
                                                                 Communication, interacts with the user, and updates
Human Initiative. During this phase, human creators
                                                                 the story as needed.
contribute to the story by making edits in any of
the four text fields. This phase ends when the agent
decides to take the initiative. We implement a point-            Learning from human. The system will ask about
based heuristic based on pilot studies: the agent would          (Action Feedback) the way they just worked and (Con-
assign points for changes it observes, and will take             tent Feedback) the updates and content changes. The
initiative whenever enough points are accumulated,               participants choose between “Good” (Reward of 1)
signifying substantial edits from the human creators,            and “Bad” (Reward of 0). “Bad” feedback on gener-
in the following criteria:                                       ated text leads to a reversion to the original content,
                                                                 though it is not used to improve the LM in any way.
    • Each new character would add 5 points;                        A weighted mean is employed to integrate both types
    • Each time the human creator switches between               of feedback into a singular reward signal. For the study,
      fields after any changes, 100 points are added;            a weight hyperparameter of 80% is applied to the Ac-
    • Whenever the human creator leaves a text field             tion Feedback and 20% to the Content Feedback. This
      with 200 points accumulated (roughly one full              prioritizes learning action-level responsibilities rather
      sentence or two minor changes), the agent will             than the preference for LM-generated text, in which
      take the initiative by locking the editing inter-          the full system and the baseline share implementation.
      face and resetting the counter.                            This reward signal is then used to train the agent.
                                                                    For this experiment, an MAB agent with Thompson
This heuristic provides two advantages compared to               Sampling is used in the experimental system. See
other ways this decision can be made: First, this                Appendix A for a discussion and experiments related
Figure 3: Participants’ experience during the study.


to this choice.                                                  Post-study. After participants finished two sessions
   Once the learning process is complete, “human ini-            using our system, they were asked about the process
tiative” starts again. To maintain user engagement,              they had just experienced. inspired by Creative Sup-
text responses are morphed each time to avoid repeti-            port Index (CSI) [15] used in the previous studies,
tiveness, while contextual hints are also strategically          We ask questions based on dimensions related to the
provided throughout the experience. Figure 1 shows               creative support perception and overall collaborative
the user interface.                                              experience, grouped to facilitate richer responses from
                                                                 the participants while maintaining their engagement
3.3. Study Methodology                                           in the survey.
                                                                    Specifically, we ask which system(s), are (Q5, Learn-
To study the perception of human creators towards MI-            ing, Collaboration) learning to collaborate, (Q6, En-
CC systems equipped with these learning capabilities,            joyment, Immersion) more capable and easy to work
we conduct a study summarized in figure 3 on the AI              with, (Q7, Expressiveness, Exploration, Results worth
system.                                                          effort) enabling better stories; For Q5 through Q7,
   We compare our system, the “Full” system, with an             participants can choose either system, both systems,
ablation named “baseline”. The “baseline” ablation               or neither to be chosen, leading to a potential total
does not learn. It chooses each of the 3 Communica-              exceeding 100%. We ask one final question (Q8) on
tions with a 1/3 probability at all times and provides           which system will they recommend more, framed in a
only a reverting option when “asking for feedback”.              win-draw-lose format.
These systems are codenamed “Echo Wand” and “Har-                   Although these questions are presented in the same
mony Wand” respectively, not to reveal the details of            order for all participants, the order of the options is
the systems to the participants during the study.                randomized to reduce bias towards any system. All
   We recruited 39 United States participants 2 on               questions are followed by an open-text question pre-
Prolific3 with adequate English proficiency. Each ex-            pared to collect justifications from the participants.
periment session lasted for approximately 40 minutes,
and we paid the participants $15 per hour for perfect
completion of the study.                                         4. Quantitative Results
Pre-study. Before the experience, participants answer            4.1. Creative background
four 5-point Likert-scale questions on (Q1) Expertise in         Table 1 shows a summary of the creative backgrounds
Computer-Assisted Designing (CAD), (Q2) Expertise                of the participants. Although a median of 4 on all
in writing stories, (Q3) Frequency using AI, and (Q4)            questions implies that participants are familiar with
Understanding of AI. 4                                           the recent advancement of AI, when specifically ask-
   We then present instructions to familiarize the par-          ing whether they can build one, only 1 participant
ticipants with our systems by providing annotated                answered “yes” (5 in Q4), meaning that most of the
screenshots of the interface, which is a copy of Figure          participants do not have a technical background.
1, but with additional numeric overlays, descriptions of           However, comparing to 26% reported in [5], we
components, and a brief introduction to the workflow             observed 87% of the participants at least being “some-
of co-creating a story.                                          what familiar” (3+) with recent AI technologies, and
   They are then assigned the delegation task to focus           51% being “familiar” (4+); The experience of using
on writing the beginning and the development of the              commercially available Large LM-based agents may
story while leaving the other parts of the story to AI           have a profound effect on how participants, in general,
as much as possible. They are also made aware that               would collaborate with AI systems.
the AI does not know this setup in advance.
                                                                 4.2. Quantitative Results
Experience. Participants are assigned to interact
with the full system and the baseline ablation, pre-             We commence by presenting the quantitative results of
sented in random ordering, counter-balanced. They                the study through the choices made by the participants
are given 10 turns per each of the 2 sessions.                   in the multiple-choice questions.
2
                                                                   When asked which system(s) learned to collaborate
  Only counting participants who finished the whole study with
                                                                 with them under the delegation arrangement (Q5),
  valid sessions and responses.
3
  prolific.co                                                    the “Full” system is chosen 69% (𝑛 = 39) of the times,
4
  See Appendix B for the full question text.                     compared to 51% for the baseline (𝑝 < 0.018, under a
                 Q (See Appendix B for full questions)      1   2   3      4      5   Average    Median
                           Q1: CAD skills                   1   1   2      19    16    4.23        4
                          Q2: Writing skills                1   0   7      20    11    4.03        4
                      Q3: Frequency of using AI             0   0   16     11    12    3.90        4
                    Q4: Understanding of AI Tech.           0   5   14     19     1    3.41        4
    Table 1
    Creative background of the participants. 1 = Most Negative, 5 = Most Positive.


binomial test where 𝐻0 := no observable difference in           these creative support dimensions may be too minor
distribution; The same for all p-values in this section).       when it comes to how they are presented visually; The
We clearly see the “Full” system with learning capa-            effect of user interface used to present the results in an
bilities enabled being perceived significantly better at        MI-CC system is out of the scope of this work, though
learning the delegation than the baseline, demonstrat-          these findings illuminated a potential path for future
ing the effectiveness of the MAB-based model From               research.
the human creator perspective learning from their feed-
back.
   When asked which system to recommend, this trend             5. Qualitative Results
also persists: Our system is preferred (wins) 43.6%
                                                                We now show the results from the open-ended ques-
of the time, versus 20.5% (loses) for the baseline
                                                                tions following each multiple-choice question. Open-
(𝑝 < 0.001); 35.9% of the participants do not have a
                                                                ended justifications participants provided for each of
preference (draw). The “Full” system is only different
                                                                the four questions are evaluated with thematic anal-
from the baseline system with the learning capabilities
                                                                ysis [36], based on grounded theory [16]. Taking an
and corresponding frontend elements, yet we see a sta-
                                                                inductive approach, we started the process with an
tistically significant improvement in preference towards
                                                                open-coding scheme and iteratively produced in-vivo
our “Full” system, illustrating the potential of our
                                                                codes (generating codes directly from the data). Next,
method in enhancing MI-CC experience and making
                                                                we analyzed the data using axial codes, which in-
such system better for human creators.
                                                                volves finding relationships between the open codes
   When it comes to which system(s) gave a good story
                                                                and clustering them into different emergent themes.
(Q7), 72% of the participants agree that the “Full” sys-
                                                                Through an iterative process performed until consen-
tem made a good story, while 69% selected the baseline
                                                                sus was reached, we share the most salient themes that
system (𝑝 > 0.05). We were unable to statistically
                                                                emerged from axial codes.
determine whether an agent learning the delegation
would produce a better story; This is expected, We
focused on studying the sharing of responsibilities and         A MI-CC system that understands the intents of
enforced a delegation setting. In an actual MI-CC               the human creators and follows them by learning is
experience, without such a prior, A human creator               overall favored and collaborates well with the cre-
would utilize the agent’s learning capability to pro-           ators. Participants demonstrated their observation
mote their strengths and discourage their weaknesses,           of the learning capabilities of the “full” system, identi-
and an improvement in perceived performance is more             fying them as “better about learning that I specifically
likely to be observed in that setting.                          wanted help with” (P34) and “listened to my feed-
   Finally, when queried about the collaboration itself         back.”(P39). In comparison, the baseline system is
(Q6), 62% of the participants think the “Full” system           identified as “did less of the work ... did not necessarily
is capable and made the collaboration easy, while 56%           learn what its role was expected to be” (P19). this
voted for the baseline system (p>0.05). We also were            resulted in a preference for the Full system for P32,
unable to statistically determine whether the “Full”            as the Full system is quoted as a “more useful helper".
system is more enjoyable and immersive. Although the            This aligns with the quantitative observations.
difference between the “Full” system and the baseline
is substantial enough both implementation-wise and              Good content suggestions may give people the feel-
towards the perception of learning, from the angle of           ing that the system is learning how to collaborate
the user interface, the only difference is 10 additional        with them, regardless of how AI is actually doing
questions from the “Full” system per session. Previ-            so. Despite specifically asking participants to discuss
ously, Larsson et al. [34] reported that “there was a           whether the agent has “learned to collaborate with
clear trend that the visual ... was rather important to         you under that arrangement” (Q5), Participants are
the subject’s relationship towards the MI-CC.” while            also rating the system based on the generated content:
these “relationships” are directly linked to creators’
                                                                         (P25, emphasis asked) This one learned
perception of immersion of the experience; Ehsan et
                                                                         from me because it was able to build off
al. [35] additionally pointed out that even when an AI
                                                                         of my original foundation of my story that
system presents the same underlying information, how
                                                                         I typed.
it is presented influences the perceptions of human
users. We may have observed this effect from a differ-          P18, who rated their familiarity with AI as Familiar
ent angle, where a lack of differences in presentation          (4 out of 5) and AI usage as “Always / as much as
may have caused the indifference of the participants.           possible” (5 out of 5), wrote that the “Full” system is
To that end, the difference between the two systems on          learning from them:
       I could see Echo Wand adding more de-                      ... I was in control of the final text to
       tail and building out more creatively                      accept changes or not, or to make my
       than with Harmony Wand.                                    own.

This participant is familiar with recent generative AI     In a system involving a creator who wishes to create
and mentions “adding details” and “building,” which        content to their liking, it is expected that the creator
are traits that these AI are optimized for. As both        wishes to solicit as much control as possible. However,
the “Full” and the baseline use the same underlying        if the AI agent does not have any final say on the
generative AI capabilities, P18 could not distinguish      contents, should we expect it to take any creative
between the “improvements” on generated contents           responsibilities? Although we acknowledge that this
and the performance of the MAB-based agent. The            is more of a philosophical question, way out of the
apparent improvements of generated stories may result      scope of our work, what if the agent would understand
from a wide range of reasons, such as participants         what their counterpart is actually seeking and use
providing different input and LM sampled differently,      this information to determine what contribution they
unrelated to both the underlying LM and the learner,       should stick to by understanding what human creators
creating noises in the perception of participants.         are thinking?

Diversity is also important, it may not be the best        6. Discussions
strategy for a learning agent to pick the “best op-
tions”, and sometimes the agent may want to inten-         Distilling from these findings ranging from the per-
tionally surprise their teammates. P23 was impressed       ception of collaboration, good writing skills, diversity
by the range of capabilities both agents possess, seeing   in capabilities, and creators’ need for control, a com-
“They were both impressive, being able to take my          mon implication surfaces: Getting the mental model
story and to word it better, or even add things to         of the creators right, the system will succeed; Getting
change it to make it better”. When asked about the         it wrong, failure cases would surface. A mental model
generated story, P39 mentioned that “ Both of them         is described by Kieras et al. [37] as “ understanding
gave bad stories.” and “I need much more control and       ... that describes the internal mechanism“ of the sys-
options”. Curiously, this is the same participant that     tem a human is operating; Leslie et al. [38] further
enjoyed the agent that “listened to my feedback.”. P36     point out that a theory of mind is a mechanism that
preferred the baseline system that executes random         human expresses naturally, towards an understanding
actions:                                                   of thinking, in our context, their teammate AI. The
                                                           success of our “Full” system of learning rises from its
       I did all of the work with Echo, despite
                                                           ability to learn a model of how the creators wish to
       my best efforts to get it to collaborate
                                                           collaborate with them, and the reward given from a
       with me. Harmony had much more inter-
                                                           teammate can be otherwise treated as a reward for
       esting suggestions and rightfully pointed
                                                           correctly understanding their model. The need for di-
       out when a section became too dense.
                                                           versified responses and more respect to control signals
       It balanced the second two sections to
                                                           users imposed also fall into this paradigm, but be-
       match my intro and build up, unlike
                                                           yond; Understanding how these reward signals should
       Echo who almost refused to work on
                                                           be used beyond “picking the best”, and how to cap-
       them.
                                                           ture hints for new actions or capabilities needed can
For this study, we assigned delegation tasks to the        greatly improve collaborations with MI-CC systems.
participants. This is only a subset of possible respon-    This falls into the subfield of “novelty detection and
sibilities that the AI agent can take and the human        adaptation” [39] situated in RL, which is known to be
creators may expect. Lin et al. [5] have shown that a      challenging, if solvable at all with ML methods, as ML
system with more coverage of the design space, provid-     models can only rely on their extrapolation capabilities
ing more diversified options, is preferred. Our study      towards the “unknowns”, that may not hold for all
design, which is more focused on studying the learning     novelties; This will be a rewarding pathway towards
process, limited the variety of capabilities the agent     better MI-CC systems if not agentic AI overall.
may perform. To that end, once such an MI-CC sys-             We start to see a consistent narrative: creators are
tem is put into use beyond research, it is necessary       interpreting the capabilities of our AI agent learning
to diversify both the capability pool and the process      as an attempt the AI agent made to learn a mental
of the AI agent choosing them, potentially providing       model of themselves; Because our agent determines
surprise and unpredictability to further inspire the       which Communication to use and the effect of it on the
users.                                                     contents being collaborated on, We observe the par-
                                                           ticipants treating proper learning of Communication
Creator control is important, and creators may want        choices (expected) and the content generated (emerg-
their ideas to be included even when AI can pro-           ing) as both evidence that the agent is learning from
vide better candidates. Beyond the need for control        them and traits leading to their preferences towards
mentioned by P39, P28 mentioned that they were im-         these systems. This also, to some extent, explains
pressed by the capabilities of both systems in “finish     the placebo effect we observe on the baseline system:
the story that I started with.” (Emphasis added). P27      around half of the participants believe that the base-
mentioned further on their justification:                  line system is learning from them, significantly more
                                                           than 0, despite the baseline system only making deci-
                                                           sions randomly. In this controlled comparative study,
to avoid a bias towards either of the systems, we in-              Creativity (2023) 64–73. URL: http://arxiv.
tentionally did not disclose any difference between the            org/abs/2305.07465. doi:10.48550/arXiv.2305.
“full” system and the baseline. This perception may                07465, arXiv:2305.07465 [cs].
have arisen from the capability of our agent to gen-           [6] J. Sweller, Cognitive load theory, in: Psychology
erate part of stories that follow the context that the             of learning and motivation, volume 55, Elsevier,
participants provided. Although we acknowledge that                2011, pp. 37–76.
these factors are hard to decouple, this finding also          [7] K. Compton, M. Mateas, Casual creators, in:
hints at the potential of our methods in understanding             Proceedings of the sixth international conference
the human creator holisticly. Upol et al. [35] pointed             on computational creativity, 2015, p. 228.
out that the background of human users determines              [8] A. Liapis, G. N. Yannakakis, C. Alexopoulos,
their cognitive heuristics, which plays a role in their            P. Lopes, Can computers foster human users’ cre-
expectations beyond what the designer of the systems               ativity? Theory and praxis of mixed-initiative co-
expected in the first place. They also realized that if not        creativity, DCE (2016). URL: https://www.um.
treated carefully, AI systems can actually introduce               edu.mt/library/oar/handle/123456789/29476, ac-
such placebo effects, as a pitfall [40], by misleading the         cepted: 2018-04-23T12:31:38Z Publisher: DCE.
human users into appreciating their trustworthiness            [9] N. Davis, C.-P. Hsiao, K. Y. Singh, L. Li,
and power, without the development of underlying AI                S. Moningi, B. Magerko, Drawing Apprentice:
capabilities. Standing on these findings, A promising              An Enactive Co-Creative Agent for Artistic Col-
direction of research is to carefully identify the effect          laboration, in: Proceedings of the 2015 ACM
of expectations of both parties involved in the MI-CC              SIGCHI Conference on Creativity and Cognition,
process, and how they dynamically change during the                C&C ’15, Association for Computing Machin-
collaboration.                                                     ery, New York, NY, USA, 2015, pp. 185–186.
                                                                   URL: https://doi.org/10.1145/2757226.2764555.
                                                                   doi:10.1145/2757226.2764555.
7. Conclusions                                                [10] A. Alvarez, J. Font, J. Togelius, Story Designer:
                                                                   Towards a Mixed-Initiative Tool to Create Narra-
In this paper, we showcased how an MI-CC system is
                                                                   tive Structures, Proceedings of the 17th Interna-
capable of listening to human feedback and improving
                                                                   tional Conference on the Foundations of Digital
itself towards a better understanding of how it should
                                                                   Games (2022) 1–9. URL: http://arxiv.org/abs/
collaborate with human creators in a storytelling do-
                                                                   2210.09294, arXiv:2210.09294 [cs].
main. Inviting 39 participants and comparing two such
                                                              [11] M. O. Riedl, Human-centered artificial intelli-
systems with and without these learning capabilities,
                                                                   gence and machine learning, Human behavior
we found that this capability was well recognized by
                                                                   and emerging technologies 1 (2019) 33–36. Pub-
the participants and led to better satisfaction over-
                                                                   lisher: Wiley Online Library.
all. To this end, we further encourage the designers
                                                              [12] N. Davis, C.-P. Hsiao, Y. Popova, B. Magerko,
of MI-CC systems to pay attention to both the hu-
                                                                   An enactive model of creativity for computational
man creators and the AI agent, study how each party
                                                                   collaboration and co-creation, Creativity in the
should, or is already, adapting to and creating mental
                                                                   digital age (2015) 109–133. Publisher: Springer.
models of their counterpart, based on their creative
                                                              [13] M. Guzdial, N. Liao, M. Riedl, Co-Creative Level
roles taken, their previous experience, and capabili-
                                                                   Design via Machine Learning, Fifth Experimen-
ties, and most importantly, the wishes of the human
                                                                   tal AI in Games Workshop (2018). URL: http:
creators.
                                                                   //arxiv.org/abs/1809.09420, arXiv: 1809.09420.
                                                              [14] F. Zenasni, M. Besançon, T. Lubart, Creativity
References                                                         and tolerance of ambiguity: An empirical study,
                                                                   The Journal of Creative Behavior 42 (2008) 61–73.
 [1] OpenAI, GPT-4 Technical Report, 2023. URL:                    Publisher: Wiley Online Library.
     http://arxiv.org/abs/2303.08774. doi:10.48550/           [15] E. Cherry, C. Latulipe, Quantifying the creativ-
     arXiv.2303.08774, arXiv:2303.08774 [cs].                      ity support of digital tools through the creativity
 [2] P. Dhariwal, A. Nichol, Diffusion Models Beat                 support index, ACM Transactions on Computer-
     GANs on Image Synthesis, Advances in neural                   Human Interaction (TOCHI) 21 (2014) 1–25. Pub-
     information processing systems 34 (2021) 8780–                lisher: ACM New York, NY, USA.
     8794. ArXiv: 2105.05233.                                 [16] B. Glaser, A. Strauss, Discovery of grounded
 [3] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi,                 theory: Strategies for qualitative research, Rout-
     G. Neubig, Pre-train, prompt, and predict: A                  ledge, 2017.
     systematic survey of prompting methods in natu-          [17] J. Bobadilla, F. Ortega, A. Hernando, J. Bernal,
     ral language processing, ACM Computing Sur-                   A collaborative filtering approach to mitigate the
     veys 55 (2023) 1–35. Bibtex.eprint: 2107.13586                new user cold start problem, Knowledge-based
     bibtex.archivePrefix: arXiv.                                  systems 26 (2012) 225–238. Publisher: Elsevier.
 [4] Z. Lin, M. Riedl, An Ontology of Co-Creative AI          [18] S. M. Grover, Shaping effective communication
     Systems, arXiv preprint arXiv:2310.07472 (2023).              skills and therapeutic relationships at work: The
 [5] Z. Lin, U. Ehsan, R. Agarwal, S. Dani,                        foundation of collaboration, Aaohn journal 53
     V. Vashishth, M. Riedl, Beyond Prompts: Ex-                   (2005) 177–182. Publisher: SAGE Publications
     ploring the Design Space of Mixed-Initiative                  Sage CA: Los Angeles, CA.
     Co-Creativity Systems,      Proceedings of the           [19] W. Bradley Knox, P. Stone, TAMER: Train-
     14th International Conference on Computational                ing an Agent Manually via Evaluative Reinforce-
     ment, in: 2008 7th IEEE International Con-                an Experience Managed Environment, Proceed-
     ference on Development and Learning, IEEE,                ings of the AAAI Conference on Artificial Intelli-
     Monterey, CA, 2008, pp. 292–297. URL: http:               gence and Interactive Digital Entertainment 18
     //ieeexplore.ieee.org/document/4640845/. doi:10.          (2022) 207–214. URL: https://ojs.aaai.org/index.
     1109/DEVLRN.2008.4640845.                                 php/AIIDE/article/view/21965. doi:10.1609/
[20] G. Warnell, N. Waytowich, V. Lawhern, P. Stone,           aiide.v18i1.21965, number: 1.
     Deep TAMER: Interactive agent shaping in high-       [32] M. Behrooz, Y. Tian, W. Ngan, Y. Yung-
     dimensional state spaces, in: Proceedings of              ster, J. Wong, D. Zax, Holding the Line: A
     the AAAI conference on artificial intelligence,           Study of Writers’ Attitudes on Co-creativity with
     volume 32, 2018. Issue: 1.                                AI, 2024. URL: http://arxiv.org/abs/2404.13165,
[21] Z. Lin, B. Harrison, A. Keech, M. O. Riedl,               arXiv:2404.13165 [cs].
     Explore, Exploit or Listen: Combining Hu-            [33] H. Touvron, L. Martin, K. Stone, P. Albert,
     man Feedback and Policy Model to Speed up                 A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra,
     Deep Reinforcement Learning in 3D Worlds,                 P. Bhargava, S. Bhosale, D. Bikel, L. Blecher,
     arXiv:1709.03969 [cs] (2017). URL: http://arxiv.          C. C. Ferrer, M. Chen, G. Cucurull, D. Es-
     org/abs/1709.03969, arXiv: 1709.03969.                    iobu, J. Fernandes, J. Fu, W. Fu, B. Fuller,
[22] D. Arumugam, J. K. Lee, S. Saskin, M. L.                  C. Gao, V. Goswami, N. Goyal, A. Hartshorn,
     Littman,      Deep Reinforcement Learning                 S. Hosseini, R. Hou, H. Inan, M. Kardas,
     from Policy-Dependent Human Feedback,                     V. Kerkez, M. Khabsa, I. Kloumann, A. Ko-
     2019. URL: http://arxiv.org/abs/1902.04257,               renev, P. S. Koura, M.-A. Lachaux, T. Lavril,
     arXiv:1902.04257 [cs].                                    J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Mar-
[23] D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown,           tinet, T. Mihaylov, P. Mishra, I. Molybog, Y. Nie,
     A. Radford, D. Amodei, P. Christiano, G. Irving,          A. Poulton, J. Reizenstein, R. Rungta, K. Saladi,
     Fine-tuning language models from human prefer-            A. Schelten, R. Silva, E. M. Smith, R. Subrama-
     ences, arXiv preprint arXiv:1909.08593 (2019).            nian, X. E. Tan, B. Tang, R. Taylor, A. Williams,
[24] J. Vermorel, M. Mohri, Multi-armed bandit al-             J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang,
     gorithms and empirical evaluation, in: European           A. Fan, M. Kambadur, S. Narang, A. Rodriguez,
     conference on machine learning, Springer, 2005,           R. Stojnic, S. Edunov, T. Scialom, Llama 2: Open
     pp. 437–448.                                              Foundation and Fine-Tuned Chat Models, 2023.
[25] J. Koch, A. Lucero, L. Hegemann, A. Oulasvirta,           _eprint: 2307.09288.
     May AI? Design Ideation with Cooperative Con-        [34] G. Larsson, V. Lindecrantz, How an AI colleague
     textual Bandits, in: Proceedings of the 2019              affect the experiance of content creation, 2023.
     CHI Conference on Human Factors in Comput-                URL: https://www.diva-portal.org/smash/get/
     ing Systems, Association for Computing Machin-            diva2:1780852/FULLTEXT02.
     ery, New York, NY, USA, 2019, pp. 1–12. URL:         [35] U. Ehsan, S. Passi, Q. V. Liao, L. Chan, I.-H. Lee,
     https://doi.org/10.1145/3290605.3300863.                  M. Muller, M. O. Riedl, The who in explainable
[26] R. Gallotta, K. Arulkumaran, L. B. Soros,                 AI: How AI background shapes perceptions of AI
     Preference-Learning Emitters for Mixed-Initiative         explanations, in: Proceedings of the CHI Confer-
     Quality-Diversity Algorithms, IEEE Transac-               ence on Human Factors in Computing Systems,
     tions on Games (2023) 1–14. doi:10.1109/TG.               2024, pp. 1–32. ArXiv: 2107.13509 [cs.HC].
     2023.3264457.                                        [36] J. Aronson, A pragmatic view of thematic analy-
[27] Z. Lin, R. Agarwal, M. Riedl, Creative Wand:              sis, The qualitative report 2 (1994) 1–3.
     A System to Study Effects of Communications          [37] D. E. Kieras, S. Bovair,            The role of a
     in Co-creative Settings, Proceedings of the               mental model in learning to operate a de-
     AAAI Conference on Artificial Intelligence and            vice,     Cognitive Science 8 (1984) 255–273.
     Interactive Digital Entertainment 18 (2022) 45–           URL: https://www.sciencedirect.com/science/
     52. URL: https://ojs.aaai.org/index.php/AIIDE/            article/pii/S0364021384800038. doi:https://doi.
     article/view/21946. doi:10.1609/aiide.v18i1.              org/10.1016/S0364-0213(84)80003-8.
     21946, number: 1.                                    [38] A. M. Leslie, O. Friedman, T. P. German, Core
[28] H. Yu, M. Riedl, Data-driven personalized drama           mechanisms in ‘theory of mind’, Trends in cogni-
     management, in: Proceedings of the AAAI Con-              tive sciences 8 (2004) 528–533. Publisher: Else-
     ference on Artificial Intelligence and Interactive        vier.
     Digital Entertainment, volume 9, 2013, pp. 191–      [39] J. Balloch, Z. Lin, M. Hussain, A. Srinivas,
     197. Issue: 1.                                            R. Wright, X. Peng, J. Kim, M. Riedl, Novgrid: A
[29] R. C. Gray, J. Zhu, D. Arigo, E. Forman, S. On-           flexible grid world for evaluating agent response to
     tañón, Player modeling via multi-armed bandits,           novelty, arXiv preprint arXiv:2203.12117 (2022).
     in: Proceedings of the 15th International Confer-    [40] U. Ehsan, M. O. Riedl, Explainability pitfalls:
     ence on the Foundations of Digital Games, 2020,           Beyond dark patterns in explainable AI, Patterns
     pp. 1–8.                                                  5 (2024). Publisher: Elsevier.
[30] R. C. Gray, J. Zhu, S. Ontañón, Multiplayer          [41] R. S. Sutton, A. G. Barto, Reinforcement learn-
     Modeling via Multi-Armed Bandits, in: 2021                ing: An introduction, MIT press, 2018.
     IEEE Conference on Games (CoG), IEEE, 2021,          [42] R. Agrawal, Sample mean based index policies
     pp. 01–08.                                                by o (log n) regret for the multi-armed bandit
[31] A. Vinogradov, B. Harrison, Using Multi-Armed             problem, Advances in applied probability 27
     Bandits to Dynamically Update Player Models in            (1995) 1054–1078. Publisher: Cambridge Univer-
                                                             sample with the maximum probability, while seeking
                                                             a reward between 0 and 1:

                                                                            𝑎 = 𝑎𝑟𝑔𝑚𝑎𝑥(ℬ(𝛼𝑎 , 𝛽𝑎 ))              (2)

                                                             𝛼𝑎 increases by the reward received, and 𝛽𝑎 increased
                                                             by 1 minus the reward received. Initially, both 𝛼 and
                                                             𝛽 for each arm are set to 1 to establish a uniform
                                                             prior distribution. Thompson sampling is designed to
                                                             effortlessly transition from primarily exploring in the
                                                             initial stages to a more exploitation-oriented strategy
                                                             as it acquires more information.
                                                                We carried out an Oracle-based experiment to de-
                                                             termine the MAB algorithm of choice for the study.
                                                             Using an oracle, which simulates a human creator in-
                                                             teracting with the system, gives us total control of
Figure 4: Oracle experiment results on MAB algorithms of     their behaviour. We measure the performance of the
the agents performing on various feedback accuracy levels.   agents at various levels of human feedback accuracy,
Upper Bound performance, where the liked arm is always       to seek an agent that generally performs well on all
pulled, and the Lower Bound, where one not-liked arm is      accuracy levels so that it serves a wider variety of
always pulled, is also presented for reference.              human creators well.
                                                                We study four different agents and baselines: 𝜖-
                                                             greedy, UCB1, Thompson Sampling, and Random
     sity Press.                                             Baseline, where a universally random arm is chosen
[43] W. R. Thompson, On the likelihood that one              each time. We give the agents 3 arms to pull, where
     unknown probability exceeds another in view of          one is “liked” and two others are “unliked”. Each arm
     the evidence of two samples, Biometrika 25 (1933)       would give either a reward of 1 if liked or 0 otherwise
     285–294. Publisher: Oxford University Press.            when pulled, by the oracle; We define human feed-
                                                             back accuracy as the probability of the oracle giving
                                                             a reward of 1 on pulling the “liked” arm and a 0 on
                                                             pulling the “not liked” arm. As this value gets lower,
A. Choosing a MAB algorithm                                  closer to 50%, the simulated oracle becomes less clear
                                                             on which arm it liked and becomes a less efficient feed-
In this section, we provide more information on the          back provider. We simulated 5 levels of this accuracy,
design choice of the MAB agent. Following results            from 60% to 100% with equal intervals.
from Vinogradov et al. [31], we looked into three               𝜖-greedy is highly sensitive to its 𝜖 parameter cho-
representative MAB algorithms: 𝜖-greedy, UCB1 and            sen, and we report with the best performing 𝜖-greedy
Thompson Sampling.                                           agent in the with 𝜖 = 0.2. We report the “normalized
   𝜖-greedy [41], widely used in RL, works on a simple       rewards”, which is the agent’s reward relative to the
principle: The agent has probability 𝜖 (a hyperparam-        theoretical maximum of always choosing the “liked”
eter) to choose a random action (explore) instead of         arm. We repeat each experiment condition 100 times
performing the best action from its policy (exploit).        and report the mean normalized rewards after 10 steps
   UCB1, or Upper Confidence Bound 1 [42] instead            to simulate a scenario where the MI-CC agent has to
takes a more deterministic approach: This algorithm          quickly learn from their human counterparts, similar
calculates an “Upper Confidence Bound” for each arm,         to our actual study.
considering both the current running average of the             Figure 4 summarizes the results from the Oracle
rewards and the uncertainty due to lack of sampling:         experiments. As we only gave these agents 10 steps to
                               √︀                            learn the arms, the agent may not have yet converged.
           𝑎 = 𝑎𝑟𝑔𝑚𝑎𝑥(𝑥
                      ¯𝑎 +        2 log 𝑡/𝑛𝑎 )        (1)
                                                             This is expected in a quick-learning scenario. 𝜖-greedy
where 𝑥¯ 𝑎 represents the average reward received from       performed poorly, even worse than the random base-
arm 𝑎, 𝑛𝑎 represents number of times arm 𝑎 was pulled,       line, likely due to its inability to quickly change fo-
and 𝑡 the total number of times all arms are pulled.         cus between exploration and exploitation; UCB1 and
This makes UCB1 aware of the uncertainty of the              Thompson perform at similar levels, demonstrating
rewards from each arm when the agent makes its deci-         their capabilities to calculate an upper-bound reward
sions. Although probability distributions are used to        and use it in their decision-making process.
calculate these bounds, this algorithm does not sample          Although UCB1 and Thompson performed similarly,
at all and provides a deterministic choice for a given       Thompson Sampling is preferred because of its sam-
system state.                                                pling behavior. UCB1 schedules its exploration over a
   Finally, Thompson Sampling is a robust Bayesian           very long session in a deterministic way (exploring once
approach first introduced by Thompson [43]. It main-         after exploiting 𝑛 times). As we aim for quick learn-
tains a probability distribution over the possible values    ing and adaptation, without sampling, UCB1 risks
of each arm’s reward and uses this distribution to make      showing “stubbornness” to a suboptimal arm without
decisions. To determine which arm to pull, it draws          any probability to unstuck itself, a behavior that is
samples from a Beta (ℬ) distribution of the number           less preferred from an MI-CC perspective. Thompson
of successes and failures for each arm, choosing the         Sampling, on the other side, exhibits its capability
to dynamically change its exploration aggressiveness        are followed by an open-text question prepared to
based on previous observations, while using a Bayesian      collect justifications from the participants.
prior instead of greedy sampling, both benefiting its ap-
plication in our experiment MI-CC setup. This results
in both an effectively dynamic “epsilon” compared to        C. Prompting details
epsilon-greedy and some randomness instead of being
                                                            Prompts for Communications start with
fully greedy per each step, compared to UCB1.
   We chose Thompson Sampling as the MAB algo-                    “You are an AI writing assistant, col-
rithm used in the experimental system.                            laborating with a human on the task of
                                                                  writing a story.You are very concise, and
                                                                  answer only what is absolutely necessary,
B. Questionnaires used in the study                               without any explanations or introduc-
Pre-study.   Four 5-point Likert scale questions are              tions.You make sure that all your an-
asked:                                                            swers are surrounded by an underscore,
                                                                  such as _My answer_ .”
    • Q1: Do you agree that you are familiar with the
      process of creating content, such as writing arti-    and are followed by a few examples of the tasks, along
      cles, drawing pictures or creating a video game       with the constraints, formed in a question-answering
      stage, using a computer? (Strongly Disagree           format; The final question does not come with an an-
      → Strongly Agree)                                     swer, and the continuation is treated as the response.
    • Q2: Do you agree that you are good at writ-
      ing or telling a story, either real or fictional?
      (Strongly Disagree / Never attempted in the
      past 5 years → Strongly Agree)
    • Q3: How frequently do you use or interface
      with artificial intelligence? For example, using
      map services to find a route to your destina-
      tion, playing a game with a computer-controlled
      character, or using a chatbot. (Never used →
      Always / For as many things as possible)
    • Q4: How much understanding do you have
      of the recent developments in Artificial Intelli-
      gence technologies? (Very unfamiliar → Very
      familiar / I can build one)

Post-study. Four questions are asked regarding the
systems they used during the study.
    • Q5-(Learning, Collaboration) You were as-
      signed a specific way to collaborate with the
      assistant Wands, and the assistant is not in-
      formed of this arrangement in advance. Which
      assistant wand learned to collaborate with you
      under that arrangement? If you have chosen at
      least one of the assistant wands, how did you
      know they learned from you?
    • Q6-(Enjoyment, Immersion) Which assistant
      wand is more capable and made the collabo-
      ration easy for you? If you have chosen at
      least one of the assistant wands, how did the
      assistant(s) impress you with their capabilities?
    • Q7-(Expressiveness, Exploration, Results worth
      effort) With these assistant wands, which col-
      laborative experience ended up in a good story?
      If you have chosen at least one of the assistant
      wands, What do you think helped? If you chose
      neither, what went wrong?
    • Q8-Lastly, which assistant wand would you rec-
      ommend more to a friend or a colleague story
      writer? Please let us know if you have any other
      message or comment to share.
For Q5 to Q7, Participants may select one, both, or
neither system; For Q8, as it is a comparative question,
the option of "neither" is not available. All questions

</pre>