=Paper=
{{Paper
|id=Vol-3847/paper5
|storemode=property
|title=Using Combinatorial Multi-Armed Bandits to Dynamically Update Player Models in an
              Experience Managed Environment
|pdfUrl=https://ceur-ws.org/Vol-3847/paper5.pdf
|volume=Vol-3847
|authors=Anton Vinogradov,Brent Harrison
|dblpUrl=https://dblp.org/rec/conf/int-ws/VinogradovH24
}}
==Using Combinatorial Multi-Armed Bandits to Dynamically Update Player Models in an
              Experience Managed Environment==
<pdf width="1500px">https://ceur-ws.org/Vol-3847/paper5.pdf</pdf>
<pre>
                         Using Combinatorial Multi-Armed Bandits to Dynamically
                         Update Player Models in an Experience Managed Environment
                         Anton Vinogradov, Brent Harrison
                         University of Kentucky, Lexington, KY 40506


                                          Abstract
                                          Designers often treat players as having static play styles, but this is shown to not necessarily be always true. This is not an issue with
                                          games that create a relatively static experience for the player but can cause problems for games that attempt to model the player and
                                          adapt themselves to better suit the player such as those with Experience Managers (ExpMs). When an ExpM makes changes to the world
                                          it necessarily biases the game environment to the better match with what it believes that the player wants. This process limits what
                                          sorts of observations that the ExpM can take and leads to problems if and when a player suddenly shifts their own preferences leading
                                          to an outdated player model that can be slow to recover. Previously it has been shown that detecting a preference shift is possible and
                                          that the Multi-Armed Bandit (MAB) framework can be used to recover the player model, but this method had limits in how much
                                          information it could gather about the player. In this paper, we offer an improvement on recovering a player model after a preference
                                          shift after one is detected by using Combinatorial MABs (CMAB). To evaluate these claims we test our method in a text-based game
                                          environment on artificial agents and find that CMABs have a significant gain in how well they can recover a model. We also validate
                                          that our artificial agents perform similarly to how humans would by testing the task on human subjects.


                         1. Introduction                                                                                             able, they may be difficult to find and the majority will
                                                                                                                                     be focused on combat. With their choices largely limited
                         Experience management is the study of automated systems                                                     the player may even continue to do combat actions leading
                         that guide players through a more interesting and tailored                                                  the ExpM to observe the player engaging with combat and
                         experience than that which could normally be achieved.                                                      incorrectly strengthening its now outdated player model,
                         A game that features an experience manager (ExpM) can                                                       continuing to show that the player prefers combat. For the
                         automatically adapt the player’s experience to better serve                                                 player model to be properly updated, the player themselves
                         their specific goals and play style while also balancing the                                                would need to seek out and find appropriate content, which
                         wants of the author, thereby guiding the player towards                                                     may be difficult and cause the player to disengage from the
                         an optimal gameplay path [1, 2]. An ExpM does this by                                                       experience.
                         observing details about the player, such as the actions that                                                   Because of this, many experience managers assume that
                         they take within the game, and extrapolating from them to                                                   player preferences remain static during gameplay. Our pre-
                         make decisions about which content to serve in the future                                                   vious work shows that a player preference shift can be de-
                         and by what means.                                                                                          tected [5] and that by framing the problem as a Multi-Armed
                            ExpMs often make use of player models, which are persis-                                                 Bandit (MAB) it is possible to find the player’s preference
                         tent models used to represent the player’s internal state that                                              and quickly recover the player model, though this has only
                         the experience manager can update. These models are built                                                   been shown to work with artificial agents [6]. These works
                         over time and can be used to make more complex and long-                                                    argue that since the game environment is biased by the
                         term decisions by the ExpM, allowing for a better balance                                                   removal of possible methods of interaction, then one can
                         between personalizing the game to the player and ensuring                                                   attempt to learn a player’s updated preferences by adding
                         the author’s intent is carried out [3].                                                                     actions back into the environment. This is done by intro-
                            Since the player model is only an approximation of the                                                   ducing a new form of game object called a distraction, an
                         player’s internal thoughts and preferences it is not always                                                 object that is used by the ExpM to gain information about
                         completely accurate as it is difficult to predict how humans                                                the player while also minimally disrupting the game.
                         will behave. This can be problematic for ExpMs. When the                                                       These distractions need to be deliberately designed to
                         ExpM takes actions it necessarily biases the world towards                                                  entice players that are not engaged while also being ignored
                         being better suited to its model of the player. At the same                                                 by players that are engaged. This had been accomplished
                         time, this biasing of the environment can influence the sorts                                               naturally by the limitations of adapting the problem as an
                         of observations that the ExpM is likely to see. Take, for                                                   MAB as MABs play rounds sequentially but only allow for
                         example, a player who has previously shown to take combat                                                   a single arm pull in each round. This limits the feedback to
                         options in a game, the ExpM observes this and changes the                                                   only the single distraction added in that round but is only a
                         environment to better suit this sort of play style, removing                                                limitation of the adaption, not of distractions themselves.
                         other possible types of actions and adding in more combat-                                                     In this paper, we improve on this method by extending the
                         focused ones. If the player suddenly shifts their preferences,                                              MAB framework to a Combinatorial MAB (CMAB), which
                         which players have been shown to do [4], to prefer a more                                                   allows us to gather more information from the player and
                         diplomatic approach, then the environment may not offer                                                     recover the player model more quickly. We make use of
                         suitable affordances for the player’s current preferences.                                                  the CUCB algorithm [7] which allows us to use more than
                         While there may be some diplomacy-oriented actions avail-                                                   one distraction at a time though can potentially be more
                                                                                                                                     disruptive to the player due to using more distractions. We
                          AIIDE Workshop on Intelligent Narrative Technologies, November 18, 2024,                                   additionally create an improvement that lessens the amount
                          University of Kentucky Lexington, KY, USA
                                                                                                                                     of distractions needed by reusing part of the environment
                          $ Anton.Vinogradov@uky.edu (A. Vinogradov);
                          Harrison@cs.uky.edu (B. Harrison)                                                                          when it is available. Both these methods are shown to out-
                           0009-0000-4312-4896 (A. Vinogradov); 0000-0002-1301-5928                                                 perform previous methods in automated tests with artificial
                          (B. Harrison)                                                                                              agents. We also conduct a human study to validate that
                                      © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                      Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
humans perform similarly to how these artificial agents do.         We make use of Combinatorial Multi-Armed Bandits
                                                                 (CMABs) an extension on MABs [19]. Similarly to MABs, a
                                                                 CMAB contains a set of 𝑚 base arms that are played over
2. Related Works                                                 some number of rounds. Each arm is associated with a ran-
                                                                 dom variable 𝑋𝑖,𝑡 for 1 ≤ 𝑖 ≤ 𝑚 and 𝑡 ≥ 1 which denotes
Experience management is the generalization of drama man-
                                                                 the outcome for arm 𝑖 at round 𝑡. The set of random variables
agement [8] to not only encompass entertainment and man-
                                                                 {𝑋𝑖,𝑡 |𝑡 ≥ 1} associated with base arm 𝑖 are iid with some
aging narrative drama, but also more serious contexts such
                                                                 unknown expected mean value 𝜇𝑖 . 𝜇 = {𝜇1 , 𝜇2 , ..., 𝜇𝑖 } is
as education and training, and managing a player’s overall
                                                                 the expected mean of all arms. Instead of playing a single
experience in the game. It does this by observing the player
                                                                 arm, as one would do with a MAB, a super arm 𝑆 is played
and manipulating the virtual world with a set of experience
                                                                 from the set of all super arms 𝑆. We consider 𝑆 to be the
manager actions that allow it to modify the game and its
                                                                 subsets of the set of all arms 𝑚 of size 𝑘, for 𝑘 ∈ {2, 3}.
environment to coerce the player’s experience according
                                                                 At the end of the round, the reward 𝑅𝑡 (𝑆) is revealed for
to some goal. Early work on drama management focused
                                                                 the super arm and is given to the contributing arm 𝑖 ∈ 𝑆.
on balancing the intent of the author with allowing for a
                                                                 We observe the reward per arm played and only reward a
breadth of actions to the player, without necessarily forming
                                                                 single arm, a type of feedback belonging to semi-bandits
an explicit persistent player model and instead modeling it
                                                                 [20]. 𝑇𝑡 (𝑖) is the number of times arm 𝑖 has been played up
as an optimization problem [9, 10, 11, 8]. These are limited
                                                                 to round 𝑡.
in how they can represent and understand the player but do
                                                                    To learn an optimal policy, we make use of the CUCB
serve to allow the player a wider breadth of actions, even
                                                                 algorithm [19]. This algorithm first takes an initial 𝑚 rounds
repairing the narrative after the player takes an unexpected
                                                                 of playing a super arm that contains one of each of the
action as such killing an important NPC.
                                                                 arms. For our implementation of the algorithm we choose
   Our focus is on ExpMs that make use of a player model,
                                                                 to always play a super arm that contains the lowest played
as having a persistent model of the player can allow for the
                                                                 arms thus far, ensuring that each arm is played 𝑘 times
ExpM to make longer term decisions and allow for more
                                                                 in this time. Afterwards   we calculate the adjusted means
intelligent actions in single [1, 12, 3] and multiplayer games                      √︁
[13, 14]. These works assume that the player is static in        with 𝜇
                                                                      ¯𝑖 = 𝜇ˆ 𝑖 + 2𝑇3 ln 𝑡
                                                                                      𝑡 (𝑖)
                                                                                            , and select the super arm with the
their preferences and that further observation will lead to      highest valued adjusted mean.
a more accurate model of the player. This assumption has
been challenged more recently [4] and the need for dynamic       3.2. CMAB Adaptation
updates to ExpM player models has been acknowledged [15].
These approaches focus on categorizing players according         We extend on the work we introduced in [6] and thus use
to some pre-existing set of play styles.                         a similar formulation to their MAB adaptation. One of the
   In our previous work we have shown that player pref-          core innovations to adapting the player model recovery
erence shifts can be detected [5] and recovered [6]. These       system to a MAB are distractions. Distractions are a type of
have been shown to work in automated agents that mimic           game object that are used to gain more information from
human-like behavior and utilize MAB algorithms to accom-         the player. These are intended to entice players that are
plish the player model recovery process. MABs have been          currently not engaged with the game from their current
used in player modeling [16, 13] and CMABs in experience         task, but ideally would not be noticed or interacted with by
management [17] before but these approaches still assume         players that are engaged. This way they can be used as a
that the player is static in their preferences. We extend on     safe means to test whether a player is currently engaged,
this work to show that humans instructed to emulate a pref-      and more importantly whether they are taking actions that
erence shift do act similarly to these automated agents and      align well with the ExpM’s player model. Since distractions
additionally offer an improvement on player model recov-         are meant to balance this thin line of engagement they have
ery by expanding the ExpM’s actions and allowing it to pull      a few extra requirements. Distractions should:
combinatorial super arms.                                            1. be largely irrelevant to important parts of the game
                                                                        like quests as to not tread on any authorial goals
3. Methods                                                           2. represent a type of action or style of play in the game
                                                                        (the distractions action type)
In this section, we will start with a brief overview of CMABs        3. be easily recognizable by the player as belonging to
and the CUCB algorithm before detailing how we adapt our                that action type
environment to make use of this framework.
                                                                 This way distractions can serve as a means to measure the
                                                                 player’s preference by being clear and understandable while
3.1. Combinatorial Multi-Armed Bandits                           also not being too intrusive to normal gameplay.
Multi-armed bandits (MABs) are single state Markov Deci-            These distractions are utilized with a set of ExpM actions,
sion Processes where agents choose to take one of several        which we call distraction actions, which put into play a new
actions (called arms) [18]. Each action has a reward distri-     distraction for the player to interact with. In terms of MABs,
bution and the task of solving comes in the form of finding      these distraction actions form the arm pulls, where adding a
an optimal policy that maximizes the total reward while          distraction with a specific type of action to the environment
minimizing losses. This policy needs to be able to balance       is a pull of an arm for that type of action. For example: if a
between exploring each of the arms while gaining informa-        valid action type is crafting, a distraction action for crafting
tion on their underlying reward distribution and exploiting      would be spawning a low level crafting ingredient like a
this information to maximize the reward from all the arms.       torn sack for cloth as a distraction. A player that is well
                                                                 engaged with the game would ignore this and continue on
with their current task, but a player that is not engaged may
investigate further.
   In adapting the player model recovery system to a MAB
framework, previously we were were restricted to pulling a
single arm at a time[6], which aligned well with the goal of
reducing the number of distractions used. Since the amount
of distractions introduced per round is flexible, we found
that further adapting to CMAB framework is natural and
allows us to gain more information per round since super
arms can be played. Pulling a super arm is a distraction ac-
tion that adds more than one distraction per turn and each
individual arm is an individual distraction. The super arm
formed from the touch and read arms would add 2 distrac-
tions, one of touch and the other of read action types. We
also follow our previous method of giving a reward of 1 to
the arm for the distraction that the agent has interacted with
and a reward of 0 for all other arms. This means that super
arms are not rewarded together, only a single arm within
the super arm gets a reward. If the agent has not interacted     Figure 1: The expanded map for the human environment. The
with any distractions in that turn then no reward is given to    participant starts in the Village Outskirts (VO) and is tasked with
any arms that round. This allows for more specific rewards       retrieving a gem from the Secret Mining Shaft (SM) in the upper
to the action types that are interacted with compared to         right, finding a way to bypass a group of miners blocking the
rewarding the super arm as a whole.                              path.
   We have also previously found mixed success in purely
reducing the bias of the environment by only adding a single
distraction per round, selecting the distraction from those      the same actions and explore the same rooms without get-
that have action types that are not in the current area. Tak-    ting bored or disengaging with the experience, as a human
ing that as inspiration we have developed an additional          in a similar situation would. This environment is written in
improvement: replace-with-environment-action. With this          Inform7, a language and integrated engine used to create
improvement active, a super arm that contains a distraction      interactive fiction that is played with natural language syn-
with an action type that is already present within the current   tax. Inform7 is well suited to be used with artificial agents
area will have the distraction replaced with the matching        and is easily understandable to humans.
environment action. Instead of adding a distraction of that          In this environment, we use seven areas, called rooms,
overlapping type, the algorithm considers any action taken       which are all traversable using cardinal directions. Within
of that overlapping type in that round to count towards          these rooms are several objects that the player agents can
that distraction. This has the effect of adding some of the      interact with, with each object having a corresponding ac-
player’s interaction with the environment to the CMAB            tion type. We make use of five different action types: look,
reward model. This has the effect of reducing the amount of      talk, touch, read, and eat. There is a sixth type of action,
distractions added early on to around 1.7 per round in the       move, but since it is so prevalent and necessary for game-
first 10 turns, though as the algorithm gains more informa-      play we do not consider this something that a player can
tion it starts to rarely need to give an environment action      prefer. Of these, we consider talk to be the Primary action
and thus this decreases to an average of 1.95 after the first    type since it is the one most prevalent in the environment
10 turns.                                                        and is the focus of the quest that the player agent is tasked
                                                                 with. Other types of actions are also present within the
                                                                 environment but are not the focus, namely look and touch.
4. Experiments and Results                                       It is often not possible to create an environment or quest in
In this section, we will go over both our tests with automated   a game that only uses a single type of action so we include
agents and human subjects as well as their results. While        these to mimic that. These, along with the Primary action
these two experiments share many features, humans require        type, are considered Environmental action types since they
much more than artificial agents do and as such we will note     are present in the environment. The last two, read and eat,
where there are differences.                                     are missing from the environment and thus are considered
                                                                 Missing action types. Our inclusion of these two types of
                                                                 actions that are completely missing from the environment
4.1. Automated Experiments                                       is to simulate the effects of the ExpM previously biasing
The goal of these experiments is to evaluate our CUCB            environment due to the player not preferring read or eat
method in a controlled situation. Thus, we follow our pre-       actions.
vious experimental outline in which we use autonomous                These five action types are also the basis for our player
agents rather than humans. In these experiments, we com-         model, with the player agents preferring to primarily inter-
pare our CUCB methods against their best performing              act with one of these action types. We expect that players
method 𝜖-greedy.                                                 will have a strong preference towards one of these, but
                                                                 also that it is often not possible to complete quests without
4.1.1. Environment                                               engaging with others. Because of this, our player agent pref-
                                                                 erences are set to primarily interact with one type of action
Our automated agents use a smaller environment with sim-         (11/15ths of the time), but have a low chance of interacting
pler action types as the artificial agents are free to perform
with the other 4 (1/15th of the time each).                        include 20 different scenarios, switching from primarily pre-
                                                                   ferring one action type to a different action type. These
4.1.2. Agents                                                      are grouped together into 4 groups that represent whether
                                                                   the preferred action type is considered to be Environmental
To test our method in our environment we make use of the           or Missing: Environment to Environment, Missing to Envi-
same 3 automated agents used in our previous work. These           ronment, Environment to Missing, and Missing to Missing.
agents are known as the: Exploration Focused Agent, Goal Fo-       We previously identified that the most relevant scenario
cused Agent, and Novelty Focused Agent. These agents work          group for analysis is Environment to Missing so we will be
in different ways and are intended to attempt to resemble dif-     focusing on it, but for completeness we include the results
ferent aspects of how humans would play the game. Two of           for all scenario groups in the appendix.
these, the Exploration and Goal focused agents, are inspired
by the Bartle taxonomy of player types [21], representing
                                                                   4.1.5. Results
the explorers and achievers quadrants respectively. Since
there is no social aspect to our game we do not consider           For each combination of the 5 managers, 3 agents, and the
the other two quadrants, instead for our last agent we take        preference switch scenarios we run 100 trials. Each trial
inspiration from literature on user engagement that states         consists of 100 turns of history that is shared between all
that more novel objects are likely to increase engagement          the agents but differ for each preference switch scenario.
[22].                                                              Since this history is shared between the agents it is run with
   Each of these three player agents have an internal prefer-      the Goal focused agent. This history consists of 90 turns in
ence distribution that they use to decide which game objects       which the agent follows its initial preference, followed by
to interact with, but each agent uses it slightly differently.     10 turns of its switched preference in which the managers
The Exploration Focused Agent has a 90% chance to interact         is recording but not actively giving distractions. These 10
with an object that it sees in the room, with a 10% chance to      turns are used to simulate the time it would take to detect
wander to a different room. For all the objects that are avail-    that a player preference shift has happened. At turn 100 the
able it randomly chooses one, drawing the probability of           state of the quest is reset to allow agents to continue exhibit-
interaction from its preference distribution to choose which       ing their quest completing behavior and the managers starts
one, and if there are no objects it moves to a different room.     taking distraction actions and it and the agent continue un-
The Goal Focused Agent on the other hand first chooses a           til turn 199. We compare the agent’s internal preference
type of action to interact with according to its preference        to the managers calculated player model (which it starts to
distribution. If it fails to find a suitable object to interact    measure from 90 onwards) with the Jensen-Shannon (JS) dis-
with that is compatible with that type of action it will instead   tance. A lower value corresponds to a closer match between
take a step towards completing the quest goal. Most actions        the two models and indicates a better result.
needed to complete the quest goal are either moving to a              The results of our tests can be seen in Figure 2 focusing on
different room or talking, the primary action type. Lastly,        what has been identified as the most realistic scenario, when
the Novelty Focused Agent puts equal importance on both            the player agent switches from preferring Environmental
the novelty of an object and its own preference distribution,      action types to Missing ones.
and likewise first chooses a type of action to interact with
before finding the object in the environment matching that         4.2. Human Study
action type. If there are multiple objects of that type it will
prefer the one it has interacted with the least, and if there      We found that our method performs better than previous
are no objects of that type it falls back on taking a step         methods with automated agents but has not been tested
towards completing the quest goal.                                 against human subjects. In this section, we will go over
                                                                   the modifications that we made to the environment and the
4.1.3. Managers                                                    distraction to make this compatible with human players. We
                                                                   will also detail the specifics of the human task. This study
Alongside our three agents, we tested each against five dif-       was reviewed by our University’s Institutional Review Board
ferent managers. As these do not implement full experience         (IRB).
managers we use the term manager to distinguish them. We
compare against a lower baseline which takes advantage of
                                                                   4.2.1. Environment
how our player agents work by always providing 5 different
types of distractions, which we call the One-of-each man-          To better accommodate the complexities of human behavior
ager. We also compare against our previous best, 𝜖-greedy.         we have heavily modified the Inform7 environment for hu-
The last three of these are dedicated to testing variants on       man use. We expect that, unlike agents, humans will grow
CUCB, one where a super arm consists of 2 distractions             bored and disengaged when attempting to traverse the same
(𝑘 = 2), another with a super arm of 3 distractions (𝑘 = 3),       7 rooms and the single quest for 75 turns. Thus, we have ex-
and the last where a super arm only has 2 or fewer distrac-        panded the number of rooms available to the player from 7
tions using our replace-with-environment-action strategy           to 36 rooms in a 6 by 6 grid. These rooms are all traversable
(𝑘 = 2, rwea). All of these managers calculate the player          with only cardinal directions for ease of navigation, but
model the same way, by measuring the frequency of types            form a more complex map which can be seen in Figure 1.
of actions that the player takes starting when the managers        Along with the expanded amount of rooms, we have added
thinks that the player has shifted their preference.               2 tasks for players to engage in, each more complex than
                                                                   the task in our automated tests, to better simulate a normal
4.1.4. Preference Scenarios                                        interactive fiction game.
                                                                      We have also used different action types to better rep-
We also tested a number of preference switch scenarios             resent the types of actions that humans would normally
for a full understanding of how our methods perform. We
          Figure 2: Mean JS Distance between Player Model and Agent Preferences vs. Turn in the Environment to Missing Scenario


engage with in games: Diplomacy, Crafting, Combat, Stealth,        found in human players. To do this we set up the task to
and Magic. Since we have fewer human participants than             mimic the Environment to Missing scenario group, focusing
we can with automated agents we have also modified the             only on a single type of preference shift for simplicity. To
distributions of the action types in the environment. For          this end we recruited 30 anonymous human participants
these tests we consider Diplomacy the Primary action type,         on Prolific to take part in this study, only restricting the
Combat to be an additional Environmental action type, and          location to English speakers in the United States. Of these,
the rest to be missing since they are not normally available       we had to remove a single user. This user was able to finish
in the environment.                                                the task but the majority of actions they tried to take caused
                                                                   errors as they did not use the syntax necessary for Inform7
4.2.2. Distractions                                                games, leaving only 22 valid actions, all but one of which
                                                                   was just moving around the map.
This environment also includes distractions, which are nor-           We task the human player to start with a preference for
mally not present in the world but can be added to the room        Diplomacy type actions and after playing for 20 turns to
the player is currently in with a manager action. Since we         switch their preference to Crafting. We simulate that it
suspected that players would be either need to be alerted to       will take some time for the manager to detect that a player
objects added to a room or otherwise they may not notice           is distracted so have added a 5 turn offset from when the
that they are added, we have opted to only add distractions        participant is asked to switch their preference to when the
when the player moves to a different room as this will align       manager can start taking actions. This 5 turn offset is op-
with when the room description is printed out, listing the         timistic and previous results have indicated that it should
items. These distractions are removed from the room when           take longer [5]. Nonetheless, this is left at 5 turns so as to
the player leaves the room.                                        not frustrate the participant and to minimize the amount of
   This change does affect how our manager handles ob-             time they need to spend on the task.
serving to player actions. Now the observations happen on             Afterward, the manager continues to observe the partici-
room change and all distractions that the player has inter-        pant’s actions and add new distractions for 50 more turns.
acted with before moving onto a new room are counted as            Due to the way that the CUCB algorithm works the first 5
being interacted with. Distractions resolve themselves in a        of these turns contain a super arm with one of the five arms,
single turn and are subsequently removed as to only allow          which we have chosen to always play the two distractions
for the player to interact with each once. A sample of the         that have been least played so far, which results in each type
distractions for each type can be seen in Table 1.                 of distraction appearing twice. After those 50 turns are over,
   Distractions also were originally designed to end with          on turn 75, we automatically end the play session.
a failure outcome, a monster turns out to be a bunch of
branches shaped like a monster, or a deer runs away before
                                                                   4.2.4. Processing
it is hit. This is due to our focus on the distraction being
irrelevant to the larger narrative. We found that prelimi-         In serving the Inform7 environment to humans we were
nary participants found these discouraging and recognized          presented with a number of limitations, both from the game
that such objects would end in failure, which led to many          due to using a web interpreter (Quixe, bundled with In-
to simply not interact with any distractions. This led us          form7) and due to the changes that were needed to make
to change most distractions to resolve themselves with a           this playable by humans. Unlike the automated tests, we
small success instead, giving the player a small reward such       no longer had access to the internal state information that
as XP, rations, or crafting materials, though these values         reports what kind of action was made and had to classify
were never actually tracked. These changes contributed to          it manually. This was done based on keyword matching,
participants continuing to interact with distractions though       generally based on the verb used, and iteratively continuing
may have made them too enticing. For future work, we               to add keywords until all user commands were classified.
consider that these sorts of rewards need to be finely tuned       We classified these into several categories, one for each of
and contextualized so as to not make the act of interacting        the five action types and additionally move for movement
with distractions too enticing to the player.                      actions, and none for all the rest. The five action type cat-
                                                                   egories mostly used the verbs related to the proper way
4.2.3. Task                                                        of interacting with the distraction, but we also added ex-
                                                                   tra keywords when the intent was clear (e.g. the invalid
Our goal in this task is to show that the same trends that are     verbs "repair" and "craft" which are clearly an attempt to
visible in the automated agents reflect the sorts of behaviors
          Figure 3: Mean JS Distance between Player Model and 5 Primary Preference Models.


craft). The largest of these categories is none and accounts            For our CUCB based managers we find that giving only
for all invalid verbs due to spelling mistakes, commands            two distractions at a time already provides significant ben-
without verbs, and valid verbs like "look" and "examine"            efits over our previous best, 𝜖-greedy (𝜖 = 0.2). This is
which simply do not correspond to an action type. For our           expected as giving more distractions allows us to gain more
player model calculation, we only used the 5 action type            information, taking advantage of not just if the player agent
categories.                                                         interacts with a distraction, but also which of the distrac-
                                                                    tions it chooses to interact with. Previously an effect was
4.2.5. Results                                                      found where the distance between the agent’s internal pref-
                                                                    erence distribution and the measured preference distribu-
The results of our human experiments can be seen in Figure          tion hits a minimum value and then starts to increase. This
3. Since we do not expect to know what the final prefer-            is attributed to the MAB gaining enough information on the
ence is in normal gameplay, we have opted to show the JS            preferences that it started to give the highest valued action
distance of the measured player model compared to five              type almost exclusively, though was often only seen in less
different preference distributions. These five distributions        realistic scenarios like when the agent shifts to an action
correspond to preferring one of the five different action           type that is already well represented by the environment.
types and are distributed similarly to our agent’s internal             In our tests we see that this effect is also present even
preferences, 11/15 for the preferred action type and 1/15           in the Environment to Missing scenario in the Exploration
for the others. The results also depend on what turn we             focused agent for everything but the 𝜖-greedy manager. This
start measuring the player model. Ideally detecting a prefer-       suggests that for this agent the strategies that exhibit this
ence shift would be able to find which turn a player shifted        effect are capable of identifying the new preference within
their preference, but to cover all cases we show what the JS        20 turns of the manager activating, and is likewise because
distance looks when starting from the beginning (20 turns           one of the distractions given is almost always the preferred
before a shift), when the preference shift occurs, and when         distraction. Other agents do not exhibit this effect, which
the manager starts to give distractions (5 turns after the          we take to mean that they are significantly more difficult
shift).                                                             to recover the player’s preferences for. For both the Goal
                                                                    and Novelty focused agents this is likely because they will
5. Discussion                                                       default to environment actions when they are not given an
                                                                    action that they wish to interact with, thus skewing the data
We found that CUCB outperforms the previous best of 𝜖-              in favor of environment action types.
greedy with the ability to get surprisingly close to One-               For CUCB both 𝑘 = 3 and 𝑘 = 2 when replacing a dis-
Of-Each. Additionally we found that human results show              traction with an environment action are capable of getting
similarities to the artificial agents, but that creating distrac-   surprisingly close to One-Of-Each in all agents. Replacing a
tions that are well suited for human players requires careful       distraction with an environment action was developed to
balancing. We will discuss further implications of these            reduce the number of distractions that were shown each
experiments below.                                                  turn, which reduced the number of distractions from 2 to an
                                                                    average of 1.93, though this value is around 1.3 for the first
                                                                    couple turns. Later on the manager has enough information
5.1. Automated Agents                                               about actions in the environment that it does not need to
The best performing manager is the One-Of-Each manager              play them as often so replacement only occurs rarely. We
which serves as our lower baseline, specifically taking ad-         expected that this would result in a hit to this strategy’s
vantage of our agent’s behavior. This manager provides one          ability to recover the new preferences but found that instead
of each of the five distractions on every turn, which means         it increased its ability. Without replacement when the agent
that all possible action types are always represented. Since        interacts with the environment no reward is given to any
our agents do not take into account how many distractions           of the actions in the environment. In replacing a distraction
there are like humans would, this serves as an approxima-           with an environment action we allow any object that the
tion to the lower limit to how quickly a player model can           agent interacts with to count as a reward for the CMAB algo-
possibly be recovered. For this reason, we do not consider          rithm, which allows it to watch for more actions than before
this to be a valid strategy to test on humans since it would        and naturally fill in action types that are underrepresented
quickly overwhelm them with options, and would require a            due to the biasing of the environment.
large amount of varying distractions as to not be repetitive.
     Distraction type       Distraction Action        Resolution
                                                      As you approach the figure it turns out to be several branches vaguely
                          Attack Strange Goblin
             Combat                                   in the shape of a goblin.
                               Attack Deer            You quickly take down the deer. +10xp, +3 rations.
                            Take Meteoric Iron        You have acquired +1 iron.
             Crafting
                             Take cotton cloth        You have acquired +1 cloth.
                                                      You give some food to the beggar, who eats it gratefully and blesses
                           Help Hungry Beggar
                                                      you for your kindness. +1xp.
           Diplomacy
                                                      You carefully free the frog from the crevice, and it hops away with a
                            Help Trapped Frog
                                                      thankful croak. +2xp.
                                                      Your heart is warmed by the god’s blessing and you are filled with
                            Pray at idol statue
               Magic                                  peace and courage. +2mp.
                           Touch mystical vine        The vine’s light pulses stronger, and you feel a rush of vitality. +1hp.
                                                      You decide to take the whole thing +10gp, though you discard the coin
                              Steal coin purse
                                                      purse itself after you extract the contents.
              Stealth
                                                      The man reeks of alcohol and is absolutely conked out, but unfortunately
                         Pickpocket snoozing man
                                                      you do not find anything of use on him.
    Table 1
    An example of distractions used in the human study. Distractions are designed to resolve themselves after a single interaction
    and to sometimes give the player a small reward. These rewards are bolded in the resolution and were displayed as bolded for
    the player though their values are not tracked for this experiment.


5.2. Human Study                                                      recognizably a Crafting action, the only distinguishable dif-
                                                                      ference between the two would be the outcome after the
In our results, we find that we see a pattern that is similar
                                                                      player has already interacted with it. The current version
to our agents. Our agents (Figure 2) measure the JS distance
                                                                      of Combat distractions simplifies this, always presenting
against their own internal preference distribution, and the
                                                                      animals to be hunted or threatening monsters, and always
analogous curve in our human data (Figure 3) would cor-
                                                                      interacting by attacking. An ideal implementation of the
respond to Vs. Crafting, as that is what we instructed the
                                                                      system would allow for distractions to be independent of the
participants to prefer after the switch. We expect that as
                                                                      action, allowing for a generic distraction with multiple types
soon as distractions are available (turn 25) we should see
                                                                      of interaction. In the future we plan on investigating how
a sharp drop in the distance between the measured player
                                                                      these sorts of combo-distractions can be used, especially
model from a primarily Crafting model, while the distance
                                                                      if they can themselves be formed as a super arm, thereby
to other models steadily rises starting from when the pref-
                                                                      allowing for using fewer distractions at a time while still
erence shift occurred (turn 20). This trend exists when
                                                                      gaining the same amount of information on the player’s
measuring from the beginning, but we additionally see a
                                                                      preferences.
small immediate drop. This is due to some players attempt-
                                                                         We find that the distance measurement is sensitive to
ing to take crafting actions immediately even though the
                                                                      when we start measuring. When we start measuring at the
environment does not allow for it. We also expect the trend
                                                                      beginning, 20 turns before the preference shift, we find that
of the measured distance to eventually flatten out, which is
                                                                      the extra Diplomacy moves make it more difficult to find
observed in all three plots.
                                                                      what the preference is immediately after. Starting at later
   The behavior of users attempting to do actions that are
                                                                      points makes the trend is easier to see, but may be throwing
not possible is due to how we have to manually classify
                                                                      out relevant data. Only starting from when the manager
what type of action a human commands. Instead of directly
                                                                      detects a preference shift may not perform as well as having
looking at the type of action as reported by Inform like
                                                                      a good estimate for when the preference shift occurred.
we can for our automated tests, we instead classify actions
based on the keyword that was used in the command. This
allows us to classify actions that are not actually possible          6. Conclusion
in the game, so it still counts as a crafting action when a
participant attempts to "craft" something. We found in pre-           Experience managed games help guide players through a
liminary tests that many participants struggled to navigate           more interesting and tailored experience, but this process
the interface early in the experiment though we still wanted          of customizing the experience does not take into account
to capture the intent of the commands entered.                        the unpredictable nature of people. Player experience may
   This confusion on how to navigate did not affect all users,        degrade if this unpredictability is not taken into account and
but in response, we clarified the instructions and guided             accommodated. Previously it has been shown that recover-
users to use the "help" command if they needed it, but we             ing a player model after a preference shift is possible, but
also found that the issue may have been more fundamental              only in automated agents. In this paper we further improve
to the distractions. An early version of the distractions had         on this process by modeling it as a combinatorial multi-
a larger variety of interactions with each type of distrac-           armed bandit and making use of the existing environment
tion, and we had not yet considered the importance of the             to supplement distractions. In addition we demonstrate that
distraction’s action type being easily recognizable by the            humans behave similarly to how the artificial agents behave.
player. Take the Combat distractions as an example. Early             In the future we plan on expanding this system to combine
formulations of these had either equipment you could pick             detection of preference shifts and player model recovery
up or monsters you could attack. These could be confus-               after a shift has been detected.
ing to players as picking up items to disassemble them is
References                                                        [17] K. Y. Kristen, M. Guzdial, N. R. Sturtevant, M. Cseli-
                                                                       nacz, C. Corfe, I. H. Lyall, C. Smith, Adventures of
 [1] M. Sharma, S. Ontanón, C. R. Strong, M. Mehta, A. Ram,            ai directors early in the development of nightingale,
     Towards player preference modeling for drama man-                 in: Proceedings of the AAAI Conference on Artifi-
     agement in interactive stories., in: FLAIRS, 2007, pp.            cial Intelligence and Interactive Digital Entertainment,
     571–576.                                                          volume 18, 2022, pp. 70–77.
 [2] D. J. Thue, Generalized experience management                [18] A. Slivkins, et al., Introduction to multi-armed bandits,
     (2015).                                                           Foundations and Trends® in Machine Learning 12
 [3] H. Yu, M. Riedl, Data-driven personalized drama man-              (2019) 1–286.
     agement, in: Proceedings of the AAAI Conference              [19] W. Chen, Y. Wang, Y. Yuan, Combinatorial multi-
     on Artificial Intelligence and Interactive Digital Enter-         armed bandit: General framework and applications, in:
     tainment, volume 9, 2013, pp. 191–197.                            International conference on machine learning, PMLR,
 [4] J. Valls-Vargas, S. Ontanón, J. Zhu, Exploring player             2013, pp. 151–159.
     trace segmentation for dynamic play style prediction,        [20] J.-Y. Audibert, S. Bubeck, G. Lugosi, Minimax policies
     in: Proceedings of the AAAI conference on artificial              for combinatorial prediction games, in: Proceedings
     intelligence and interactive digital entertainment, vol-          of the 24th Annual Conference on Learning Theory,
     ume 11, 2015, pp. 93–99.                                          JMLR Workshop and Conference Proceedings, 2011,
 [5] A. Vinogradov, B. Harrison, Detecting player prefer-              pp. 107–132.
     ence shifts in an experience managed environment,            [21] R. Bartle, Hearts, clubs, diamonds, spades: Players
     in: International Conference on Interactive Digital               who suit muds, Journal of MUD research 1 (1996) 19.
     Storytelling, Springer, 2023, pp. 517–531.                   [22] H. L. O’Brien, E. G. Toms, The development and evalu-
 [6] A. Vinogradov, B. Harrison, Using multi-armed ban-                ation of a survey to measure user engagement, Journal
     dits to dynamically update player models in an ex-                of the American Society for Information Science and
     perience managed environment, in: Proceedings of                  Technology 61 (2010) 50–69.
     the AAAI Conference on Artificial Intelligence and
     Interactive Digital Entertainment, volume 18, 2022, pp.
     207–214.
 [7] W. Chen, W. Hu, F. Li, J. Li, Y. Liu, P. Lu, Combinatorial
     multi-armed bandit with general reward functions,
     Advances in Neural Information Processing Systems
     29 (2016).
 [8] M. O. Riedl, A. Stern, D. Dini, J. Alderman, Dynamic
     experience management in virtual worlds for enter-
     tainment, education, and training, International Trans-
     actions on Systems Science and Applications, Special
     Issue on Agent Based Systems for Human Learning 4
     (2008) 23–42.
 [9] M. Mateas, A. Stern, Integrating plot, character and
     natural language processing in the interactive drama
     façade, in: Proceedings of the 1st International Con-
     ference on Technologies for Interactive Digital Story-
     telling and Entertainment (TIDSE-03), volume 2, 2003.
[10] A. Lamstein, M. Mateas, Search-based drama manage-
     ment, in: Proceedings of the AAAI-04 Workshop on
     Challenges in Game AI, 2004, pp. 103–107.
[11] M. J. Nelson, M. Mateas, Another look at search-based
     drama management., in: AAMAS (3), 2008, pp. 1293–
     1298.
[12] M. Sharma, S. Ontañón, M. Mehta, A. Ram, Drama
     management and player modeling for interactive fic-
     tion games, Computational Intelligence 26 (2010) 183–
     211.
[13] R. C. Gray, J. Zhu, S. Ontañón, Multiplayer modeling
     via multi-armed bandits, in: 2021 IEEE Conference on
     Games (CoG), IEEE, 2021, pp. 01–08.
[14] J. Zhu, S. Ontañón, Experience management in multi-
     player games, in: 2019 IEEE Conference on Games
     (CoG), IEEE, 2019, pp. 1–6.
[15] R. Khoshkangini, S. Ontanón, A. Marconi, J. Zhu, Dy-
     namically extracting play style in educational games,
     EUROSIS proceedings, GameOn (2018).
[16] R. C. Gray, J. Zhu, D. Arigo, E. Forman, S. Ontañón,
     Player modeling via multi-armed bandits, in: Pro-
     ceedings of the 15th international conference on the
     foundations of digital games, 2020, pp. 1–8.

</pre>