Using Combinatorial Multi-Armed Bandits to Dynamically Update Player Models in an Experience Managed Environment Anton Vinogradov, Brent Harrison University of Kentucky, Lexington, KY 40506 Abstract Designers often treat players as having static play styles, but this is shown to not necessarily be always true. This is not an issue with games that create a relatively static experience for the player but can cause problems for games that attempt to model the player and adapt themselves to better suit the player such as those with Experience Managers (ExpMs). When an ExpM makes changes to the world it necessarily biases the game environment to the better match with what it believes that the player wants. This process limits what sorts of observations that the ExpM can take and leads to problems if and when a player suddenly shifts their own preferences leading to an outdated player model that can be slow to recover. Previously it has been shown that detecting a preference shift is possible and that the Multi-Armed Bandit (MAB) framework can be used to recover the player model, but this method had limits in how much information it could gather about the player. In this paper, we offer an improvement on recovering a player model after a preference shift after one is detected by using Combinatorial MABs (CMAB). To evaluate these claims we test our method in a text-based game environment on artificial agents and find that CMABs have a significant gain in how well they can recover a model. We also validate that our artificial agents perform similarly to how humans would by testing the task on human subjects. 1. Introduction able, they may be difficult to find and the majority will be focused on combat. With their choices largely limited Experience management is the study of automated systems the player may even continue to do combat actions leading that guide players through a more interesting and tailored the ExpM to observe the player engaging with combat and experience than that which could normally be achieved. incorrectly strengthening its now outdated player model, A game that features an experience manager (ExpM) can continuing to show that the player prefers combat. For the automatically adapt the player’s experience to better serve player model to be properly updated, the player themselves their specific goals and play style while also balancing the would need to seek out and find appropriate content, which wants of the author, thereby guiding the player towards may be difficult and cause the player to disengage from the an optimal gameplay path [1, 2]. An ExpM does this by experience. observing details about the player, such as the actions that Because of this, many experience managers assume that they take within the game, and extrapolating from them to player preferences remain static during gameplay. Our pre- make decisions about which content to serve in the future vious work shows that a player preference shift can be de- and by what means. tected [5] and that by framing the problem as a Multi-Armed ExpMs often make use of player models, which are persis- Bandit (MAB) it is possible to find the player’s preference tent models used to represent the player’s internal state that and quickly recover the player model, though this has only the experience manager can update. These models are built been shown to work with artificial agents [6]. These works over time and can be used to make more complex and long- argue that since the game environment is biased by the term decisions by the ExpM, allowing for a better balance removal of possible methods of interaction, then one can between personalizing the game to the player and ensuring attempt to learn a player’s updated preferences by adding the author’s intent is carried out [3]. actions back into the environment. This is done by intro- Since the player model is only an approximation of the ducing a new form of game object called a distraction, an player’s internal thoughts and preferences it is not always object that is used by the ExpM to gain information about completely accurate as it is difficult to predict how humans the player while also minimally disrupting the game. will behave. This can be problematic for ExpMs. When the These distractions need to be deliberately designed to ExpM takes actions it necessarily biases the world towards entice players that are not engaged while also being ignored being better suited to its model of the player. At the same by players that are engaged. This had been accomplished time, this biasing of the environment can influence the sorts naturally by the limitations of adapting the problem as an of observations that the ExpM is likely to see. Take, for MAB as MABs play rounds sequentially but only allow for example, a player who has previously shown to take combat a single arm pull in each round. This limits the feedback to options in a game, the ExpM observes this and changes the only the single distraction added in that round but is only a environment to better suit this sort of play style, removing limitation of the adaption, not of distractions themselves. other possible types of actions and adding in more combat- In this paper, we improve on this method by extending the focused ones. If the player suddenly shifts their preferences, MAB framework to a Combinatorial MAB (CMAB), which which players have been shown to do [4], to prefer a more allows us to gather more information from the player and diplomatic approach, then the environment may not offer recover the player model more quickly. We make use of suitable affordances for the player’s current preferences. the CUCB algorithm [7] which allows us to use more than While there may be some diplomacy-oriented actions avail- one distraction at a time though can potentially be more disruptive to the player due to using more distractions. We AIIDE Workshop on Intelligent Narrative Technologies, November 18, 2024, additionally create an improvement that lessens the amount University of Kentucky Lexington, KY, USA of distractions needed by reusing part of the environment $ Anton.Vinogradov@uky.edu (A. Vinogradov); Harrison@cs.uky.edu (B. Harrison) when it is available. Both these methods are shown to out-  0009-0000-4312-4896 (A. Vinogradov); 0000-0002-1301-5928 perform previous methods in automated tests with artificial (B. Harrison) agents. We also conduct a human study to validate that © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings humans perform similarly to how these artificial agents do. We make use of Combinatorial Multi-Armed Bandits (CMABs) an extension on MABs [19]. Similarly to MABs, a CMAB contains a set of 𝑚 base arms that are played over 2. Related Works some number of rounds. Each arm is associated with a ran- dom variable 𝑋𝑖,𝑡 for 1 ≤ 𝑖 ≤ 𝑚 and 𝑡 ≥ 1 which denotes Experience management is the generalization of drama man- the outcome for arm 𝑖 at round 𝑡. The set of random variables agement [8] to not only encompass entertainment and man- {𝑋𝑖,𝑡 |𝑡 ≥ 1} associated with base arm 𝑖 are iid with some aging narrative drama, but also more serious contexts such unknown expected mean value 𝜇𝑖 . 𝜇 = {𝜇1 , 𝜇2 , ..., 𝜇𝑖 } is as education and training, and managing a player’s overall the expected mean of all arms. Instead of playing a single experience in the game. It does this by observing the player arm, as one would do with a MAB, a super arm 𝑆 is played and manipulating the virtual world with a set of experience from the set of all super arms 𝑆. We consider 𝑆 to be the manager actions that allow it to modify the game and its subsets of the set of all arms 𝑚 of size 𝑘, for 𝑘 ∈ {2, 3}. environment to coerce the player’s experience according At the end of the round, the reward 𝑅𝑡 (𝑆) is revealed for to some goal. Early work on drama management focused the super arm and is given to the contributing arm 𝑖 ∈ 𝑆. on balancing the intent of the author with allowing for a We observe the reward per arm played and only reward a breadth of actions to the player, without necessarily forming single arm, a type of feedback belonging to semi-bandits an explicit persistent player model and instead modeling it [20]. 𝑇𝑡 (𝑖) is the number of times arm 𝑖 has been played up as an optimization problem [9, 10, 11, 8]. These are limited to round 𝑡. in how they can represent and understand the player but do To learn an optimal policy, we make use of the CUCB serve to allow the player a wider breadth of actions, even algorithm [19]. This algorithm first takes an initial 𝑚 rounds repairing the narrative after the player takes an unexpected of playing a super arm that contains one of each of the action as such killing an important NPC. arms. For our implementation of the algorithm we choose Our focus is on ExpMs that make use of a player model, to always play a super arm that contains the lowest played as having a persistent model of the player can allow for the arms thus far, ensuring that each arm is played 𝑘 times ExpM to make longer term decisions and allow for more in this time. Afterwards we calculate the adjusted means intelligent actions in single [1, 12, 3] and multiplayer games √︁ [13, 14]. These works assume that the player is static in with 𝜇 ¯𝑖 = 𝜇ˆ 𝑖 + 2𝑇3 ln 𝑡 𝑡 (𝑖) , and select the super arm with the their preferences and that further observation will lead to highest valued adjusted mean. a more accurate model of the player. This assumption has been challenged more recently [4] and the need for dynamic 3.2. CMAB Adaptation updates to ExpM player models has been acknowledged [15]. These approaches focus on categorizing players according We extend on the work we introduced in [6] and thus use to some pre-existing set of play styles. a similar formulation to their MAB adaptation. One of the In our previous work we have shown that player pref- core innovations to adapting the player model recovery erence shifts can be detected [5] and recovered [6]. These system to a MAB are distractions. Distractions are a type of have been shown to work in automated agents that mimic game object that are used to gain more information from human-like behavior and utilize MAB algorithms to accom- the player. These are intended to entice players that are plish the player model recovery process. MABs have been currently not engaged with the game from their current used in player modeling [16, 13] and CMABs in experience task, but ideally would not be noticed or interacted with by management [17] before but these approaches still assume players that are engaged. This way they can be used as a that the player is static in their preferences. We extend on safe means to test whether a player is currently engaged, this work to show that humans instructed to emulate a pref- and more importantly whether they are taking actions that erence shift do act similarly to these automated agents and align well with the ExpM’s player model. Since distractions additionally offer an improvement on player model recov- are meant to balance this thin line of engagement they have ery by expanding the ExpM’s actions and allowing it to pull a few extra requirements. Distractions should: combinatorial super arms. 1. be largely irrelevant to important parts of the game like quests as to not tread on any authorial goals 3. Methods 2. represent a type of action or style of play in the game (the distractions action type) In this section, we will start with a brief overview of CMABs 3. be easily recognizable by the player as belonging to and the CUCB algorithm before detailing how we adapt our that action type environment to make use of this framework. This way distractions can serve as a means to measure the player’s preference by being clear and understandable while 3.1. Combinatorial Multi-Armed Bandits also not being too intrusive to normal gameplay. Multi-armed bandits (MABs) are single state Markov Deci- These distractions are utilized with a set of ExpM actions, sion Processes where agents choose to take one of several which we call distraction actions, which put into play a new actions (called arms) [18]. Each action has a reward distri- distraction for the player to interact with. In terms of MABs, bution and the task of solving comes in the form of finding these distraction actions form the arm pulls, where adding a an optimal policy that maximizes the total reward while distraction with a specific type of action to the environment minimizing losses. This policy needs to be able to balance is a pull of an arm for that type of action. For example: if a between exploring each of the arms while gaining informa- valid action type is crafting, a distraction action for crafting tion on their underlying reward distribution and exploiting would be spawning a low level crafting ingredient like a this information to maximize the reward from all the arms. torn sack for cloth as a distraction. A player that is well engaged with the game would ignore this and continue on with their current task, but a player that is not engaged may investigate further. In adapting the player model recovery system to a MAB framework, previously we were were restricted to pulling a single arm at a time[6], which aligned well with the goal of reducing the number of distractions used. Since the amount of distractions introduced per round is flexible, we found that further adapting to CMAB framework is natural and allows us to gain more information per round since super arms can be played. Pulling a super arm is a distraction ac- tion that adds more than one distraction per turn and each individual arm is an individual distraction. The super arm formed from the touch and read arms would add 2 distrac- tions, one of touch and the other of read action types. We also follow our previous method of giving a reward of 1 to the arm for the distraction that the agent has interacted with and a reward of 0 for all other arms. This means that super arms are not rewarded together, only a single arm within the super arm gets a reward. If the agent has not interacted Figure 1: The expanded map for the human environment. The with any distractions in that turn then no reward is given to participant starts in the Village Outskirts (VO) and is tasked with any arms that round. This allows for more specific rewards retrieving a gem from the Secret Mining Shaft (SM) in the upper to the action types that are interacted with compared to right, finding a way to bypass a group of miners blocking the rewarding the super arm as a whole. path. We have also previously found mixed success in purely reducing the bias of the environment by only adding a single distraction per round, selecting the distraction from those the same actions and explore the same rooms without get- that have action types that are not in the current area. Tak- ting bored or disengaging with the experience, as a human ing that as inspiration we have developed an additional in a similar situation would. This environment is written in improvement: replace-with-environment-action. With this Inform7, a language and integrated engine used to create improvement active, a super arm that contains a distraction interactive fiction that is played with natural language syn- with an action type that is already present within the current tax. Inform7 is well suited to be used with artificial agents area will have the distraction replaced with the matching and is easily understandable to humans. environment action. Instead of adding a distraction of that In this environment, we use seven areas, called rooms, overlapping type, the algorithm considers any action taken which are all traversable using cardinal directions. Within of that overlapping type in that round to count towards these rooms are several objects that the player agents can that distraction. This has the effect of adding some of the interact with, with each object having a corresponding ac- player’s interaction with the environment to the CMAB tion type. We make use of five different action types: look, reward model. This has the effect of reducing the amount of talk, touch, read, and eat. There is a sixth type of action, distractions added early on to around 1.7 per round in the move, but since it is so prevalent and necessary for game- first 10 turns, though as the algorithm gains more informa- play we do not consider this something that a player can tion it starts to rarely need to give an environment action prefer. Of these, we consider talk to be the Primary action and thus this decreases to an average of 1.95 after the first type since it is the one most prevalent in the environment 10 turns. and is the focus of the quest that the player agent is tasked with. Other types of actions are also present within the environment but are not the focus, namely look and touch. 4. Experiments and Results It is often not possible to create an environment or quest in In this section, we will go over both our tests with automated a game that only uses a single type of action so we include agents and human subjects as well as their results. While these to mimic that. These, along with the Primary action these two experiments share many features, humans require type, are considered Environmental action types since they much more than artificial agents do and as such we will note are present in the environment. The last two, read and eat, where there are differences. are missing from the environment and thus are considered Missing action types. Our inclusion of these two types of actions that are completely missing from the environment 4.1. Automated Experiments is to simulate the effects of the ExpM previously biasing The goal of these experiments is to evaluate our CUCB environment due to the player not preferring read or eat method in a controlled situation. Thus, we follow our pre- actions. vious experimental outline in which we use autonomous These five action types are also the basis for our player agents rather than humans. In these experiments, we com- model, with the player agents preferring to primarily inter- pare our CUCB methods against their best performing act with one of these action types. We expect that players method 𝜖-greedy. will have a strong preference towards one of these, but also that it is often not possible to complete quests without 4.1.1. Environment engaging with others. Because of this, our player agent pref- erences are set to primarily interact with one type of action Our automated agents use a smaller environment with sim- (11/15ths of the time), but have a low chance of interacting pler action types as the artificial agents are free to perform with the other 4 (1/15th of the time each). include 20 different scenarios, switching from primarily pre- ferring one action type to a different action type. These 4.1.2. Agents are grouped together into 4 groups that represent whether the preferred action type is considered to be Environmental To test our method in our environment we make use of the or Missing: Environment to Environment, Missing to Envi- same 3 automated agents used in our previous work. These ronment, Environment to Missing, and Missing to Missing. agents are known as the: Exploration Focused Agent, Goal Fo- We previously identified that the most relevant scenario cused Agent, and Novelty Focused Agent. These agents work group for analysis is Environment to Missing so we will be in different ways and are intended to attempt to resemble dif- focusing on it, but for completeness we include the results ferent aspects of how humans would play the game. Two of for all scenario groups in the appendix. these, the Exploration and Goal focused agents, are inspired by the Bartle taxonomy of player types [21], representing 4.1.5. Results the explorers and achievers quadrants respectively. Since there is no social aspect to our game we do not consider For each combination of the 5 managers, 3 agents, and the the other two quadrants, instead for our last agent we take preference switch scenarios we run 100 trials. Each trial inspiration from literature on user engagement that states consists of 100 turns of history that is shared between all that more novel objects are likely to increase engagement the agents but differ for each preference switch scenario. [22]. Since this history is shared between the agents it is run with Each of these three player agents have an internal prefer- the Goal focused agent. This history consists of 90 turns in ence distribution that they use to decide which game objects which the agent follows its initial preference, followed by to interact with, but each agent uses it slightly differently. 10 turns of its switched preference in which the managers The Exploration Focused Agent has a 90% chance to interact is recording but not actively giving distractions. These 10 with an object that it sees in the room, with a 10% chance to turns are used to simulate the time it would take to detect wander to a different room. For all the objects that are avail- that a player preference shift has happened. At turn 100 the able it randomly chooses one, drawing the probability of state of the quest is reset to allow agents to continue exhibit- interaction from its preference distribution to choose which ing their quest completing behavior and the managers starts one, and if there are no objects it moves to a different room. taking distraction actions and it and the agent continue un- The Goal Focused Agent on the other hand first chooses a til turn 199. We compare the agent’s internal preference type of action to interact with according to its preference to the managers calculated player model (which it starts to distribution. If it fails to find a suitable object to interact measure from 90 onwards) with the Jensen-Shannon (JS) dis- with that is compatible with that type of action it will instead tance. A lower value corresponds to a closer match between take a step towards completing the quest goal. Most actions the two models and indicates a better result. needed to complete the quest goal are either moving to a The results of our tests can be seen in Figure 2 focusing on different room or talking, the primary action type. Lastly, what has been identified as the most realistic scenario, when the Novelty Focused Agent puts equal importance on both the player agent switches from preferring Environmental the novelty of an object and its own preference distribution, action types to Missing ones. and likewise first chooses a type of action to interact with before finding the object in the environment matching that 4.2. Human Study action type. If there are multiple objects of that type it will prefer the one it has interacted with the least, and if there We found that our method performs better than previous are no objects of that type it falls back on taking a step methods with automated agents but has not been tested towards completing the quest goal. against human subjects. In this section, we will go over the modifications that we made to the environment and the 4.1.3. Managers distraction to make this compatible with human players. We will also detail the specifics of the human task. This study Alongside our three agents, we tested each against five dif- was reviewed by our University’s Institutional Review Board ferent managers. As these do not implement full experience (IRB). managers we use the term manager to distinguish them. We compare against a lower baseline which takes advantage of 4.2.1. Environment how our player agents work by always providing 5 different types of distractions, which we call the One-of-each man- To better accommodate the complexities of human behavior ager. We also compare against our previous best, 𝜖-greedy. we have heavily modified the Inform7 environment for hu- The last three of these are dedicated to testing variants on man use. We expect that, unlike agents, humans will grow CUCB, one where a super arm consists of 2 distractions bored and disengaged when attempting to traverse the same (𝑘 = 2), another with a super arm of 3 distractions (𝑘 = 3), 7 rooms and the single quest for 75 turns. Thus, we have ex- and the last where a super arm only has 2 or fewer distrac- panded the number of rooms available to the player from 7 tions using our replace-with-environment-action strategy to 36 rooms in a 6 by 6 grid. These rooms are all traversable (𝑘 = 2, rwea). All of these managers calculate the player with only cardinal directions for ease of navigation, but model the same way, by measuring the frequency of types form a more complex map which can be seen in Figure 1. of actions that the player takes starting when the managers Along with the expanded amount of rooms, we have added thinks that the player has shifted their preference. 2 tasks for players to engage in, each more complex than the task in our automated tests, to better simulate a normal 4.1.4. Preference Scenarios interactive fiction game. We have also used different action types to better rep- We also tested a number of preference switch scenarios resent the types of actions that humans would normally for a full understanding of how our methods perform. We Figure 2: Mean JS Distance between Player Model and Agent Preferences vs. Turn in the Environment to Missing Scenario engage with in games: Diplomacy, Crafting, Combat, Stealth, found in human players. To do this we set up the task to and Magic. Since we have fewer human participants than mimic the Environment to Missing scenario group, focusing we can with automated agents we have also modified the only on a single type of preference shift for simplicity. To distributions of the action types in the environment. For this end we recruited 30 anonymous human participants these tests we consider Diplomacy the Primary action type, on Prolific to take part in this study, only restricting the Combat to be an additional Environmental action type, and location to English speakers in the United States. Of these, the rest to be missing since they are not normally available we had to remove a single user. This user was able to finish in the environment. the task but the majority of actions they tried to take caused errors as they did not use the syntax necessary for Inform7 4.2.2. Distractions games, leaving only 22 valid actions, all but one of which was just moving around the map. This environment also includes distractions, which are nor- We task the human player to start with a preference for mally not present in the world but can be added to the room Diplomacy type actions and after playing for 20 turns to the player is currently in with a manager action. Since we switch their preference to Crafting. We simulate that it suspected that players would be either need to be alerted to will take some time for the manager to detect that a player objects added to a room or otherwise they may not notice is distracted so have added a 5 turn offset from when the that they are added, we have opted to only add distractions participant is asked to switch their preference to when the when the player moves to a different room as this will align manager can start taking actions. This 5 turn offset is op- with when the room description is printed out, listing the timistic and previous results have indicated that it should items. These distractions are removed from the room when take longer [5]. Nonetheless, this is left at 5 turns so as to the player leaves the room. not frustrate the participant and to minimize the amount of This change does affect how our manager handles ob- time they need to spend on the task. serving to player actions. Now the observations happen on Afterward, the manager continues to observe the partici- room change and all distractions that the player has inter- pant’s actions and add new distractions for 50 more turns. acted with before moving onto a new room are counted as Due to the way that the CUCB algorithm works the first 5 being interacted with. Distractions resolve themselves in a of these turns contain a super arm with one of the five arms, single turn and are subsequently removed as to only allow which we have chosen to always play the two distractions for the player to interact with each once. A sample of the that have been least played so far, which results in each type distractions for each type can be seen in Table 1. of distraction appearing twice. After those 50 turns are over, Distractions also were originally designed to end with on turn 75, we automatically end the play session. a failure outcome, a monster turns out to be a bunch of branches shaped like a monster, or a deer runs away before 4.2.4. Processing it is hit. This is due to our focus on the distraction being irrelevant to the larger narrative. We found that prelimi- In serving the Inform7 environment to humans we were nary participants found these discouraging and recognized presented with a number of limitations, both from the game that such objects would end in failure, which led to many due to using a web interpreter (Quixe, bundled with In- to simply not interact with any distractions. This led us form7) and due to the changes that were needed to make to change most distractions to resolve themselves with a this playable by humans. Unlike the automated tests, we small success instead, giving the player a small reward such no longer had access to the internal state information that as XP, rations, or crafting materials, though these values reports what kind of action was made and had to classify were never actually tracked. These changes contributed to it manually. This was done based on keyword matching, participants continuing to interact with distractions though generally based on the verb used, and iteratively continuing may have made them too enticing. For future work, we to add keywords until all user commands were classified. consider that these sorts of rewards need to be finely tuned We classified these into several categories, one for each of and contextualized so as to not make the act of interacting the five action types and additionally move for movement with distractions too enticing to the player. actions, and none for all the rest. The five action type cat- egories mostly used the verbs related to the proper way 4.2.3. Task of interacting with the distraction, but we also added ex- tra keywords when the intent was clear (e.g. the invalid Our goal in this task is to show that the same trends that are verbs "repair" and "craft" which are clearly an attempt to visible in the automated agents reflect the sorts of behaviors Figure 3: Mean JS Distance between Player Model and 5 Primary Preference Models. craft). The largest of these categories is none and accounts For our CUCB based managers we find that giving only for all invalid verbs due to spelling mistakes, commands two distractions at a time already provides significant ben- without verbs, and valid verbs like "look" and "examine" efits over our previous best, 𝜖-greedy (𝜖 = 0.2). This is which simply do not correspond to an action type. For our expected as giving more distractions allows us to gain more player model calculation, we only used the 5 action type information, taking advantage of not just if the player agent categories. interacts with a distraction, but also which of the distrac- tions it chooses to interact with. Previously an effect was 4.2.5. Results found where the distance between the agent’s internal pref- erence distribution and the measured preference distribu- The results of our human experiments can be seen in Figure tion hits a minimum value and then starts to increase. This 3. Since we do not expect to know what the final prefer- is attributed to the MAB gaining enough information on the ence is in normal gameplay, we have opted to show the JS preferences that it started to give the highest valued action distance of the measured player model compared to five type almost exclusively, though was often only seen in less different preference distributions. These five distributions realistic scenarios like when the agent shifts to an action correspond to preferring one of the five different action type that is already well represented by the environment. types and are distributed similarly to our agent’s internal In our tests we see that this effect is also present even preferences, 11/15 for the preferred action type and 1/15 in the Environment to Missing scenario in the Exploration for the others. The results also depend on what turn we focused agent for everything but the 𝜖-greedy manager. This start measuring the player model. Ideally detecting a prefer- suggests that for this agent the strategies that exhibit this ence shift would be able to find which turn a player shifted effect are capable of identifying the new preference within their preference, but to cover all cases we show what the JS 20 turns of the manager activating, and is likewise because distance looks when starting from the beginning (20 turns one of the distractions given is almost always the preferred before a shift), when the preference shift occurs, and when distraction. Other agents do not exhibit this effect, which the manager starts to give distractions (5 turns after the we take to mean that they are significantly more difficult shift). to recover the player’s preferences for. For both the Goal and Novelty focused agents this is likely because they will 5. Discussion default to environment actions when they are not given an action that they wish to interact with, thus skewing the data We found that CUCB outperforms the previous best of 𝜖- in favor of environment action types. greedy with the ability to get surprisingly close to One- For CUCB both 𝑘 = 3 and 𝑘 = 2 when replacing a dis- Of-Each. Additionally we found that human results show traction with an environment action are capable of getting similarities to the artificial agents, but that creating distrac- surprisingly close to One-Of-Each in all agents. Replacing a tions that are well suited for human players requires careful distraction with an environment action was developed to balancing. We will discuss further implications of these reduce the number of distractions that were shown each experiments below. turn, which reduced the number of distractions from 2 to an average of 1.93, though this value is around 1.3 for the first couple turns. Later on the manager has enough information 5.1. Automated Agents about actions in the environment that it does not need to The best performing manager is the One-Of-Each manager play them as often so replacement only occurs rarely. We which serves as our lower baseline, specifically taking ad- expected that this would result in a hit to this strategy’s vantage of our agent’s behavior. This manager provides one ability to recover the new preferences but found that instead of each of the five distractions on every turn, which means it increased its ability. Without replacement when the agent that all possible action types are always represented. Since interacts with the environment no reward is given to any our agents do not take into account how many distractions of the actions in the environment. In replacing a distraction there are like humans would, this serves as an approxima- with an environment action we allow any object that the tion to the lower limit to how quickly a player model can agent interacts with to count as a reward for the CMAB algo- possibly be recovered. For this reason, we do not consider rithm, which allows it to watch for more actions than before this to be a valid strategy to test on humans since it would and naturally fill in action types that are underrepresented quickly overwhelm them with options, and would require a due to the biasing of the environment. large amount of varying distractions as to not be repetitive. Distraction type Distraction Action Resolution As you approach the figure it turns out to be several branches vaguely Attack Strange Goblin Combat in the shape of a goblin. Attack Deer You quickly take down the deer. +10xp, +3 rations. Take Meteoric Iron You have acquired +1 iron. Crafting Take cotton cloth You have acquired +1 cloth. You give some food to the beggar, who eats it gratefully and blesses Help Hungry Beggar you for your kindness. +1xp. Diplomacy You carefully free the frog from the crevice, and it hops away with a Help Trapped Frog thankful croak. +2xp. Your heart is warmed by the god’s blessing and you are filled with Pray at idol statue Magic peace and courage. +2mp. Touch mystical vine The vine’s light pulses stronger, and you feel a rush of vitality. +1hp. You decide to take the whole thing +10gp, though you discard the coin Steal coin purse purse itself after you extract the contents. Stealth The man reeks of alcohol and is absolutely conked out, but unfortunately Pickpocket snoozing man you do not find anything of use on him. Table 1 An example of distractions used in the human study. Distractions are designed to resolve themselves after a single interaction and to sometimes give the player a small reward. These rewards are bolded in the resolution and were displayed as bolded for the player though their values are not tracked for this experiment. 5.2. Human Study recognizably a Crafting action, the only distinguishable dif- ference between the two would be the outcome after the In our results, we find that we see a pattern that is similar player has already interacted with it. The current version to our agents. Our agents (Figure 2) measure the JS distance of Combat distractions simplifies this, always presenting against their own internal preference distribution, and the animals to be hunted or threatening monsters, and always analogous curve in our human data (Figure 3) would cor- interacting by attacking. An ideal implementation of the respond to Vs. Crafting, as that is what we instructed the system would allow for distractions to be independent of the participants to prefer after the switch. We expect that as action, allowing for a generic distraction with multiple types soon as distractions are available (turn 25) we should see of interaction. In the future we plan on investigating how a sharp drop in the distance between the measured player these sorts of combo-distractions can be used, especially model from a primarily Crafting model, while the distance if they can themselves be formed as a super arm, thereby to other models steadily rises starting from when the pref- allowing for using fewer distractions at a time while still erence shift occurred (turn 20). This trend exists when gaining the same amount of information on the player’s measuring from the beginning, but we additionally see a preferences. small immediate drop. This is due to some players attempt- We find that the distance measurement is sensitive to ing to take crafting actions immediately even though the when we start measuring. When we start measuring at the environment does not allow for it. We also expect the trend beginning, 20 turns before the preference shift, we find that of the measured distance to eventually flatten out, which is the extra Diplomacy moves make it more difficult to find observed in all three plots. what the preference is immediately after. Starting at later The behavior of users attempting to do actions that are points makes the trend is easier to see, but may be throwing not possible is due to how we have to manually classify out relevant data. Only starting from when the manager what type of action a human commands. Instead of directly detects a preference shift may not perform as well as having looking at the type of action as reported by Inform like a good estimate for when the preference shift occurred. we can for our automated tests, we instead classify actions based on the keyword that was used in the command. This allows us to classify actions that are not actually possible 6. Conclusion in the game, so it still counts as a crafting action when a participant attempts to "craft" something. We found in pre- Experience managed games help guide players through a liminary tests that many participants struggled to navigate more interesting and tailored experience, but this process the interface early in the experiment though we still wanted of customizing the experience does not take into account to capture the intent of the commands entered. the unpredictable nature of people. Player experience may This confusion on how to navigate did not affect all users, degrade if this unpredictability is not taken into account and but in response, we clarified the instructions and guided accommodated. Previously it has been shown that recover- users to use the "help" command if they needed it, but we ing a player model after a preference shift is possible, but also found that the issue may have been more fundamental only in automated agents. In this paper we further improve to the distractions. An early version of the distractions had on this process by modeling it as a combinatorial multi- a larger variety of interactions with each type of distrac- armed bandit and making use of the existing environment tion, and we had not yet considered the importance of the to supplement distractions. In addition we demonstrate that distraction’s action type being easily recognizable by the humans behave similarly to how the artificial agents behave. player. Take the Combat distractions as an example. Early In the future we plan on expanding this system to combine formulations of these had either equipment you could pick detection of preference shifts and player model recovery up or monsters you could attack. These could be confus- after a shift has been detected. ing to players as picking up items to disassemble them is References [17] K. Y. Kristen, M. Guzdial, N. R. Sturtevant, M. Cseli- nacz, C. Corfe, I. H. Lyall, C. Smith, Adventures of [1] M. Sharma, S. Ontanón, C. R. Strong, M. Mehta, A. Ram, ai directors early in the development of nightingale, Towards player preference modeling for drama man- in: Proceedings of the AAAI Conference on Artifi- agement in interactive stories., in: FLAIRS, 2007, pp. cial Intelligence and Interactive Digital Entertainment, 571–576. volume 18, 2022, pp. 70–77. [2] D. J. Thue, Generalized experience management [18] A. Slivkins, et al., Introduction to multi-armed bandits, (2015). Foundations and Trends® in Machine Learning 12 [3] H. Yu, M. Riedl, Data-driven personalized drama man- (2019) 1–286. agement, in: Proceedings of the AAAI Conference [19] W. Chen, Y. Wang, Y. Yuan, Combinatorial multi- on Artificial Intelligence and Interactive Digital Enter- armed bandit: General framework and applications, in: tainment, volume 9, 2013, pp. 191–197. International conference on machine learning, PMLR, [4] J. Valls-Vargas, S. Ontanón, J. Zhu, Exploring player 2013, pp. 151–159. trace segmentation for dynamic play style prediction, [20] J.-Y. Audibert, S. Bubeck, G. Lugosi, Minimax policies in: Proceedings of the AAAI conference on artificial for combinatorial prediction games, in: Proceedings intelligence and interactive digital entertainment, vol- of the 24th Annual Conference on Learning Theory, ume 11, 2015, pp. 93–99. JMLR Workshop and Conference Proceedings, 2011, [5] A. Vinogradov, B. Harrison, Detecting player prefer- pp. 107–132. ence shifts in an experience managed environment, [21] R. Bartle, Hearts, clubs, diamonds, spades: Players in: International Conference on Interactive Digital who suit muds, Journal of MUD research 1 (1996) 19. Storytelling, Springer, 2023, pp. 517–531. [22] H. L. O’Brien, E. G. Toms, The development and evalu- [6] A. Vinogradov, B. Harrison, Using multi-armed ban- ation of a survey to measure user engagement, Journal dits to dynamically update player models in an ex- of the American Society for Information Science and perience managed environment, in: Proceedings of Technology 61 (2010) 50–69. the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, volume 18, 2022, pp. 207–214. [7] W. Chen, W. Hu, F. Li, J. Li, Y. Liu, P. Lu, Combinatorial multi-armed bandit with general reward functions, Advances in Neural Information Processing Systems 29 (2016). [8] M. O. Riedl, A. Stern, D. Dini, J. Alderman, Dynamic experience management in virtual worlds for enter- tainment, education, and training, International Trans- actions on Systems Science and Applications, Special Issue on Agent Based Systems for Human Learning 4 (2008) 23–42. [9] M. Mateas, A. Stern, Integrating plot, character and natural language processing in the interactive drama façade, in: Proceedings of the 1st International Con- ference on Technologies for Interactive Digital Story- telling and Entertainment (TIDSE-03), volume 2, 2003. [10] A. Lamstein, M. Mateas, Search-based drama manage- ment, in: Proceedings of the AAAI-04 Workshop on Challenges in Game AI, 2004, pp. 103–107. [11] M. J. Nelson, M. Mateas, Another look at search-based drama management., in: AAMAS (3), 2008, pp. 1293– 1298. [12] M. Sharma, S. Ontañón, M. Mehta, A. Ram, Drama management and player modeling for interactive fic- tion games, Computational Intelligence 26 (2010) 183– 211. [13] R. C. Gray, J. Zhu, S. Ontañón, Multiplayer modeling via multi-armed bandits, in: 2021 IEEE Conference on Games (CoG), IEEE, 2021, pp. 01–08. [14] J. Zhu, S. Ontañón, Experience management in multi- player games, in: 2019 IEEE Conference on Games (CoG), IEEE, 2019, pp. 1–6. [15] R. Khoshkangini, S. Ontanón, A. Marconi, J. Zhu, Dy- namically extracting play style in educational games, EUROSIS proceedings, GameOn (2018). [16] R. C. Gray, J. Zhu, D. Arigo, E. Forman, S. Ontañón, Player modeling via multi-armed bandits, in: Pro- ceedings of the 15th international conference on the foundations of digital games, 2020, pp. 1–8.