-

1613-0073

Multi-Character Text Role-Playing Game Agents

Christopher Cui

ccui46@gatech.edu 0 1

Xiangyu Peng

xpeng62@gatech.edu 0 1

Mark Riedl

riedl@cc.gatech.edu 0 1

Reinforcement Learning, Text Role Playing Games, Open-endedness, Few-shot learning

0 AIIDE Workshop on Experimental Artificial Intelligence in Games 1 Georgia Institute of Technology , Atlanta, GA, 30332 , USA

Text-adventure games and text role-playing games are grand challenges for reinforcement learning game playing agents. Text role-playing games are open-ended environments where an agent must faithfully play a particular character. We consider the distinction between characters and actors, where an actor agent has the ability to play multiple characters. We present a framework we call a thespian agent that can learn to emulate multiple characters along with a soft prompt that can be used to direct it as to which character to play at any time. We further describe an attention mechanism that allows the agent to learn new characters that are based on previously learned characters in a few-shot fashion. We show that our agent outperforms the state of the art agent framework in multi-character learning and few-shot learning.

Action

CEUR ceur-ws.org

1. Introduction

Text adventure games are those in which a player can only interact with an interactive environment through reading text descriptions of the environment and acting by typing descriptions of actions. Text games present a grand challenge for AI because they (a) are partially observable; (b) have combinatorially large state spaces consisting of all possible descriptive text strings; (c) have combinatorially large action spaces in the order of billions of possible text commands; (d) require reasoning about long-horizon causal dependencies; and (e) require commonsense and narrative trope reasoning [1]. Text adventure game playing has become a benchmark challenge for reinforcement Learning (RL) agents [1, 2, 3, 4, 5, 6], which play by exploring the environment and receiving score based on how far they make it through the game.

Relatedly, table-top role playing games, such as Dungeons & Dragons, involve multiple players that interact with textual descriptions of the environment as well as dialogue with other players. While players may be motivated by a quest or mission, table-top role playing games are fundamentally open-ended, meaning that players can interact with the environment and with each other in ways that are not strictly dictated by a quest, mission, or set of puzzles. Open-ended role-playing extends the same challenges of text adventure games but removes the inant question for open-ended role-playing is whether an agent acts consistently with a given character definition.

Because there may be no explicit reward associated with progression in open-ended role playing games, an CEUR Workshop Proce dings htp:/ceur-ws.org ISN1613-073

Attribution 4.0 International (CC BY 4.0).

CEUR Workshop Proceedings (CEUR-WS.org)

environmentally-dictated reward structure. The predom- produces a probability distribution across all actions for each

Dungeon

...

You are in the Dungeon. It is dark and gloomy... There is a skeleton here.

There is a path south, a path east, and a path west. You have a sword, a shield, and a tattered map. Previous action: Go west

Prompt:

You are an adventurer pickpocket skeleton

Thief Adventurer Rogue

As an adventurer, the best action for me to take is hit skeleton You hit the skeleton! The skeleton died! of diferent characters by being provided a prompt that indicates which character it should emulate at the time. The agent prompt. Then, for the current character, the corresponding distribution is then sampled to produce the chosen action. tions depending on the situation.1 agent must instead be trained to, at least, emulate particular character types such as “thief” or an “adventurer”, each of which has diferent preferences for diferent ac1This is a simplification of table-top role-playing games that can also

In this paper we consider the distinction between char- doesn’t degrade the performance of original characters. acter agents and actor agents. A character agent is trained to act like one specific character; for all intents and purposes it is that character and knows nothing else but 2. Background and Related Work how to be that character. In contrast, an actor agent has knowledge about how to play many diferent characters The distinction between characters and actors have been and can receive instruction from an external source (for made before. Louchart and Aylett [7] consider an actor example a movie director or a dungeon master) about agent one that makes a secondary assessment of its own which character type to play. Furthermore, an actor can cognitive and emotional state. Riedl [8] consider an actor leverage the character knowledge to learn to blend char- agent one that doesn’t just reason about the best action acters with only a small amount of additional practice to convey a character but also incorporates directorial (e.g., few-shot learning) without exhaustively re-training goals. Si et al. [9] consider an actor agent one that reasons from scratch. We will refer to actor agents as thespian about the cognitive state of other interlocutors in an agents to distinguish between agents that learn to enact interactive game; they also referred to their agent as a multiple characters from the actor-critic reinforcement thespian. These prior works looked at acting as metalearning architecture. cognition, but agents could not represent more than one

This paper considers two challenges. The first is to character without retraining or reprogramming. While train a single reinforcement learning agent model that our work can also be considered a form of meta-cognition, can switch between character types with a simple in- our focus is on a single model trained to be able to reason struction. We present a new RL agent that can learn about and enact diferent characters. to emulate multiple characters simultaneously with an updated policy model that generates || sets of action 2.1. Text Adventure Game Playing Agents distributions, where is a set of character classes. The Text adventures are games in which the player must read agent also learns a soft prompt that can later be provided textual descriptions of the environment and describe their as a cue to emulate a specific character. actions with short text commands. Most text adventure

The second challenge is to be able to train a thespian games have a narrative progression through puzzles toagent to learn new characters in a fraction of the training ward an ultimate goal or conclusion. Text based games time while maintaining performance in the previously have shown great potential for use as Reinforcement trained characters. We achieve this by adding an atten- Learning benchmark environments [1, 2]. Ammanabrolu tion mechanism to the outputs of the the thespian agent, and Riedl [3] proposed augmenting reinforcement learnwhich can learn how to blend the action probabilities of ing with knowledge graphs as external memory about diferent characters, thus learning a new character and a world state. Ammanabrolu and Hausknecht [4] proposed new soft prompt. KG-A2C, which integrates knowledge graphs into the

To return to our character vs. actor metaphor, we now actor-critic [10] RL framework. The Q*BERT agent [5] have a thespian model that can simultaneously generate further extended KG-A2C to incorporate the BERT [11] diferent actions for diferent characters. This is equiva- language model into the model architecture. We build lent to a thespian thinking about how diferent characters on top of the KG-A2C family of models since they have will respond to the same situation. The thespian agent re- shown state-of-the-art performance. Other techniques ceives direction in the form of a prompt indicating what for playing textgames include GATA [6], which builds character to play. If the thespian needs to play a new a knowledge-graph based representation of the world character that it has never played before it can learn a on top of a transformer-based agent, training through a new prompt for the new character much faster than if it combination of RL and self-supervised learning. had to learn from scratch by leveraging what it already knows about playing other characters.

We conduct experiments across two original charac- 2.2. Text-based Role Playing Agents ter types, a “thief” and an “adventurer” and demonstrate Whereas text adventure games have pre-defined progresthe ability of a single thespian agent trained on both sion toward a goal state, table-top role playing games characters to perform as well as separate baseline mod- involve open-ended game play. We refer to text-based els trained to emulate individual characters. We show environments that support open-ended game play as that we can use a novel attention mechanism to learn a text-based role playing to signify the interaction with the third character that is a blend of the previously trained environment through reading and writing text instead of characters in a few-shot fashion. This few-shot charac- verbal interactions with other players and game masters. ter learning is 10x faster than baseline alternatives and The LIGHT environment [12] is a crowdsourced textbased role playing game with a rich environment with feature distinct character personalities and back-stories. interactable NPCs, objects and locations, each with a is a parameter estimating the reward horizon [1]. In our short paragraph description, demonstrating the value setting, we will use a deterministic transition function , of grounding in training agents that can not only act which is common in text-based games. However, nothbut also converse successfully. Ammanabrolu et al. [13] ing in our proposed technique strictly requires it. The propose agents that can switch seamlessly between gen- objective of reinforcement learning is to learn a policy, erating natural language and action declarations. These ∶ → that maps states to actions, such that taking agents can learn to play diferent characters when given the action mapped to the current state and following the a motivation that includes character type and goal as part policy henceforth maximizes expected reward. of the input world state. This work is most similar to ours, except our agents do not require explicit motivations or 3.2. LIGHT goals beyond a learned character prompt.

Story Shaping [14] is a technique for training RL agents Our agent is trained in the LIGHT environment [12], a to play text role-playing games wherein a story is con- text world environment with a database of 1775 Nonverted into a rich reward signal. The technique can be Player Characters (NPCs), 663 locations, and 3462 obused to train diferent characters, but can only train a jects with rich text descriptions. Game maps can also single agent to emulate a single character. Our character- be handcrafted with specifically placed NPCs, locations based reward strategy is related, but our rewards are and objects. We create a map for our experiments such manually crafted instead of inferred from stories. that multiple character types can have relevant activities to perform, including interacting with objects and 2.3. Few-Shot Adaptation NPCs. For example there are dragons for an “adventurer” character to slay, and armor to don, whereas a “thief” Large pre-trained Language models have emerged as character can take money from the donations receptacle extremely powerful tools for NLP tasks[15, 16, 17]. How- in a sanctuary. ever, a limitation of these powerful models is their size, Our experiments use base character types of “Thief” some with parameters numbering in the billions [17]. and “Adventurer”. We also associate rewards to diferent This makes them prohibitively expensive when it comes actions for each character type. For example, a “Thief” to further training or fine-tuning. Low-Rank Adaptation character agent is rewarded for obtaining a hidden dag(LoRA) circumvents this by keeping the model frozen and ger, stealing, and other thief-like actions. Likewise, an introducing trainable rank decomposition matrices. Our “Adventurer” character agent is rewarded for obtaining proposed technique also freezes the core model and trains a sword and armor from the armory and killing monadditional layers on top, though the specific mechanics sters, and other adventurer-like actions. There is no needed for reinforcement learning are diferent. requirement that an agent do particular actions and no

Prompt-tuning also avoids the need to do further train- prescribed order. This is equivalent to the Story Shaping ing on the model itself by introducing trainable, soft technique[14] , except the rewards are manual, which is prompts that learn an ideal input based on the desired done to make more controlled experiments. Regardless output [18]. [19] proposes pairing soft prompts with of character type, all games terminate when the agent an attention module to induce language models to per- enters a particular, preset “goal room”, at which time form diferent tasks. Using knowledge from a previously the agent receives a final reward that is smaller than the trained task to improve learning on a new task has also others. The entire game map is provided in the appendix. been explored by [20], their approach more focused on generalization across simpler objectives and adaptation 3.3. KG-A2C to unseen environments.

3. Preliminaries

3.1. Textworlds as RL Testbeds A text-adventure or text-based role playing game can be modeled as a partially-observable Markov decision process (POMDP) M = ⟨, , , , , , ⟩ where is the set of ground truth world states, is the set of actions, is the probability of transitioning from one state to another given an executed action, is a reward function, is the set of possible observations, is the probability of observations given the ground truth world state, and We build of the KG-A2C agent framework [ 4], an Advantage-Actor Critic architecture augmented with a knowledge-graph based attention. KG-A2C’s space of observations includes (a) text description of the room the agent is in via the “look” command, (b) text descriptions of the character’s inventory via the “inventory” command, (c) the agent’s last command, and (d) feedback from the last command. The state observations are concatenated and embedded using a recurrent GRU.

Simultaneously, the state observation is used to update a knowledge graph of facts about the world that have been observed to date. This includes facts and relations about rooms, objects in rooms, inventory items, etc. Room Description

Inventory Game Feedback Previous Action + idde n E m b g Knowledge Graph

a + rL ya L i n e e r Thief Prompt Adv. Prompt

Other Pretrained Prompts. .

Learnable Prompt

RL Agent

Embeddings Room Description

Inventory Game Feedback Previous Action Thief Action Logits Adv. Action Logits OOOthtthehreeArrAcAtcioctinotinLooLngoigLtsiot.sg..i.ts. .

Learnable Action

Logits Thief Object Logits Adv. Object Logits OOthtehrerObject Logits. .

OtherOObbjejcetcLtoLgiotsg.i.ts. .

Learnable Object

Logits Thief Value Adv. Value Other P-T Value OOtthheerrPP-T-TVaVluaelue Learnable Value

Feedforward Network Feedforward Network Feedforward Network Feedforward Network

Softmax

Sample X

Softmax

Sample Thief At ention Scores Adv. At ention Scores OOththeerOrAAtthteenrntitAoiontneSnctoiorens. .

Scores. .

Learnable At ention

Scores Thief At ention Scores Adv. At ention Scores OOththeerOrAAtthteenrntitAoiontneSnctioorens. .

Scores. .

Learnable At ention

Scores Select most influential persona

X X

Complete per character for each observation. When training the Thespian Attention, the blue-shaded boxes indicate frozen modules with red-shaded boxes being trainable modules.

This knowledge graph is then embedded using a graph we want to execute. Figure 2(left, green box) shows the attention mechanism [21].

Advantage-actor critic networks [22] have two heads.

The actor head generates logit scores, one for each possible action, which can be converted to a probability distribution via softmax and sampled to determine which action the agent takes. The critic head estimates the utility of the state. Actions are made up of verbs and optional object names. The KG-A2C agent generates a verb, which maps to a pre-defined template, and the generated object name is used to populate the template.

4. The Thespian Agent

Building of the basic framework of KG-A2C we describe how a single agent policy model can learn to emulate multiple characters. To train a single model to emulate diferent characters, it must be rewarded diferently for each character, which can confuse an agent unless it has a way of disentangling the characters. Our thespian agent architecture addresses this challenge in two ways.

First, we provide a means to learn soft character prompts .

These are unique codes that are associated with diferent characters and can be provided as input to indicate which character the agent should emulate. Second, we change the actor and critic heads to generate sets of logit scores for all learned characters. Thus the agent can reason about which actions are best for each character, and we can sample from the set of logits for which ever character thespian agent, focusing on the these two aspects. 4.1. Character Prompts 1 p p First, we allow for a soft character prompt to be learned.

Each prompt is associated with a diferent character the model has been trained to emulate and induces the agent to generate behavior that is consistent with the associated character. This is similar to the notion of the soft prompt[18], which is like a regular prompt for LLMs but given as an embedding instead of natural language. The soft character prompt vector of values can be interpreted as an instruction analogous to saying “I am in state x and I am a Thief. My next action would be...” at the embedding level.

Let P = [ ... ] be a set of soft character prompts for each character ∈ and let o be the embedded current state observation. Initially, the prompts p are empty, initialized with random numbers. The internal state representation for character is: s = × cat(o, p )

(1) where is a set of trainable weights.

The soft character prompts are learned as follows. During training, the agent will engage in reinforcement learning games as normal. In each game, the agent will be provided with a diferent reward function for each character . That is, a thief will be rewarded for certain actions and an adventurer will be rewarded for diferent actions. and use the checkpoint with the highest performance on The character, corresponding character reward function, 20 test game runs, split equally between each character. and character prompt p are rotated each game to balance We evaluate the agent in the same environment, exethe training of multiple characters. Over time, each soft cuting the agent with with each character prompt one prompt is updated via gradient flow through such that at a time. We measure the percentage of total charactereach unique prompt is associated with a particular way specific action opportunities the agent takes. We run in which the agent is rewarded. each character prompt for 100 games with diferent initialization seeds and take the average result. 4.2. Character-Specific Action Scores We compare to a baseline KG-A2C trained with the same training method (but without the prompts since the We also modify the agent model’s actor and critic mod- base KG-A2C architecture would not understand them), ules. The standard A2C framework produces logit scores as well as the thespian agent with a prompt made of for each action. This vector of logit scores is traditionally random numbers. converted to a probability distribution with a softmax Table 1 shows the results. The base KG-A2C when layer and sampled to determine which action the agent trained only on thief rewards or adventurer rewards is takes. Our thespian agent model instead produces a stack able to achieve most of the character-specific score. The of action logit scores. A softmax over this stack of logits base agent trained on one character rarely attempts to produces probability distributions, for characters. perform actions specific to another character, which is to

The critic head is likewise modified to produce pre- be expected and demonstrates that the environment setdicted utility scores, one for each character. ting is fair if the objective were to only train one character

Thus, the agent is simultaneously determining which at a time. However, when the base KG-A2C is trained action is best for each character and how good the current with both character rewards, the agent’s performance in state is from the perspective of each character. one character sufers. The resulting agent also attempts

At training time, the characters are rotated each game to get all rewards, regardless of character, thus failing to and the th set of logit scores is sampled to determine diferentiate between characters. the agent’s action, and the th utility value is used to In comparison, thespian agent uses a single model and compute character-specific advantage loss. The loss is that single model scores a high thief score when given the backpropagated through only the logits and utility used. thief prompt and a high adventurer score when given the adventurer prompt. The thespian agent rarely attempts 5. Thespian Agent Experiments actions that are specific to a non-prompted character. Despite being trained on multiple character rewards, the thespian agent achieves performance equivalent to the base model trained on only one character. Figure 3 shows the learning curve of the single thespian agent training on both characters versus a single base KG-A2C training on both characters using the same character rotation scheme; KG-A2C gets trapped in a local maximum.

When the thespian agent is given a random prompt, it scores poorly as either character. There may be a bias in the environment that leads the agent to prefer the branch that contains more adventurer score, explaining why the agent obtains more adventurer rewards.

In this section we evaluate the thespian agent without the additional few-shot learning attention mechanism to determine the extent to which the agent can learn more than one character at a time. We train a single agent to emulate two characters: thief and adventurer.

We execute the agent in the same general environment that has multiple opportunities for thief-specific actions and adventurer-specific actions. The environment (see Figure 5 in the Appendix) has a common starting room and an exit room that terminates the game when the agent enters it. There are a cluster of thief-specific and adventurer-specific rewards clustered near the starting room. The environment then branches with one branch heading to areas that only contain thief-specific rewards and another branch heading to areas that only contain adventurer-specific actions.

The thespian agent is trained as follows. We create empty prompts for thief and adventurer. We train on one character reward, accompanied by the character prompt, for two games, then switch to the next character reward and character prompt for two more games. A game completes when the agent navigates to the exit room as described in Section 3.2. We train for a total of 10,000 games

6. Few-Shot Learning with Thespian Attention

The thespian agent is a single agent that can be trained to emulate many diferent characters by providing one of the learned prompts as a cue for how to behave in an open-ended fashion. In this section we consider the question of whether a pre-trained thespian agent can learn a new character that draws on knowledge of previously learned characters. e r o c S

Episodes the risk that the agent forgets the previous characters. influence of diferent parts of the observation. They can and add a module (see Figure 2 right, yellow) with learn- is that the thespian attention learns the optimal weights able weights that operate on the original, frozen model’s 2In place of the embedded token sequence, we use the embedded outputs. Since we seek to teach the agent a new character that is a blend of existing characters, we apply an observation tensors but do not perform a maxpool over the embedded observations as they are much smaller than the token sequences used in Peng et all’s model ensemble

The traditional actor-critic loss is computed as the diference between the agent’s predicted value of an action and the true expected value. However, the thespian agent produces a real-numbered utility value prediction for each character. Rather than perform a weighted average with the attention scores as we did for the action logits, we take the average of the predicted values of the state from the new character’s perspective and the predicted value of the most influential pre-trained character. This is the pre-existing character that the agent thinks has the best chance of receiving reward even though the reward function is for a new character. Thus loss is a function of how much better the thespian attention can pick an action for the new character over the best chance if it had to play a pre-existing character.

The thespian agent can now be trained as before, by providing a new character reward and an empty prompt.

With the core thespian agent weights frozen, the agent will retain the ability to respond to existing character prompts. The thespian agent will learn new weights in the feed-forward networks that combine the existing characters action logits. We no longer need to specify which set of character action logits to sample from. It will also learn a new prompt for the new character.

7. Few-shot Experiments

7.1. Baselines We compare two agents: • Thespian attention agent: a pre-trained thespian agent with frozen weights and the few-shot attention mechanism. • Unfrozen thespian agent: the same pre-trained thespian agent but with unfrozen weights and no attention mechanism.

Both agents are trained on a new “Rogue” reward, which rewards the agent for a subset of thief-specific and adventurer-specific actions.

For the thespian attention agent, we measure the total “rogue” game score after each step. For the unfrozen agent, we measure the total “rogue” game score as well as just the thief score and just the adventurer score.

Whereas the thespian attention agent is frozen and cannot lose its ability to emulate a thief or adventurer (character prompt and internal weights are unchanged), the unfrozen agent may lose its ability to emulate the thief and adventurer as it trains on the “rogue” reward. 7.2. Results The thespian attention uses far fewer parameters than the core agent. Therefore we test the ability to train the thespian attention module to learn a new character in fewer training steps versus training from scratch. Given a frozen thespian agent pre-trained to respond to the thief and adventurer prompts, we train a new character—a “Rogue”—that excels at both thieving and adventuring. To demonstrate few-shot learning, we limit the total training steps to 3,000.

We created three variations of the environment:

Thespian Attention

Unfrozen Thief Score

Unfrozen Rogue Score

Steps Unfrozen Adventurer Score

Adventurer Max Score

Thief Max Score Rogue Max Score to emulate the plain thief and plain adventurer. The choose actions that went with the most attended prompt unfrozen agent can be trained using a rotation of games and would never achieve blending. This is because the for all three characters. When this is done it takes in attention layer would just act as a scalar on the inputs. excess of 40,000 steps before it converges on a model that The third alternative would have used a softmax layer can play all three characters. to convert action logits to a probability distribution be

The reason the thespian attention agent does not do fore being fed into the attention mechanism. In all cases, as well on the thief-first map as the others is because of this variation was inferior to operating on raw logits. The bias introduced in the pre-training. Because the training softmax conversion of raw logits to a probability distriburegimen alternates characters, it trains on the “thief” char- tion smooths the values, making it harder to discriminate acter last. This makes the thespian agent slightly overfit between actions. Manipulating the logits allows for the to the thief character (relative to the adventurer). While biases of the individual character prompts to be more this might seem like it might give it an advantage on the faithfully preserved. thief-first map, in means that it takes longer to encounter non-thief “rogue” rewards; the encounter of early thief rewards reinforces this by placing more attention weight 9. Conclusions on thief action logits. We see a similar behavior when we allow the thespian agent to complete an additional round of training on the “adventurer” character.

We make the distinction between character agents and actor agents. A character agent learns a model of a single character. An actor, or thespian, agent learns a model of multiple characters and can take direction through a soft 8. Ablation Studies prompt about which character to emulate. Our formulation of a thespian agent is further able to reason about We investigate three alternative ways to incorporate at- which actions would be appropriate to each character. tention into the thespian agent: The production of diferent action logit scores for different characters allows us to add an additional attention • Attention over a direct weighted average of char- mechanism that learn new characters that remix previacter prompts. ously known characters in a few-shot fashion. This is • Attention over a weighted average of the soft shown by training a new character that can take on the character prompt plus state observation. behavioral characteristics of previously known charac• Attention over action probabilities vs. raw logits. ters to respond to new circumstances in the environment.

In the context of text role-playing games, a grand challenge for AI [24], this work presents a step toward openended agents with disentanglable behavior policies.

The first two, which focused on attention over the soft character prompts in various ways, resulted in agents that failed to learn a new character. The agent would chronous methods for deep reinforcement learning, in: International conference on machine learning,

PMLR, 2016. [23] J. L. Ba, J. R. Kiros, G. E. Hinton, Layer normaliza

tion, arXiv preprint arXiv:1607.06450 (2016). [24] C. Callison-Burch, G. Singh Tomar, L. J. Martin,

D. Ippolito, S. Bailis, D. Reitter, Dungeons and Dragons as a Dialogue Challenge for Artificial Intelligence, in: Conference on Empirical Methods in Natural Language Processing (EMNLP), ACL, Abu

Dhabi, UAE, 2022. arXiv:2210.07109. npcs: graveyard keeper, ghost objs: family heirlooms Dungeon npcs: Living Skeleton objs: Lower Dungeon npcs: huge dragon, captured knight objs: Forest Entrance npcs: wolf, snake objs: trees

Fishing Store npcs: fisherman objs: pearl

Marketplace npcs: some customer objs: the finest wines, skinning

knife npcs: some man

Tavern npcs: some individual, old homeless man objs: hidden dagger, grimy stools, beer keg

A. Appendix

A.1. LIGHT Map

A.2. Training Details While most other hyperparameters are kept the same, we increase the learning rate while decreasing the value loss for the thespian attention. Despite the new prompt and the Attention Module having comparably a smaller number of trainable parameters, we also train over a much smaller number of steps to emulate Few-Shot training. Where thespian agent allowed to train to completion over 10,000 games, we constrain the thespian attention to only 3000 steps, which for a well performing agent could be potentially 150 games but could also potentially only be 40 games for a nonperforming agent, depending on the number of steps the agent takes within a game. While we found a higher learning rate hinders the thespian agent, for the thespian attention the higher learning rate beneifted the agent due to the agent having already learned and being constrained to a smaller, more optimal set of actions.

We also lower the coeficient of the value loss as well as changing how the value is calculated. As the Critic is frozen, we know it will always output the wrong reward value for any “Adventurer” or “Thief” action that isn’t included in the new character. This results in large amounts of unnecessary loss that throws of the fusion agent during training. However, the value loss cannot be removed completely as it comprises the vast majority of the loss due to the pre-training of the thespian agent prior to the thespian attention.