<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Multi-Character Text Role-Playing Game Agents</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Christopher Cui</string-name>
          <email>ccui46@gatech.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiangyu Peng</string-name>
          <email>xpeng62@gatech.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mark Riedl</string-name>
          <email>riedl@cc.gatech.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Reinforcement Learning, Text Role Playing Games, Open-endedness, Few-shot learning</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>AIIDE Workshop on Experimental Artificial Intelligence in Games</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Georgia Institute of Technology</institution>
          ,
          <addr-line>Atlanta, GA, 30332</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Text-adventure games and text role-playing games are grand challenges for reinforcement learning game playing agents. Text role-playing games are open-ended environments where an agent must faithfully play a particular character. We consider the distinction between characters and actors, where an actor agent has the ability to play multiple characters. We present a framework we call a thespian agent that can learn to emulate multiple characters along with a soft prompt that can be used to direct it as to which character to play at any time. We further describe an attention mechanism that allows the agent to learn new characters that are based on previously learned characters in a few-shot fashion. We show that our agent outperforms the state of the art agent framework in multi-character learning and few-shot learning.</p>
      </abstract>
      <kwd-group>
        <kwd>Action</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>Text adventure games are those in which a player can
only interact with an interactive environment through
reading text descriptions of the environment and acting
by typing descriptions of actions. Text games present
a grand challenge for AI because they (a) are partially
observable; (b) have combinatorially large state spaces
consisting of all possible descriptive text strings; (c) have
combinatorially large action spaces in the order of
billions of possible text commands; (d) require reasoning
about long-horizon causal dependencies; and (e) require
commonsense and narrative trope reasoning [1]. Text
adventure game playing has become a benchmark challenge
for reinforcement Learning (RL) agents [1, 2, 3, 4, 5, 6],
which play by exploring the environment and receiving
score based on how far they make it through the game.</p>
      <p>Relatedly, table-top role playing games, such as
Dungeons &amp; Dragons, involve multiple players that interact
with textual descriptions of the environment as well as
dialogue with other players. While players may be
motivated by a quest or mission, table-top role playing games
are fundamentally open-ended, meaning that players can
interact with the environment and with each other in
ways that are not strictly dictated by a quest, mission,
or set of puzzles. Open-ended role-playing extends the
same challenges of text adventure games but removes the
inant question for open-ended role-playing is whether an
agent acts consistently with a given character definition.</p>
      <p>Because there may be no explicit reward associated
with progression in open-ended role playing games, an
CEUR
Workshop
Proce dings
htp:/ceur-ws.org
ISN1613-073</p>
      <p>Attribution 4.0 International (CC BY 4.0).</p>
      <sec id="sec-2-1">
        <title>CEUR</title>
      </sec>
      <sec id="sec-2-2">
        <title>Workshop Proceedings (CEUR-WS.org)</title>
        <p>environmentally-dictated reward structure. The predom- produces a probability distribution across all actions for each</p>
        <sec id="sec-2-2-1">
          <title>Dungeon</title>
          <p>...</p>
          <p>You are in the Dungeon. It is dark and gloomy...
There is a skeleton here.</p>
          <p>There is a path south, a path east, and a path west.
You have a sword, a shield, and a tattered map.
Previous action: Go west</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>Prompt:</title>
          <p>You are an adventurer
pickpocket skeleton</p>
        </sec>
        <sec id="sec-2-2-3">
          <title>Thief</title>
        </sec>
        <sec id="sec-2-2-4">
          <title>Adventurer Rogue</title>
          <p>As an adventurer, the best action
for me to take is hit skeleton
You hit the skeleton! The skeleton died!
of diferent characters by being provided a prompt that
indicates which character it should emulate at the time. The agent
prompt. Then, for the current character, the corresponding
distribution is then sampled to produce the chosen action.
tions depending on the situation.1
agent must instead be trained to, at least, emulate
particular character types such as “thief” or an “adventurer”,
each of which has diferent preferences for diferent
ac1This is a simplification of table-top role-playing games that can also</p>
          <p>In this paper we consider the distinction between char- doesn’t degrade the performance of original characters.
acter agents and actor agents. A character agent is trained
to act like one specific character; for all intents and
purposes it is that character and knows nothing else but 2. Background and Related Work
how to be that character. In contrast, an actor agent has
knowledge about how to play many diferent characters The distinction between characters and actors have been
and can receive instruction from an external source (for made before. Louchart and Aylett [7] consider an actor
example a movie director or a dungeon master) about agent one that makes a secondary assessment of its own
which character type to play. Furthermore, an actor can cognitive and emotional state. Riedl [8] consider an actor
leverage the character knowledge to learn to blend char- agent one that doesn’t just reason about the best action
acters with only a small amount of additional practice to convey a character but also incorporates directorial
(e.g., few-shot learning) without exhaustively re-training goals. Si et al. [9] consider an actor agent one that reasons
from scratch. We will refer to actor agents as thespian about the cognitive state of other interlocutors in an
agents to distinguish between agents that learn to enact interactive game; they also referred to their agent as a
multiple characters from the actor-critic reinforcement thespian. These prior works looked at acting as
metalearning architecture. cognition, but agents could not represent more than one</p>
          <p>This paper considers two challenges. The first is to character without retraining or reprogramming. While
train a single reinforcement learning agent model that our work can also be considered a form of meta-cognition,
can switch between character types with a simple in- our focus is on a single model trained to be able to reason
struction. We present a new RL agent that can learn about and enact diferent characters.
to emulate multiple characters simultaneously with an
updated policy model that generates || sets of action 2.1. Text Adventure Game Playing Agents
distributions, where  is a set of character classes. The Text adventures are games in which the player must read
agent also learns a soft prompt that can later be provided textual descriptions of the environment and describe their
as a cue to emulate a specific character. actions with short text commands. Most text adventure</p>
          <p>The second challenge is to be able to train a thespian games have a narrative progression through puzzles
toagent to learn new characters in a fraction of the training ward an ultimate goal or conclusion. Text based games
time while maintaining performance in the previously have shown great potential for use as Reinforcement
trained characters. We achieve this by adding an atten- Learning benchmark environments [1, 2]. Ammanabrolu
tion mechanism to the outputs of the the thespian agent, and Riedl [3] proposed augmenting reinforcement
learnwhich can learn how to blend the action probabilities of ing with knowledge graphs as external memory about
diferent characters, thus learning a new character and a world state. Ammanabrolu and Hausknecht [4] proposed
new soft prompt. KG-A2C, which integrates knowledge graphs into the</p>
          <p>To return to our character vs. actor metaphor, we now actor-critic [10] RL framework. The Q*BERT agent [5]
have a thespian model that can simultaneously generate further extended KG-A2C to incorporate the BERT [11]
diferent actions for diferent characters. This is equiva- language model into the model architecture. We build
lent to a thespian thinking about how diferent characters on top of the KG-A2C family of models since they have
will respond to the same situation. The thespian agent re- shown state-of-the-art performance. Other techniques
ceives direction in the form of a prompt indicating what for playing textgames include GATA [6], which builds
character to play. If the thespian needs to play a new a knowledge-graph based representation of the world
character that it has never played before it can learn a on top of a transformer-based agent, training through a
new prompt for the new character much faster than if it combination of RL and self-supervised learning.
had to learn from scratch by leveraging what it already
knows about playing other characters.</p>
          <p>We conduct experiments across two original charac- 2.2. Text-based Role Playing Agents
ter types, a “thief” and an “adventurer” and demonstrate Whereas text adventure games have pre-defined
progresthe ability of a single thespian agent trained on both sion toward a goal state, table-top role playing games
characters to perform as well as separate baseline mod- involve open-ended game play. We refer to text-based
els trained to emulate individual characters. We show environments that support open-ended game play as
that we can use a novel attention mechanism to learn a text-based role playing to signify the interaction with the
third character that is a blend of the previously trained environment through reading and writing text instead of
characters in a few-shot fashion. This few-shot charac- verbal interactions with other players and game masters.
ter learning is 10x faster than baseline alternatives and The LIGHT environment [12] is a crowdsourced
textbased role playing game with a rich environment with
feature distinct character personalities and back-stories.
interactable NPCs, objects and locations, each with a is a parameter estimating the reward horizon [1]. In our
short paragraph description, demonstrating the value setting, we will use a deterministic transition function  ,
of grounding in training agents that can not only act which is common in text-based games. However,
nothbut also converse successfully. Ammanabrolu et al. [13] ing in our proposed technique strictly requires it. The
propose agents that can switch seamlessly between gen- objective of reinforcement learning is to learn a policy,
erating natural language and action declarations. These  ∶  →  that maps states to actions, such that taking
agents can learn to play diferent characters when given the action mapped to the current state and following the
a motivation that includes character type and goal as part policy henceforth maximizes expected reward.
of the input world state. This work is most similar to ours,
except our agents do not require explicit motivations or 3.2. LIGHT
goals beyond a learned character prompt.</p>
          <p>Story Shaping [14] is a technique for training RL agents Our agent is trained in the LIGHT environment [12], a
to play text role-playing games wherein a story is con- text world environment with a database of 1775
Nonverted into a rich reward signal. The technique can be Player Characters (NPCs), 663 locations, and 3462
obused to train diferent characters, but can only train a jects with rich text descriptions. Game maps can also
single agent to emulate a single character. Our character- be handcrafted with specifically placed NPCs, locations
based reward strategy is related, but our rewards are and objects. We create a map for our experiments such
manually crafted instead of inferred from stories. that multiple character types can have relevant
activities to perform, including interacting with objects and
2.3. Few-Shot Adaptation NPCs. For example there are dragons for an “adventurer”
character to slay, and armor to don, whereas a “thief”
Large pre-trained Language models have emerged as character can take money from the donations receptacle
extremely powerful tools for NLP tasks[15, 16, 17]. How- in a sanctuary.
ever, a limitation of these powerful models is their size, Our experiments use base character types of “Thief”
some with parameters numbering in the billions [17]. and “Adventurer”. We also associate rewards to diferent
This makes them prohibitively expensive when it comes actions for each character type. For example, a “Thief”
to further training or fine-tuning. Low-Rank Adaptation character agent is rewarded for obtaining a hidden
dag(LoRA) circumvents this by keeping the model frozen and ger, stealing, and other thief-like actions. Likewise, an
introducing trainable rank decomposition matrices. Our “Adventurer” character agent is rewarded for obtaining
proposed technique also freezes the core model and trains a sword and armor from the armory and killing
monadditional layers on top, though the specific mechanics sters, and other adventurer-like actions. There is no
needed for reinforcement learning are diferent. requirement that an agent do particular actions and no</p>
          <p>Prompt-tuning also avoids the need to do further train- prescribed order. This is equivalent to the Story Shaping
ing on the model itself by introducing trainable, soft technique[14] , except the rewards are manual, which is
prompts that learn an ideal input based on the desired done to make more controlled experiments. Regardless
output [18]. [19] proposes pairing soft prompts with of character type, all games terminate when the agent
an attention module to induce language models to per- enters a particular, preset “goal room”, at which time
form diferent tasks. Using knowledge from a previously the agent receives a final reward that is smaller than the
trained task to improve learning on a new task has also others. The entire game map is provided in the appendix.
been explored by [20], their approach more focused on
generalization across simpler objectives and adaptation 3.3. KG-A2C
to unseen environments.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Preliminaries</title>
      <p>3.1. Textworlds as RL Testbeds
A text-adventure or text-based role playing game can be
modeled as a partially-observable Markov decision
process (POMDP) M = ⟨,  , , , , ,  ⟩ where  is the set
of ground truth world states,  is the set of actions,  is
the probability of transitioning from one state to another
given an executed action,  is a reward function,  is
the set of possible observations,  is the probability of
observations given the ground truth world state, and 
We build of the KG-A2C agent framework [ 4], an
Advantage-Actor Critic architecture augmented with a
knowledge-graph based attention. KG-A2C’s space of
observations includes (a) text description of the room
the agent is in via the “look” command, (b) text
descriptions of the character’s inventory via the “inventory”
command, (c) the agent’s last command, and (d)
feedback from the last command. The state observations are
concatenated and embedded using a recurrent GRU.</p>
      <p>Simultaneously, the state observation is used to
update a knowledge graph of facts about the world that
have been observed to date. This includes facts and
relations about rooms, objects in rooms, inventory items, etc.
Room Description</p>
      <p>Inventory
Game Feedback
Previous Action
+ idde
n
E
m
b
g
Knowledge Graph</p>
      <p>a
+ rL
ya
L
i
n
e
e
r
Thief Prompt
Adv. Prompt</p>
      <p>Other
Pretrained Prompts. .</p>
      <p>Learnable
Prompt</p>
      <p>RL
Agent</p>
      <p>Embeddings
Room Description</p>
      <p>Inventory
Game Feedback
Previous Action
Thief Action Logits
Adv. Action Logits
OOOthtthehreeArrAcAtcioctinotinLooLngoigLtsiot.sg..i.ts. .</p>
      <p>Learnable Action</p>
      <p>Logits
Thief Object Logits
Adv. Object Logits
OOthtehrerObject Logits. .</p>
      <p>OtherOObbjejcetcLtoLgiotsg.i.ts. .</p>
      <p>Learnable Object</p>
      <p>Logits
Thief Value
Adv. Value
Other P-T Value
OOtthheerrPP-T-TVaVluaelue
Learnable Value</p>
      <p>Feedforward
Network
Feedforward
Network
Feedforward
Network
Feedforward
Network</p>
      <p>X</p>
      <p>Softmax</p>
      <p>Softmax</p>
      <p>Sample
X</p>
      <p>Softmax</p>
      <p>Softmax</p>
      <p>Sample
Thief At ention Scores
Adv. At ention Scores
OOththeerOrAAtthteenrntitAoiontneSnctoiorens. .</p>
      <p>Scores. .</p>
      <p>Scores. .</p>
      <p>Learnable At ention</p>
      <p>Scores
Thief At ention Scores
Adv. At ention Scores
OOththeerOrAAtthteenrntitAoiontneSnctioorens. .</p>
      <p>Scores. .</p>
      <p>Scores. .</p>
      <p>Learnable At ention</p>
      <p>Scores
Select most
influential
persona</p>
      <p>X
X</p>
      <p>Complete
per character for each observation. When training the Thespian Attention, the blue-shaded boxes indicate frozen modules
with red-shaded boxes being trainable modules.</p>
      <p>This knowledge graph is then embedded using a graph
we want to execute. Figure 2(left, green box) shows the
attention mechanism [21].</p>
      <p>Advantage-actor critic networks [22] have two heads.</p>
      <p>The actor head generates logit scores, one for each
possible action, which can be converted to a probability
distribution via softmax and sampled to determine which
action the agent takes. The critic head estimates the
utility of the state. Actions are made up of verbs and optional
object names. The KG-A2C agent generates a verb, which
maps to a pre-defined template, and the generated object
name is used to populate the template.</p>
    </sec>
    <sec id="sec-4">
      <title>4. The Thespian Agent</title>
      <p>Building of the basic framework of KG-A2C we describe
how a single agent policy model can learn to emulate
multiple characters. To train a single model to emulate
diferent characters, it must be rewarded diferently for
each character, which can confuse an agent unless it
has a way of disentangling the characters. Our thespian
agent architecture addresses this challenge in two ways.</p>
      <p>First, we provide a means to learn soft character prompts .</p>
      <p>These are unique codes that are associated with diferent
characters and can be provided as input to indicate which
character the agent should emulate. Second, we change
the actor and critic heads to generate sets of logit scores
for all learned characters. Thus the agent can reason
about which actions are best for each character, and we
can sample from the set of logits for which ever character
thespian agent, focusing on the these two aspects.
4.1. Character Prompts
1
p
p

First, we allow for a soft character prompt to be learned.</p>
      <p>Each prompt is associated with a diferent character the
model has been trained to emulate and induces the agent
to generate behavior that is consistent with the
associated character. This is similar to the notion of the soft
prompt[18], which is like a regular prompt for LLMs but
given as an embedding instead of natural language. The
soft character prompt vector of values can be interpreted
as an instruction analogous to saying “I am in state x
and I am a Thief. My next action would be...” at the
embedding level.</p>
      <p>Let P = [ ... ] be a set of soft character prompts for
each character   ∈  and let o be the embedded current
state observation. Initially, the prompts p are empty,
initialized with random numbers. The internal state
representation   for character   is:
s =    × cat(o, p )</p>
      <p>(1)
where  is a set of trainable weights.</p>
      <p>The soft character prompts are learned as follows.
During training, the agent will engage in reinforcement
learning games as normal. In each game, the agent will be
provided with a diferent reward function for each
character   . That is, a thief will be rewarded for certain actions
and an adventurer will be rewarded for diferent actions. and use the checkpoint with the highest performance on
The character, corresponding character reward function, 20 test game runs, split equally between each character.
and character prompt p are rotated each game to balance We evaluate the agent in the same environment,
exethe training of multiple characters. Over time, each soft cuting the agent with with each character prompt one
prompt is updated via gradient flow through  such that at a time. We measure the percentage of total
charactereach unique prompt is associated with a particular way specific action opportunities the agent takes. We run
in which the agent is rewarded. each character prompt for 100 games with diferent
initialization seeds and take the average result.
4.2. Character-Specific Action Scores We compare to a baseline KG-A2C trained with the
same training method (but without the prompts since the
We also modify the agent model’s actor and critic mod- base KG-A2C architecture would not understand them),
ules. The standard A2C framework produces logit scores as well as the thespian agent with a prompt made of
for each action. This vector of logit scores is traditionally random numbers.
converted to a probability distribution with a softmax Table 1 shows the results. The base KG-A2C when
layer and sampled to determine which action the agent trained only on thief rewards or adventurer rewards is
takes. Our thespian agent model instead produces a stack able to achieve most of the character-specific score. The
of action logit scores. A softmax over this stack of logits base agent trained on one character rarely attempts to
produces  probability distributions, for  characters. perform actions specific to another character, which is to</p>
      <p>The critic head is likewise modified to produce  pre- be expected and demonstrates that the environment
setdicted utility scores, one for each character. ting is fair if the objective were to only train one character</p>
      <p>Thus, the agent is simultaneously determining which at a time. However, when the base KG-A2C is trained
action is best for each character and how good the current with both character rewards, the agent’s performance in
state is from the perspective of each character. one character sufers. The resulting agent also attempts</p>
      <p>At training time, the characters are rotated each game to get all rewards, regardless of character, thus failing to
and the  th set of logit scores is sampled to determine diferentiate between characters.
the agent’s action, and the  th utility value is used to In comparison, thespian agent uses a single model and
compute character-specific advantage loss. The loss is that single model scores a high thief score when given the
backpropagated through only the logits and utility used. thief prompt and a high adventurer score when given the
adventurer prompt. The thespian agent rarely attempts
5. Thespian Agent Experiments actions that are specific to a non-prompted character.
Despite being trained on multiple character rewards, the
thespian agent achieves performance equivalent to the
base model trained on only one character. Figure 3 shows
the learning curve of the single thespian agent training
on both characters versus a single base KG-A2C training
on both characters using the same character rotation
scheme; KG-A2C gets trapped in a local maximum.</p>
      <p>When the thespian agent is given a random prompt, it
scores poorly as either character. There may be a bias in
the environment that leads the agent to prefer the branch
that contains more adventurer score, explaining why the
agent obtains more adventurer rewards.</p>
      <p>In this section we evaluate the thespian agent without
the additional few-shot learning attention mechanism
to determine the extent to which the agent can learn
more than one character at a time. We train a single
agent to emulate two characters: thief and adventurer.</p>
      <p>We execute the agent in the same general environment
that has multiple opportunities for thief-specific actions
and adventurer-specific actions. The environment (see
Figure 5 in the Appendix) has a common starting room
and an exit room that terminates the game when the
agent enters it. There are a cluster of thief-specific and
adventurer-specific rewards clustered near the starting
room. The environment then branches with one branch
heading to areas that only contain thief-specific rewards
and another branch heading to areas that only contain
adventurer-specific actions.</p>
      <p>The thespian agent is trained as follows. We create
empty prompts for thief and adventurer. We train on one
character reward, accompanied by the character prompt,
for two games, then switch to the next character reward
and character prompt for two more games. A game
completes when the agent navigates to the exit room as
described in Section 3.2. We train for a total of 10,000 games</p>
    </sec>
    <sec id="sec-5">
      <title>6. Few-Shot Learning with</title>
    </sec>
    <sec id="sec-6">
      <title>Thespian Attention</title>
      <p>The thespian agent is a single agent that can be trained
to emulate many diferent characters by providing one
of the learned prompts as a cue for how to behave in an
open-ended fashion. In this section we consider the
question of whether a pre-trained thespian agent can learn
a new character that draws on knowledge of previously
learned characters.
e
r
o
c
S</p>
      <p>Episodes
the risk that the agent forgets the previous  characters. influence of diferent parts of the observation. They can
and add a module (see Figure 2 right, yellow) with learn- is that the thespian attention learns the optimal weights
able weights that operate on the original, frozen model’s 2In place of the embedded token sequence, we use the embedded
outputs. Since we seek to teach the agent a new
character that is a blend of existing characters, we apply an
observation tensors   but do not perform a maxpool over the
embedded observations as they are much smaller than the token sequences
used in Peng et all’s model ensemble</p>
      <p>The traditional actor-critic loss is computed as the
diference between the agent’s predicted value of an action and
the true expected value. However, the thespian agent
produces a real-numbered utility value prediction for
each character. Rather than perform a weighted average
with the attention scores as we did for the action logits,
we take the average of the predicted values of the state
from the new character’s perspective and the predicted
value of the most influential pre-trained character. This
is the pre-existing character that the agent thinks has the
best chance of receiving reward even though the reward
function is for a new character. Thus loss is a function
of how much better the thespian attention can pick an
action for the new character over the best chance if it
had to play a pre-existing character.</p>
      <p>The thespian agent can now be trained as before, by
providing a new character reward and an empty prompt.</p>
      <p>With the core thespian agent weights frozen, the agent
will retain the ability to respond to existing character
prompts. The thespian agent will learn new weights
in the feed-forward networks that combine the existing
characters action logits. We no longer need to specify
which set of character action logits to sample from. It
will also learn a new prompt for the new character.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Few-shot Experiments</title>
      <p>7.1. Baselines
We compare two agents:
• Thespian attention agent: a pre-trained
thespian agent with frozen weights and the few-shot
attention mechanism.
• Unfrozen thespian agent: the same pre-trained
thespian agent but with unfrozen weights and no
attention mechanism.</p>
      <p>Both agents are trained on a new “Rogue” reward,
which rewards the agent for a subset of thief-specific
and adventurer-specific actions.</p>
      <p>For the thespian attention agent, we measure the
total “rogue” game score after each step. For the unfrozen
agent, we measure the total “rogue” game score as well
as just the thief score and just the adventurer score.</p>
      <p>Whereas the thespian attention agent is frozen and
cannot lose its ability to emulate a thief or adventurer
(character prompt and internal weights are unchanged), the
unfrozen agent may lose its ability to emulate the thief
and adventurer as it trains on the “rogue” reward.
7.2. Results
The thespian attention uses far fewer parameters than
the core agent. Therefore we test the ability to train the
thespian attention module to learn a new character in
fewer training steps versus training from scratch. Given a
frozen thespian agent pre-trained to respond to the thief
and adventurer prompts, we train a new character—a
“Rogue”—that excels at both thieving and adventuring. To
demonstrate few-shot learning, we limit the total training
steps to 3,000.</p>
      <p>We created three variations of the environment:</p>
      <p>Thespian Attention</p>
      <p>Unfrozen Thief Score</p>
      <p>Unfrozen Rogue Score</p>
      <p>Steps
Unfrozen Adventurer Score</p>
      <p>Adventurer Max Score</p>
      <p>Thief Max Score
Rogue Max Score
to emulate the plain thief and plain adventurer. The choose actions that went with the most attended prompt
unfrozen agent can be trained using a rotation of games and would never achieve blending. This is because the
for all three characters. When this is done it takes in attention layer would just act as a scalar on the inputs.
excess of 40,000 steps before it converges on a model that The third alternative would have used a softmax layer
can play all three characters. to convert action logits to a probability distribution
be</p>
      <p>The reason the thespian attention agent does not do fore being fed into the attention mechanism. In all cases,
as well on the thief-first map as the others is because of this variation was inferior to operating on raw logits. The
bias introduced in the pre-training. Because the training softmax conversion of raw logits to a probability
distriburegimen alternates characters, it trains on the “thief” char- tion smooths the values, making it harder to discriminate
acter last. This makes the thespian agent slightly overfit between actions. Manipulating the logits allows for the
to the thief character (relative to the adventurer). While biases of the individual character prompts to be more
this might seem like it might give it an advantage on the faithfully preserved.
thief-first map, in means that it takes longer to encounter
non-thief “rogue” rewards; the encounter of early thief
rewards reinforces this by placing more attention weight 9. Conclusions
on thief action logits. We see a similar behavior when
we allow the thespian agent to complete an additional
round of training on the “adventurer” character.</p>
      <p>We make the distinction between character agents and
actor agents. A character agent learns a model of a single
character. An actor, or thespian, agent learns a model of
multiple characters and can take direction through a soft
8. Ablation Studies prompt about which character to emulate. Our
formulation of a thespian agent is further able to reason about
We investigate three alternative ways to incorporate at- which actions would be appropriate to each character.
tention into the thespian agent: The production of diferent action logit scores for
different characters allows us to add an additional attention
• Attention over a direct weighted average of char- mechanism that learn new characters that remix
previacter prompts. ously known characters in a few-shot fashion. This is
• Attention over a weighted average of the soft shown by training a new character that can take on the
character prompt plus state observation. behavioral characteristics of previously known
charac• Attention over action probabilities vs. raw logits. ters to respond to new circumstances in the environment.</p>
      <p>In the context of text role-playing games, a grand
challenge for AI [24], this work presents a step toward
openended agents with disentanglable behavior policies.</p>
      <p>The first two, which focused on attention over the soft
character prompts in various ways, resulted in agents
that failed to learn a new character. The agent would
chronous methods for deep reinforcement learning,
in: International conference on machine learning,</p>
      <p>PMLR, 2016.
[23] J. L. Ba, J. R. Kiros, G. E. Hinton, Layer
normaliza</p>
      <p>tion, arXiv preprint arXiv:1607.06450 (2016).
[24] C. Callison-Burch, G. Singh Tomar, L. J. Martin,</p>
      <p>D. Ippolito, S. Bailis, D. Reitter, Dungeons and
Dragons as a Dialogue Challenge for Artificial
Intelligence, in: Conference on Empirical Methods in
Natural Language Processing (EMNLP), ACL, Abu</p>
      <p>Dhabi, UAE, 2022. arXiv:2210.07109.
npcs: graveyard
keeper, ghost
objs: family heirlooms
Dungeon
npcs: Living
Skeleton
objs:
Lower Dungeon
npcs: huge dragon,
captured knight
objs:
Forest Entrance
npcs: wolf,
snake
objs: trees</p>
      <p>Fishing Store
npcs: fisherman
objs: pearl</p>
      <p>Marketplace
npcs: some
customer
objs: the finest
wines, skinning</p>
      <p>knife
npcs: some man</p>
      <p>Tavern
npcs: some individual, old
homeless man
objs: hidden dagger,
grimy stools, beer keg</p>
    </sec>
    <sec id="sec-8">
      <title>A. Appendix</title>
      <p>A.1. LIGHT Map</p>
      <p>A.2. Training Details
While most other hyperparameters are kept the same, we
increase the learning rate while decreasing the value loss
for the thespian attention. Despite the new prompt and
the Attention Module having comparably a smaller
number of trainable parameters, we also train over a much
smaller number of steps to emulate Few-Shot training.
Where thespian agent allowed to train to completion over
10,000 games, we constrain the thespian attention to only
3000 steps, which for a well performing agent could be
potentially 150 games but could also potentially only be
40 games for a nonperforming agent, depending on the
number of steps the agent takes within a game. While we
found a higher learning rate hinders the thespian agent,
for the thespian attention the higher learning rate
beneifted the agent due to the agent having already learned
and being constrained to a smaller, more optimal set of
actions.</p>
      <p>We also lower the coeficient of the value loss as well
as changing how the value is calculated. As the Critic
is frozen, we know it will always output the wrong
reward value for any “Adventurer” or “Thief” action that
isn’t included in the new character. This results in large
amounts of unnecessary loss that throws of the fusion
agent during training. However, the value loss cannot
be removed completely as it comprises the vast majority
of the loss due to the pre-training of the thespian agent
prior to the thespian attention.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>