=Paper= {{Paper |id=Vol-2350/short2 |storemode=property |title=From Demonstrations and Knowledge Engineering to a DNN Agent in a Modern Open-World Video Game |pdfUrl=https://ceur-ws.org/Vol-2350/short2.pdf |volume=Vol-2350 |authors=Igor Borovikov,Ahmad Beirami |dblpUrl=https://dblp.org/rec/conf/aaaiss/BorovikovB19 }} ==From Demonstrations and Knowledge Engineering to a DNN Agent in a Modern Open-World Video Game== https://ceur-ws.org/Vol-2350/short2.pdf

From Demonstrations and Knowledge Engineering to a DNN Agent in a
Modern Open-World Video Game
Igor Borovikov Ahmad Beirami
EA Digital Platform – Data & AI EA Digital Platform – Data & AI
209 Redwood Shores Pkwy 209 Redwood Shores Pkwy
Redwood City, CA 94065 Redwood City, CA 94065
iborovikov@ea.com abeirami@ea.com

Abstract substantial engineering efforts at training agents in such con-
ditions is not practical and calls for different approaches.
In video games, there is a high demand for non-player char- The preferred solution would utilize only a few relatively
acters (NPCs) whose behavior is believable and human-like. short episodes played by the developers. The time allowed
The traditional hand-crafted AI driving NPCs is hard to scale
up in modern open-world multiplayer games, and often leads
for augmenting these demonstrations by autoplay would be
to the uncanny valley of robotic behavior. We discuss a novel limited, especially if the game engine doesn’t support a dra-
approach to solving this problem based on imitation learning. matic speedup. Thus, the solution has to be sample-efficient
We combine demonstrations, programmed rules, and boot- and train the agents offline. Using frame buffer for train-
strapping in the game environment to train a Deep Neural ing is problematic due to frequent changes of the game look
Network (DNN) defining the NPC behavior. Unlike Rein- during development. In contrary, the conceptual types of the
forcement Learning (RL), where the objective is optimal per- game environment rarely change during its production. That
formance, we aim to reproduce a human player style from allows for the engineering of a substantial part of the prior
few demonstrations. We embed the implicit knowledge of the knowledge about the game states into a low dimensional fea-
basic gameplay rules which are hard to learn via self-play or ture vector replacing the frame buffer. The core gameplay
infer from a few demonstrations but are straightforward to
capture with simple programmed logic. We build a compos-
also rarely changes which allows to describe desired behav-
ite model that interacts with the game to bootstrap the hu- ior as a compact model depending on core game features.
man demonstrations to provide sufficient training data for a In this position paper, we train a model of an NPC with
more complex DNN model capturing stylized gameplay from stylistics traits provided by demonstrations. We include the
demonstrations and enhanced with rules. We show that the aspects of behavior that are hard to infer from demonstra-
method is computationally fast and delivers promising results tions as engineered rules. We use low-dimensional engi-
in a game production cycle. neered features to provide similar information that is avail-
able in organic gameplay. Finally, we train a composite DNN
model, which we show to be effective while inexpensive.
Introduction and Problem Statement
The leading advances in the field of RL in application to Proposed Approach
playing computer games, e.g., (OpenAI Five 2018; Mnih The methodology we explore assumes modest domain
et al. 2015; Vinyals et al. 2017; Harmer et al. 2018), strive knowledge and an intuitive understanding of the game me-
to train an optimal artificial agent (“agent” for short) maxi- chanics of a First Person Shooter (FPS). Such level of un-
mizing clearly defined rewards, and the game itself remains derstanding is usually widely available to the game design-
fixed for a foreseen future. In contrast to that, during the ers and software engineers working on the game. We aim at
game development, the objectives and the settings are en- a high level of abstraction to avoid too elaborate or game-
tirely different. The agents can play a variety of roles with specific formalization of that knowledge. With that in mind,
the rewards that are hard to define formally, e.g., a goal the three components we engineer are:
of an agent exploring a game level is different from forag-
ing, defeating all adversaries, or solving a puzzle. Also, the • state space (features),
game environment is frequently changing between the game • action space,
builds. In such settings, it is desirable to quickly train agents • rules capturing implicit human knowledge.
that can work as an NPC or as bots for the automated test We complement the rules with explicit human demonstra-
and game balancing. Throwing computational resources and tions to build a Markov ensemble, which provides basic gen-
Copyright held by the author(s). In A. Martin, K. Hinkelmann, A. eralization. We use this aggregate model to drive an agent in
Gerber, D. Lenat, F. van Harmelen, P. Clark (Eds.), Proceedings of the game to automate the generation of additional bootstrap
the AAAI 2019 Spring Symposium on Combining Machine Learn- data. The bootstrapped data set allows to train a compos-
ing with Knowledge Engineering (AAAI-MAKE 2019). Stanford ite DNN model combining the demonstrations and the en-
University, Palo Alto, California, USA, March 25-27, 2019. gineered rules. As a case study, we explore an FPS game,
which is conceptually similar to the one investigated by Capturing Gameplay Style with Markov
(Harmer et al. 2018). Its core mechanics is generic enough Ensembles
to make our approach applicable to other games in the same
Learning stylistic elements of a policy is challenging. One of
or similar genre.
the main difficulties comes from style evaluation, which is
highly subjective and not easily quantifiable. While inverse
State Space (Features) reinforcement learning aims at finding reward functions that
promote a certain style and behavior, we don’t have suffi-
We instrument the game and expose its current state s to cient data at hand to solve this ill-posed problem. Instead,
the agent as a low-dimensional normalized vector s = we incorporate style directly from the demonstrations using
(c1 , . . . , cn , d1 , . . . , dm ). Its components, c and d, describe imitation learning. Our demonstration data consists of per-
n continuous and m discrete features correspondingly. In an frame records of the already described engineered states and
FPS game, the continuous features could include the dis- the player actions encoded as the game controller inputs.
tance to the adversary, the angle between target Line of Notice that we inherently deal with a partially observable
Sight (LoS) and the player orientation, velocity components, Markov decision process (POMDP) as not all the attributes
ammo, and health. We map all continuous components to the in the state space are observable (Borovikov and Beirami ).
range |ci | ∈ [0, 1]. A mapping function for unbound com- To support Markov model capturing player style, we include
ponents like distance could be arctan or any similarly be- the history of N recent actions as part of the state space as
having smooth function. For variables with naturally defined well. The depth N of the history affects the expressiveness
range, like ammo or health, we use linear normalization. The of style capture. In our experiments, we limit the depth to
discrete components d may include binary values like the less than a second of gameplay (N < 60), which still pre-
presence of an LoS, sprinting on-off, and one hot coded val- serves the visual style of a simple reactive policy.
ues of non-binary values, such as current animation type. By To build the Markov ensemble, we use an approach to
construction, all di ∈ {0, 1}. style reproduction inspired by natural language processing
The total number of features for a single adversary in our literature (see review (Zhai 2008)). We encode the demon-
proof-of-concept implementation is quite small: n ∼ 20 and strations as symbolic sequences utilizing a hierarchy of
m ∼ 20, depending on the experiment. In particular, we multi-resolution quantization schemes ranging from detailed
explore only on-foot action and remain in the same game to a complete information loss for the continuous and the
modality, e.g., we exclude getting into and driving vehicles. discrete channels. The most detailed quantization and higher
The locations and objects of interest like cover spots, med- order Markov models can reproduce sequences of human
ical packs and ammo clips, can appear in the state vector actions in similar situations with high accuracy, thus cap-
in different forms but we leave their detailed description to turing gameplaying style. The coarsest level corresponds to
future more detailed publications, where we intend to ex- a Markov agent blindly sampling actions from the demon-
plore more complex gameplay. Finally, we expose only the strations. The hierarchical ensemble of Markov models pro-
features that are observable in organic gameplay to achieve vides minimal generalization from the demonstration data.
human-like gameplay. The ensemble of such models is straightforward to build,
and the inference is a lookup process.
We intend to provide a complete description of Markov
Action Space ensemble by publishing a preprint of our internal technical
The agent’s actions naturally map to the game controller in- report (Borovikov and Harder 2018).
put encoded with a few variables. Actions a1 , . . . , ak com-
prise of k continuous and discrete values. There are six Capturing Implicit Human Knowledge with
analog inputs from the controller: two sticks normalized to Embedded Rules
[−1, 1] and two triggers normalized to [0, 1] by design. Also The outlined Markov ensemble can formally generalize to
there are a few binary inputs from buttons. While all possible previously unobserved states. However, at the coarsest level
combinations of the inputs may look intractable for learn- of quantization, blind sampling from the observed actions
ing, as stated in (Harmer et al. 2018), our goal of imitating a easily breaks an illusion of intelligent goal-driven behav-
human player doesn’t require to cover all possible combina- ior. To address that issue, we trim the coarser levels of
tions of inputs. Instead, we extract only those combinations quantization from the ensemble, leaving some states unhan-
that occur in organic gameplay and encode them as one hot, dled by the resulting incomplete Markov policy. To han-
which drastically reduces action space complexity. The re- dle such states, we augment the Markov ensemble with a
sulting dimensionality of the action space in our experiments small number of heuristics captured as rules. The rules pro-
turned out to be also in the low two digits k ∼ 15, depending vide a reasonable response to the states never observed in
on the experiment. the demonstrations and not covered by simple generaliza-
The proposed approach to the game state-action space tion with Markov models. To illustrate the possible types of
serves the two purposes: keep the action space dimension- rules, we briefly examine two of them.
ality under control and eliminate actions that humans never One type of rules illustrates implicit short-term goal
take. Thus, the model we train is “fair” (i.e. is not using more setting for the model, eliminating Inverse Reinforcement
inputs at the same time than allowed by human anatomy). Learning from the problem formulation. An obvious top-
level goal in an FPS genre is to find, attack and defeat the
adversary. A human player would not stand still or wan- Table 1: Comparison between OpenAI 1V1 Dota 2 Bot
der while receiving damage from the adversary. Such state (OpenAI Five 2018) training metrics and training an agent
would rarely if ever appear in demonstrations. Instead, more from human demonstrations, programmed rules, and boot-
often than not, the player would face the adversary and en- strap in a proprietary open-world first-person shooter. While
gage in combat. The corresponding rule we propose boils to the objectives of training are different, the environments are
simple “Turn to and approach the target whenever there is somewhat comparable. The metrics illustrate the practical
nothing else to do”, captured with only a couple of lines of advantages of the proposed technique.
code. The target here can be the adversary, a cover spot or OpenAI Bootstrapped
other objects of interest. The rule eventually transitions the 1V1 Bot Agent
agent into a state that it can handle from demonstrations. Experience ∼300 years ∼5 min human
For the second type of rules, an example could be as (per day) demonstrations
simple as “Do not run indefinitely in the same direction if
Bootstrap using N/A ×5-20
moving in that direction is not possible”. Humans proac-
game client
tively avoid blocked states, and they may never occur in or-
CPU 60,000 CPU 1 local CPU
ganic gameplay. Hence, learning such a rule directly from
cores on Azure
the demonstrations is not possible since the data for such
GPU 256 K80 GPUs N/A
blocked states is not present.
on Azure
In both cases, discovering the desired behavior via explo-
Size of ∼3.3kB ∼0.5kB
ration would require a substantial amount of time, compu-
observation
tational resources and a hand-crafted reward function. The
costs of such exploration are disproportional to the simplic- Observations 10 33
ity of the decisions the agent needs to learn. per second
To summarize, the engineered rules capture simple hu- of gameplay
man knowledge and complement the ensemble model in the
states unobserved in the demonstrations. When the trimmed chitectures, allowing a quick experimentation loop. We con-
ensemble model fails to produce an action, the script checks verged on a simple model with a single “wide” hidden layer
for the conditions like the blocked one to generate a fallback for motion control channels and a DNN model for discrete
action using the rules. The proposed combination of Markov channels toggling actions like sprinting, firing, climbing.
ensemble and the programmed heuristics provides a segway While winning is not everything (Borovikov et al. 2019),
to the next step which addresses the linear growth of the en- we would like the agent to perform at a reasonable level of
semble with the number of demonstrations. FPS genre metrics, e.g., demonstrate good kill-death ratio or
defeat the adversary within the allowed limit of health and
DNN Model Trained with Bootstrapped ammo. As we observe in our experiments, the kill-death ra-
Demonstrations and Rules tio for the model we train can vary in wide range and is at
The traditional RL requires thousands of episodes to learn best around 10-40% of the teacher’s performance. It remains
useful policies. Further, it is hard to engineer reward to an open problem how we can improve performance metrics
achieve desired style. Instead, we resort to imitation learning of a trained model using a limited amount of additional train-
where we treat the demonstrations as a training set for a su- ing while preserving the style.
pervised learning problem. The model predicts the next ac-
tion from a sequence of observed state-action pairs. This ap- Conclusion and Future Work
proach has proved to be useful in pre-training of self-driving We tested our approach on a proprietary open-world first-
cars (Montemerlo et al. 2006) and is the subject of analysis person shooter game, which resulted in an agent behaving
in more recent literature, e.g., (Ross and Bagnell 2010). The similarly to a human player with minimal training costs. Ta-
main argument against casting IL as a supervised learning ble 1 illustrates significant computational advantages gained
framework is the inability to learn from new situations and to from adding engineered knowledge to the training process of
recover from mistakes. The rules and feature engineering we practically useful agents. However, when comparing our ap-
present above intend to address these issues by incorporating proach to the mainstream RL, we need to emphasize the dif-
prior human knowledge and bootstrapping demonstrations ference between the training objectives, which makes such a
to make it part of the supervised learning data. We achieve comparison only illustrative.
it by augmenting our small set of demonstrations with boot- The focus of our research is the practical cost-efficient de-
strap. We construct an agent controlled by a Markov ensem- velopment of human-like behavior in games. Keeping the
ble enhanced with the rules and let it interact with the game performance of the model within certain limits is our fu-
to generate new episodes. The generated augmented data set ture secondary objective. Obtaining theoretical guarantees
feeds into training a DNN described next. for the style and performance of the trained agents would
The trained DNN model predicts action from the already require substantial additional work. Applying our approach
observed state-action pairs including those previously han- to multi-agent policies and covering multi-modal gameplay
dled by the scripts. The low dimensionality of the feature is a logical next step. We plan to extend the encouraging re-
space results in fast training in a wide range of model ar- sults shown here to other games in development.
References
Borovikov, I., and Beirami, A. Imitation learning via
bootstrapped demonstrations in an open-world video game.
NeurIPS 2018 Workshop on Reinforcement Learning under
Partial Observability.
Borovikov, I., and Harder, J. 2018. Learning models to
imitate personal behavior style with applications in video
gaming. Technical report, Electronic Arts, Digital Platforms
Data and AI. To be published as a preprint.
Borovikov, I.; Zhao, Y.; Beirami, A.; Harder, J.; Kolen, J.;
Pestrak, J.; Pinto, J.; Pourabolghasem, R.; Chaput, H.; Sar-
dari, M.; Lin, L.; Aghdaie, N.; and Zaman, K. 2019. Win-
ning Isn’t Everything: Training Agents to Playtest Modern
Games. In AAAI Workshop on Reinforcement Learning in
Games.
Harmer, J.; Gisslen, L.; del Val, J.; Holst, H.; Bergdahl, J.;
Olsson, T.; Sjoo, K.; and Nordin, M. 2018. Imitation learn-
ing with concurrent actions in 3d games.
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Ve-
ness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.;
Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Human-
level control through deep reinforcement learning. Nature
518(7540):529.
Montemerlo, M.; Thrun, S.; Dahlkamp, H.; and Stavens, D.
2006. Winning the darpa grand challenge with an ai robot.
In In Proceedings of the AAAI National Conference on Arti-
ficial Intelligence, 17–20.
OpenAI Five. 2018. [Online, June 2018] https://
openai.com/five.
Ross, S., and Bagnell, D. 2010. Efficient reductions for
imitation learning. In Proceedings of the Thirteenth Inter-
national Conference on Artificial Intelligence and Statistics,
AISTATS 2010, Sardinia, Italy, May 2010, 661–668.
Vinyals, O.; Ewalds, T.; Bartunov, S.; Georgiev, P.; Vezh-
nevets, A. S.; Yeo, M.; Makhzani, A.; Küttler, H.; Aga-
piou, J.; Schrittwieser, J.; et al. 2017. StarCraft II: A
new challenge for reinforcement learning. arXiv preprint
arXiv:1708.04782.
Zhai, C. 2008. Statistical language models for information
retrieval a critical review. Foundations and Trends in Infor-
mation Retrieval 2(3):137–213.