=Paper= {{Paper |id=Vol-2350/short2 |storemode=property |title=From Demonstrations and Knowledge Engineering to a DNN Agent in a Modern Open-World Video Game |pdfUrl=https://ceur-ws.org/Vol-2350/short2.pdf |volume=Vol-2350 |authors=Igor Borovikov,Ahmad Beirami |dblpUrl=https://dblp.org/rec/conf/aaaiss/BorovikovB19 }} ==From Demonstrations and Knowledge Engineering to a DNN Agent in a Modern Open-World Video Game== https://ceur-ws.org/Vol-2350/short2.pdf
        From Demonstrations and Knowledge Engineering to a DNN Agent in a
                        Modern Open-World Video Game
                             Igor Borovikov                                        Ahmad Beirami
                     EA Digital Platform – Data & AI                        EA Digital Platform – Data & AI
                       209 Redwood Shores Pkwy                                209 Redwood Shores Pkwy
                        Redwood City, CA 94065                                 Redwood City, CA 94065
                        iborovikov@ea.com                                       abeirami@ea.com


                           Abstract                                  substantial engineering efforts at training agents in such con-
                                                                     ditions is not practical and calls for different approaches.
  In video games, there is a high demand for non-player char-           The preferred solution would utilize only a few relatively
  acters (NPCs) whose behavior is believable and human-like.         short episodes played by the developers. The time allowed
  The traditional hand-crafted AI driving NPCs is hard to scale
  up in modern open-world multiplayer games, and often leads
                                                                     for augmenting these demonstrations by autoplay would be
  to the uncanny valley of robotic behavior. We discuss a novel      limited, especially if the game engine doesn’t support a dra-
  approach to solving this problem based on imitation learning.      matic speedup. Thus, the solution has to be sample-efficient
  We combine demonstrations, programmed rules, and boot-             and train the agents offline. Using frame buffer for train-
  strapping in the game environment to train a Deep Neural           ing is problematic due to frequent changes of the game look
  Network (DNN) defining the NPC behavior. Unlike Rein-              during development. In contrary, the conceptual types of the
  forcement Learning (RL), where the objective is optimal per-       game environment rarely change during its production. That
  formance, we aim to reproduce a human player style from            allows for the engineering of a substantial part of the prior
  few demonstrations. We embed the implicit knowledge of the         knowledge about the game states into a low dimensional fea-
  basic gameplay rules which are hard to learn via self-play or      ture vector replacing the frame buffer. The core gameplay
  infer from a few demonstrations but are straightforward to
  capture with simple programmed logic. We build a compos-
                                                                     also rarely changes which allows to describe desired behav-
  ite model that interacts with the game to bootstrap the hu-        ior as a compact model depending on core game features.
  man demonstrations to provide sufficient training data for a          In this position paper, we train a model of an NPC with
  more complex DNN model capturing stylized gameplay from            stylistics traits provided by demonstrations. We include the
  demonstrations and enhanced with rules. We show that the           aspects of behavior that are hard to infer from demonstra-
  method is computationally fast and delivers promising results      tions as engineered rules. We use low-dimensional engi-
  in a game production cycle.                                        neered features to provide similar information that is avail-
                                                                     able in organic gameplay. Finally, we train a composite DNN
                                                                     model, which we show to be effective while inexpensive.
      Introduction and Problem Statement
The leading advances in the field of RL in application to                              Proposed Approach
playing computer games, e.g., (OpenAI Five 2018; Mnih                The methodology we explore assumes modest domain
et al. 2015; Vinyals et al. 2017; Harmer et al. 2018), strive        knowledge and an intuitive understanding of the game me-
to train an optimal artificial agent (“agent” for short) maxi-       chanics of a First Person Shooter (FPS). Such level of un-
mizing clearly defined rewards, and the game itself remains          derstanding is usually widely available to the game design-
fixed for a foreseen future. In contrast to that, during the         ers and software engineers working on the game. We aim at
game development, the objectives and the settings are en-            a high level of abstraction to avoid too elaborate or game-
tirely different. The agents can play a variety of roles with        specific formalization of that knowledge. With that in mind,
the rewards that are hard to define formally, e.g., a goal           the three components we engineer are:
of an agent exploring a game level is different from forag-
ing, defeating all adversaries, or solving a puzzle. Also, the       • state space (features),
game environment is frequently changing between the game             • action space,
builds. In such settings, it is desirable to quickly train agents    • rules capturing implicit human knowledge.
that can work as an NPC or as bots for the automated test            We complement the rules with explicit human demonstra-
and game balancing. Throwing computational resources and             tions to build a Markov ensemble, which provides basic gen-
Copyright held by the author(s). In A. Martin, K. Hinkelmann, A.     eralization. We use this aggregate model to drive an agent in
Gerber, D. Lenat, F. van Harmelen, P. Clark (Eds.), Proceedings of   the game to automate the generation of additional bootstrap
the AAAI 2019 Spring Symposium on Combining Machine Learn-           data. The bootstrapped data set allows to train a compos-
ing with Knowledge Engineering (AAAI-MAKE 2019). Stanford            ite DNN model combining the demonstrations and the en-
University, Palo Alto, California, USA, March 25-27, 2019.           gineered rules. As a case study, we explore an FPS game,
which is conceptually similar to the one investigated by                  Capturing Gameplay Style with Markov
(Harmer et al. 2018). Its core mechanics is generic enough                Ensembles
to make our approach applicable to other games in the same
                                                                          Learning stylistic elements of a policy is challenging. One of
or similar genre.
                                                                          the main difficulties comes from style evaluation, which is
                                                                          highly subjective and not easily quantifiable. While inverse
State Space (Features)                                                    reinforcement learning aims at finding reward functions that
                                                                          promote a certain style and behavior, we don’t have suffi-
We instrument the game and expose its current state s to                  cient data at hand to solve this ill-posed problem. Instead,
the agent as a low-dimensional normalized vector s =                      we incorporate style directly from the demonstrations using
(c1 , . . . , cn , d1 , . . . , dm ). Its components, c and d, describe   imitation learning. Our demonstration data consists of per-
n continuous and m discrete features correspondingly. In an               frame records of the already described engineered states and
FPS game, the continuous features could include the dis-                  the player actions encoded as the game controller inputs.
tance to the adversary, the angle between target Line of                     Notice that we inherently deal with a partially observable
Sight (LoS) and the player orientation, velocity components,              Markov decision process (POMDP) as not all the attributes
ammo, and health. We map all continuous components to the                 in the state space are observable (Borovikov and Beirami ).
range |ci | ∈ [0, 1]. A mapping function for unbound com-                 To support Markov model capturing player style, we include
ponents like distance could be arctan or any similarly be-                the history of N recent actions as part of the state space as
having smooth function. For variables with naturally defined              well. The depth N of the history affects the expressiveness
range, like ammo or health, we use linear normalization. The              of style capture. In our experiments, we limit the depth to
discrete components d may include binary values like the                  less than a second of gameplay (N < 60), which still pre-
presence of an LoS, sprinting on-off, and one hot coded val-              serves the visual style of a simple reactive policy.
ues of non-binary values, such as current animation type. By                 To build the Markov ensemble, we use an approach to
construction, all di ∈ {0, 1}.                                            style reproduction inspired by natural language processing
   The total number of features for a single adversary in our             literature (see review (Zhai 2008)). We encode the demon-
proof-of-concept implementation is quite small: n ∼ 20 and                strations as symbolic sequences utilizing a hierarchy of
m ∼ 20, depending on the experiment. In particular, we                    multi-resolution quantization schemes ranging from detailed
explore only on-foot action and remain in the same game                   to a complete information loss for the continuous and the
modality, e.g., we exclude getting into and driving vehicles.             discrete channels. The most detailed quantization and higher
The locations and objects of interest like cover spots, med-              order Markov models can reproduce sequences of human
ical packs and ammo clips, can appear in the state vector                 actions in similar situations with high accuracy, thus cap-
in different forms but we leave their detailed description to             turing gameplaying style. The coarsest level corresponds to
future more detailed publications, where we intend to ex-                 a Markov agent blindly sampling actions from the demon-
plore more complex gameplay. Finally, we expose only the                  strations. The hierarchical ensemble of Markov models pro-
features that are observable in organic gameplay to achieve               vides minimal generalization from the demonstration data.
human-like gameplay.                                                      The ensemble of such models is straightforward to build,
                                                                          and the inference is a lookup process.
                                                                             We intend to provide a complete description of Markov
Action Space                                                              ensemble by publishing a preprint of our internal technical
The agent’s actions naturally map to the game controller in-              report (Borovikov and Harder 2018).
put encoded with a few variables. Actions a1 , . . . , ak com-
prise of k continuous and discrete values. There are six                  Capturing Implicit Human Knowledge with
analog inputs from the controller: two sticks normalized to               Embedded Rules
[−1, 1] and two triggers normalized to [0, 1] by design. Also             The outlined Markov ensemble can formally generalize to
there are a few binary inputs from buttons. While all possible            previously unobserved states. However, at the coarsest level
combinations of the inputs may look intractable for learn-                of quantization, blind sampling from the observed actions
ing, as stated in (Harmer et al. 2018), our goal of imitating a           easily breaks an illusion of intelligent goal-driven behav-
human player doesn’t require to cover all possible combina-               ior. To address that issue, we trim the coarser levels of
tions of inputs. Instead, we extract only those combinations              quantization from the ensemble, leaving some states unhan-
that occur in organic gameplay and encode them as one hot,                dled by the resulting incomplete Markov policy. To han-
which drastically reduces action space complexity. The re-                dle such states, we augment the Markov ensemble with a
sulting dimensionality of the action space in our experiments             small number of heuristics captured as rules. The rules pro-
turned out to be also in the low two digits k ∼ 15, depending             vide a reasonable response to the states never observed in
on the experiment.                                                        the demonstrations and not covered by simple generaliza-
   The proposed approach to the game state-action space                   tion with Markov models. To illustrate the possible types of
serves the two purposes: keep the action space dimension-                 rules, we briefly examine two of them.
ality under control and eliminate actions that humans never                  One type of rules illustrates implicit short-term goal
take. Thus, the model we train is “fair” (i.e. is not using more          setting for the model, eliminating Inverse Reinforcement
inputs at the same time than allowed by human anatomy).                   Learning from the problem formulation. An obvious top-
level goal in an FPS genre is to find, attack and defeat the
adversary. A human player would not stand still or wan-           Table 1: Comparison between OpenAI 1V1 Dota 2 Bot
der while receiving damage from the adversary. Such state         (OpenAI Five 2018) training metrics and training an agent
would rarely if ever appear in demonstrations. Instead, more      from human demonstrations, programmed rules, and boot-
often than not, the player would face the adversary and en-       strap in a proprietary open-world first-person shooter. While
gage in combat. The corresponding rule we propose boils to        the objectives of training are different, the environments are
simple “Turn to and approach the target whenever there is         somewhat comparable. The metrics illustrate the practical
nothing else to do”, captured with only a couple of lines of      advantages of the proposed technique.
code. The target here can be the adversary, a cover spot or                               OpenAI            Bootstrapped
other objects of interest. The rule eventually transitions the                            1V1 Bot           Agent
agent into a state that it can handle from demonstrations.              Experience        ∼300 years        ∼5 min human
   For the second type of rules, an example could be as                                   (per day)         demonstrations
simple as “Do not run indefinitely in the same direction if
                                                                        Bootstrap using   N/A               ×5-20
moving in that direction is not possible”. Humans proac-
                                                                        game client
tively avoid blocked states, and they may never occur in or-
                                                                        CPU               60,000 CPU        1 local CPU
ganic gameplay. Hence, learning such a rule directly from
                                                                                          cores on Azure
the demonstrations is not possible since the data for such
                                                                        GPU               256 K80 GPUs      N/A
blocked states is not present.
                                                                                          on Azure
   In both cases, discovering the desired behavior via explo-
                                                                        Size of           ∼3.3kB            ∼0.5kB
ration would require a substantial amount of time, compu-
                                                                        observation
tational resources and a hand-crafted reward function. The
costs of such exploration are disproportional to the simplic-           Observations      10                33
ity of the decisions the agent needs to learn.                          per second
   To summarize, the engineered rules capture simple hu-                of gameplay
man knowledge and complement the ensemble model in the
states unobserved in the demonstrations. When the trimmed         chitectures, allowing a quick experimentation loop. We con-
ensemble model fails to produce an action, the script checks      verged on a simple model with a single “wide” hidden layer
for the conditions like the blocked one to generate a fallback    for motion control channels and a DNN model for discrete
action using the rules. The proposed combination of Markov        channels toggling actions like sprinting, firing, climbing.
ensemble and the programmed heuristics provides a segway             While winning is not everything (Borovikov et al. 2019),
to the next step which addresses the linear growth of the en-     we would like the agent to perform at a reasonable level of
semble with the number of demonstrations.                         FPS genre metrics, e.g., demonstrate good kill-death ratio or
                                                                  defeat the adversary within the allowed limit of health and
DNN Model Trained with Bootstrapped                               ammo. As we observe in our experiments, the kill-death ra-
Demonstrations and Rules                                          tio for the model we train can vary in wide range and is at
The traditional RL requires thousands of episodes to learn        best around 10-40% of the teacher’s performance. It remains
useful policies. Further, it is hard to engineer reward to        an open problem how we can improve performance metrics
achieve desired style. Instead, we resort to imitation learning   of a trained model using a limited amount of additional train-
where we treat the demonstrations as a training set for a su-     ing while preserving the style.
pervised learning problem. The model predicts the next ac-
tion from a sequence of observed state-action pairs. This ap-                 Conclusion and Future Work
proach has proved to be useful in pre-training of self-driving    We tested our approach on a proprietary open-world first-
cars (Montemerlo et al. 2006) and is the subject of analysis      person shooter game, which resulted in an agent behaving
in more recent literature, e.g., (Ross and Bagnell 2010). The     similarly to a human player with minimal training costs. Ta-
main argument against casting IL as a supervised learning         ble 1 illustrates significant computational advantages gained
framework is the inability to learn from new situations and to    from adding engineered knowledge to the training process of
recover from mistakes. The rules and feature engineering we       practically useful agents. However, when comparing our ap-
present above intend to address these issues by incorporating     proach to the mainstream RL, we need to emphasize the dif-
prior human knowledge and bootstrapping demonstrations            ference between the training objectives, which makes such a
to make it part of the supervised learning data. We achieve       comparison only illustrative.
it by augmenting our small set of demonstrations with boot-          The focus of our research is the practical cost-efficient de-
strap. We construct an agent controlled by a Markov ensem-        velopment of human-like behavior in games. Keeping the
ble enhanced with the rules and let it interact with the game     performance of the model within certain limits is our fu-
to generate new episodes. The generated augmented data set        ture secondary objective. Obtaining theoretical guarantees
feeds into training a DNN described next.                         for the style and performance of the trained agents would
   The trained DNN model predicts action from the already         require substantial additional work. Applying our approach
observed state-action pairs including those previously han-       to multi-agent policies and covering multi-modal gameplay
dled by the scripts. The low dimensionality of the feature        is a logical next step. We plan to extend the encouraging re-
space results in fast training in a wide range of model ar-       sults shown here to other games in development.
                       References
Borovikov, I., and Beirami, A. Imitation learning via
bootstrapped demonstrations in an open-world video game.
NeurIPS 2018 Workshop on Reinforcement Learning under
Partial Observability.
Borovikov, I., and Harder, J. 2018. Learning models to
imitate personal behavior style with applications in video
gaming. Technical report, Electronic Arts, Digital Platforms
Data and AI. To be published as a preprint.
Borovikov, I.; Zhao, Y.; Beirami, A.; Harder, J.; Kolen, J.;
Pestrak, J.; Pinto, J.; Pourabolghasem, R.; Chaput, H.; Sar-
dari, M.; Lin, L.; Aghdaie, N.; and Zaman, K. 2019. Win-
ning Isn’t Everything: Training Agents to Playtest Modern
Games. In AAAI Workshop on Reinforcement Learning in
Games.
Harmer, J.; Gisslen, L.; del Val, J.; Holst, H.; Bergdahl, J.;
Olsson, T.; Sjoo, K.; and Nordin, M. 2018. Imitation learn-
ing with concurrent actions in 3d games.
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Ve-
ness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.;
Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Human-
level control through deep reinforcement learning. Nature
518(7540):529.
Montemerlo, M.; Thrun, S.; Dahlkamp, H.; and Stavens, D.
2006. Winning the darpa grand challenge with an ai robot.
In In Proceedings of the AAAI National Conference on Arti-
ficial Intelligence, 17–20.
OpenAI Five. 2018. [Online, June 2018] https://
openai.com/five.
Ross, S., and Bagnell, D. 2010. Efficient reductions for
imitation learning. In Proceedings of the Thirteenth Inter-
national Conference on Artificial Intelligence and Statistics,
AISTATS 2010, Sardinia, Italy, May 2010, 661–668.
Vinyals, O.; Ewalds, T.; Bartunov, S.; Georgiev, P.; Vezh-
nevets, A. S.; Yeo, M.; Makhzani, A.; Küttler, H.; Aga-
piou, J.; Schrittwieser, J.; et al. 2017. StarCraft II: A
new challenge for reinforcement learning. arXiv preprint
arXiv:1708.04782.
Zhai, C. 2008. Statistical language models for information
retrieval a critical review. Foundations and Trends in Infor-
mation Retrieval 2(3):137–213.