From Demonstrations and Knowledge Engineering to a DNN Agent in a Modern Open-World Video Game Igor Borovikov Ahmad Beirami EA Digital Platform – Data & AI EA Digital Platform – Data & AI 209 Redwood Shores Pkwy 209 Redwood Shores Pkwy Redwood City, CA 94065 Redwood City, CA 94065 iborovikov@ea.com abeirami@ea.com Abstract substantial engineering efforts at training agents in such con- ditions is not practical and calls for different approaches. In video games, there is a high demand for non-player char- The preferred solution would utilize only a few relatively acters (NPCs) whose behavior is believable and human-like. short episodes played by the developers. The time allowed The traditional hand-crafted AI driving NPCs is hard to scale up in modern open-world multiplayer games, and often leads for augmenting these demonstrations by autoplay would be to the uncanny valley of robotic behavior. We discuss a novel limited, especially if the game engine doesn’t support a dra- approach to solving this problem based on imitation learning. matic speedup. Thus, the solution has to be sample-efficient We combine demonstrations, programmed rules, and boot- and train the agents offline. Using frame buffer for train- strapping in the game environment to train a Deep Neural ing is problematic due to frequent changes of the game look Network (DNN) defining the NPC behavior. Unlike Rein- during development. In contrary, the conceptual types of the forcement Learning (RL), where the objective is optimal per- game environment rarely change during its production. That formance, we aim to reproduce a human player style from allows for the engineering of a substantial part of the prior few demonstrations. We embed the implicit knowledge of the knowledge about the game states into a low dimensional fea- basic gameplay rules which are hard to learn via self-play or ture vector replacing the frame buffer. The core gameplay infer from a few demonstrations but are straightforward to capture with simple programmed logic. We build a compos- also rarely changes which allows to describe desired behav- ite model that interacts with the game to bootstrap the hu- ior as a compact model depending on core game features. man demonstrations to provide sufficient training data for a In this position paper, we train a model of an NPC with more complex DNN model capturing stylized gameplay from stylistics traits provided by demonstrations. We include the demonstrations and enhanced with rules. We show that the aspects of behavior that are hard to infer from demonstra- method is computationally fast and delivers promising results tions as engineered rules. We use low-dimensional engi- in a game production cycle. neered features to provide similar information that is avail- able in organic gameplay. Finally, we train a composite DNN model, which we show to be effective while inexpensive. Introduction and Problem Statement The leading advances in the field of RL in application to Proposed Approach playing computer games, e.g., (OpenAI Five 2018; Mnih The methodology we explore assumes modest domain et al. 2015; Vinyals et al. 2017; Harmer et al. 2018), strive knowledge and an intuitive understanding of the game me- to train an optimal artificial agent (“agent” for short) maxi- chanics of a First Person Shooter (FPS). Such level of un- mizing clearly defined rewards, and the game itself remains derstanding is usually widely available to the game design- fixed for a foreseen future. In contrast to that, during the ers and software engineers working on the game. We aim at game development, the objectives and the settings are en- a high level of abstraction to avoid too elaborate or game- tirely different. The agents can play a variety of roles with specific formalization of that knowledge. With that in mind, the rewards that are hard to define formally, e.g., a goal the three components we engineer are: of an agent exploring a game level is different from forag- ing, defeating all adversaries, or solving a puzzle. Also, the • state space (features), game environment is frequently changing between the game • action space, builds. In such settings, it is desirable to quickly train agents • rules capturing implicit human knowledge. that can work as an NPC or as bots for the automated test We complement the rules with explicit human demonstra- and game balancing. Throwing computational resources and tions to build a Markov ensemble, which provides basic gen- Copyright held by the author(s). In A. Martin, K. Hinkelmann, A. eralization. We use this aggregate model to drive an agent in Gerber, D. Lenat, F. van Harmelen, P. Clark (Eds.), Proceedings of the game to automate the generation of additional bootstrap the AAAI 2019 Spring Symposium on Combining Machine Learn- data. The bootstrapped data set allows to train a compos- ing with Knowledge Engineering (AAAI-MAKE 2019). Stanford ite DNN model combining the demonstrations and the en- University, Palo Alto, California, USA, March 25-27, 2019. gineered rules. As a case study, we explore an FPS game, which is conceptually similar to the one investigated by Capturing Gameplay Style with Markov (Harmer et al. 2018). Its core mechanics is generic enough Ensembles to make our approach applicable to other games in the same Learning stylistic elements of a policy is challenging. One of or similar genre. the main difficulties comes from style evaluation, which is highly subjective and not easily quantifiable. While inverse State Space (Features) reinforcement learning aims at finding reward functions that promote a certain style and behavior, we don’t have suffi- We instrument the game and expose its current state s to cient data at hand to solve this ill-posed problem. Instead, the agent as a low-dimensional normalized vector s = we incorporate style directly from the demonstrations using (c1 , . . . , cn , d1 , . . . , dm ). Its components, c and d, describe imitation learning. Our demonstration data consists of per- n continuous and m discrete features correspondingly. In an frame records of the already described engineered states and FPS game, the continuous features could include the dis- the player actions encoded as the game controller inputs. tance to the adversary, the angle between target Line of Notice that we inherently deal with a partially observable Sight (LoS) and the player orientation, velocity components, Markov decision process (POMDP) as not all the attributes ammo, and health. We map all continuous components to the in the state space are observable (Borovikov and Beirami ). range |ci | ∈ [0, 1]. A mapping function for unbound com- To support Markov model capturing player style, we include ponents like distance could be arctan or any similarly be- the history of N recent actions as part of the state space as having smooth function. For variables with naturally defined well. The depth N of the history affects the expressiveness range, like ammo or health, we use linear normalization. The of style capture. In our experiments, we limit the depth to discrete components d may include binary values like the less than a second of gameplay (N < 60), which still pre- presence of an LoS, sprinting on-off, and one hot coded val- serves the visual style of a simple reactive policy. ues of non-binary values, such as current animation type. By To build the Markov ensemble, we use an approach to construction, all di ∈ {0, 1}. style reproduction inspired by natural language processing The total number of features for a single adversary in our literature (see review (Zhai 2008)). We encode the demon- proof-of-concept implementation is quite small: n ∼ 20 and strations as symbolic sequences utilizing a hierarchy of m ∼ 20, depending on the experiment. In particular, we multi-resolution quantization schemes ranging from detailed explore only on-foot action and remain in the same game to a complete information loss for the continuous and the modality, e.g., we exclude getting into and driving vehicles. discrete channels. The most detailed quantization and higher The locations and objects of interest like cover spots, med- order Markov models can reproduce sequences of human ical packs and ammo clips, can appear in the state vector actions in similar situations with high accuracy, thus cap- in different forms but we leave their detailed description to turing gameplaying style. The coarsest level corresponds to future more detailed publications, where we intend to ex- a Markov agent blindly sampling actions from the demon- plore more complex gameplay. Finally, we expose only the strations. The hierarchical ensemble of Markov models pro- features that are observable in organic gameplay to achieve vides minimal generalization from the demonstration data. human-like gameplay. The ensemble of such models is straightforward to build, and the inference is a lookup process. We intend to provide a complete description of Markov Action Space ensemble by publishing a preprint of our internal technical The agent’s actions naturally map to the game controller in- report (Borovikov and Harder 2018). put encoded with a few variables. Actions a1 , . . . , ak com- prise of k continuous and discrete values. There are six Capturing Implicit Human Knowledge with analog inputs from the controller: two sticks normalized to Embedded Rules [−1, 1] and two triggers normalized to [0, 1] by design. Also The outlined Markov ensemble can formally generalize to there are a few binary inputs from buttons. While all possible previously unobserved states. However, at the coarsest level combinations of the inputs may look intractable for learn- of quantization, blind sampling from the observed actions ing, as stated in (Harmer et al. 2018), our goal of imitating a easily breaks an illusion of intelligent goal-driven behav- human player doesn’t require to cover all possible combina- ior. To address that issue, we trim the coarser levels of tions of inputs. Instead, we extract only those combinations quantization from the ensemble, leaving some states unhan- that occur in organic gameplay and encode them as one hot, dled by the resulting incomplete Markov policy. To han- which drastically reduces action space complexity. The re- dle such states, we augment the Markov ensemble with a sulting dimensionality of the action space in our experiments small number of heuristics captured as rules. The rules pro- turned out to be also in the low two digits k ∼ 15, depending vide a reasonable response to the states never observed in on the experiment. the demonstrations and not covered by simple generaliza- The proposed approach to the game state-action space tion with Markov models. To illustrate the possible types of serves the two purposes: keep the action space dimension- rules, we briefly examine two of them. ality under control and eliminate actions that humans never One type of rules illustrates implicit short-term goal take. Thus, the model we train is “fair” (i.e. is not using more setting for the model, eliminating Inverse Reinforcement inputs at the same time than allowed by human anatomy). Learning from the problem formulation. An obvious top- level goal in an FPS genre is to find, attack and defeat the adversary. A human player would not stand still or wan- Table 1: Comparison between OpenAI 1V1 Dota 2 Bot der while receiving damage from the adversary. Such state (OpenAI Five 2018) training metrics and training an agent would rarely if ever appear in demonstrations. Instead, more from human demonstrations, programmed rules, and boot- often than not, the player would face the adversary and en- strap in a proprietary open-world first-person shooter. While gage in combat. The corresponding rule we propose boils to the objectives of training are different, the environments are simple “Turn to and approach the target whenever there is somewhat comparable. The metrics illustrate the practical nothing else to do”, captured with only a couple of lines of advantages of the proposed technique. code. The target here can be the adversary, a cover spot or OpenAI Bootstrapped other objects of interest. The rule eventually transitions the 1V1 Bot Agent agent into a state that it can handle from demonstrations. Experience ∼300 years ∼5 min human For the second type of rules, an example could be as (per day) demonstrations simple as “Do not run indefinitely in the same direction if Bootstrap using N/A ×5-20 moving in that direction is not possible”. Humans proac- game client tively avoid blocked states, and they may never occur in or- CPU 60,000 CPU 1 local CPU ganic gameplay. Hence, learning such a rule directly from cores on Azure the demonstrations is not possible since the data for such GPU 256 K80 GPUs N/A blocked states is not present. on Azure In both cases, discovering the desired behavior via explo- Size of ∼3.3kB ∼0.5kB ration would require a substantial amount of time, compu- observation tational resources and a hand-crafted reward function. The costs of such exploration are disproportional to the simplic- Observations 10 33 ity of the decisions the agent needs to learn. per second To summarize, the engineered rules capture simple hu- of gameplay man knowledge and complement the ensemble model in the states unobserved in the demonstrations. When the trimmed chitectures, allowing a quick experimentation loop. We con- ensemble model fails to produce an action, the script checks verged on a simple model with a single “wide” hidden layer for the conditions like the blocked one to generate a fallback for motion control channels and a DNN model for discrete action using the rules. The proposed combination of Markov channels toggling actions like sprinting, firing, climbing. ensemble and the programmed heuristics provides a segway While winning is not everything (Borovikov et al. 2019), to the next step which addresses the linear growth of the en- we would like the agent to perform at a reasonable level of semble with the number of demonstrations. FPS genre metrics, e.g., demonstrate good kill-death ratio or defeat the adversary within the allowed limit of health and DNN Model Trained with Bootstrapped ammo. As we observe in our experiments, the kill-death ra- Demonstrations and Rules tio for the model we train can vary in wide range and is at The traditional RL requires thousands of episodes to learn best around 10-40% of the teacher’s performance. It remains useful policies. Further, it is hard to engineer reward to an open problem how we can improve performance metrics achieve desired style. Instead, we resort to imitation learning of a trained model using a limited amount of additional train- where we treat the demonstrations as a training set for a su- ing while preserving the style. pervised learning problem. The model predicts the next ac- tion from a sequence of observed state-action pairs. This ap- Conclusion and Future Work proach has proved to be useful in pre-training of self-driving We tested our approach on a proprietary open-world first- cars (Montemerlo et al. 2006) and is the subject of analysis person shooter game, which resulted in an agent behaving in more recent literature, e.g., (Ross and Bagnell 2010). The similarly to a human player with minimal training costs. Ta- main argument against casting IL as a supervised learning ble 1 illustrates significant computational advantages gained framework is the inability to learn from new situations and to from adding engineered knowledge to the training process of recover from mistakes. The rules and feature engineering we practically useful agents. However, when comparing our ap- present above intend to address these issues by incorporating proach to the mainstream RL, we need to emphasize the dif- prior human knowledge and bootstrapping demonstrations ference between the training objectives, which makes such a to make it part of the supervised learning data. We achieve comparison only illustrative. it by augmenting our small set of demonstrations with boot- The focus of our research is the practical cost-efficient de- strap. We construct an agent controlled by a Markov ensem- velopment of human-like behavior in games. Keeping the ble enhanced with the rules and let it interact with the game performance of the model within certain limits is our fu- to generate new episodes. The generated augmented data set ture secondary objective. Obtaining theoretical guarantees feeds into training a DNN described next. for the style and performance of the trained agents would The trained DNN model predicts action from the already require substantial additional work. Applying our approach observed state-action pairs including those previously han- to multi-agent policies and covering multi-modal gameplay dled by the scripts. The low dimensionality of the feature is a logical next step. We plan to extend the encouraging re- space results in fast training in a wide range of model ar- sults shown here to other games in development. References Borovikov, I., and Beirami, A. Imitation learning via bootstrapped demonstrations in an open-world video game. NeurIPS 2018 Workshop on Reinforcement Learning under Partial Observability. Borovikov, I., and Harder, J. 2018. Learning models to imitate personal behavior style with applications in video gaming. Technical report, Electronic Arts, Digital Platforms Data and AI. To be published as a preprint. Borovikov, I.; Zhao, Y.; Beirami, A.; Harder, J.; Kolen, J.; Pestrak, J.; Pinto, J.; Pourabolghasem, R.; Chaput, H.; Sar- dari, M.; Lin, L.; Aghdaie, N.; and Zaman, K. 2019. Win- ning Isn’t Everything: Training Agents to Playtest Modern Games. In AAAI Workshop on Reinforcement Learning in Games. Harmer, J.; Gisslen, L.; del Val, J.; Holst, H.; Bergdahl, J.; Olsson, T.; Sjoo, K.; and Nordin, M. 2018. Imitation learn- ing with concurrent actions in 3d games. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Ve- ness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Human- level control through deep reinforcement learning. Nature 518(7540):529. Montemerlo, M.; Thrun, S.; Dahlkamp, H.; and Stavens, D. 2006. Winning the darpa grand challenge with an ai robot. In In Proceedings of the AAAI National Conference on Arti- ficial Intelligence, 17–20. OpenAI Five. 2018. [Online, June 2018] https:// openai.com/five. Ross, S., and Bagnell, D. 2010. Efficient reductions for imitation learning. In Proceedings of the Thirteenth Inter- national Conference on Artificial Intelligence and Statistics, AISTATS 2010, Sardinia, Italy, May 2010, 661–668. Vinyals, O.; Ewalds, T.; Bartunov, S.; Georgiev, P.; Vezh- nevets, A. S.; Yeo, M.; Makhzani, A.; Küttler, H.; Aga- piou, J.; Schrittwieser, J.; et al. 2017. StarCraft II: A new challenge for reinforcement learning. arXiv preprint arXiv:1708.04782. Zhai, C. 2008. Statistical language models for information retrieval a critical review. Foundations and Trends in Infor- mation Retrieval 2(3):137–213.