Tutor4RL: Guiding Reinforcement Learning with External Knowledge Mauricio Fadel Argerich, Jonathan Fürst, Bin Cheng NEC Laboratories Europe Kurfürsten-Anlage 36, 69115 Heidelberg, Germany {mauricio.fadel@neclab.eu, jonathan.fuerst@neclab.eu, bin.cheng@neclab.eu} Abstract context of the agent changes drastically with each deploy- ment (e.g., the execution environment is different). Because We introduce Tutor4RL, a method to improve reinforcement of this, data or simulation training are not feasible. Thus, the learning (RL) performance during training, using external agent needs to gather its experience online, from the perfor- knowledge to guide the agents’ decisions and experience. Current approaches of RL need extensive experience to de- mance of a live system where each action the agent explores liver good performance, something that is not acceptable in has a real cost that impacts the system. To apply RL in such many real systems when no simulation environment or con- scenarios, an agent needs to provide “good-enough” perfor- siderable previous data are available. In Tutor4RL, external mance from the start and act safely throughout the learn- knowledge– such as expert or domain knowledge– is ex- ing (Dulac-Arnold, Mankowitz, and Hester 2019)—a funda- pressed as programmable functions that are fed to the RL mental problem that must be solved to make RL applicable agent. During its first steps, the agent uses these knowledge to real world use cases. functions to decide the best action, guiding its exploration and providing better performance from the start. As the agent However, in our application of RL for container orches- gathers experience, it increasingly exploits its learned policy, tration in Big Data systems, we have experienced that some eventually leaving its tutor behind. We demonstrate Tutor4RL knowledge of the environment is usually available before the with a DQN agent. In our tests, Tutor4RL achieves more than deployment of the agent. This knowledge includes common 3 times higher reward in the beginning of its training than an heuristics or notions, well known to domain experts (e.g., agent with no external knowledge. DevOps engineers). In our work, we aim to make this knowl- edge accessible to the RL agent. Our intuition is that just as Introduction a person prepares for a task beforehand by gathering knowl- edge from sources such as other human-beings, manuals, Reinforcement Learning (RL) has achieved great success etc., we can provide external knowledge that the RL agent in fields such as robotics (Kober, Bagnell, and Peters can use when not enough experience has been gathered. 2013), recommender systems (Theocharous, Thomas, and Ghavamzadeh 2015) and video games, in which RL has even To realize this concept, we introduce Tutor4RL, a method surpassed human performance (Mnih et al. 2015). However, that guides RL via external knowledge. In Tutor4RL, ex- before achieving this, RL often poor performance for an ex- ternal knowledge– such as expert or domain knowledge – tended time—possibly for millions of iterations—until it has is expressed as programmable functions that the agent will gathered enough experience. In practice, this problem is ap- use during its training phase, and especially its initial steps proached mainly with two RL techniques: (1) training via to guide its behavior and learning. Thanks to Tutor4RL, simulation and (2) learning from historical data. the agent can perform in a reliable way from the start and Recently, researchers have applied RL to computer sys- achieve higher performance in a shorter training time. As tems tasks, such as database management system configu- the RL agent gathers more experience, it learns its policy ration (Schaarschmidt et al. 2018) or container orchestra- and leaves the tutor behind, improving on its results thanks tion for Big Data systems (Fadel Argerich, Cheng, and Fürst to its empirical knowledge. We take some inspiration from 2019). Here, experience data for the specific task might not the recently proposed Data Programming concept for su- be available (e.g., the complexity of a data analytic task de- pervised learning, in which functions developed by experts, pends on the processed dataset, not known before) and the are used to encode weakly supervision sources (Ratner et al. 2017). Copyright c 2020 held by the author(s). In A. Martin, K. Hinkel- We test our approach with a DQN Agent in the Atari game mann, H.-G. Fill, A. Gerber, D. Lenat, R. Stolle, F. van Harmelen (Eds.), Proceedings of the AAAI 2020 Spring Symposium on Com- Breakout and compare it to a plain DQN Agent. In our tests, bining Machine Learning and Knowledge Engineering in Practice Tutor4RL achieves more than a 3 times higher reward than (AAAI-MAKE 2020). Stanford University, Palo Alto, California, the plain agent in the beginning of its training, and keeps up USA, March 23-25, 2020. Use permitted under Creative Commons with its performance even after leaving the tutor behind, i.e. License Attribution 4.0 International (CC BY 4.0). only using its learned policy. Background and Motivation in previous training and is limited by the availability of such As illustrated in Figure 1 (Sutton, Barto, and others 1998), data. in Reinforcement Learning (RL), an agent learns to control Instead of relying on any pre-trained model, we explore an unknown environment to achieve a certain goal while in- how to utilize a set of programmable knowledge functions to teracting with the environment. The agent interacts with the guide the exploration of a RL agent so that we can quickly environment in discrete time steps t = 0, 1, .... In each step, make effective decisions, even just after a few exploration the agent receives a representation st ∈ S of the state of steps. We call our method Tutor4RL. Unlike existing ap- environment and a numerical signal rt called reward, and proaches, Tutor4RL requires no previous training and is performs an action at ∈ A that leads to the state st+1 and therefore, a more practical approach for the use of RL in reward rt+1 , perceived by the agent in the next time step. S real systems. To the best of our knowledge, Tutor4RL is the and A represent the sets of states and actions respectively. first to apply programmable knowledge functions into RL for improving training performance and sample efficiency. Agent Tutor4RL Reward 𝑟" Figure 2 shows our overall design of Tutor4RL. As com- Action State 𝑎" pared to traditional RL, we add a new component called Tu- 𝑠" tor to guide the agent to make informed decisions during Environment training. The reason why the tutor is able to guide the agent is because it can directly leverage a set of knowledge func- tions defined by domain experts. In this way, Tutor4RL can Figure 1: The RL framework and its elements. help the agent avoid any blind decisions at the beginning. The tutor possesses external knowledge and interacts with The agents’ behaviour is defined by its policy π, which the agent during training. The tutor takes as input the state of provides a mapping from states S to actions A. The value the environment, and outputs the action to take; in a similar function qπ (s, a) represents the expected future reward re- way to the agent’s policy. However, the tutor is implemented ceived when taking action a at state s with a policy π. The as programmable functions, in which external knowledge is goal of the agent is to find a policy that maximizes cumula- used to decide the mapping between states and actions, e.g., tive reward in the long run. for Atari Breakout, the tutor takes the frame from the video To do this, the agent learns from its experience for each game as input, and outputs in what direction the bar should performed action at , and then uses the collected observa- be moved. For every time step, the tutor interacts with the tions (st+1 , rt+1 ) to optimize its policy π based on different agent and gives advise to the agent for making better deci- models of the value function, such as a Tabular model (Sut- sions, based on all provided knowledge functions. ton, Barto, and others 1998) or a deep neural network One issue for the agent to consider is when and how often model (Mnih et al. 2015). Existing studies show that RL can it should ask for advise from the tutor. In a similar fashion lead to a reasonable model for determining which action to to Epsilon greedy exploration, we define τ as the threshold take in each state after learning from a large number of ex- parameter for the agent to control when it will take the sug- perience data. However, the biggest problem is how the RL gested actions from the tutor instead of its own decisions. τ agent can learn fast and efficiently from experience data. is a parameter of our model and the best value to initialize There have been different approaches of addressing this it depends on the use case; thus its initial value is left to be problem in the state of the art. A simple approach is to ex- decided during implementation. plore the state space randomly, but this approach is usually time-consuming and costly when the state/action space is large. The drawback of this approach has been reported by knowledge our previous study (Fadel Argerich, Cheng, and Fürst 2019) Domain functions in the case of leveraging RL to automatically decide the experts configuration and deployment actions of a data processing pipeline in a cloud and edge environment. Tutor Another approach is to gain experience via simulation. With enough computational resources, we can easily pro- duce lots of experience data in a short time, but it is difficult Agent to ensure that the simulated experiences are realistic enough to reflect the actual situations in the observed system. Reward 𝑟" Action Recently, a new trend to leverage external knowledge State 𝑎" to improve the exploration efficiency of RL agents has 𝑠" emerged. For example, in (Moreno et al. 2004) and (Hes- Environment ter et al. 2018), prior knowledge like pre-trained models and policies are used to bootstrap the exploration phase of a RL agent. However, this type of prior knowledge still originates Figure 2: Overall Working of Tutor4RL Knowledge functions must be programmed by domain ex- Parameter DQN DQN + Tutoring perts and allow them to easily bring different types of do- Policy Epsilon greedy main knowledge into the tutor component. Currently, Tu- [0.3-0.1] decreasing linearly Epsilon through 0 to 1.75M steps tor4RL considers the following two types of knowledge Gamma 0.99 functions: Constrain Functions and Guide Functions. Warmup steps 50000 Constrain Functions are programmable functions that Optimizer Adam with lr=0.00025 constrain the behavior of the agent. At each time step t, a [1-0] decreasing linearly constrain function takes the state of the environment as in- Tau - through 0 to 1.5M put, and returns a vector to indicate whether an action in the action space could be taken or not using the value 1 or 0; 1 Table 1: Parameters used in evaluation. represents the action is enabled while 0 represents the action axis, and returns ”fire” if no ball is found, or to move in the is disabled and cannot be performed for this state. Therefore, direction of the ball (left or right) if the ball is not above constrain functions provide a mask to avoid unnecessary ac- the bar. The simplified code for this function can be seen in tions for certain states. Listing 1. In addition, we also include the simplified code the Guide Functions are programmable functions that ex- agent uses in each step of its training in Listing 2, choosing press domain heuristics that the agent will use to guide its between the tutor decision and the policy decision. decisions, especially in moments of high uncertainty, e.g. the start of the learning process. Each guide function takes the current RL state and reward as input, and outputs a vector def guide_function(obs): to represent the weight of each preferred action according to # Find bar and ball in frame. bar_x_left, bar_x_right = \ the encoded domain knowledge. find_bar_edges_x(obs) The benefit of Tutor4RL is twofold: ball_x = find_ball(observation) 1. During training, the tutor enables a reasonable perfor- if ball_x != None: mance, opposed to the unreliable performance from an in- # Where to move bar. experienced agent, while generating experience for train- if bar_x_left > ball_x: return [0, 0, 0, 1] # left ing. Furthermore, the experience generated by the tutor is elif bar_x_right < ball_x > 0: important because it provides examples of good behavior. return [0, 0, 1, 0] # right 2. The knowledge of the tutor does not need to be per- return [0, 1, 0, 0] # fire fect or extensive. The tutor might have partial knowledge Listing 1: Our implementation of a guide function for about the environment, i.e. know what to do in certain Breakout. We check the position of the ball and the bar in cases only, or might not have a perfectly accurate knowl- the current frame and move the bar towards the ball. edge. The tutor provides some ”rules of thumb” that the agent can follow during training, and based on experience, def select_action(obs, tau, guide_function): the agent can improve upon these decisions, achieving a if numpy.random.uniform() < tau: higher reward than the tutor. # Use tutor. tf_output = guide_function(obs) Evaluation action = numpy.argmax(tf_output) else: We implement Tutor4RL by modifying the DQN (Mnih et # Use policy normally. al. 2015) agent, using the library Keras-RL (Plappert 2016) action = policy.select_action() along with Tensorflow. In order to make our evaluation re- return action producible, we choose a well-known workload for RL: play- Listing 2: Selection of action when training the agent. ing Atari games. In particular, we select the Atari game Breakout, using the environment BreakoutDeterministic-v4 Figure 3 depicts the mean reward per episode of the plain from OpenAI Gym (Brockman et al. 2016). We compare DQN agent and DQN agent with Tutor4RL during train- our approach to a standard DQN agent as implemented by ing. During the beginning of the training and until step Keras-RL and we use the same set of parameters for both, 500, 000, the plain DQN Agent shows an expected low re- the DQN agent with Tutor4RL and the one without. The pa- ward (< 15 points) because it starts with no knowledge, rameters used for the agents are detailed in Table 1. while the DQN Agent with Tutor4RL—thanks to the use In the BreakoutDeterministic-v4 environment, the obser- of its tutor knowledge—manages to achieve a mean reward vation is a RGB image of the screen, which is an array of between 15 and 35 points, ca. double the maximum of the shape (210, 160, 3) and four actions are available: no opera- plain DQN Agent. From step 500, 000 we see how the plain tion, fire (starts the game by ”throwing the ball”), right and DQN agent starts to improve, but its not until step 1.1M left. Each action is repeatedly performed for a duration of that the plain DQN agent shows equally good results as the k = 4 frames. In order to simplify the state space of our tutored one. From there we see a similar reward for both agent, we pre-process each frame converting it to greyscale agents,, with DQN Agent + Tutor4RL achieving a slightly and reducing its resolution to (105, 105, 1). higher mean reward in most cases. Because τ is decreased We implement a simple guide function that takes the pre- uniformly throughout training, the tutor is used less as train- processed frame, locates the the ball and the bar in the X ing progresses. Finally, in step 1.5M, τ = 0 and the tutor is Figure 3: Average mean reward per episode achieved by plain DQN agent and DQN agent with Tutor4RL during training. Data was averaged over 4 tests for each agent and with a rolling mean of 20 episodes, bands show 0.90 confidence interval. no longer used. It is important to note that from this point on, References the reward does not decrease but it keeps improving with the Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schul- agent’s learning. Moreover, we test both agents after 1.75M man, J.; Tang, J.; and Zaremba, W. 2016. Openai gym. arXiv steps: the plain DQN agent achieves an average reward of preprint arXiv:1606.01540. 40.75 points while Tutor4RL achieves a reward of 43. Note Dulac-Arnold, G.; Mankowitz, D.; and Hester, T. 2019. Chal- that this reward comes only from the learned policy of the lenges of real-world reinforcement learning. arXiv preprint agents and keeping  = 0.05, i.e. no tutor knowledge is used. arXiv:1904.12901. Fadel Argerich, M.; Cheng, B.; and Fürst, J. 2019. Reinforce- Discussion and Future Work ment learning based orchestration for elastic services. arXiv Tutor4RL is ongoing research and as such we plan to de- preprint arXiv:1904.12676. velop several improvements. First, the decision about when Hester, T.; Vecerik, M.; Pietquin, O.; Lanctot, M.; Schaul, T.; to use the tutor is an important aspect to our approach. Cur- Piot, B.; Horgan, D.; Quan, J.; Sendonaris, A.; Osband, I.; et al. rently, we are using τ as a parameter that decreases as the 2018. Deep q-learning from demonstrations. In Thirty-Second agent gathers more experience in each step. However, this AAAI Conference on Artificial Intelligence. can be improved to take into account the actual learning of Kober, J.; Bagnell, J. A.; and Peters, J. 2013. Reinforcement the agent, i.e. how ”good” are the actions selected by the learning in robotics: A survey. The International Journal of policy, or how certain the policy is about the action to be Robotics Research 32(11):1238–1274. taken in the given state. Second, as discussed before, we Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; plan to evaluate constrain functions to limit the behavior of Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; the agent. This can help in situations in which an action does Ostrovski, G.; et al. 2015. Human-level control through deep not make sense, e.g. in Breakout, moving the bar to the left reinforcement learning. Nature 518. when it’s already in its most left position. Last, the accu- Moreno, D. L.; Regueiro, C. V.; Iglesias, R.; and Barro, S. 2004. racy of guide functions can vary and thus, the decisions of Using prior knowledge to improve reinforcement learning in the Tutor can be improved by weighting the decisions of the mobile robotics. Proc. Towards Autonomous Robotics Systems. functions according to their accuracy. Univ. of Essex, UK. Plappert, M. 2016. keras-rl. https://github.com/keras-rl/keras- Conclusion rl. We have demonstrated Tutor4RL, a method that uses exter- Ratner, A.; Bach, S. H.; Ehrenberg, H.; Fries, J.; Wu, S.; and nal knowledge functions to improve the initial performance Ré, C. 2017. Snorkel: Rapid training data creation with weak of reinforcement learning agents. Tutor4RL targets deploy- supervision. Proceedings of the VLDB Endowment 11(3). ment scenarios where historical data for training is not avail- Schaarschmidt, M.; Kuhnle, A.; Ellis, B.; Fricke, K.; Gessert, able and building a simulator is impractical. Our results F.; and Yoneki, E. 2018. Lift: Reinforcement learning in com- show that Tutor4RL achieves 3 times higher reward than an puter systems by learning from demonstrations. arXiv preprint agent without using external knowledge in its initial stage. arXiv:1808.07903. Sutton, R. S.; Barto, A. G.; et al. 1998. Introduction to rein- Acknowledgments forcement learning, volume 2. MIT press Cambridge. The research leading to these results has re- Theocharous, G.; Thomas, P. S.; and Ghavamzadeh, M. 2015. ceived funding from the European Commu- Personalized ad recommendation systems for life-time value nity’s Horizon 2020 research and innovation optimization with guarantees. In Twenty-Fourth International programme under grant agreement no 779747. Joint Conference on Artificial Intelligence.