Tutor4RL: Guiding Reinforcement Learning with External Knowledge

                                 Mauricio Fadel Argerich, Jonathan Fürst, Bin Cheng
                                                    NEC Laboratories Europe
                                       Kurfürsten-Anlage 36, 69115 Heidelberg, Germany
                          {mauricio.fadel@neclab.eu, jonathan.fuerst@neclab.eu, bin.cheng@neclab.eu}


                            Abstract                                 context of the agent changes drastically with each deploy-
                                                                     ment (e.g., the execution environment is different). Because
  We introduce Tutor4RL, a method to improve reinforcement           of this, data or simulation training are not feasible. Thus, the
  learning (RL) performance during training, using external
                                                                     agent needs to gather its experience online, from the perfor-
  knowledge to guide the agents’ decisions and experience.
  Current approaches of RL need extensive experience to de-          mance of a live system where each action the agent explores
  liver good performance, something that is not acceptable in        has a real cost that impacts the system. To apply RL in such
  many real systems when no simulation environment or con-           scenarios, an agent needs to provide “good-enough” perfor-
  siderable previous data are available. In Tutor4RL, external       mance from the start and act safely throughout the learn-
  knowledge– such as expert or domain knowledge– is ex-              ing (Dulac-Arnold, Mankowitz, and Hester 2019)—a funda-
  pressed as programmable functions that are fed to the RL           mental problem that must be solved to make RL applicable
  agent. During its first steps, the agent uses these knowledge      to real world use cases.
  functions to decide the best action, guiding its exploration
  and providing better performance from the start. As the agent         However, in our application of RL for container orches-
  gathers experience, it increasingly exploits its learned policy,   tration in Big Data systems, we have experienced that some
  eventually leaving its tutor behind. We demonstrate Tutor4RL       knowledge of the environment is usually available before the
  with a DQN agent. In our tests, Tutor4RL achieves more than        deployment of the agent. This knowledge includes common
  3 times higher reward in the beginning of its training than an     heuristics or notions, well known to domain experts (e.g.,
  agent with no external knowledge.                                  DevOps engineers). In our work, we aim to make this knowl-
                                                                     edge accessible to the RL agent. Our intuition is that just as
                        Introduction                                 a person prepares for a task beforehand by gathering knowl-
                                                                     edge from sources such as other human-beings, manuals,
Reinforcement Learning (RL) has achieved great success               etc., we can provide external knowledge that the RL agent
in fields such as robotics (Kober, Bagnell, and Peters               can use when not enough experience has been gathered.
2013), recommender systems (Theocharous, Thomas, and
Ghavamzadeh 2015) and video games, in which RL has even                 To realize this concept, we introduce Tutor4RL, a method
surpassed human performance (Mnih et al. 2015). However,             that guides RL via external knowledge. In Tutor4RL, ex-
before achieving this, RL often poor performance for an ex-          ternal knowledge– such as expert or domain knowledge –
tended time—possibly for millions of iterations—until it has         is expressed as programmable functions that the agent will
gathered enough experience. In practice, this problem is ap-         use during its training phase, and especially its initial steps
proached mainly with two RL techniques: (1) training via             to guide its behavior and learning. Thanks to Tutor4RL,
simulation and (2) learning from historical data.                    the agent can perform in a reliable way from the start and
   Recently, researchers have applied RL to computer sys-            achieve higher performance in a shorter training time. As
tems tasks, such as database management system configu-              the RL agent gathers more experience, it learns its policy
ration (Schaarschmidt et al. 2018) or container orchestra-           and leaves the tutor behind, improving on its results thanks
tion for Big Data systems (Fadel Argerich, Cheng, and Fürst         to its empirical knowledge. We take some inspiration from
2019). Here, experience data for the specific task might not         the recently proposed Data Programming concept for su-
be available (e.g., the complexity of a data analytic task de-       pervised learning, in which functions developed by experts,
pends on the processed dataset, not known before) and the            are used to encode weakly supervision sources (Ratner et al.
                                                                     2017).
Copyright c 2020 held by the author(s). In A. Martin, K. Hinkel-
                                                                        We test our approach with a DQN Agent in the Atari game
mann, H.-G. Fill, A. Gerber, D. Lenat, R. Stolle, F. van Harmelen
(Eds.), Proceedings of the AAAI 2020 Spring Symposium on Com-        Breakout and compare it to a plain DQN Agent. In our tests,
bining Machine Learning and Knowledge Engineering in Practice        Tutor4RL achieves more than a 3 times higher reward than
(AAAI-MAKE 2020). Stanford University, Palo Alto, California,        the plain agent in the beginning of its training, and keeps up
USA, March 23-25, 2020. Use permitted under Creative Commons         with its performance even after leaving the tutor behind, i.e.
License Attribution 4.0 International (CC BY 4.0).                   only using its learned policy.
            Background and Motivation                               in previous training and is limited by the availability of such
As illustrated in Figure 1 (Sutton, Barto, and others 1998),        data.
in Reinforcement Learning (RL), an agent learns to control             Instead of relying on any pre-trained model, we explore
an unknown environment to achieve a certain goal while in-          how to utilize a set of programmable knowledge functions to
teracting with the environment. The agent interacts with the        guide the exploration of a RL agent so that we can quickly
environment in discrete time steps t = 0, 1, .... In each step,     make effective decisions, even just after a few exploration
the agent receives a representation st ∈ S of the state of          steps. We call our method Tutor4RL. Unlike existing ap-
environment and a numerical signal rt called reward, and            proaches, Tutor4RL requires no previous training and is
performs an action at ∈ A that leads to the state st+1 and          therefore, a more practical approach for the use of RL in
reward rt+1 , perceived by the agent in the next time step. S       real systems. To the best of our knowledge, Tutor4RL is the
and A represent the sets of states and actions respectively.        first to apply programmable knowledge functions into RL
                                                                    for improving training performance and sample efficiency.

                                 Agent
                                                                                             Tutor4RL
                  Reward
                    𝑟"                                              Figure 2 shows our overall design of Tutor4RL. As com-
                                                Action
          State                                  𝑎"                 pared to traditional RL, we add a new component called Tu-
           𝑠"                                                       tor to guide the agent to make informed decisions during
                             Environment                            training. The reason why the tutor is able to guide the agent
                                                                    is because it can directly leverage a set of knowledge func-
                                                                    tions defined by domain experts. In this way, Tutor4RL can
       Figure 1: The RL framework and its elements.                 help the agent avoid any blind decisions at the beginning.
                                                                    The tutor possesses external knowledge and interacts with
   The agents’ behaviour is defined by its policy π, which          the agent during training. The tutor takes as input the state of
provides a mapping from states S to actions A. The value            the environment, and outputs the action to take; in a similar
function qπ (s, a) represents the expected future reward re-        way to the agent’s policy. However, the tutor is implemented
ceived when taking action a at state s with a policy π. The         as programmable functions, in which external knowledge is
goal of the agent is to find a policy that maximizes cumula-        used to decide the mapping between states and actions, e.g.,
tive reward in the long run.                                        for Atari Breakout, the tutor takes the frame from the video
   To do this, the agent learns from its experience for each        game as input, and outputs in what direction the bar should
performed action at , and then uses the collected observa-          be moved. For every time step, the tutor interacts with the
tions (st+1 , rt+1 ) to optimize its policy π based on different    agent and gives advise to the agent for making better deci-
models of the value function, such as a Tabular model (Sut-         sions, based on all provided knowledge functions.
ton, Barto, and others 1998) or a deep neural network                  One issue for the agent to consider is when and how often
model (Mnih et al. 2015). Existing studies show that RL can         it should ask for advise from the tutor. In a similar fashion
lead to a reasonable model for determining which action to          to Epsilon greedy exploration, we define τ as the threshold
take in each state after learning from a large number of ex-        parameter for the agent to control when it will take the sug-
perience data. However, the biggest problem is how the RL           gested actions from the tutor instead of its own decisions. τ
agent can learn fast and efficiently from experience data.          is a parameter of our model and the best value to initialize
   There have been different approaches of addressing this          it depends on the use case; thus its initial value is left to be
problem in the state of the art. A simple approach is to ex-        decided during implementation.
plore the state space randomly, but this approach is usually
time-consuming and costly when the state/action space is
large. The drawback of this approach has been reported by                                         knowledge
our previous study (Fadel Argerich, Cheng, and Fürst 2019)                            Domain      functions
in the case of leveraging RL to automatically decide the                               experts
configuration and deployment actions of a data processing
pipeline in a cloud and edge environment.                                                            Tutor
   Another approach is to gain experience via simulation.
With enough computational resources, we can easily pro-
duce lots of experience data in a short time, but it is difficult                                    Agent
to ensure that the simulated experiences are realistic enough
to reflect the actual situations in the observed system.                                Reward
                                                                                          𝑟"                       Action
   Recently, a new trend to leverage external knowledge                        State                                𝑎"
to improve the exploration efficiency of RL agents has                           𝑠"
emerged. For example, in (Moreno et al. 2004) and (Hes-                                           Environment
ter et al. 2018), prior knowledge like pre-trained models and
policies are used to bootstrap the exploration phase of a RL
agent. However, this type of prior knowledge still originates                 Figure 2: Overall Working of Tutor4RL
   Knowledge functions must be programmed by domain ex-                 Parameter       DQN          DQN + Tutoring
perts and allow them to easily bring different types of do-             Policy                   Epsilon greedy
main knowledge into the tutor component. Currently, Tu-                                   [0.3-0.1] decreasing linearly
                                                                        Epsilon
                                                                                            through 0 to 1.75M steps
tor4RL considers the following two types of knowledge
                                                                        Gamma                          0.99
functions: Constrain Functions and Guide Functions.
                                                                        Warmup steps                  50000
   Constrain Functions are programmable functions that                  Optimizer            Adam with lr=0.00025
constrain the behavior of the agent. At each time step t, a                                      [1-0] decreasing linearly
constrain function takes the state of the environment as in-            Tau               -
                                                                                                    through 0 to 1.5M
put, and returns a vector to indicate whether an action in the
action space could be taken or not using the value 1 or 0; 1                 Table 1: Parameters used in evaluation.
represents the action is enabled while 0 represents the action
                                                                 axis, and returns ”fire” if no ball is found, or to move in the
is disabled and cannot be performed for this state. Therefore,
                                                                 direction of the ball (left or right) if the ball is not above
constrain functions provide a mask to avoid unnecessary ac-
                                                                 the bar. The simplified code for this function can be seen in
tions for certain states.
                                                                 Listing 1. In addition, we also include the simplified code the
   Guide Functions are programmable functions that ex-
                                                                 agent uses in each step of its training in Listing 2, choosing
press domain heuristics that the agent will use to guide its
                                                                 between the tutor decision and the policy decision.
decisions, especially in moments of high uncertainty, e.g. the
start of the learning process. Each guide function takes the
current RL state and reward as input, and outputs a vector       def guide_function(obs):
to represent the weight of each preferred action according to        # Find bar and ball in frame.
                                                                     bar_x_left, bar_x_right = \
the encoded domain knowledge.                                                         find_bar_edges_x(obs)
   The benefit of Tutor4RL is twofold:                               ball_x = find_ball(observation)
1. During training, the tutor enables a reasonable perfor-           if ball_x != None:
   mance, opposed to the unreliable performance from an in-              # Where to move bar.
   experienced agent, while generating experience for train-             if bar_x_left > ball_x:
                                                                             return [0, 0, 0, 1] # left
   ing. Furthermore, the experience generated by the tutor is            elif bar_x_right < ball_x > 0:
   important because it provides examples of good behavior.                  return [0, 0, 1, 0] # right
2. The knowledge of the tutor does not need to be per-               return [0, 1, 0, 0] # fire
   fect or extensive. The tutor might have partial knowledge     Listing 1: Our implementation of a guide function for
   about the environment, i.e. know what to do in certain        Breakout. We check the position of the ball and the bar in
   cases only, or might not have a perfectly accurate knowl-     the current frame and move the bar towards the ball.
   edge. The tutor provides some ”rules of thumb” that the
   agent can follow during training, and based on experience,    def select_action(obs, tau, guide_function):
   the agent can improve upon these decisions, achieving a           if numpy.random.uniform() < tau:
   higher reward than the tutor.                                         # Use tutor.
                                                                         tf_output = guide_function(obs)
                       Evaluation                                        action = numpy.argmax(tf_output)
                                                                     else:
We implement Tutor4RL by modifying the DQN (Mnih et                      # Use policy normally.
al. 2015) agent, using the library Keras-RL (Plappert 2016)              action = policy.select_action()
along with Tensorflow. In order to make our evaluation re-           return action
producible, we choose a well-known workload for RL: play-           Listing 2: Selection of action when training the agent.
ing Atari games. In particular, we select the Atari game
Breakout, using the environment BreakoutDeterministic-v4            Figure 3 depicts the mean reward per episode of the plain
from OpenAI Gym (Brockman et al. 2016). We compare               DQN agent and DQN agent with Tutor4RL during train-
our approach to a standard DQN agent as implemented by           ing. During the beginning of the training and until step
Keras-RL and we use the same set of parameters for both,         500, 000, the plain DQN Agent shows an expected low re-
the DQN agent with Tutor4RL and the one without. The pa-         ward (< 15 points) because it starts with no knowledge,
rameters used for the agents are detailed in Table 1.            while the DQN Agent with Tutor4RL—thanks to the use
   In the BreakoutDeterministic-v4 environment, the obser-       of its tutor knowledge—manages to achieve a mean reward
vation is a RGB image of the screen, which is an array of        between 15 and 35 points, ca. double the maximum of the
shape (210, 160, 3) and four actions are available: no opera-    plain DQN Agent. From step 500, 000 we see how the plain
tion, fire (starts the game by ”throwing the ball”), right and   DQN agent starts to improve, but its not until step 1.1M
left. Each action is repeatedly performed for a duration of      that the plain DQN agent shows equally good results as the
k = 4 frames. In order to simplify the state space of our        tutored one. From there we see a similar reward for both
agent, we pre-process each frame converting it to greyscale      agents,, with DQN Agent + Tutor4RL achieving a slightly
and reducing its resolution to (105, 105, 1).                    higher mean reward in most cases. Because τ is decreased
   We implement a simple guide function that takes the pre-      uniformly throughout training, the tutor is used less as train-
processed frame, locates the the ball and the bar in the X       ing progresses. Finally, in step 1.5M, τ = 0 and the tutor is
Figure 3: Average mean reward per episode achieved by plain DQN agent and DQN agent with Tutor4RL during training. Data
was averaged over 4 tests for each agent and with a rolling mean of 20 episodes, bands show 0.90 confidence interval.


no longer used. It is important to note that from this point on,                            References
the reward does not decrease but it keeps improving with the       Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schul-
agent’s learning. Moreover, we test both agents after 1.75M        man, J.; Tang, J.; and Zaremba, W. 2016. Openai gym. arXiv
steps: the plain DQN agent achieves an average reward of           preprint arXiv:1606.01540.
40.75 points while Tutor4RL achieves a reward of 43. Note          Dulac-Arnold, G.; Mankowitz, D.; and Hester, T. 2019. Chal-
that this reward comes only from the learned policy of the         lenges of real-world reinforcement learning. arXiv preprint
agents and keeping  = 0.05, i.e. no tutor knowledge is used.      arXiv:1904.12901.
                                                                   Fadel Argerich, M.; Cheng, B.; and Fürst, J. 2019. Reinforce-
            Discussion and Future Work                             ment learning based orchestration for elastic services. arXiv
Tutor4RL is ongoing research and as such we plan to de-            preprint arXiv:1904.12676.
velop several improvements. First, the decision about when         Hester, T.; Vecerik, M.; Pietquin, O.; Lanctot, M.; Schaul, T.;
to use the tutor is an important aspect to our approach. Cur-      Piot, B.; Horgan, D.; Quan, J.; Sendonaris, A.; Osband, I.; et al.
rently, we are using τ as a parameter that decreases as the        2018. Deep q-learning from demonstrations. In Thirty-Second
agent gathers more experience in each step. However, this          AAAI Conference on Artificial Intelligence.
can be improved to take into account the actual learning of        Kober, J.; Bagnell, J. A.; and Peters, J. 2013. Reinforcement
the agent, i.e. how ”good” are the actions selected by the         learning in robotics: A survey. The International Journal of
policy, or how certain the policy is about the action to be        Robotics Research 32(11):1238–1274.
taken in the given state. Second, as discussed before, we          Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.;
plan to evaluate constrain functions to limit the behavior of      Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.;
the agent. This can help in situations in which an action does     Ostrovski, G.; et al. 2015. Human-level control through deep
not make sense, e.g. in Breakout, moving the bar to the left       reinforcement learning. Nature 518.
when it’s already in its most left position. Last, the accu-       Moreno, D. L.; Regueiro, C. V.; Iglesias, R.; and Barro, S. 2004.
racy of guide functions can vary and thus, the decisions of        Using prior knowledge to improve reinforcement learning in
the Tutor can be improved by weighting the decisions of the        mobile robotics. Proc. Towards Autonomous Robotics Systems.
functions according to their accuracy.                             Univ. of Essex, UK.
                                                                   Plappert, M. 2016. keras-rl. https://github.com/keras-rl/keras-
                        Conclusion                                 rl.
We have demonstrated Tutor4RL, a method that uses exter-           Ratner, A.; Bach, S. H.; Ehrenberg, H.; Fries, J.; Wu, S.; and
nal knowledge functions to improve the initial performance         Ré, C. 2017. Snorkel: Rapid training data creation with weak
of reinforcement learning agents. Tutor4RL targets deploy-         supervision. Proceedings of the VLDB Endowment 11(3).
ment scenarios where historical data for training is not avail-    Schaarschmidt, M.; Kuhnle, A.; Ellis, B.; Fricke, K.; Gessert,
able and building a simulator is impractical. Our results          F.; and Yoneki, E. 2018. Lift: Reinforcement learning in com-
show that Tutor4RL achieves 3 times higher reward than an          puter systems by learning from demonstrations. arXiv preprint
agent without using external knowledge in its initial stage.       arXiv:1808.07903.
                                                                   Sutton, R. S.; Barto, A. G.; et al. 1998. Introduction to rein-
                   Acknowledgments                                 forcement learning, volume 2. MIT press Cambridge.
                 The research leading to these results has re-     Theocharous, G.; Thomas, P. S.; and Ghavamzadeh, M. 2015.
                 ceived funding from the European Commu-           Personalized ad recommendation systems for life-time value
                 nity’s Horizon 2020 research and innovation       optimization with guarantees. In Twenty-Fourth International
                 programme under grant agreement no 779747.        Joint Conference on Artificial Intelligence.