=Paper= {{Paper |id=Vol-2600/short9 |storemode=property |title=Tutor4RL: Guiding Reinforcement Learning with External Knowledge |pdfUrl=https://ceur-ws.org/Vol-2600/short9.pdf |volume=Vol-2600 |authors=Mauricio Fadel Argerich,Jonathan Fürst,Bin Cheng |dblpUrl=https://dblp.org/rec/conf/aaaiss/ArgerichF020 }} ==Tutor4RL: Guiding Reinforcement Learning with External Knowledge== https://ceur-ws.org/Vol-2600/short9.pdf

Tutor4RL: Guiding Reinforcement Learning with External Knowledge

Mauricio Fadel Argerich, Jonathan Fürst, Bin Cheng
NEC Laboratories Europe
Kurfürsten-Anlage 36, 69115 Heidelberg, Germany
{mauricio.fadel@neclab.eu, jonathan.fuerst@neclab.eu, bin.cheng@neclab.eu}

Abstract context of the agent changes drastically with each deploy-
ment (e.g., the execution environment is different). Because
We introduce Tutor4RL, a method to improve reinforcement of this, data or simulation training are not feasible. Thus, the
learning (RL) performance during training, using external
agent needs to gather its experience online, from the perfor-
knowledge to guide the agents’ decisions and experience.
Current approaches of RL need extensive experience to de- mance of a live system where each action the agent explores
liver good performance, something that is not acceptable in has a real cost that impacts the system. To apply RL in such
many real systems when no simulation environment or con- scenarios, an agent needs to provide “good-enough” perfor-
siderable previous data are available. In Tutor4RL, external mance from the start and act safely throughout the learn-
knowledge– such as expert or domain knowledge– is ex- ing (Dulac-Arnold, Mankowitz, and Hester 2019)—a funda-
pressed as programmable functions that are fed to the RL mental problem that must be solved to make RL applicable
agent. During its first steps, the agent uses these knowledge to real world use cases.
functions to decide the best action, guiding its exploration
and providing better performance from the start. As the agent However, in our application of RL for container orches-
gathers experience, it increasingly exploits its learned policy, tration in Big Data systems, we have experienced that some
eventually leaving its tutor behind. We demonstrate Tutor4RL knowledge of the environment is usually available before the
with a DQN agent. In our tests, Tutor4RL achieves more than deployment of the agent. This knowledge includes common
3 times higher reward in the beginning of its training than an heuristics or notions, well known to domain experts (e.g.,
agent with no external knowledge. DevOps engineers). In our work, we aim to make this knowl-
edge accessible to the RL agent. Our intuition is that just as
Introduction a person prepares for a task beforehand by gathering knowl-
edge from sources such as other human-beings, manuals,
Reinforcement Learning (RL) has achieved great success etc., we can provide external knowledge that the RL agent
in fields such as robotics (Kober, Bagnell, and Peters can use when not enough experience has been gathered.
2013), recommender systems (Theocharous, Thomas, and
Ghavamzadeh 2015) and video games, in which RL has even To realize this concept, we introduce Tutor4RL, a method
surpassed human performance (Mnih et al. 2015). However, that guides RL via external knowledge. In Tutor4RL, ex-
before achieving this, RL often poor performance for an ex- ternal knowledge– such as expert or domain knowledge –
tended time—possibly for millions of iterations—until it has is expressed as programmable functions that the agent will
gathered enough experience. In practice, this problem is ap- use during its training phase, and especially its initial steps
proached mainly with two RL techniques: (1) training via to guide its behavior and learning. Thanks to Tutor4RL,
simulation and (2) learning from historical data. the agent can perform in a reliable way from the start and
Recently, researchers have applied RL to computer sys- achieve higher performance in a shorter training time. As
tems tasks, such as database management system configu- the RL agent gathers more experience, it learns its policy
ration (Schaarschmidt et al. 2018) or container orchestra- and leaves the tutor behind, improving on its results thanks
tion for Big Data systems (Fadel Argerich, Cheng, and Fürst to its empirical knowledge. We take some inspiration from
2019). Here, experience data for the specific task might not the recently proposed Data Programming concept for su-
be available (e.g., the complexity of a data analytic task de- pervised learning, in which functions developed by experts,
pends on the processed dataset, not known before) and the are used to encode weakly supervision sources (Ratner et al.
2017).
Copyright c 2020 held by the author(s). In A. Martin, K. Hinkel-
We test our approach with a DQN Agent in the Atari game
mann, H.-G. Fill, A. Gerber, D. Lenat, R. Stolle, F. van Harmelen
(Eds.), Proceedings of the AAAI 2020 Spring Symposium on Com- Breakout and compare it to a plain DQN Agent. In our tests,
bining Machine Learning and Knowledge Engineering in Practice Tutor4RL achieves more than a 3 times higher reward than
(AAAI-MAKE 2020). Stanford University, Palo Alto, California, the plain agent in the beginning of its training, and keeps up
USA, March 23-25, 2020. Use permitted under Creative Commons with its performance even after leaving the tutor behind, i.e.
License Attribution 4.0 International (CC BY 4.0). only using its learned policy.
Background and Motivation in previous training and is limited by the availability of such
As illustrated in Figure 1 (Sutton, Barto, and others 1998), data.
in Reinforcement Learning (RL), an agent learns to control Instead of relying on any pre-trained model, we explore
an unknown environment to achieve a certain goal while in- how to utilize a set of programmable knowledge functions to
teracting with the environment. The agent interacts with the guide the exploration of a RL agent so that we can quickly
environment in discrete time steps t = 0, 1, .... In each step, make effective decisions, even just after a few exploration
the agent receives a representation st ∈ S of the state of steps. We call our method Tutor4RL. Unlike existing ap-
environment and a numerical signal rt called reward, and proaches, Tutor4RL requires no previous training and is
performs an action at ∈ A that leads to the state st+1 and therefore, a more practical approach for the use of RL in
reward rt+1 , perceived by the agent in the next time step. S real systems. To the best of our knowledge, Tutor4RL is the
and A represent the sets of states and actions respectively. first to apply programmable knowledge functions into RL
for improving training performance and sample efficiency.

Agent
Tutor4RL
Reward
𝑟" Figure 2 shows our overall design of Tutor4RL. As com-
Action
State 𝑎" pared to traditional RL, we add a new component called Tu-
𝑠" tor to guide the agent to make informed decisions during
Environment training. The reason why the tutor is able to guide the agent
is because it can directly leverage a set of knowledge func-
tions defined by domain experts. In this way, Tutor4RL can
Figure 1: The RL framework and its elements. help the agent avoid any blind decisions at the beginning.
The tutor possesses external knowledge and interacts with
The agents’ behaviour is defined by its policy π, which the agent during training. The tutor takes as input the state of
provides a mapping from states S to actions A. The value the environment, and outputs the action to take; in a similar
function qπ (s, a) represents the expected future reward re- way to the agent’s policy. However, the tutor is implemented
ceived when taking action a at state s with a policy π. The as programmable functions, in which external knowledge is
goal of the agent is to find a policy that maximizes cumula- used to decide the mapping between states and actions, e.g.,
tive reward in the long run. for Atari Breakout, the tutor takes the frame from the video
To do this, the agent learns from its experience for each game as input, and outputs in what direction the bar should
performed action at , and then uses the collected observa- be moved. For every time step, the tutor interacts with the
tions (st+1 , rt+1 ) to optimize its policy π based on different agent and gives advise to the agent for making better deci-
models of the value function, such as a Tabular model (Sut- sions, based on all provided knowledge functions.
ton, Barto, and others 1998) or a deep neural network One issue for the agent to consider is when and how often
model (Mnih et al. 2015). Existing studies show that RL can it should ask for advise from the tutor. In a similar fashion
lead to a reasonable model for determining which action to to Epsilon greedy exploration, we define τ as the threshold
take in each state after learning from a large number of ex- parameter for the agent to control when it will take the sug-
perience data. However, the biggest problem is how the RL gested actions from the tutor instead of its own decisions. τ
agent can learn fast and efficiently from experience data. is a parameter of our model and the best value to initialize
There have been different approaches of addressing this it depends on the use case; thus its initial value is left to be
problem in the state of the art. A simple approach is to ex- decided during implementation.
plore the state space randomly, but this approach is usually
time-consuming and costly when the state/action space is
large. The drawback of this approach has been reported by knowledge
our previous study (Fadel Argerich, Cheng, and Fürst 2019) Domain functions
in the case of leveraging RL to automatically decide the experts
configuration and deployment actions of a data processing
pipeline in a cloud and edge environment. Tutor
Another approach is to gain experience via simulation.
With enough computational resources, we can easily pro-
duce lots of experience data in a short time, but it is difficult Agent
to ensure that the simulated experiences are realistic enough
to reflect the actual situations in the observed system. Reward
𝑟" Action
Recently, a new trend to leverage external knowledge State 𝑎"
to improve the exploration efficiency of RL agents has 𝑠"
emerged. For example, in (Moreno et al. 2004) and (Hes- Environment
ter et al. 2018), prior knowledge like pre-trained models and
policies are used to bootstrap the exploration phase of a RL
agent. However, this type of prior knowledge still originates Figure 2: Overall Working of Tutor4RL
Knowledge functions must be programmed by domain ex- Parameter DQN DQN + Tutoring
perts and allow them to easily bring different types of do- Policy Epsilon greedy
main knowledge into the tutor component. Currently, Tu- [0.3-0.1] decreasing linearly
Epsilon
through 0 to 1.75M steps
tor4RL considers the following two types of knowledge
Gamma 0.99
functions: Constrain Functions and Guide Functions.
Warmup steps 50000
Constrain Functions are programmable functions that Optimizer Adam with lr=0.00025
constrain the behavior of the agent. At each time step t, a [1-0] decreasing linearly
constrain function takes the state of the environment as in- Tau -
through 0 to 1.5M
put, and returns a vector to indicate whether an action in the
action space could be taken or not using the value 1 or 0; 1 Table 1: Parameters used in evaluation.
represents the action is enabled while 0 represents the action
axis, and returns ”fire” if no ball is found, or to move in the
is disabled and cannot be performed for this state. Therefore,
direction of the ball (left or right) if the ball is not above
constrain functions provide a mask to avoid unnecessary ac-
the bar. The simplified code for this function can be seen in
tions for certain states.
Listing 1. In addition, we also include the simplified code the
Guide Functions are programmable functions that ex-
agent uses in each step of its training in Listing 2, choosing
press domain heuristics that the agent will use to guide its
between the tutor decision and the policy decision.
decisions, especially in moments of high uncertainty, e.g. the
start of the learning process. Each guide function takes the
current RL state and reward as input, and outputs a vector def guide_function(obs):
to represent the weight of each preferred action according to # Find bar and ball in frame.
bar_x_left, bar_x_right = \
the encoded domain knowledge. find_bar_edges_x(obs)
The benefit of Tutor4RL is twofold: ball_x = find_ball(observation)
1. During training, the tutor enables a reasonable perfor- if ball_x != None:
mance, opposed to the unreliable performance from an in- # Where to move bar.
experienced agent, while generating experience for train- if bar_x_left > ball_x:
return [0, 0, 0, 1] # left
ing. Furthermore, the experience generated by the tutor is elif bar_x_right < ball_x > 0:
important because it provides examples of good behavior. return [0, 0, 1, 0] # right
2. The knowledge of the tutor does not need to be per- return [0, 1, 0, 0] # fire
fect or extensive. The tutor might have partial knowledge Listing 1: Our implementation of a guide function for
about the environment, i.e. know what to do in certain Breakout. We check the position of the ball and the bar in
cases only, or might not have a perfectly accurate knowl- the current frame and move the bar towards the ball.
edge. The tutor provides some ”rules of thumb” that the
agent can follow during training, and based on experience, def select_action(obs, tau, guide_function):
the agent can improve upon these decisions, achieving a if numpy.random.uniform() < tau:
higher reward than the tutor. # Use tutor.
tf_output = guide_function(obs)
Evaluation action = numpy.argmax(tf_output)
else:
We implement Tutor4RL by modifying the DQN (Mnih et # Use policy normally.
al. 2015) agent, using the library Keras-RL (Plappert 2016) action = policy.select_action()
along with Tensorflow. In order to make our evaluation re- return action
producible, we choose a well-known workload for RL: play- Listing 2: Selection of action when training the agent.
ing Atari games. In particular, we select the Atari game
Breakout, using the environment BreakoutDeterministic-v4 Figure 3 depicts the mean reward per episode of the plain
from OpenAI Gym (Brockman et al. 2016). We compare DQN agent and DQN agent with Tutor4RL during train-
our approach to a standard DQN agent as implemented by ing. During the beginning of the training and until step
Keras-RL and we use the same set of parameters for both, 500, 000, the plain DQN Agent shows an expected low re-
the DQN agent with Tutor4RL and the one without. The pa- ward (< 15 points) because it starts with no knowledge,
rameters used for the agents are detailed in Table 1. while the DQN Agent with Tutor4RL—thanks to the use
In the BreakoutDeterministic-v4 environment, the obser- of its tutor knowledge—manages to achieve a mean reward
vation is a RGB image of the screen, which is an array of between 15 and 35 points, ca. double the maximum of the
shape (210, 160, 3) and four actions are available: no opera- plain DQN Agent. From step 500, 000 we see how the plain
tion, fire (starts the game by ”throwing the ball”), right and DQN agent starts to improve, but its not until step 1.1M
left. Each action is repeatedly performed for a duration of that the plain DQN agent shows equally good results as the
k = 4 frames. In order to simplify the state space of our tutored one. From there we see a similar reward for both
agent, we pre-process each frame converting it to greyscale agents,, with DQN Agent + Tutor4RL achieving a slightly
and reducing its resolution to (105, 105, 1). higher mean reward in most cases. Because τ is decreased
We implement a simple guide function that takes the pre- uniformly throughout training, the tutor is used less as train-
processed frame, locates the the ball and the bar in the X ing progresses. Finally, in step 1.5M, τ = 0 and the tutor is
Figure 3: Average mean reward per episode achieved by plain DQN agent and DQN agent with Tutor4RL during training. Data
was averaged over 4 tests for each agent and with a rolling mean of 20 episodes, bands show 0.90 confidence interval.

no longer used. It is important to note that from this point on, References
the reward does not decrease but it keeps improving with the Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schul-
agent’s learning. Moreover, we test both agents after 1.75M man, J.; Tang, J.; and Zaremba, W. 2016. Openai gym. arXiv
steps: the plain DQN agent achieves an average reward of preprint arXiv:1606.01540.
40.75 points while Tutor4RL achieves a reward of 43. Note Dulac-Arnold, G.; Mankowitz, D.; and Hester, T. 2019. Chal-
that this reward comes only from the learned policy of the lenges of real-world reinforcement learning. arXiv preprint
agents and keeping = 0.05, i.e. no tutor knowledge is used. arXiv:1904.12901.
Fadel Argerich, M.; Cheng, B.; and Fürst, J. 2019. Reinforce-
Discussion and Future Work ment learning based orchestration for elastic services. arXiv
Tutor4RL is ongoing research and as such we plan to de- preprint arXiv:1904.12676.
velop several improvements. First, the decision about when Hester, T.; Vecerik, M.; Pietquin, O.; Lanctot, M.; Schaul, T.;
to use the tutor is an important aspect to our approach. Cur- Piot, B.; Horgan, D.; Quan, J.; Sendonaris, A.; Osband, I.; et al.
rently, we are using τ as a parameter that decreases as the 2018. Deep q-learning from demonstrations. In Thirty-Second
agent gathers more experience in each step. However, this AAAI Conference on Artificial Intelligence.
can be improved to take into account the actual learning of Kober, J.; Bagnell, J. A.; and Peters, J. 2013. Reinforcement
the agent, i.e. how ”good” are the actions selected by the learning in robotics: A survey. The International Journal of
policy, or how certain the policy is about the action to be Robotics Research 32(11):1238–1274.
taken in the given state. Second, as discussed before, we Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.;
plan to evaluate constrain functions to limit the behavior of Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.;
the agent. This can help in situations in which an action does Ostrovski, G.; et al. 2015. Human-level control through deep
not make sense, e.g. in Breakout, moving the bar to the left reinforcement learning. Nature 518.
when it’s already in its most left position. Last, the accu- Moreno, D. L.; Regueiro, C. V.; Iglesias, R.; and Barro, S. 2004.
racy of guide functions can vary and thus, the decisions of Using prior knowledge to improve reinforcement learning in
the Tutor can be improved by weighting the decisions of the mobile robotics. Proc. Towards Autonomous Robotics Systems.
functions according to their accuracy. Univ. of Essex, UK.
Plappert, M. 2016. keras-rl. https://github.com/keras-rl/keras-
Conclusion rl.
We have demonstrated Tutor4RL, a method that uses exter- Ratner, A.; Bach, S. H.; Ehrenberg, H.; Fries, J.; Wu, S.; and
nal knowledge functions to improve the initial performance Ré, C. 2017. Snorkel: Rapid training data creation with weak
of reinforcement learning agents. Tutor4RL targets deploy- supervision. Proceedings of the VLDB Endowment 11(3).
ment scenarios where historical data for training is not avail- Schaarschmidt, M.; Kuhnle, A.; Ellis, B.; Fricke, K.; Gessert,
able and building a simulator is impractical. Our results F.; and Yoneki, E. 2018. Lift: Reinforcement learning in com-
show that Tutor4RL achieves 3 times higher reward than an puter systems by learning from demonstrations. arXiv preprint
agent without using external knowledge in its initial stage. arXiv:1808.07903.
Sutton, R. S.; Barto, A. G.; et al. 1998. Introduction to rein-
Acknowledgments forcement learning, volume 2. MIT press Cambridge.
The research leading to these results has re- Theocharous, G.; Thomas, P. S.; and Ghavamzadeh, M. 2015.
ceived funding from the European Commu- Personalized ad recommendation systems for life-time value
nity’s Horizon 2020 research and innovation optimization with guarantees. In Twenty-Fourth International
programme under grant agreement no 779747. Joint Conference on Artificial Intelligence.