Unsupervised Methods For Subgoal Discovery During Intrinsic Motivation in Model-Free Hierarchical Reinforcement Learning Jacob Rafati and David C. Noelle {jrafatiheravi,dnoelle}@ucmerced.edu Electrical Engineering and Computer Science Computational Cognititve Neurosceince Laboratory University of California, Merced 5200 North Lake Road, Merced, CA 95343, USA Abstract set of tasks. Such frequently reused subpolicies can be ab- stracted into skills that can be treated as individual actions Common approaches to Reinforcement Learning (RL) are se- at a higher level of abstraction. A somewhat different ap- riously challenged by large-scale applications involving huge proach to temporal abstraction involves identifying a set of state spaces and sparse delayed reward feedback. Hierarchi- cal Reinforcement Learning (HRL) methods attempt to ad- states that make for useful subgoals. This introduces a major dress this scalability issue by learning action selection poli- open problem in HRL: that of subgoal discovery. cies at multiple levels of temporal abstraction. Abstraction A variety of researchers have proposed approaches can be had by identifying a relatively small set of states that to identifying useful subpolicies and reifying them as are likely to be useful as subgoals, in concert with the learn- skills (Pickett and Barto 2002; Thrun and Schwartz 1995). ing of corresponding skill policies to achieve those subgoals. Many approaches to subgoal discovery in HRL depend on the For example, (Sutton, Precup, and Singh 1999) proposed analysis of a model of the environment, but the need to learn the options framework, which involves abstractions over the such a model introduces its own problems of scale. Once space of actions. At each step, the agent chooses either a subgoals are identified, skills may be learned through intrin- one-step “primitive” action or a “multi-step” action policy sic motivation, introducing an internal reward signal marking (an option). Each option defines a policy over actions (ei- subgoal attainment. In this paper, we present a novel model- ther primitive or other options) and comes to completion ac- free method for subgoal discovery using incremental unsu- cording to a termination condition. Other researchers have pervised learning over a small memory of the most recent ex- focused on identifying subgoals — states that are generally periences (trajectories) of the agent. When combined with an useful to attain — and learning a collection of skills that al- intrinsic motivation learning mechanism, this method learns low the agent to efficiently reach those subgoals. Some ap- both subgoals and skills, based on experiences in the envi- ronment. Thus, we offer an original approach to HRL that proaches to subgoal discovery maintain the value function does not require the acquisition of a model of the environ- in a large look-up table (Sutton, Precup, and Singh 1999; ment, suitable for large-scale applications. We demonstrate Goel and Huber 2003; Şimşek, Wolfe, and Barto 2005), the efficiency of our method on two RL problems with sparse and most of these methods require building the state tran- delayed feedback: a variant of the rooms environment and the sition graph, providing a model of the environment and first screen of the ATARI 2600 Montezuma’s Revenge game. the agents possible interactions with it (Machado, Belle- mare, and Bowling 2017; Şimşek, Wolfe, and Barto 2005; Goel and Huber 2003). Formally, the state transition graph Introduction is a directed graph G = (V, E) with a set of vertices, V ⊆ S The reinforcement learning problem suffers from serious and set of edges E ⊆ A(S), where S is the set of states scaling issues. Hierarchical Reinforcement Learning (HRL) and A(S) is the set of allowable actions. Since actions typ- is an important computational approach intended to tackle ically modify the state of the agent, each directed edge, problems of scale by learning to operate over different levels (s, s0 ) ∈ E, indicates an action that takes the agent from of temporal abstraction (Sutton, Precup, and Singh 1999). state s to state s0 . In nondeterministic environments, a prob- The acquisition of hierarchies of reusable skills is one of the ability distribution over subsequent states, given the current distinguishing characteristics of biological intelligence, and state and an action, p(s0 |s, a), is maintained as part of the the learning of such hierarchies is an important open prob- model of the environment. One method of this kind that was lem in computational reinforcement learning. Also, in the applied to a somewhat larger scale task — the first screen context of games, the development of robust HRL methods of the ATARI 2600 game called Montezuma’s Revenge — will provide a means to acquire relevant knowledge at mul- is that of Machado & Bowling (2016). This method con- tiple levels of abstraction, potentially speeding learning and structs the combinatorial transition graph Laplacian matrix, supporting generalization. and an eigen-decomposition of that matrix produces candi- A number of approaches to HRL have been suggested. date subgoals. While it was shown that some of these can- One approach focuses on action sequences, subpolicies, or didates make for useful subgoals, only heuristic domain- “options” that appear repeatedly during the learning of a sensitive methods have been reported for identifying useful subgoals from the large set of candidates (e.g., thousands). ment learning framework, eschewing any approach that re- Thus, previously proposed subgoal discovery methods have quires the learning or use of an environment model. Sec- provided useful insights and have been demonstrated to im- ond, we are devoted to integrating subgoal discovery with in- prove learning, but there continue to be challenges with re- trinsic motivation learning. Specifically, we conjecture that gard to scalability and generalization. Scaling to large state intrinsic motivation learning can increase appropriate state spaces will generally mandate the use of some form of non- space coverage, supporting more efficient subgoal discov- linear function approximator to encode the value function, ery. Lastly, we focus on subgoal discovery algorithms that rather than a look-up table. More importantly, as the scale of are likely to scale to large reinforcement learning tasks. The reinforcement learning problem increases, the tractability of result is a unified model-free HRL algorithm that incorpo- obtaining a good model of the environment, capturing all rel- rates the learning of useful internal representations of states, evant state transition probabilities, precipitously decreases. automatic subgoal discovery, intrinsic motivation learning Once useful subgoals are discovered, an HRL agent of skills, and the learning of subgoal selection by a “meta- should be able to learn the skills to attain those subgoals controller”. We demonstrate the effectiveness of this algo- through the use of intrinsic motivation — artificially reward- rithm by applying it to a variant of the rooms task (illustrated ing the agent for attaining selected subgoals. The nature in Figure 2(a)), as well as the initial screen of the ATARI and origin of “good” intrinsic reward functions is an open 2600 game called Montezuma’s Revenge (illustrated in Fig- question in reinforcement learning, however, and a number ure 3(a)). of approaches have been proposed. Singh et al. (2010) ex- plored agents with intrinsic reward structures in order to Reinforcement Learning Problem learn generic options that can apply to a wide variety of tasks. Value functions have also been generalized to con- The Reinforcement Learning (RL) problem is learning sider goals along with states (Vezhnevets et al. 2017). Such through interaction with an environment (Sutton and Barto a parameterized universal value function, q(s, g, a; w), in- 1998). At each time step the agent receives a representation tegrates the value functions for multiple skills into a single of the environment’s state, s ∈ S, where S is the set of all function taking the current subgoal, g, as an argument. possible states. On that basis, the agent selects an action, Recently, Kulkarni et al. (2016) proposed a scheme for a ∈ A, where A is the set of all available actions. One time temporal abstraction that involves simultaneously learning step later, as a consequence of the agent’s action, the agent options and a hierarchical control policy in a deep reinforce- receives a reward, r ∈ R, and also an update on the agent’s ment learning framework. Their approach does not use sep- new state, s0 , from the environment. Each cycle of interac- arate Q-functions for each option, but, instead, treats the tion is called a transition experience, e = (s, a, r, s0 ). At option as an argument. This method lacks a technique for each time step, the agent implements a mapping from states automatic subgoal discovery, however, forcing the system to possible actions, π : S → A, called its policy. The goal designer to specify a set of promising subgoal candidates of the RL agent is to find an optimal policy that maximizes in advance. The approach proposed in this paper is inspired the expected value of the return, i.e. the cumulative sum of t0 −t 0 PT by Kulkarni et al. (2016), which has advantages in terms of future rewards, Gt = t0 =t γ rt +1 , where γ ∈ (0, 1] scalability and generalization, but it incorporates automatic is the discount factor and T is a final step. The Temporal subgoal discovery. Difference (TD) learning approach is a class of model-free It is important to note that model-free HRL, which does RL methods that attempt to learn a policy without learning a not require a model of the environment, still often requires model of the environment. It is often useful to define a value the learning of useful internal representations of states. function qπ : S × A → R to estimate the expected value of When learning the value function using a nonlinear func- the return, following policy π. When the state space is large, tion approximator, such as a deep neural network, rele- or not all states are observable, we can use a nonlinear func- vant features of states must be extracted in order to sup- tion approximator, such as an artificial neural network (Mnih port generalization at scale. A number of methods have et al. 2015), or a linear approximation (Liang et al. 2016), been explored for learning such internal representations to calculate Q(s, a; w), an estimate the value function qπ , during model-free reinforcement learning (Tesauro 1995; parameterized by w. Q-learning is a TD algorithm that at- Rafati and Noelle 2017; Mnih et al. 2015). tempts to find the optimal value function by minimizing the In this paper, we seek to address major open problems in loss function L(w), which is defined as the expectation of the integration of internal representation learning, temporal squared TD error over a recent transition experience mem- abstraction, automatic subgoal discovery, and intrinsic mo- ory, D: tivation learning, all within the model-free HRL framework. h 2 i 0 0 We propose and implement efficient and general meth- L(w) , Ee∼D r + γ max 0 Q(s , a ; w) − Q(s, a; w) . a ods for subgoal discovery using unsupervised learning and anomaly (outlier) detection. These methods do not require information beyond that which is typically collected by the A Unified Model-Free HRL Framework agent during model-free reinforcement learning, such as a In Hierarchical Reinforcement Learning (HRL), a central small memory of recent experiences (agent trajectories). Our goal is to support the learning of representations at multiple methods are fundamentally constrained in three ways, by levels of abstraction. As a simple example, consider the task design. First, we remain faithful to a model-free reinforce- of navigation in the 4-room environment with a key and a lock in Figure 2(a). This is a variant of the rooms task (Sut- tion based on the reward received from the environment: ton, Precup, and Singh 1999). The 4-room is a grid-world  2  Li (W) , E(s,g,G,st0 )∼D2 Y − Q(s, g; W) , (1) environment consisting of 4 rooms. Each grid square is a state, and the agent has access to the Cartesian location of Pt+T t0 −t where G = t0 =t γ rt0 is the accumulated external each grid square. Actions allow the agent to move to an ad- reward (return) between the selection of consecutive sub- jacent grid square. The 4 rooms are connected through door- goals. The target value for the expected return at the time ways. The agent is rewarded for entering the grid square con- that the meta-controller selected subgoal g is Y = G + taining the key, and it is more substantially rewarded for en- γ maxg0 Q(s0 , g 0 ; W). The controller improves its subpolicy, tering the grid square with the lock after obtaining the key. π(a|s, g), by learning its value function, q(s, g, a; w), over Learning this task based on sparse delayed feedback is chal- the set of recorded transition experiences. The controller up- lenging for a reinforcement learning agent. dates its value function approximator parameters, w, so as to Our intuition, shared with other researchers, is that hierar- minimize its loss function: chies of abstraction will be critical for successfully solving  2  problems of this kind. To be successful, the agent should rep- Li (w) , E(s,g,a,r̃,s0 )∼D1 y − q(s, g, a; w) , (2) resent knowledge at multiple levels of spatial and temporal where y = r̃ + γ max0a q(s0 , g, a0 ; w) is the target expected abstraction. Appropriate abstraction can be had by identify- intrinsic return value. ing a relatively small set of states that are likely to be useful as subgoals and jointly learning the corresponding skills of achieving these subgoals, using intrinsic motivation. In this section, we introduce a unified method for model- free HRL. The major components of our framework, and the information flow between them, are sketched in Figure 1. Before describing the unified method, we introduce the var- ious components of our framework. Meta-Controller and Controller Framework Inspired by Kulkarni et al. (2016), we start by using two lev- els of hierarchy (Figure 1). The more abstract level of this hierarchy is managed by a meta-controller which guides the Figure 1: The information flow in the unified Model-Free action selection processes of the lower level controller. Sep- Hierarchical Reinforcement Learning framework. arate value functions are learned for the meta-controller and the controller. At time step t, the meta-controller receives a state observation, s = st , from the environment. It has a policy for selecting a subgoal, g = gt , from a set of sub- Intrinsic Motivation Learning goals, G. In our implementation, the policy arises from esti- Intrinsic motivation learning is the core idea behind the mating the value of each subgoal, Q(s, g; W), and selecting learning of value functions in the meta-controller and the the goal of highest estimated value (except when perform- controller. In some tasks with sparse delayed feedback, a ing random exploration). With the current subgoal selected, standard RL agent cannot effectively explore the state space the controller uses its policy to select an action, a ∈ A, so as to have a sufficient number of rewarding experiences based on the current state, s, and the current subgoal, g. In to learn how to maximize rewards. In contrast, the intrin- our implementation, this policy involves selecting the action sic critic in our HRL framework can send much more reg- that results in the highest estimate of the controller’s value ular feedback to the controller, since it is based on attain- function, q(s, g, a; w). Actions continue to be selected by the ing subgoals, rather than ultimate goals. As an example, our controller while an internal critic monitors the current state, implementation typically awards an intrinsic reward of +1 comparing it to the current subgoal, and delivering an appro- when the agent attains the current subgoal, g, and −1 for any priate intrinsic reward, r̃, to the controller on each time step. other state transition. Successfully solving a difficult task not Each transition experience, (s, g, a, r̃, s0 ), is recorded in the only depends on such an intrinsic motivation learning mech- controller’s experience memory set, D1 , to support learning. anism, but also on the meta-controller’s ability to learn how When the subgoal is attained, or a maximum amount of time to choose the right subgoal for any given state, s, from a set has passed, the meta-controller observes the resulting state, of candidate subgoals. Indeed, identifying a good set of can- st0 = st+T +1 , and selects another subgoal, g 0 = gt+T +1 , didate subgoals is an additional prerequisite for success, and at which time the process repeats, but not before recording it is discussed next. a transition experience for the meta-controller, (s, g, G, st0 ) in the meta-controller’s experience memory set, D2 . The pa- Unsupervised Subgoal Discovery rameters of the value function approximators are adjusted The performance of the meta-controller/controller frame- based on the collections of recent experiences. For training work depends critically on selecting good candidate sub- the meta-controller value function, we minimize a loss func- goals for the meta-controller to consider. What is a subgoal? In our framework, a subgoal is a state, insight. Heuristic meta-parameter thresholds can be used to or a set of related states, that satisfies at least one of these identify dissimilarities that warrant special attention, or un- conditions: supervised machine learning methods can be used to model the joint probability distribution of state variables, with low 1. It is close (in terms of actions) to a rewarding state. For probability states seen as anomalous. example, in the rooms task in Figure 2(a), the key and lock are rewarding states. K-Means Clustering The idea behind using a clustering 2. It represents a set of states, at least some of which tend to algorithm is “spatial” state space abstraction and dimension- be along a state transition path to a rewarding state. ality reduction with regard to the internal representations of For example, in the rooms task, the red room should be vis- states. If a collection of transition experiences are very simi- ited to move from the purple room to the blue room in order lar to each other, this might suggest that the associated states to pick up the key. Thus any state in the red room is a reason- are all roughly equally good as subgoals. Thus, rather than ably good subgoal for an agent currently in the purple room. considering all of those states, the learning process might be Similarly, the states in the blue room are all reasonably good made faster by considering a representative state (or smaller subgoals for an agent currently in the red room. The door- set of states), such as the centroid of a cluster, as a subgoal. ways between rooms can also be considered as good sub- Furthermore, using a simple clustering technique like K- goals, since entering these states allows for the transition to means clustering to find a small number of centroids in the a set of states that may be closer to rewarding states. space of experiences is likely to produce centroid subgoals Our strategy involves leveraging the set of recent transi- that are dissimilar from each other. Since rewards are sparse, tion experiences that must be recorded for value function this dissimilarity will be dominated by state features. For learning, regardless. Unsupervised learning methods applied example, in the rooms task, the centroids of K-means clus- to sets of experiences can be used to identify sets of states ters, with K = 4, lie close to the geometric center of each that may be good subgoal candidates. We focus specifically room, with the states within each room coming to belong on two kinds of analysis that can be performed on the set of to the corresponding subgoal’s cluster. In this way, the clus- transition experiences. We hypothesize that good subgoals tering of transition experiences can approximately produce might be found by (1) attending to the states associated with a coarser representation of state space, in this case replac- anomalous transition experiences and (2) clustering experi- ing the fine grained “grid square location” with the coarser ences based on a similarity measure and collecting the set “room location”. of associated states into a potential subgoal. Thus, our pro- posed method merges anomaly (outlier) detection with the A Unified Framework K-means clustering of experiences. These conceptual components can be unified into a single Anomaly Detection The anomaly (outlier) detection pro- model-free HRL framework. The major components of this cess identifies states associated with experiences that differ framework, and the information flow between these compo- significantly from the others. In the context of subgoal dis- nents, are schematically displayed in Figure 1. At time t, the covery, a relevant anomalous experience would be one that meta-controller observes the state, s = st , from the envi- includes a substantial positive reward in an environment in ronment and chooses a subgoal, g = gt , either from the dis- which reward is sparse. We propose that the states associated covered subgoals or from a random set of states (to promote with these experiences make for good candidate subgoals. exploration). The controller receives an input tuple, (s, g), For example, in the rooms task, transitions that arrive at the and is expected to learn to implement a subpolicy, π(a|s, g), key or the lock are quite dissimilar to most transitions, due that solves the subtask of reaching from s to g. The con- to the large positive reward that is received at that point. troller selects an action, a, based on its policy, in our case Since the goal of RL is maximizing accumulated (dis- directly derived from its value function, q(s, g, a; w). Af- counted) reward, these anomalous experiences, involving ter one step, the environment updates the state to s0 and large rewards, are ideal as subgoal candidates. (Experiences sends a reward r. The transition experience (s, g, a, r̃, s0 ) involving large negative rewards are also anomalous, but is then stored in the experience memory for the controller, make for poor subgoals. As long as these sorts of anoma- D1 . If the internal critic detects that the resulting state, s0 , lies do not greatly outnumber others, we expect that the is the current goal, g, the experience (st , g, G, st0 ) is stored meta-controller can efficiently learn to avoid poor subgoal in the meta-controller experience memory, D2 , where st is choices.) Large changes in state features can also be marked the state that prompted the selection of the current subgoal, as anomalous. In some computer games, like Montezuma’s and st0 = st+T is the state when the meta-controller as- Revenge, each screen represents a room, and the screen signs the next subgoal, g 0 = gt0 . The experience memory changes quickly when the agent moves from one room to sets are typically used to train the value function approxima- another. This produces a large distance between two con- tors for the meta-controller and the controller by sampling a secutive states. Such a transition can be recognized simply random minibatch of recent experiences. The subgoal dis- by the large instantaneous change in state features, mark- covery mechanism exploits the underlying structure in the ing the associated states as reasonable candidate subgoals. experience memory sets using unsupervised anomaly detec- There is a large literature on anomaly detection (Hodge and tion and experience clustering. A detailed description of this Austin 2004), in general, offering methods for applying this process is outlined in Algorithm 1. Algorithm 1 Unified Model-Free HRL Algorithm In these simulations, learning occurred in two phases. In Initialize discovered subgoals set G ← ∅ Phase I of learning, the controller learned how to navigate Initialize experience memories D, D1 and D2 in the state space, from any arbitrary state to any other state, Initialize parameters w and W randomly using intrinsic motivation learning. The agent’s trajectories, Inputs: learning rate α, exploration rate  (s, a, r, s0 ), during this pretraining phase were stored in D, Choose Phase I or Phase II and our unsupervised subgoal discovery method extracted for episode = 1, . . . , M do the structure of D by performing K-means clustering and Initialize state s0 ∈ S, s ← s0 anomaly detection. The discovered subgoals were stored Initialize episode return G ← 0 in G. In Phase II of learning, the meta-controller learned if Phase I of learning then meta-policies over G, and the controller had opportunities Choose a subgoal g randomly form S to refine its ability to navigate from any arbitrary state, s, else if Phase II of learning then to any assigned subgoal, g ∈ G. We separated the learn- g ←EPSILON-GREEDY(Q(s, G; W), ) ing process into phases for two reasons: (1) to focus on the end if knowledge extracted from game environments using our ap- Intrinsic Motivation Learning Algorithm: proach to unsupervised subgoal discovery during intrinsic repeat for each step t = 1, . . . , T motivation learning; (2) to avoid complications that arise in compute q(s, g, a; w) meta-controller learning when the set of subgoals, G, is not a ←EPSILON-GREEDY(q(s, g, A; w), ) fixed. Integrating these phases into a uniform and incremen- Take action a, observe s0 and external reward r tal HRL process is a central apsect of our future work. Compute intrinsic reward r̃ from internal critic Store (s, g, a, r̃, s0 ) to D1 4-Room Task with Key and Lock Store (s, a, r, s0 ) to D Consider the task of navigation in the 4-room environment Sample J1 ⊂ D1 and compute ∇L with a key and a lock, as shown in Figure 2(a). This task Update w, w ← w − α∇L was inspired by the rooms environment introduced by Sut- if Phase II of learning then ton, et al. (1999), but it is much more complex. The agent Sample J2 ⊂ D2 and compute ∇L not only needs to learn how to navigate form any arbi- Update W, W ← W − α∇L trary state to any other state, but also it needs to visit some end if states in a specific temporal order. At the beginning of each s ← s0 , G ← G + r episode, the agent is initialized in an arbitrary location in an Decay exploration rate  arbitrary room. The agent has four possible move actions, until s is terminal or subgoal g is attained A = {N orth, South, East, W est}, on each time step. Store (s0 , g, G, s0 ) to D2 The agent receives r = +10 reward for reaching the key if Phase I of training then and r = +40 if it moves to the lock while carrying the key Run Unsupervised Subgoal Discovery (i.e., any time after visiting the key location during the same end if episode). Bumping into a wall boundary is punished with a end for reward of r = −2. There is no reward for just exploring ==================================== the space. Learning in this environment with sparse delayed Unsupervised Subgoal Discovery Algorithm feedback is challenging for a reinforcement learning agent. for each e = (s, a, r, s0 ) stored in D do To successfully solve the task, the agent should represent if experience e is an outlier (anomaly) then knowledge at multiple levels of spatial and temporal abstrac- Store s0 to the subgoals set G tions. The agent should also learn to explore the environment Remove e from D efficiently. end if We first examined the unsupervised subgoal discovery end for algorithm over the course of a random walk. The agent Fit a K-means Clustering Algorithm on D using previous was allowed to explore the 4-room environment for 10,000 centroids as initial points episodes. Each episode ended either when the task was com- Store the updated centroids to the subgoals set G pleted or after reaching a maximum time step limit of 200. The agent’s experiences, e = (s, a, r, s0 ), were collected in an experience memory, D. The stream of external re- wards for each transition was used to detect anomalous sub- Experiments goals (Figure 2(c)). We applied a heuristic anomaly detec- We conducted simulation experiments in order to investigate tion method for the streaming rewards that was able to differ- the ability of our unsupervised subgoal discovery method entiate between the rare large positive rewards and the reg- to discover useful subgoals, as well as the efficiency of ular small ones. These peaks, as shown in Figure 2(c), cor- our unified model-free hierarchical reinforcement learning responded to the experiences in which the key was reached framework. The simulations were conducted in two environ- (r = +10) or the experience of reaching the lock after ob- ments with sparse delayed feedback: a variant of the rooms taining the key. task, shown in Figure 2(a), and the “Montezumas Revenge” We also applied a K-means clustering algorithm to the game, shown in Figure 3(a). experience memory. (See Algorithm 1.) The centroids of the (a) (b) 100 Success in Reaching Subgoals % 80 60 40 20 0 20000 40000 60000 80000 100000 Training steps (c) (d) 50 100 40 Success in Solving Task% 80 Episode Return 30 60 Our Unified Model-Free HRL Method 20 Regular RL 40 10 20 0 Our Unified Model-Free HRL Method Regular RL 0 0 20000 40000 60000 80000 100000 0 20000 40000 60000 80000 100000 Training steps Training steps (e) (f) Figure 2: (a) The 4-room task with a key and a lock. (b) The results of the unsupervised subgoal discovery algorithm with anomalies marked with black Xs and centroids with colored ones. (c) Reward over an episode, with anomalous points corre- sponding to the key (r = +10) and the lock (r = +40). (d) The average success of the controller in reaching subgoals during intrinsic motivation learning in the pretraining phase. (e) The average episode return. (f) The rate of solving the 4-room task. K-means clusters (with K = 4) are plotted in Figure 2(b). future work.) The clusters, along with the anomalous states, We found these centroids to roughly correspond to the cen- were collected into the set of subgoals. ters of the rooms. (We experimented with K = 8 and saw Phase I learning consisted of 100,000 episodes. Value equally useful clusters, with each room containing two clus- function approximators were implemented as multi-layer ar- ter centroids. We will investigate methods for choosing K in tificial neural networks augmented to encourage the learn- ing of sparse internal representations of states (Rafati and is able to succeed in task environments that are effectively Noelle 2015). The controller network, q(s, g, a; w), took the outside of the scope of standard RL approaches. state, s, and the goal, g, as inputs. States were presented to the network as Cartesian coordinates, with separate pools of Montezuma’s Revenge inputs for each of the two dimensions. During this learn- We applied our HRL approach to the first room of the game ing phase, the subgoal was initially chosen randomly from Montezuma’s Revenge. (See Figure 3(a).) The game is well- the state space. After unsupervised subgoal discovery, the known as a challenge for RL agents because it requires solv- subgoal was chosen randomly from the subgoals set, G. In ing many subtasks while avoiding traps. Having only sparse this way, the controller was trained to navigate from any ar- delayed reward feedback to drive learning makes this RL bitrary state, s, to any subgoal state. When a centroid was problem extremely difficult. The agent should learn to nav- selected as a subgoal, if the agent entered any state in the igate the man in red to the key by: (1) climbing the middle corresponding cluster, the subgoal was considered attained. ladder (2) climbing the bottom right ladder (3) climbing the Thus, the controller essentially learned how to navigate from bottom left ladder (4) moving to the key. After picking up any location to any state cluster (room) and also to any of the key (r = +100), the agent should return back, revers- the anomalous subgoals (key and door). The learning rate ing the previous action sequence, and attempt to reach the was α = 0.001, the discount factor was γ = 0.99, and the door (r = +300) and exit the room. The moving skull at the exploration rate was set to  = 0.2. The average success rate bottom of the screen, which ends an episode upon contact, (over 10 consecutive episodes) for the first phase of intrinsic makes obtaining the key extremely difficult. The episode motivation learning is shown in Figure 2(d). also ends unsuccessfully if the man falls off of a platform. Phase II learning involved training both the meta- DeepMind’s Deep Q-Learning (DQN) algorithm (Mnih controller and the controller, together, using the discovered et al. 2015), which surpassed human performance on many subgoals, G. (See Algorithm 1.) The subgoal set regularly ATARI 2600 games, failed to learn this game since the agent came to contain a total of 6 subgoals: 2 anomalous ones and did not reach any rewarding state during exploration. 4 centroids. The system was trained for 100,000 episodes. In this problem, the agent requires the skills arising from The meta-controller’s value function approximation network intrinsic motivation learning in order to explore the environ- consisted of two layers. The first layer, was a one-hot encod- ment in a more efficient way (Kulkarni et al. 2016). Our ing of the conjunction of the current subgoal and the state, HRL approach supports the learning of such skills. As be- computed by converting the state to the index of the corre- fore, the meta-controller and the controller were trained in sponding subgoal. This was connected directly to the out- two phases. In Phase I, the controller was trained to move put layer. The average return, over 10 consecutive episodes, the man from any location in the given frame, s, to any other is shown in Figure 2(e). The agent very quickly converged location specified in a subgoal frame, g. An initial set of on the optimal policies and collected the maximum reward “interesting” subgoal locations were identified using a cus- (+50). The high exploration rate,  = 0.2, caused high tom edge detection algorithm, avoiding empty regions as stochasticity, but the meta-controller and controller could ro- subgoals. Unsupervised object detection using computer vi- bustly solve the task on more than 90% of the episodes very sion algorithms can be challenging (Kulkarni et al. 2016; early in training. After about 40,000 episodes, the success Fragkiadaki et al. 2015). We made the simplifying assump- rate was 100%, as shown in Figure 2(f). tion that, in many games, edges were suggestive of objects, We compared the learning efficiency of our unified HRL and the locations of objects made for good initial subgoals. method with the performance resulting from training a value These locations were used in Phase I of training to train approximation network with a regular, non-hierarchical, RL the controller through intrinsic motivation. Note that edge algorithm, TD SARSA (Sutton and Barto 1998). The func- detection was only performed to identify Phase I subgoals. tion approximator that we used for Q(s, a; w) matched that Specifically, it was not used to change or augment the state of the controller, equating for computational resources, and representation in any way. we used the same values for the training hyper-parameters. We used a variant of the DQN deep Convolutional Neu- The regular RL agent could only reach the key before be- ral Network (CNN) architecture (Figure 3(b)) for approxi- coming stuck in that region, due to the high local reward. mation of the controller’s value function, q(s, g, a; w). The Despite the very high exploration rate used, the regular RL input to the controller network consisted of four consecu- agent was not motivated to explore the rest of the state space tive frames of size 84 × 84, encoding the state, s, and an to reach the lock and solve the task. Results are shown in additional frame binary mask encoding the subgoal, g. The Figure 2(e) and (f) (red plots). concatenated state and subgoal frames were passed to the It is worth noting that this task involves a partially observ- network, and the controller then selected one of 18 different able Markov decision process (POMDP), because informa- joystick actions based on a policy derived from q(s, g, a; w). tion about whether or not the agent has the key is not visible in the state. This hidden state information poses a serious During intrinsic motivation learning, the recent experi- problem for standard RL algorithms, but our HRL agent was ences were saved in an experience memory, D, with a size able to overcome this obstacle. Through Phase II learning, of 106 . In order to support comparison to previously pub- the hidden information became implicit in the selected sub- lished results, we used the same learning parameters of goal, with the meta-controller changing the current subgoal DeepMind’s DQN (Mnih et al. 2015). Specifically, the learn- once the key is obtained. In this way, our HRL framework ing rate, α, was set to to be 0.00025, with a discount rate (a) (b) (c) (d) 400 Our Unified Model-Free HRL Method Success in reaching subgoals % Average return over 10 episdes 350 DeepMind DQN Algorithm (Mnih et. al., 2015) 80 300 250 60 Our Unified Model-Free HRL Method 200 DeepMind DQN Algorithm (Mnih et. al., 2015) 40 150 100 20 50 0 0 0 500000 1000000 1500000 2000000 2500000 0 500000 1000000 1500000 2000000 2500000 Training steps Training steps (e) (f) Figure 3: (a) A sample screen from the ATARI 2600 game Montezuma’s Revenge. (b) The CNN architecture for the controller’s value function. (c) The CNN architecture for the meta-controller’s value function. (d) The results of the unsupervised subgoal discovery algorithm. The blue circles are the discovered anomalous subgoals and the red ones are the centroid subgoals. (e) The average of return over 10 episodes during the second phase of the learning. (f) The success of the controller in reaching to the subgoals during the second phase of learning. of γ = 0.99. During Phase I learning, we trained the net- million time steps, the controller managed to reach the key work for a total of 2.5 × 106 time steps. The exploration subgoal more frequently. The success of the intrinsic moti- probability parameter, , decreased from 1.0 to 0.1 in the vation learning is depicted in Figure 3(f). At the end of the first million steps and remained fixed after that. After ev- second phase of learning (after 2.5 million learning steps), ery 100, 000 time steps, we applied our unsupervised sub- the meta-controller regularly chose the proper subgoals for goal discovery method to the contents of the experience collecting the maximum reward (+400). memory in order to find new subgoals, both anomalies and centroids, using K-means clustering with K = 10. As Conclusion shown in Figure 3(d), the unsupervised learning algorithm managed to discover the location of the key and the doors We have proposed and demonstrated a novel model-free in this way. It also identified useful objects such as lad- HRL method for subgoal discovery using unsupervised ders, platforms, and the rope. In Phase II, we trained the learning over a small memory of the most recent experi- meta-controller and the controller jointly. We used an ar- ences (trajectories) of the agent. When combined with an chitecture based on the DQN CNN (Mnih et al. 2015), as intrinsic motivation learning mechanism, this method learns shown in Figure 3(c), for the meta-controller’s value func- subgoals and skills together, based on experiences in the en- tion, Q(s, g; W). We used the non-overlapping discovered vironment. Thus, we offer an HRL approach that does not subgoals, which resulted in a set of 11 subgoals, G. At the require a model of the environment, making it suitable for beginning of each episode, the meta-controller assigned a larger-scale applications. subgoal, g ∈ G, based on an epsilon-greedy policy derived from Q(s, g; W). The controller then attempted to reach Acknowledgments these subgoals. The controller’s experience memory, D1 , We acknowledge the valuable comments of the anonymous had a size of 106 , and the size of the meta-controller’s expe- reviewers of this paper. The computations are conducted on rience memory, D2 , was 5×104 . The cumulative rewards for the MERCED cluster at UC Merced, which was funded by the game episodes is shown in Figure 3(e). After about 1.5 National Science Foundation Grant No. ACI-1429783. References Sutton, R. S.; Precup, D.; and Singh, S. 1999. Be- Şimşek, O.; Wolfe, A. P.; and Barto, A. G. 2005. Identify- tween MDPs and semi-MDPs: A framework for temporal ing useful subgoals in reinforcement learning by local graph abstraction in reinforcement learning. Artificial Intelligence partitioning. In Proceedings of the 22nd International Con- 112(1):181 – 211. ference on Machine Learning, 816–823. Tesauro, G. 1995. Temporal difference learning and TD- Fragkiadaki, K.; Arbelaez, P.; Felsen, P.; and Malik, J. 2015. Gammon. Communications of the ACM 38(3). Learning to segment moving objects in videos. In IEEE Thrun, S., and Schwartz, A. 1995. Finding structure in re- Conference on Computer Vision and Pattern Recognition inforcement learning. In Advances in Neural Information (CVPR), 4083 – 4090. Processing Systems 7. MIT Press. 385–392. Goel, S., and Huber, M. 2003. Subgoal discovery for hi- Vezhnevets, A. S.; Osindero, S.; Schaul, T.; Heess, N.; Jader- erarchical reinforcement learning using learned policies. In berg, M.; Silver, D.; and Kavukcuoglu, K. 2017. Feu- FLAIRS Conference, 346–350. AAAI Press. dal networks for hierarchical reinforcement learning. CoRR abs/1703.01161. Hodge, V. J., and Austin, J. 2004. A survey of outlier detec- tion methodologies. Artificial Intelligence Review 22(2):85– 126. Kulkarni, T. D.; Narasimhan, K.; Saeedi, A.; and Tenen- baum, J. 2016. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in Neural Information Processing Systems, 3675– 3683. Liang, Y.; Machado, M. C.; Talvitie, E.; and Bowling, M. H. 2016. State of the Art Control of Atari Games Using Shal- low Reinforcement Learning. In Proceedings of the Inter- national Conference on Autonomous Agents and Multiagent Systems, 17–25. Machado, M. C., and Bowling, M. 2016. Learning Pur- poseful Behaviour in the Absence of Rewards. Presented at the ICML-16 Workshop on Abstraction in Reinforcement Learning. Machado, M. C.; Bellemare, M. G.; and Bowling, M. H. 2017. A laplacian framework for option discovery in re- inforcement learning. In Proceedings of the 34th Interna- tional Conference on Machine Learning, ICML 2017, Syd- ney, NSW, Australia, 6-11 August 2017, 2295–2304. Mnih, V.; Kavukcuoglu, K.; Silver, D.; et al. 2015. Human- level control through deep reinforcement learning. Nature 518(7540):529–533. Pickett, M., and Barto, A. G. 2002. Policyblocks: An al- gorithm for creating useful macro-actions in reinforcement learning. In Proceedings of the Nineteenth International Conference on Machine Learning, 506–513. Rafati, J., and Noelle, D. C. 2015. Lateral inhibition over- comes limits of temporal difference learning. In 37th Annual Cognitive Science Society Meeting, Pasadena, CA, USA. Rafati, J., and Noelle, D. C. 2017. Sparse coding of learned state representations in reinforcement learning. In Confer- ence on Cognitive Computational Neuroscience, New York City, NY, USA. Singh, S.; Lewis, R. L.; Barto, A. G.; and Sorg, J. 2010. In- trinsically motivated reinforcement learning: An evolution- ary perspective. IEEE Transaction on Autonomous Mental Development 2(2):70–82. Sutton, R. S., and Barto, A. G. 1998. Reinforcement Learn- ing: An Introduction. Cambridge, MA: MIT Press, 1st edi- tion.