Unsupervised Methods For Subgoal Discovery During Intrinsic Motivation in
                Model-Free Hierarchical Reinforcement Learning
                                              Jacob Rafati and David C. Noelle
                                            {jrafatiheravi,dnoelle}@ucmerced.edu
                                            Electrical Engineering and Computer Science
                                         Computational Cognititve Neurosceince Laboratory
                                                   University of California, Merced
                                          5200 North Lake Road, Merced, CA 95343, USA


                            Abstract                                set of tasks. Such frequently reused subpolicies can be ab-
                                                                    stracted into skills that can be treated as individual actions
  Common approaches to Reinforcement Learning (RL) are se-          at a higher level of abstraction. A somewhat different ap-
  riously challenged by large-scale applications involving huge
                                                                    proach to temporal abstraction involves identifying a set of
  state spaces and sparse delayed reward feedback. Hierarchi-
  cal Reinforcement Learning (HRL) methods attempt to ad-           states that make for useful subgoals. This introduces a major
  dress this scalability issue by learning action selection poli-   open problem in HRL: that of subgoal discovery.
  cies at multiple levels of temporal abstraction. Abstraction         A variety of researchers have proposed approaches
  can be had by identifying a relatively small set of states that   to identifying useful subpolicies and reifying them as
  are likely to be useful as subgoals, in concert with the learn-
                                                                    skills (Pickett and Barto 2002; Thrun and Schwartz 1995).
  ing of corresponding skill policies to achieve those subgoals.
  Many approaches to subgoal discovery in HRL depend on the         For example, (Sutton, Precup, and Singh 1999) proposed
  analysis of a model of the environment, but the need to learn     the options framework, which involves abstractions over the
  such a model introduces its own problems of scale. Once           space of actions. At each step, the agent chooses either a
  subgoals are identified, skills may be learned through intrin-    one-step “primitive” action or a “multi-step” action policy
  sic motivation, introducing an internal reward signal marking     (an option). Each option defines a policy over actions (ei-
  subgoal attainment. In this paper, we present a novel model-      ther primitive or other options) and comes to completion ac-
  free method for subgoal discovery using incremental unsu-         cording to a termination condition. Other researchers have
  pervised learning over a small memory of the most recent ex-      focused on identifying subgoals — states that are generally
  periences (trajectories) of the agent. When combined with an      useful to attain — and learning a collection of skills that al-
  intrinsic motivation learning mechanism, this method learns
                                                                    low the agent to efficiently reach those subgoals. Some ap-
  both subgoals and skills, based on experiences in the envi-
  ronment. Thus, we offer an original approach to HRL that          proaches to subgoal discovery maintain the value function
  does not require the acquisition of a model of the environ-       in a large look-up table (Sutton, Precup, and Singh 1999;
  ment, suitable for large-scale applications. We demonstrate       Goel and Huber 2003; Şimşek, Wolfe, and Barto 2005),
  the efficiency of our method on two RL problems with sparse       and most of these methods require building the state tran-
  delayed feedback: a variant of the rooms environment and the      sition graph, providing a model of the environment and
  first screen of the ATARI 2600 Montezuma’s Revenge game.          the agents possible interactions with it (Machado, Belle-
                                                                    mare, and Bowling 2017; Şimşek, Wolfe, and Barto 2005;
                                                                    Goel and Huber 2003). Formally, the state transition graph
                        Introduction                                is a directed graph G = (V, E) with a set of vertices, V ⊆ S
The reinforcement learning problem suffers from serious             and set of edges E ⊆ A(S), where S is the set of states
scaling issues. Hierarchical Reinforcement Learning (HRL)           and A(S) is the set of allowable actions. Since actions typ-
is an important computational approach intended to tackle           ically modify the state of the agent, each directed edge,
problems of scale by learning to operate over different levels      (s, s0 ) ∈ E, indicates an action that takes the agent from
of temporal abstraction (Sutton, Precup, and Singh 1999).           state s to state s0 . In nondeterministic environments, a prob-
The acquisition of hierarchies of reusable skills is one of the     ability distribution over subsequent states, given the current
distinguishing characteristics of biological intelligence, and      state and an action, p(s0 |s, a), is maintained as part of the
the learning of such hierarchies is an important open prob-         model of the environment. One method of this kind that was
lem in computational reinforcement learning. Also, in the           applied to a somewhat larger scale task — the first screen
context of games, the development of robust HRL methods             of the ATARI 2600 game called Montezuma’s Revenge —
will provide a means to acquire relevant knowledge at mul-          is that of Machado & Bowling (2016). This method con-
tiple levels of abstraction, potentially speeding learning and      structs the combinatorial transition graph Laplacian matrix,
supporting generalization.                                          and an eigen-decomposition of that matrix produces candi-
   A number of approaches to HRL have been suggested.               date subgoals. While it was shown that some of these can-
One approach focuses on action sequences, subpolicies, or           didates make for useful subgoals, only heuristic domain-
“options” that appear repeatedly during the learning of a           sensitive methods have been reported for identifying useful
subgoals from the large set of candidates (e.g., thousands).     ment learning framework, eschewing any approach that re-
Thus, previously proposed subgoal discovery methods have         quires the learning or use of an environment model. Sec-
provided useful insights and have been demonstrated to im-       ond, we are devoted to integrating subgoal discovery with in-
prove learning, but there continue to be challenges with re-     trinsic motivation learning. Specifically, we conjecture that
gard to scalability and generalization. Scaling to large state   intrinsic motivation learning can increase appropriate state
spaces will generally mandate the use of some form of non-       space coverage, supporting more efficient subgoal discov-
linear function approximator to encode the value function,       ery. Lastly, we focus on subgoal discovery algorithms that
rather than a look-up table. More importantly, as the scale of   are likely to scale to large reinforcement learning tasks. The
reinforcement learning problem increases, the tractability of    result is a unified model-free HRL algorithm that incorpo-
obtaining a good model of the environment, capturing all rel-    rates the learning of useful internal representations of states,
evant state transition probabilities, precipitously decreases.   automatic subgoal discovery, intrinsic motivation learning
   Once useful subgoals are discovered, an HRL agent             of skills, and the learning of subgoal selection by a “meta-
should be able to learn the skills to attain those subgoals      controller”. We demonstrate the effectiveness of this algo-
through the use of intrinsic motivation — artificially reward-   rithm by applying it to a variant of the rooms task (illustrated
ing the agent for attaining selected subgoals. The nature        in Figure 2(a)), as well as the initial screen of the ATARI
and origin of “good” intrinsic reward functions is an open       2600 game called Montezuma’s Revenge (illustrated in Fig-
question in reinforcement learning, however, and a number        ure 3(a)).
of approaches have been proposed. Singh et al. (2010) ex-
plored agents with intrinsic reward structures in order to                Reinforcement Learning Problem
learn generic options that can apply to a wide variety of
tasks. Value functions have also been generalized to con-        The Reinforcement Learning (RL) problem is learning
sider goals along with states (Vezhnevets et al. 2017). Such     through interaction with an environment (Sutton and Barto
a parameterized universal value function, q(s, g, a; w), in-     1998). At each time step the agent receives a representation
tegrates the value functions for multiple skills into a single   of the environment’s state, s ∈ S, where S is the set of all
function taking the current subgoal, g, as an argument.          possible states. On that basis, the agent selects an action,
   Recently, Kulkarni et al. (2016) proposed a scheme for        a ∈ A, where A is the set of all available actions. One time
temporal abstraction that involves simultaneously learning       step later, as a consequence of the agent’s action, the agent
options and a hierarchical control policy in a deep reinforce-   receives a reward, r ∈ R, and also an update on the agent’s
ment learning framework. Their approach does not use sep-        new state, s0 , from the environment. Each cycle of interac-
arate Q-functions for each option, but, instead, treats the      tion is called a transition experience, e = (s, a, r, s0 ). At
option as an argument. This method lacks a technique for         each time step, the agent implements a mapping from states
automatic subgoal discovery, however, forcing the system         to possible actions, π : S → A, called its policy. The goal
designer to specify a set of promising subgoal candidates        of the RL agent is to find an optimal policy that maximizes
in advance. The approach proposed in this paper is inspired      the expected value of the return, i.e. the cumulative sum of
                                                                                                     t0 −t 0
                                                                                          PT
by Kulkarni et al. (2016), which has advantages in terms of      future rewards, Gt =        t0 =t γ       rt +1 , where γ ∈ (0, 1]
scalability and generalization, but it incorporates automatic    is the discount factor and T is a final step. The Temporal
subgoal discovery.                                               Difference (TD) learning approach is a class of model-free
   It is important to note that model-free HRL, which does       RL methods that attempt to learn a policy without learning a
not require a model of the environment, still often requires     model of the environment. It is often useful to define a value
the learning of useful internal representations of states.       function qπ : S × A → R to estimate the expected value of
When learning the value function using a nonlinear func-         the return, following policy π. When the state space is large,
tion approximator, such as a deep neural network, rele-          or not all states are observable, we can use a nonlinear func-
vant features of states must be extracted in order to sup-       tion approximator, such as an artificial neural network (Mnih
port generalization at scale. A number of methods have           et al. 2015), or a linear approximation (Liang et al. 2016),
been explored for learning such internal representations         to calculate Q(s, a; w), an estimate the value function qπ ,
during model-free reinforcement learning (Tesauro 1995;          parameterized by w. Q-learning is a TD algorithm that at-
Rafati and Noelle 2017; Mnih et al. 2015).                       tempts to find the optimal value function by minimizing the
   In this paper, we seek to address major open problems in      loss function L(w), which is defined as the expectation of
the integration of internal representation learning, temporal    squared TD error over a recent transition experience mem-
abstraction, automatic subgoal discovery, and intrinsic mo-      ory, D:
tivation learning, all within the model-free HRL framework.                        h                                          2 i
                                                                                                        0 0
We propose and implement efficient and general meth-              L(w) , Ee∼D r + γ max       0
                                                                                                  Q(s    , a ; w) − Q(s, a; w)    .
                                                                                             a
ods for subgoal discovery using unsupervised learning and
anomaly (outlier) detection. These methods do not require
information beyond that which is typically collected by the
                                                                     A Unified Model-Free HRL Framework
agent during model-free reinforcement learning, such as a        In Hierarchical Reinforcement Learning (HRL), a central
small memory of recent experiences (agent trajectories). Our     goal is to support the learning of representations at multiple
methods are fundamentally constrained in three ways, by          levels of abstraction. As a simple example, consider the task
design. First, we remain faithful to a model-free reinforce-     of navigation in the 4-room environment with a key and a
lock in Figure 2(a). This is a variant of the rooms task (Sut-       tion based on the reward received from the environment:
ton, Precup, and Singh 1999). The 4-room is a grid-world                                                                 2 
                                                                           Li (W) , E(s,g,G,st0 )∼D2 Y − Q(s, g; W) , (1)
environment consisting of 4 rooms. Each grid square is a
state, and the agent has access to the Cartesian location of                         Pt+T t0 −t
                                                                     where G =           t0 =t γ   rt0 is the accumulated external
each grid square. Actions allow the agent to move to an ad-          reward (return) between the selection of consecutive sub-
jacent grid square. The 4 rooms are connected through door-          goals. The target value for the expected return at the time
ways. The agent is rewarded for entering the grid square con-        that the meta-controller selected subgoal g is Y = G +
taining the key, and it is more substantially rewarded for en-       γ maxg0 Q(s0 , g 0 ; W). The controller improves its subpolicy,
tering the grid square with the lock after obtaining the key.        π(a|s, g), by learning its value function, q(s, g, a; w), over
Learning this task based on sparse delayed feedback is chal-         the set of recorded transition experiences. The controller up-
lenging for a reinforcement learning agent.                          dates its value function approximator parameters, w, so as to
   Our intuition, shared with other researchers, is that hierar-     minimize its loss function:
chies of abstraction will be critical for successfully solving                                                           2 
problems of this kind. To be successful, the agent should rep-             Li (w) , E(s,g,a,r̃,s0 )∼D1 y − q(s, g, a; w) , (2)
resent knowledge at multiple levels of spatial and temporal          where y = r̃ + γ max0a q(s0 , g, a0 ; w) is the target expected
abstraction. Appropriate abstraction can be had by identify-         intrinsic return value.
ing a relatively small set of states that are likely to be useful
as subgoals and jointly learning the corresponding skills of
achieving these subgoals, using intrinsic motivation.
   In this section, we introduce a unified method for model-
free HRL. The major components of our framework, and the
information flow between them, are sketched in Figure 1.
Before describing the unified method, we introduce the var-
ious components of our framework.

Meta-Controller and Controller Framework

Inspired by Kulkarni et al. (2016), we start by using two lev-
els of hierarchy (Figure 1). The more abstract level of this
hierarchy is managed by a meta-controller which guides the
                                                                     Figure 1: The information flow in the unified Model-Free
action selection processes of the lower level controller. Sep-
                                                                     Hierarchical Reinforcement Learning framework.
arate value functions are learned for the meta-controller and
the controller. At time step t, the meta-controller receives
a state observation, s = st , from the environment. It has a
policy for selecting a subgoal, g = gt , from a set of sub-          Intrinsic Motivation Learning
goals, G. In our implementation, the policy arises from esti-        Intrinsic motivation learning is the core idea behind the
mating the value of each subgoal, Q(s, g; W), and selecting          learning of value functions in the meta-controller and the
the goal of highest estimated value (except when perform-            controller. In some tasks with sparse delayed feedback, a
ing random exploration). With the current subgoal selected,          standard RL agent cannot effectively explore the state space
the controller uses its policy to select an action, a ∈ A,           so as to have a sufficient number of rewarding experiences
based on the current state, s, and the current subgoal, g. In        to learn how to maximize rewards. In contrast, the intrin-
our implementation, this policy involves selecting the action        sic critic in our HRL framework can send much more reg-
that results in the highest estimate of the controller’s value       ular feedback to the controller, since it is based on attain-
function, q(s, g, a; w). Actions continue to be selected by the      ing subgoals, rather than ultimate goals. As an example, our
controller while an internal critic monitors the current state,      implementation typically awards an intrinsic reward of +1
comparing it to the current subgoal, and delivering an appro-        when the agent attains the current subgoal, g, and −1 for any
priate intrinsic reward, r̃, to the controller on each time step.    other state transition. Successfully solving a difficult task not
Each transition experience, (s, g, a, r̃, s0 ), is recorded in the   only depends on such an intrinsic motivation learning mech-
controller’s experience memory set, D1 , to support learning.        anism, but also on the meta-controller’s ability to learn how
When the subgoal is attained, or a maximum amount of time            to choose the right subgoal for any given state, s, from a set
has passed, the meta-controller observes the resulting state,        of candidate subgoals. Indeed, identifying a good set of can-
st0 = st+T +1 , and selects another subgoal, g 0 = gt+T +1 ,         didate subgoals is an additional prerequisite for success, and
at which time the process repeats, but not before recording          it is discussed next.
a transition experience for the meta-controller, (s, g, G, st0 )
in the meta-controller’s experience memory set, D2 . The pa-         Unsupervised Subgoal Discovery
rameters of the value function approximators are adjusted            The performance of the meta-controller/controller frame-
based on the collections of recent experiences. For training         work depends critically on selecting good candidate sub-
the meta-controller value function, we minimize a loss func-         goals for the meta-controller to consider.
   What is a subgoal? In our framework, a subgoal is a state,      insight. Heuristic meta-parameter thresholds can be used to
or a set of related states, that satisfies at least one of these   identify dissimilarities that warrant special attention, or un-
conditions:                                                        supervised machine learning methods can be used to model
                                                                   the joint probability distribution of state variables, with low
1. It is close (in terms of actions) to a rewarding state. For
                                                                   probability states seen as anomalous.
   example, in the rooms task in Figure 2(a), the key and lock
   are rewarding states.
                                                                   K-Means Clustering The idea behind using a clustering
2. It represents a set of states, at least some of which tend to   algorithm is “spatial” state space abstraction and dimension-
   be along a state transition path to a rewarding state.          ality reduction with regard to the internal representations of
For example, in the rooms task, the red room should be vis-        states. If a collection of transition experiences are very simi-
ited to move from the purple room to the blue room in order        lar to each other, this might suggest that the associated states
to pick up the key. Thus any state in the red room is a reason-    are all roughly equally good as subgoals. Thus, rather than
ably good subgoal for an agent currently in the purple room.       considering all of those states, the learning process might be
Similarly, the states in the blue room are all reasonably good     made faster by considering a representative state (or smaller
subgoals for an agent currently in the red room. The door-         set of states), such as the centroid of a cluster, as a subgoal.
ways between rooms can also be considered as good sub-             Furthermore, using a simple clustering technique like K-
goals, since entering these states allows for the transition to    means clustering to find a small number of centroids in the
a set of states that may be closer to rewarding states.            space of experiences is likely to produce centroid subgoals
   Our strategy involves leveraging the set of recent transi-      that are dissimilar from each other. Since rewards are sparse,
tion experiences that must be recorded for value function          this dissimilarity will be dominated by state features. For
learning, regardless. Unsupervised learning methods applied        example, in the rooms task, the centroids of K-means clus-
to sets of experiences can be used to identify sets of states      ters, with K = 4, lie close to the geometric center of each
that may be good subgoal candidates. We focus specifically         room, with the states within each room coming to belong
on two kinds of analysis that can be performed on the set of       to the corresponding subgoal’s cluster. In this way, the clus-
transition experiences. We hypothesize that good subgoals          tering of transition experiences can approximately produce
might be found by (1) attending to the states associated with      a coarser representation of state space, in this case replac-
anomalous transition experiences and (2) clustering experi-        ing the fine grained “grid square location” with the coarser
ences based on a similarity measure and collecting the set         “room location”.
of associated states into a potential subgoal. Thus, our pro-
posed method merges anomaly (outlier) detection with the           A Unified Framework
K-means clustering of experiences.
                                                                   These conceptual components can be unified into a single
Anomaly Detection The anomaly (outlier) detection pro-             model-free HRL framework. The major components of this
cess identifies states associated with experiences that differ     framework, and the information flow between these compo-
significantly from the others. In the context of subgoal dis-      nents, are schematically displayed in Figure 1. At time t, the
covery, a relevant anomalous experience would be one that          meta-controller observes the state, s = st , from the envi-
includes a substantial positive reward in an environment in        ronment and chooses a subgoal, g = gt , either from the dis-
which reward is sparse. We propose that the states associated      covered subgoals or from a random set of states (to promote
with these experiences make for good candidate subgoals.           exploration). The controller receives an input tuple, (s, g),
For example, in the rooms task, transitions that arrive at the     and is expected to learn to implement a subpolicy, π(a|s, g),
key or the lock are quite dissimilar to most transitions, due      that solves the subtask of reaching from s to g. The con-
to the large positive reward that is received at that point.       troller selects an action, a, based on its policy, in our case
   Since the goal of RL is maximizing accumulated (dis-            directly derived from its value function, q(s, g, a; w). Af-
counted) reward, these anomalous experiences, involving            ter one step, the environment updates the state to s0 and
large rewards, are ideal as subgoal candidates. (Experiences       sends a reward r. The transition experience (s, g, a, r̃, s0 )
involving large negative rewards are also anomalous, but           is then stored in the experience memory for the controller,
make for poor subgoals. As long as these sorts of anoma-           D1 . If the internal critic detects that the resulting state, s0 ,
lies do not greatly outnumber others, we expect that the           is the current goal, g, the experience (st , g, G, st0 ) is stored
meta-controller can efficiently learn to avoid poor subgoal        in the meta-controller experience memory, D2 , where st is
choices.) Large changes in state features can also be marked       the state that prompted the selection of the current subgoal,
as anomalous. In some computer games, like Montezuma’s             and st0 = st+T is the state when the meta-controller as-
Revenge, each screen represents a room, and the screen             signs the next subgoal, g 0 = gt0 . The experience memory
changes quickly when the agent moves from one room to              sets are typically used to train the value function approxima-
another. This produces a large distance between two con-           tors for the meta-controller and the controller by sampling a
secutive states. Such a transition can be recognized simply        random minibatch of recent experiences. The subgoal dis-
by the large instantaneous change in state features, mark-         covery mechanism exploits the underlying structure in the
ing the associated states as reasonable candidate subgoals.        experience memory sets using unsupervised anomaly detec-
There is a large literature on anomaly detection (Hodge and        tion and experience clustering. A detailed description of this
Austin 2004), in general, offering methods for applying this       process is outlined in Algorithm 1.
Algorithm 1 Unified Model-Free HRL Algorithm                     In these simulations, learning occurred in two phases. In
  Initialize discovered subgoals set G ← ∅                    Phase I of learning, the controller learned how to navigate
  Initialize experience memories D, D1 and D2                 in the state space, from any arbitrary state to any other state,
  Initialize parameters w and W randomly                      using intrinsic motivation learning. The agent’s trajectories,
  Inputs: learning rate α, exploration rate                  (s, a, r, s0 ), during this pretraining phase were stored in D,
  Choose Phase I or Phase II                                  and our unsupervised subgoal discovery method extracted
  for episode = 1, . . . , M do                               the structure of D by performing K-means clustering and
      Initialize state s0 ∈ S, s ← s0                         anomaly detection. The discovered subgoals were stored
      Initialize episode return G ← 0                         in G. In Phase II of learning, the meta-controller learned
      if Phase I of learning then                             meta-policies over G, and the controller had opportunities
          Choose a subgoal g randomly form S                  to refine its ability to navigate from any arbitrary state, s,
      else if Phase II of learning then                       to any assigned subgoal, g ∈ G. We separated the learn-
          g ←EPSILON-GREEDY(Q(s, G; W), )                    ing process into phases for two reasons: (1) to focus on the
      end if                                                  knowledge extracted from game environments using our ap-
      Intrinsic Motivation Learning Algorithm:                proach to unsupervised subgoal discovery during intrinsic
      repeat for each step t = 1, . . . , T                   motivation learning; (2) to avoid complications that arise in
          compute q(s, g, a; w)                               meta-controller learning when the set of subgoals, G, is not
          a ←EPSILON-GREEDY(q(s, g, A; w), )                 fixed. Integrating these phases into a uniform and incremen-
          Take action a, observe s0 and external reward r     tal HRL process is a central apsect of our future work.
          Compute intrinsic reward r̃ from internal critic
          Store (s, g, a, r̃, s0 ) to D1                      4-Room Task with Key and Lock
          Store (s, a, r, s0 ) to D                           Consider the task of navigation in the 4-room environment
          Sample J1 ⊂ D1 and compute ∇L                       with a key and a lock, as shown in Figure 2(a). This task
          Update w, w ← w − α∇L                               was inspired by the rooms environment introduced by Sut-
          if Phase II of learning then                        ton, et al. (1999), but it is much more complex. The agent
               Sample J2 ⊂ D2 and compute ∇L                  not only needs to learn how to navigate form any arbi-
               Update W, W ← W − α∇L                          trary state to any other state, but also it needs to visit some
          end if                                              states in a specific temporal order. At the beginning of each
          s ← s0 , G ← G + r                                  episode, the agent is initialized in an arbitrary location in an
          Decay exploration rate                             arbitrary room. The agent has four possible move actions,
      until s is terminal or subgoal g is attained            A = {N orth, South, East, W est}, on each time step.
      Store (s0 , g, G, s0 ) to D2                            The agent receives r = +10 reward for reaching the key
      if Phase I of training then                             and r = +40 if it moves to the lock while carrying the key
          Run Unsupervised Subgoal Discovery                  (i.e., any time after visiting the key location during the same
      end if                                                  episode). Bumping into a wall boundary is punished with a
  end for                                                     reward of r = −2. There is no reward for just exploring
  ====================================                        the space. Learning in this environment with sparse delayed
  Unsupervised Subgoal Discovery Algorithm                    feedback is challenging for a reinforcement learning agent.
  for each e = (s, a, r, s0 ) stored in D do                  To successfully solve the task, the agent should represent
      if experience e is an outlier (anomaly) then            knowledge at multiple levels of spatial and temporal abstrac-
          Store s0 to the subgoals set G                      tions. The agent should also learn to explore the environment
          Remove e from D                                     efficiently.
      end if                                                     We first examined the unsupervised subgoal discovery
  end for                                                     algorithm over the course of a random walk. The agent
  Fit a K-means Clustering Algorithm on D using previous      was allowed to explore the 4-room environment for 10,000
  centroids as initial points                                 episodes. Each episode ended either when the task was com-
  Store the updated centroids to the subgoals set G           pleted or after reaching a maximum time step limit of 200.
                                                              The agent’s experiences, e = (s, a, r, s0 ), were collected
                                                              in an experience memory, D. The stream of external re-
                                                              wards for each transition was used to detect anomalous sub-
                     Experiments                              goals (Figure 2(c)). We applied a heuristic anomaly detec-
We conducted simulation experiments in order to investigate   tion method for the streaming rewards that was able to differ-
the ability of our unsupervised subgoal discovery method      entiate between the rare large positive rewards and the reg-
to discover useful subgoals, as well as the efficiency of     ular small ones. These peaks, as shown in Figure 2(c), cor-
our unified model-free hierarchical reinforcement learning    responded to the experiences in which the key was reached
framework. The simulations were conducted in two environ-     (r = +10) or the experience of reaching the lock after ob-
ments with sparse delayed feedback: a variant of the rooms    taining the key.
task, shown in Figure 2(a), and the “Montezumas Revenge”         We also applied a K-means clustering algorithm to the
game, shown in Figure 3(a).                                   experience memory. (See Algorithm 1.) The centroids of the
                                               (a)                                                                                         (b)
                                                                                                                       100


                                                                                      Success in Reaching Subgoals %
                                                                                                                       80


                                                                                                                       60


                                                                                                                       40


                                                                                                                       20


                                                                                                                             0   20000   40000     60000      80000      100000
                                                                                                                                         Training steps
                                               (c)                                                                                         (d)
                            50                                                                                         100

                            40
                                                                                      Success in Solving Task%


                                                                                                                       80
           Episode Return


                            30
                                                                                                                       60
                                                                                                                                            Our Unified Model-Free HRL Method
                            20                                                                                                              Regular RL
                                                                                                                       40
                            10
                                                                                                                       20
                            0
                                                Our Unified Model-Free HRL Method
                                                Regular RL
                                                                                                                        0
                                 0   20000   40000     60000      80000      100000                                          0   20000   40000     60000      80000      100000
                                             Training steps                                                                              Training steps
                                               (e)                                                                                         (f)

Figure 2: (a) The 4-room task with a key and a lock. (b) The results of the unsupervised subgoal discovery algorithm with
anomalies marked with black Xs and centroids with colored ones. (c) Reward over an episode, with anomalous points corre-
sponding to the key (r = +10) and the lock (r = +40). (d) The average success of the controller in reaching subgoals during
intrinsic motivation learning in the pretraining phase. (e) The average episode return. (f) The rate of solving the 4-room task.


K-means clusters (with K = 4) are plotted in Figure 2(b).                                                future work.) The clusters, along with the anomalous states,
We found these centroids to roughly correspond to the cen-                                               were collected into the set of subgoals.
ters of the rooms. (We experimented with K = 8 and saw                                                       Phase I learning consisted of 100,000 episodes. Value
equally useful clusters, with each room containing two clus-                                             function approximators were implemented as multi-layer ar-
ter centroids. We will investigate methods for choosing K in                                             tificial neural networks augmented to encourage the learn-
ing of sparse internal representations of states (Rafati and        is able to succeed in task environments that are effectively
Noelle 2015). The controller network, q(s, g, a; w), took the       outside of the scope of standard RL approaches.
state, s, and the goal, g, as inputs. States were presented to
the network as Cartesian coordinates, with separate pools of        Montezuma’s Revenge
inputs for each of the two dimensions. During this learn-           We applied our HRL approach to the first room of the game
ing phase, the subgoal was initially chosen randomly from           Montezuma’s Revenge. (See Figure 3(a).) The game is well-
the state space. After unsupervised subgoal discovery, the          known as a challenge for RL agents because it requires solv-
subgoal was chosen randomly from the subgoals set, G. In            ing many subtasks while avoiding traps. Having only sparse
this way, the controller was trained to navigate from any ar-       delayed reward feedback to drive learning makes this RL
bitrary state, s, to any subgoal state. When a centroid was         problem extremely difficult. The agent should learn to nav-
selected as a subgoal, if the agent entered any state in the        igate the man in red to the key by: (1) climbing the middle
corresponding cluster, the subgoal was considered attained.         ladder (2) climbing the bottom right ladder (3) climbing the
Thus, the controller essentially learned how to navigate from       bottom left ladder (4) moving to the key. After picking up
any location to any state cluster (room) and also to any of         the key (r = +100), the agent should return back, revers-
the anomalous subgoals (key and door). The learning rate            ing the previous action sequence, and attempt to reach the
was α = 0.001, the discount factor was γ = 0.99, and the            door (r = +300) and exit the room. The moving skull at the
exploration rate was set to  = 0.2. The average success rate       bottom of the screen, which ends an episode upon contact,
(over 10 consecutive episodes) for the first phase of intrinsic     makes obtaining the key extremely difficult. The episode
motivation learning is shown in Figure 2(d).                        also ends unsuccessfully if the man falls off of a platform.
   Phase II learning involved training both the meta-                  DeepMind’s Deep Q-Learning (DQN) algorithm (Mnih
controller and the controller, together, using the discovered       et al. 2015), which surpassed human performance on many
subgoals, G. (See Algorithm 1.) The subgoal set regularly           ATARI 2600 games, failed to learn this game since the agent
came to contain a total of 6 subgoals: 2 anomalous ones and         did not reach any rewarding state during exploration.
4 centroids. The system was trained for 100,000 episodes.              In this problem, the agent requires the skills arising from
The meta-controller’s value function approximation network          intrinsic motivation learning in order to explore the environ-
consisted of two layers. The first layer, was a one-hot encod-      ment in a more efficient way (Kulkarni et al. 2016). Our
ing of the conjunction of the current subgoal and the state,        HRL approach supports the learning of such skills. As be-
computed by converting the state to the index of the corre-         fore, the meta-controller and the controller were trained in
sponding subgoal. This was connected directly to the out-           two phases. In Phase I, the controller was trained to move
put layer. The average return, over 10 consecutive episodes,        the man from any location in the given frame, s, to any other
is shown in Figure 2(e). The agent very quickly converged           location specified in a subgoal frame, g. An initial set of
on the optimal policies and collected the maximum reward            “interesting” subgoal locations were identified using a cus-
(+50). The high exploration rate,  = 0.2, caused high              tom edge detection algorithm, avoiding empty regions as
stochasticity, but the meta-controller and controller could ro-     subgoals. Unsupervised object detection using computer vi-
bustly solve the task on more than 90% of the episodes very         sion algorithms can be challenging (Kulkarni et al. 2016;
early in training. After about 40,000 episodes, the success         Fragkiadaki et al. 2015). We made the simplifying assump-
rate was 100%, as shown in Figure 2(f).                             tion that, in many games, edges were suggestive of objects,
   We compared the learning efficiency of our unified HRL           and the locations of objects made for good initial subgoals.
method with the performance resulting from training a value         These locations were used in Phase I of training to train
approximation network with a regular, non-hierarchical, RL          the controller through intrinsic motivation. Note that edge
algorithm, TD SARSA (Sutton and Barto 1998). The func-              detection was only performed to identify Phase I subgoals.
tion approximator that we used for Q(s, a; w) matched that          Specifically, it was not used to change or augment the state
of the controller, equating for computational resources, and        representation in any way.
we used the same values for the training hyper-parameters.             We used a variant of the DQN deep Convolutional Neu-
The regular RL agent could only reach the key before be-            ral Network (CNN) architecture (Figure 3(b)) for approxi-
coming stuck in that region, due to the high local reward.          mation of the controller’s value function, q(s, g, a; w). The
Despite the very high exploration rate used, the regular RL         input to the controller network consisted of four consecu-
agent was not motivated to explore the rest of the state space      tive frames of size 84 × 84, encoding the state, s, and an
to reach the lock and solve the task. Results are shown in          additional frame binary mask encoding the subgoal, g. The
Figure 2(e) and (f) (red plots).                                    concatenated state and subgoal frames were passed to the
   It is worth noting that this task involves a partially observ-   network, and the controller then selected one of 18 different
able Markov decision process (POMDP), because informa-              joystick actions based on a policy derived from q(s, g, a; w).
tion about whether or not the agent has the key is not visible
in the state. This hidden state information poses a serious            During intrinsic motivation learning, the recent experi-
problem for standard RL algorithms, but our HRL agent was           ences were saved in an experience memory, D, with a size
able to overcome this obstacle. Through Phase II learning,          of 106 . In order to support comparison to previously pub-
the hidden information became implicit in the selected sub-         lished results, we used the same learning parameters of
goal, with the meta-controller changing the current subgoal         DeepMind’s DQN (Mnih et al. 2015). Specifically, the learn-
once the key is obtained. In this way, our HRL framework            ing rate, α, was set to to be 0.00025, with a discount rate
                                                       (a)                                            (b)                                                           (c)                                       (d)
                                                 400
                                                             Our Unified Model-Free HRL Method


                                                                                                                          Success in reaching subgoals %
                Average return over 10 episdes


                                                 350         DeepMind DQN Algorithm (Mnih et. al., 2015)
                                                                                                                                                           80
                                                 300

                                                 250                                                                                                       60
                                                                                                                                                                                   Our Unified Model-Free HRL Method
                                                 200
                                                                                                                                                                                   DeepMind DQN Algorithm (Mnih et. al., 2015)
                                                                                                                                                           40
                                                 150

                                                 100
                                                                                                                                                           20
                                                 50

                                                  0                                                                                                        0
                                                        0       500000     1000000    1500000     2000000   2500000                                             0         500000     1000000     1500000     2000000     2500000
                                                                           Training steps                                                                                              Training steps

                                                                              (e)                                                                                                          (f)

Figure 3: (a) A sample screen from the ATARI 2600 game Montezuma’s Revenge. (b) The CNN architecture for the controller’s
value function. (c) The CNN architecture for the meta-controller’s value function. (d) The results of the unsupervised subgoal
discovery algorithm. The blue circles are the discovered anomalous subgoals and the red ones are the centroid subgoals. (e) The
average of return over 10 episodes during the second phase of the learning. (f) The success of the controller in reaching to the
subgoals during the second phase of learning.


of γ = 0.99. During Phase I learning, we trained the net-                                                             million time steps, the controller managed to reach the key
work for a total of 2.5 × 106 time steps. The exploration                                                             subgoal more frequently. The success of the intrinsic moti-
probability parameter, , decreased from 1.0 to 0.1 in the                                                            vation learning is depicted in Figure 3(f). At the end of the
first million steps and remained fixed after that. After ev-                                                          second phase of learning (after 2.5 million learning steps),
ery 100, 000 time steps, we applied our unsupervised sub-                                                             the meta-controller regularly chose the proper subgoals for
goal discovery method to the contents of the experience                                                               collecting the maximum reward (+400).
memory in order to find new subgoals, both anomalies and
centroids, using K-means clustering with K = 10. As                                                                                                                                    Conclusion
shown in Figure 3(d), the unsupervised learning algorithm
managed to discover the location of the key and the doors                                                             We have proposed and demonstrated a novel model-free
in this way. It also identified useful objects such as lad-                                                           HRL method for subgoal discovery using unsupervised
ders, platforms, and the rope. In Phase II, we trained the                                                            learning over a small memory of the most recent experi-
meta-controller and the controller jointly. We used an ar-                                                            ences (trajectories) of the agent. When combined with an
chitecture based on the DQN CNN (Mnih et al. 2015), as                                                                intrinsic motivation learning mechanism, this method learns
shown in Figure 3(c), for the meta-controller’s value func-                                                           subgoals and skills together, based on experiences in the en-
tion, Q(s, g; W). We used the non-overlapping discovered                                                              vironment. Thus, we offer an HRL approach that does not
subgoals, which resulted in a set of 11 subgoals, G. At the                                                           require a model of the environment, making it suitable for
beginning of each episode, the meta-controller assigned a                                                             larger-scale applications.
subgoal, g ∈ G, based on an epsilon-greedy policy derived
from Q(s, g; W). The controller then attempted to reach                                                                                                                        Acknowledgments
these subgoals. The controller’s experience memory, D1 ,                                                              We acknowledge the valuable comments of the anonymous
had a size of 106 , and the size of the meta-controller’s expe-                                                       reviewers of this paper. The computations are conducted on
rience memory, D2 , was 5×104 . The cumulative rewards for                                                            the MERCED cluster at UC Merced, which was funded by
the game episodes is shown in Figure 3(e). After about 1.5                                                            National Science Foundation Grant No. ACI-1429783.
                       References                                Sutton, R. S.; Precup, D.; and Singh, S. 1999. Be-
Şimşek, O.; Wolfe, A. P.; and Barto, A. G. 2005. Identify-     tween MDPs and semi-MDPs: A framework for temporal
ing useful subgoals in reinforcement learning by local graph     abstraction in reinforcement learning. Artificial Intelligence
partitioning. In Proceedings of the 22nd International Con-      112(1):181 – 211.
ference on Machine Learning, 816–823.                            Tesauro, G. 1995. Temporal difference learning and TD-
Fragkiadaki, K.; Arbelaez, P.; Felsen, P.; and Malik, J. 2015.   Gammon. Communications of the ACM 38(3).
Learning to segment moving objects in videos. In IEEE            Thrun, S., and Schwartz, A. 1995. Finding structure in re-
Conference on Computer Vision and Pattern Recognition            inforcement learning. In Advances in Neural Information
(CVPR), 4083 – 4090.                                             Processing Systems 7. MIT Press. 385–392.
Goel, S., and Huber, M. 2003. Subgoal discovery for hi-          Vezhnevets, A. S.; Osindero, S.; Schaul, T.; Heess, N.; Jader-
erarchical reinforcement learning using learned policies. In     berg, M.; Silver, D.; and Kavukcuoglu, K. 2017. Feu-
FLAIRS Conference, 346–350. AAAI Press.                          dal networks for hierarchical reinforcement learning. CoRR
                                                                 abs/1703.01161.
Hodge, V. J., and Austin, J. 2004. A survey of outlier detec-
tion methodologies. Artificial Intelligence Review 22(2):85–
126.
Kulkarni, T. D.; Narasimhan, K.; Saeedi, A.; and Tenen-
baum, J. 2016. Hierarchical deep reinforcement learning:
Integrating temporal abstraction and intrinsic motivation. In
Advances in Neural Information Processing Systems, 3675–
3683.
Liang, Y.; Machado, M. C.; Talvitie, E.; and Bowling, M. H.
2016. State of the Art Control of Atari Games Using Shal-
low Reinforcement Learning. In Proceedings of the Inter-
national Conference on Autonomous Agents and Multiagent
Systems, 17–25.
Machado, M. C., and Bowling, M. 2016. Learning Pur-
poseful Behaviour in the Absence of Rewards. Presented
at the ICML-16 Workshop on Abstraction in Reinforcement
Learning.
Machado, M. C.; Bellemare, M. G.; and Bowling, M. H.
2017. A laplacian framework for option discovery in re-
inforcement learning. In Proceedings of the 34th Interna-
tional Conference on Machine Learning, ICML 2017, Syd-
ney, NSW, Australia, 6-11 August 2017, 2295–2304.
Mnih, V.; Kavukcuoglu, K.; Silver, D.; et al. 2015. Human-
level control through deep reinforcement learning. Nature
518(7540):529–533.
Pickett, M., and Barto, A. G. 2002. Policyblocks: An al-
gorithm for creating useful macro-actions in reinforcement
learning. In Proceedings of the Nineteenth International
Conference on Machine Learning, 506–513.
Rafati, J., and Noelle, D. C. 2015. Lateral inhibition over-
comes limits of temporal difference learning. In 37th Annual
Cognitive Science Society Meeting, Pasadena, CA, USA.
Rafati, J., and Noelle, D. C. 2017. Sparse coding of learned
state representations in reinforcement learning. In Confer-
ence on Cognitive Computational Neuroscience, New York
City, NY, USA.
Singh, S.; Lewis, R. L.; Barto, A. G.; and Sorg, J. 2010. In-
trinsically motivated reinforcement learning: An evolution-
ary perspective. IEEE Transaction on Autonomous Mental
Development 2(2):70–82.
Sutton, R. S., and Barto, A. G. 1998. Reinforcement Learn-
ing: An Introduction. Cambridge, MA: MIT Press, 1st edi-
tion.