=Paper=
{{Paper
|id=Vol-2600/short1
|storemode=property
|title=PODNet: A Neural Network for Discovery of Plannable Options
|pdfUrl=https://ceur-ws.org/Vol-2600/short1.pdf
|volume=Vol-2600
|authors=Ritwik Bera,Vinicius G. Goecks,Gregory M. Gremillion,John Valasek,Nicholas R. Waytowich
|dblpUrl=https://dblp.org/rec/conf/aaaiss/BeraGGVW20
}}
==PODNet: A Neural Network for Discovery of Plannable Options==
<pdf width="1500px">https://ceur-ws.org/Vol-2600/short1.pdf</pdf>
<pre>
               PODNet: A Neural Network for Discovery of Plannable Options

                            Ritwik Bera,1 Vinicius G. Goecks,1 Gregory M. Gremillion,2

                                        John Valasek,1 and Nicholas R. Waytowich2,3
                            1
                                Texas A&M University, 2 Army Research Laboratory, 3 Columbia University
                                            {ritwik, vinicius.goecks, valasek}@tamu.edu,
                                      {gregory.m.gremillion, nicholas.r.waytowich}.civ@mail.mil


                            Abstract                                  can achieve much faster convergence on imitation learn-
  Learning from demonstration has been widely studied in ma-          ing and reinforcement learning benchmarks than an action-
  chine learning but becomes challenging when the demon-              space policy network due to the significantly smaller size
  strated trajectories are unstructured and follow different ob-      of the option-space. Our contribution, PODNet, is a custom
  jectives. This short-paper proposes PODNet, Plannable Op-           categorical variational autoencoder (Jang, Gu, and Poole
  tion Discovery Network, addressing how to segment an un-            2016) that is composed of several constituent networks that
  structured set of demonstrated trajectories for option discov-      not only segment demonstrated trajectories into options, but
  ery. This enables learning from demonstration to perform            concurrently trains a option dynamics model that can be
  multiple tasks and plan high-level trajectories based on the        used for downstream planning tasks and training on simu-
  discovered option labels. PODNet combines a custom cate-            lated rollouts to minimize interaction with the environment
  gorical variational autoencoder, a recurrent option inference
                                                                      while the policy is maturing. Unlike previous imitation-
  network, option-conditioned policy network, and option dy-
  namics model in an end-to-end learning architecture. Due to         learning based approaches to option discovery, our approach
  the concurrently trained option-conditioned policy network          does not require the agent to interact with the environment
  and option dynamics model, the proposed architecture has            in its option discovery process as it trains offline on just be-
  implications in multi-task and hierarchical learning, explain-      havior cloning data. Moreover, being able to infer the op-
  able and interpretable artificial intelligence, and applications    tion label for the current behavior executed by the learning
  where the agent is required to learn only from observations.        agent, essentially, allowing the agent to broadcast the option
                                                                      it is currently pursuing, has implications in explainable and
                        Introduction                                  interpretable artificial intelligence.
Learning from demonstrations to perform a single task
has been widely studied in the machine learning litera-                                     Related Work
ture (Argall et al. 2009; Ross, Gordon, and Bagnell 2011;             This work addresses how to segment an unstructured set of
Ross et al. 2013; Bojarski et al. 2016; Goecks et al. 2018).          demonstrated trajectories for option discovery. The one-shot
In these approaches, demonstrations are carefully curated in          imitation architecture developed by Wang et al. (Wang et
order to exemplify a specific task to be carried out by the           al. 2017) using conditional GAIL (cGAIL) maps trajecto-
learning agent. The challenge arises when the demonstrator            ries into a set of latent codes that capture the semantics and
is performing more than one task, or multiple hierarchical            context of the trajectories. This is analogous to word2vec
sub-tasks of a complex objective, also called options, where          (Mikolov et al. 2013) in natural language processing (NLP)
the same set of observations can be mapped to a different set         where words are embedded into a vector space that preserves
of actions depending on the option being performed (Sutton,           linguistic relationships.
Precup, and Singh 1999; Stolle and Precup 2002). This is a               In InfoGAN (Chen et al. 2016), a generative adversarial
challenge for traditional behavior cloning techniques that fo-        network (GAN) maximizes the mutual information between
cus on learning a single mapping between observation and              the latent variables and the observation, learning a discrimi-
actions in a single option scenario.                                  nator that confidently predict the observation labels. InfoRL
   This paper presents Plannable Option Discovery Network             (Hayat, Singh, and Namboodiri 2019) and InfoGAIL (Li,
(PODNet), attempting to enable agents to learn the semantic           Song, and Ermon 2017) utilized the concept of mutual in-
structure behind those complex demonstrated tasks by us-              formation maximization to map latent variables to solution
ing a meta-controller operating in the option-space instead           trajectories (generated by RL) and expert demonstrations re-
of directly operating in the action-space. The main hypoth-           spectively. Directed-InfoGAIL (Sharma et al. 2018) intro-
esis is that a meta-controller operating in the option-space          duced the concept of directed information. It maximized the
Copyright c 2020, Association for the Advancement of Artificial       mutual information between the trajectory observed so far
Intelligence (www.aaai.org). All rights reserved.                     and the consequent option label. This modification to the In-
Figure 1: Proposed encoder-decoder architecture. Note that the Policy Network decoder could also be a recurrent neural network
(RNN) if we wish to make the behavior label dependent on all preceding states and labels instead of just the previous state and
corresponding behavior label.


foGAIL architecture allowed it to segment demonstrations          constituent networks: a recurrent option inference network,
and reproduce option. However, it assumed a prior knowl-          an option-conditioned policy network, and an option dynam-
edge of the number of options to be discovered. Diversity Is      ics model, as seen in Figure 1. The categorical VAE allows
All You Need (DIAYN) (Eysenbach et al. 2018) recovers dis-        the network to map each trajectory segment into a latent
tinctive sub-behaviors (from random exploration) by gener-        code and intrinsically perform soft k-means clustering on the
ating random trajectories and maximising mutual informa-          inferred option labels. The following subsections explain the
tion between the states and the behavior label.                   constituent components of PODNet.
   Variational Autoencoding Learning of Options by Rein-
forcement (VALOR) (Achiam et al. 2018) used β−VAEs                Constituent Neural Networks
(Higgins et al. 2017) to encode labels into trajectories, thus    Recurrent option inference network In a complex task,
also implicitly maximising mutual information between be-         the choice of an option at any time depends on both the cur-
havior labels and corresponding trajectories. DIAYN’s, mu-        rent state, as well a history of the current and previous op-
tual information maximisation objective function is also im-      tions that have been executed. For example, in a door open-
plicitly solved in a β−VAE setting. Both VAEs and Info-           ing task, an agent would decide to open a door only if it
GANs maximize mutual information between latent states            had already fetched the key earlier. We utilize a recurrent
and the input data. The difference is that VAE’s have ac-         encoder using short long-term memory (LSTM) (Hochre-
cess to the true data distribution while InfoGANs also have       iter and Schmidhuber 1997) to ensure the current option’s
to learn to model the true data distribution. More recently,      dependence on both the current state and the preceding op-
CompILE (Kipf et al. 2019) employed a VAE based ap-               tions is captured. This helps overcome the problem where
proach to infer not only option labels at every trajectory step   different options that contain similar or overlapping states
but also infer option start and termination points in the given   are mapped to the same option label, as was observed in DI-
trajectory. However, once inferred to be completed, options       AYN (Eysenbach et al. 2018). Our option inference network
are masked out. Thus while inferring options in the future,       P is an LSTM that takes as input the current state st as well
the agent loses track of critical options that might have hap-    as the previous option label ct−1 and predicts the next option
pened in the past.                                                label for time step t.
   Most of the related works mentioned so far do not learn
a dynamics model, and as a result, the discovered options         Option-conditioned policy network Approaches such as
cannot be used for downstream planning via model-based            InfoGAIL (Li, Song, and Ermon 2017), achieve the disentan-
RL techniques. In our work, we utilize the fact that the          glement into latent variables by imitating the demo trajecto-
demonstration data has state-transition information embed-        ries while having access only to the inferred latent variable
ded within the demonstration trajectories and thus can be         and not the demonstrator actions. We achieve this goal by
used to learn a dynamics model while simultaneously learn-        concurrently training a option label conditioned policy net-
ing options. We also present a technique to identify the num-     work π that takes in the current predicted option ct as well
ber of distinguishable options to be discovered from the          as the current state st and predicts the action at that mini-
demonstration data.                                               mizes the behavior cloning loss LBC of the demonstration
                                                                  trajectories.
      Plannable Option Discovery Network                          Option dynamics model The main novelty of PODNet is
Our proposed approach, Plannable Option Discovery Net-            the inclusion of an option dynamics model. The options dy-
work (PODNet), is a custom categorical variational autoen-        namics model Q takes in as input the current state st and op-
coder (Jang, Gu, and Poole 2016) which consists of several        tion label ct and predicts the next state st+1 . In other words,
Figure 2: Complete PODNet diagram illustrating how the option dynamics model is integrated to meta-controllers to plan
trajectories. Given a goal state sgoal , the meta-controller simulate trajectories using the option dynamics model and output the
best estimated sequence of options to achieve the goal state. This sequence is then passed to the option-conditioned policy
network, which outputs the sequence of estimated actions required to follow the planned option sequence.


the option dynamics model is an option-conditioned state-            the option with highest conditional probability would lead
transition function, or dynamics model, that is dependent on         to having a discrete operation in the neural network and pro-
the current option being executed instead of using the cur-          hibit backpropagation in PODNet. Solo softmax is only used
rent action as traditional state-transition models would. The        during the backward pass to allow backpropagation. For the
option dynamics model is trained simultaneously with the             forward pass, the softmax output is further subject to the
other policy and option inference networks by adding the             argmax operator to obtain a one-hot encoded label vector.
option dynamics consistency loss to the overall training ob-
jective. The benefit of training an option dynamics model            Entropy Regularization The categorical distribution aris-
in this way is twofold: first, it ensures the the system dy-         ing from the encoder network is forced to have minimal
namics can be completely defined by the option label, po-            KL divergence with a uniform categorical distribution. This
tentially allowing for easier recovery of option labels; sec-        is done to ensure that all inputs are not encoded into the
ond, it ensures that the recovered option labels ct allow for        same sub-behavior cluster and are meaningfully separated
modeling the environment dynamics in terms of the options            into separate clusters. Entropy-driven regularization encour-
themselves. This not only provides the ability to incorporate        ages exploration of the label space. This exploration can be
planning, but it allows planning to be performed at the op-          modulated by tuning the hyperparameter β.
tions level instead of the action level, which will allow for        Downsampling of demonstration data For accurate pre-
more efficient planning on longer time-scales.                       diction of option labels that concur with human intuition, it
                                                                     is important to downsample the state sequences since high-
Training                                                             level dynamic changes have a low-frequency. Downsam-
The training process occurs offline and starts by collecting a       pling also decreases training time due to fewer samples be-
dataset D consisting of unstructured demonstrated trajecto-          ing processed.
ries, which can be generated from any source as, for exam-
                                                                     Prediction horizon To ensure that the option dynamics
ple, human experts, optimal controllers, or pre-trained rein-
                                                                     model does not simply learn an identity projection, the dy-
forcement learning agents. The overall training loss-function
                                                                     namics model is made to predict more than one time step
is given as,
                                                                     ahead. This prediction horizon hyperparameter could be
L(θ, φ, ψ) = EπE [Ect ∼Pψ (.|st ,ct−1 ) [(st+1 − Qφ (st , ct ))2 +   manually tuned depending on the situation.
    (at − πθ (st , ct ))2 ] − βDKL (Pψ (ct | st , ct−1 ) || p(c))]   Discovery of number of options The number of options
                                                                     can be obtained by having a held-out part of the demon-
  Hence,                                                             stration, on which the behavior cloning loss LBC is eval-
                                                                     uated, similar to how validation loss is. We start with an
L(θ, φ, ψ) = Option Dynamics Consistency Loss(LODC )+                initial number of options, K, to be discovered and incre-
                                                                     ment/decrement it to move towards decreasing LBC .
   Behavior Cloning Loss(LBC ) + Entropy Regularization
Ensuring smooth backpropagation To ensure that the                   Planning Option Sequences
gradients flow through differentiable functions only during          Although the main motivation for PODNet is to segment un-
backpropagation, ct is represented by a Gumbel-Softmax               structured trajectories, the learned option dynamics model
distribution, as illustrated in the literature on Categorical        combined with the option-conditioned policy network can
VAEs (Jang, Gu, and Poole 2016). Using argmax to select              be used for planning option sequences. As shown in Figure
 2, the option dynamics model learned with PODNets can             and Namboodiri, V. P. 2019. InfoRL: Interpretable Rein-
 be integrated to meta-controllers to plan trajectories. Given     forcement Learning using Information Maximization.
 a goal state sgoal , the meta-controller simulate trajectories   [Higgins et al. 2017] Higgins, I.; Matthey, L.; Pal, A.;
 using the option dynamics model and output the best esti-         Burgess, C.; Glorot, X.; Botvinick, M.; Mohamed, S.; and
 mated sequence of options to achieve the goal state. This se-     Lerchner, A. 2017. beta-vae: Learning basic visual concepts
 quence is then passed to the option-conditioned policy net-       with a constrained variational framework. ICLR 2(5):6.
 work, which outputs the sequence of estimated actions re-
                                                                  [Hochreiter and Schmidhuber 1997] Hochreiter, S., and
 quired to follow the planned option sequence.
                                                                   Schmidhuber, J. 1997. Long short-term memory. Neural
                                                                   computation 9(8):1735–1780.
                        Conclusion                                [Jang, Gu, and Poole 2016] Jang, E.; Gu, S.; and Poole,
 In this paper we presented PODNet, a neural network ar-           B. 2016. Categorical Reparameterization with Gumbel-
 chitecture for discovery of plannable options. Our approach       Softmax.
 combines a custom categorical variational autoencoder, a re-     [Kipf et al. 2019] Kipf, T.; Li, Y.; Dai, H.; Zambaldi, V.;
 current option inference network, option-conditioned policy       Sanchez-Gonzalez, A.; Grefenstette, E.; Kohli, P.; and
 network, and option dynamics model for end-to-end train-          Battaglia, P. 2019. Compile: Compositional imitation learn-
 ing and segmentation of an unstructured set of demonstrated       ing and execution. In International Conference on Machine
 trajectories for option discovery. PODNet’s architecture im-      Learning, 3418–3428.
 plicitly utilizes prior knowledge about options being dynam-
 ically consistent (plannable and representable by a skill dy-    [Li, Song, and Ermon 2017] Li, Y.; Song, J.; and Ermon, S.
 namics model), being temporally extended and definitive of        2017. InfoGAIL: Interpretable Imitation Learning from Vi-
 the agent’s actions at a particular state (as enforced by a       sual Demonstrations.
 option-conditioned policy network). This leads to discovery      [Mikolov et al. 2013] Mikolov, T.; Chen, K.; Corrado, G.;
 of plannable options that enable predictable behavior in AI       and Dean, J. 2013. Efficient estimation of word representa-
 agents when they adapt to newer tasks in a transfer learning      tions in vector space. arXiv preprint arXiv:1301.3781.
 setting. The proposed architecture has implications in multi-    [Ross et al. 2013] Ross, S.; Melik-Barkhudarov, N.;
 task and hierarchical learning, explainable and interpretable     Shankar, K. S.; Wendel, A.; Dey, D.; Bagnell, J. A.;
 artificial intelligence.                                          and Hebert, M. 2013. Learning monocular reactive uav
                                                                   control in cluttered natural environments. In 2013 IEEE
                        References                                 international conference on robotics and automation,
                                                                   1765–1772. IEEE.
[Achiam et al. 2018] Achiam, J.; Edwards, H.; Amodei, D.;
 and Abbeel, P. 2018. Variational option discovery algo-          [Ross, Gordon, and Bagnell 2011] Ross, S.; Gordon, G.; and
 rithms. arXiv preprint arXiv:1807.10299.                          Bagnell, D. 2011. A reduction of imitation learning and
                                                                   structured prediction to no-regret online learning. In Pro-
[Argall et al. 2009] Argall, B. D.; Chernova, S.; Veloso, M.;      ceedings of the fourteenth international conference on arti-
 and Browning, B. 2009. A survey of robot learning                 ficial intelligence and statistics, 627–635.
 from demonstration. Robotics and autonomous systems
 57(5):469–483.                                                   [Sharma et al. 2018] Sharma, A.; Sharma, M.; Rhinehart, N.;
                                                                   and Kitani, K. M. 2018. Directed-Info GAIL: Learning Hi-
[Bojarski et al. 2016] Bojarski, M.; Del Testa, D.;                erarchical Policies from Unsegmented Demonstrations us-
 Dworakowski, D.; Firner, B.; Flepp, B.; Goyal, P.;                ing Directed Information.
 Jackel, L. D.; Monfort, M.; Muller, U.; Zhang, J.; et al.
 2016. End to end learning for self-driving cars. arXiv           [Stolle and Precup 2002] Stolle, M., and Precup, D. 2002.
 preprint arXiv:1604.07316.                                        Learning options in reinforcement learning. In International
                                                                   Symposium on abstraction, reformulation, and approxima-
[Chen et al. 2016] Chen, X.; Duan, Y.; Houthooft, R.; Schul-       tion, 212–223. Springer.
 man, J.; Sutskever, I.; and Abbeel, P. 2016. Infogan: Inter-
                                                                  [Sutton, Precup, and Singh 1999] Sutton, R. S.; Precup, D.;
 pretable representation learning by information maximizing
                                                                   and Singh, S. 1999. Between mdps and semi-mdps: A
 generative adversarial nets. In Advances in neural informa-
                                                                   framework for temporal abstraction in reinforcement learn-
 tion processing systems, 2172–2180.
                                                                   ing. Artificial intelligence 112(1-2):181–211.
[Eysenbach et al. 2018] Eysenbach, B.; Gupta, A.; Ibarz, J.;      [Wang et al. 2017] Wang, Z.; Merel, J.; Reed, S.; Wayne, G.;
 and Levine, S. 2018. Diversity is all you need: Learn-            de Freitas, N.; and Heess, N. 2017. Robust Imitation of
 ing skills without a reward function.        arXiv preprint       Diverse Behaviors.
 arXiv:1802.06070.
[Goecks et al. 2018] Goecks, V. G.; Gremillion, G. M.;
 Lawhern, V. J.; Valasek, J.; and Waytowich, N. R. 2018.
 Efficiently combining human demonstrations and interven-
 tions for safe training of autonomous systems in real-time.
 CoRR abs/1810.11545.
[Hayat, Singh, and Namboodiri 2019] Hayat, A.; Singh, U.;

</pre>