=Paper=
{{Paper
|id=Vol-2600/short1
|storemode=property
|title=PODNet: A Neural Network for Discovery of Plannable Options
|pdfUrl=https://ceur-ws.org/Vol-2600/short1.pdf
|volume=Vol-2600
|authors=Ritwik Bera,Vinicius G. Goecks,Gregory M. Gremillion,John Valasek,Nicholas R. Waytowich
|dblpUrl=https://dblp.org/rec/conf/aaaiss/BeraGGVW20
}}
==PODNet: A Neural Network for Discovery of Plannable Options==
PODNet: A Neural Network for Discovery of Plannable Options Ritwik Bera,1 Vinicius G. Goecks,1 Gregory M. Gremillion,2 John Valasek,1 and Nicholas R. Waytowich2,3 1 Texas A&M University, 2 Army Research Laboratory, 3 Columbia University {ritwik, vinicius.goecks, valasek}@tamu.edu, {gregory.m.gremillion, nicholas.r.waytowich}.civ@mail.mil Abstract can achieve much faster convergence on imitation learn- Learning from demonstration has been widely studied in ma- ing and reinforcement learning benchmarks than an action- chine learning but becomes challenging when the demon- space policy network due to the significantly smaller size strated trajectories are unstructured and follow different ob- of the option-space. Our contribution, PODNet, is a custom jectives. This short-paper proposes PODNet, Plannable Op- categorical variational autoencoder (Jang, Gu, and Poole tion Discovery Network, addressing how to segment an un- 2016) that is composed of several constituent networks that structured set of demonstrated trajectories for option discov- not only segment demonstrated trajectories into options, but ery. This enables learning from demonstration to perform concurrently trains a option dynamics model that can be multiple tasks and plan high-level trajectories based on the used for downstream planning tasks and training on simu- discovered option labels. PODNet combines a custom cate- lated rollouts to minimize interaction with the environment gorical variational autoencoder, a recurrent option inference while the policy is maturing. Unlike previous imitation- network, option-conditioned policy network, and option dy- namics model in an end-to-end learning architecture. Due to learning based approaches to option discovery, our approach the concurrently trained option-conditioned policy network does not require the agent to interact with the environment and option dynamics model, the proposed architecture has in its option discovery process as it trains offline on just be- implications in multi-task and hierarchical learning, explain- havior cloning data. Moreover, being able to infer the op- able and interpretable artificial intelligence, and applications tion label for the current behavior executed by the learning where the agent is required to learn only from observations. agent, essentially, allowing the agent to broadcast the option it is currently pursuing, has implications in explainable and Introduction interpretable artificial intelligence. Learning from demonstrations to perform a single task has been widely studied in the machine learning litera- Related Work ture (Argall et al. 2009; Ross, Gordon, and Bagnell 2011; This work addresses how to segment an unstructured set of Ross et al. 2013; Bojarski et al. 2016; Goecks et al. 2018). demonstrated trajectories for option discovery. The one-shot In these approaches, demonstrations are carefully curated in imitation architecture developed by Wang et al. (Wang et order to exemplify a specific task to be carried out by the al. 2017) using conditional GAIL (cGAIL) maps trajecto- learning agent. The challenge arises when the demonstrator ries into a set of latent codes that capture the semantics and is performing more than one task, or multiple hierarchical context of the trajectories. This is analogous to word2vec sub-tasks of a complex objective, also called options, where (Mikolov et al. 2013) in natural language processing (NLP) the same set of observations can be mapped to a different set where words are embedded into a vector space that preserves of actions depending on the option being performed (Sutton, linguistic relationships. Precup, and Singh 1999; Stolle and Precup 2002). This is a In InfoGAN (Chen et al. 2016), a generative adversarial challenge for traditional behavior cloning techniques that fo- network (GAN) maximizes the mutual information between cus on learning a single mapping between observation and the latent variables and the observation, learning a discrimi- actions in a single option scenario. nator that confidently predict the observation labels. InfoRL This paper presents Plannable Option Discovery Network (Hayat, Singh, and Namboodiri 2019) and InfoGAIL (Li, (PODNet), attempting to enable agents to learn the semantic Song, and Ermon 2017) utilized the concept of mutual in- structure behind those complex demonstrated tasks by us- formation maximization to map latent variables to solution ing a meta-controller operating in the option-space instead trajectories (generated by RL) and expert demonstrations re- of directly operating in the action-space. The main hypoth- spectively. Directed-InfoGAIL (Sharma et al. 2018) intro- esis is that a meta-controller operating in the option-space duced the concept of directed information. It maximized the Copyright c 2020, Association for the Advancement of Artificial mutual information between the trajectory observed so far Intelligence (www.aaai.org). All rights reserved. and the consequent option label. This modification to the In- Figure 1: Proposed encoder-decoder architecture. Note that the Policy Network decoder could also be a recurrent neural network (RNN) if we wish to make the behavior label dependent on all preceding states and labels instead of just the previous state and corresponding behavior label. foGAIL architecture allowed it to segment demonstrations constituent networks: a recurrent option inference network, and reproduce option. However, it assumed a prior knowl- an option-conditioned policy network, and an option dynam- edge of the number of options to be discovered. Diversity Is ics model, as seen in Figure 1. The categorical VAE allows All You Need (DIAYN) (Eysenbach et al. 2018) recovers dis- the network to map each trajectory segment into a latent tinctive sub-behaviors (from random exploration) by gener- code and intrinsically perform soft k-means clustering on the ating random trajectories and maximising mutual informa- inferred option labels. The following subsections explain the tion between the states and the behavior label. constituent components of PODNet. Variational Autoencoding Learning of Options by Rein- forcement (VALOR) (Achiam et al. 2018) used β−VAEs Constituent Neural Networks (Higgins et al. 2017) to encode labels into trajectories, thus Recurrent option inference network In a complex task, also implicitly maximising mutual information between be- the choice of an option at any time depends on both the cur- havior labels and corresponding trajectories. DIAYN’s, mu- rent state, as well a history of the current and previous op- tual information maximisation objective function is also im- tions that have been executed. For example, in a door open- plicitly solved in a β−VAE setting. Both VAEs and Info- ing task, an agent would decide to open a door only if it GANs maximize mutual information between latent states had already fetched the key earlier. We utilize a recurrent and the input data. The difference is that VAE’s have ac- encoder using short long-term memory (LSTM) (Hochre- cess to the true data distribution while InfoGANs also have iter and Schmidhuber 1997) to ensure the current option’s to learn to model the true data distribution. More recently, dependence on both the current state and the preceding op- CompILE (Kipf et al. 2019) employed a VAE based ap- tions is captured. This helps overcome the problem where proach to infer not only option labels at every trajectory step different options that contain similar or overlapping states but also infer option start and termination points in the given are mapped to the same option label, as was observed in DI- trajectory. However, once inferred to be completed, options AYN (Eysenbach et al. 2018). Our option inference network are masked out. Thus while inferring options in the future, P is an LSTM that takes as input the current state st as well the agent loses track of critical options that might have hap- as the previous option label ct−1 and predicts the next option pened in the past. label for time step t. Most of the related works mentioned so far do not learn a dynamics model, and as a result, the discovered options Option-conditioned policy network Approaches such as cannot be used for downstream planning via model-based InfoGAIL (Li, Song, and Ermon 2017), achieve the disentan- RL techniques. In our work, we utilize the fact that the glement into latent variables by imitating the demo trajecto- demonstration data has state-transition information embed- ries while having access only to the inferred latent variable ded within the demonstration trajectories and thus can be and not the demonstrator actions. We achieve this goal by used to learn a dynamics model while simultaneously learn- concurrently training a option label conditioned policy net- ing options. We also present a technique to identify the num- work π that takes in the current predicted option ct as well ber of distinguishable options to be discovered from the as the current state st and predicts the action at that mini- demonstration data. mizes the behavior cloning loss LBC of the demonstration trajectories. Plannable Option Discovery Network Option dynamics model The main novelty of PODNet is Our proposed approach, Plannable Option Discovery Net- the inclusion of an option dynamics model. The options dy- work (PODNet), is a custom categorical variational autoen- namics model Q takes in as input the current state st and op- coder (Jang, Gu, and Poole 2016) which consists of several tion label ct and predicts the next state st+1 . In other words, Figure 2: Complete PODNet diagram illustrating how the option dynamics model is integrated to meta-controllers to plan trajectories. Given a goal state sgoal , the meta-controller simulate trajectories using the option dynamics model and output the best estimated sequence of options to achieve the goal state. This sequence is then passed to the option-conditioned policy network, which outputs the sequence of estimated actions required to follow the planned option sequence. the option dynamics model is an option-conditioned state- the option with highest conditional probability would lead transition function, or dynamics model, that is dependent on to having a discrete operation in the neural network and pro- the current option being executed instead of using the cur- hibit backpropagation in PODNet. Solo softmax is only used rent action as traditional state-transition models would. The during the backward pass to allow backpropagation. For the option dynamics model is trained simultaneously with the forward pass, the softmax output is further subject to the other policy and option inference networks by adding the argmax operator to obtain a one-hot encoded label vector. option dynamics consistency loss to the overall training ob- jective. The benefit of training an option dynamics model Entropy Regularization The categorical distribution aris- in this way is twofold: first, it ensures the the system dy- ing from the encoder network is forced to have minimal namics can be completely defined by the option label, po- KL divergence with a uniform categorical distribution. This tentially allowing for easier recovery of option labels; sec- is done to ensure that all inputs are not encoded into the ond, it ensures that the recovered option labels ct allow for same sub-behavior cluster and are meaningfully separated modeling the environment dynamics in terms of the options into separate clusters. Entropy-driven regularization encour- themselves. This not only provides the ability to incorporate ages exploration of the label space. This exploration can be planning, but it allows planning to be performed at the op- modulated by tuning the hyperparameter β. tions level instead of the action level, which will allow for Downsampling of demonstration data For accurate pre- more efficient planning on longer time-scales. diction of option labels that concur with human intuition, it is important to downsample the state sequences since high- Training level dynamic changes have a low-frequency. Downsam- The training process occurs offline and starts by collecting a pling also decreases training time due to fewer samples be- dataset D consisting of unstructured demonstrated trajecto- ing processed. ries, which can be generated from any source as, for exam- Prediction horizon To ensure that the option dynamics ple, human experts, optimal controllers, or pre-trained rein- model does not simply learn an identity projection, the dy- forcement learning agents. The overall training loss-function namics model is made to predict more than one time step is given as, ahead. This prediction horizon hyperparameter could be L(θ, φ, ψ) = EπE [Ect ∼Pψ (.|st ,ct−1 ) [(st+1 − Qφ (st , ct ))2 + manually tuned depending on the situation. (at − πθ (st , ct ))2 ] − βDKL (Pψ (ct | st , ct−1 ) || p(c))] Discovery of number of options The number of options can be obtained by having a held-out part of the demon- Hence, stration, on which the behavior cloning loss LBC is eval- uated, similar to how validation loss is. We start with an L(θ, φ, ψ) = Option Dynamics Consistency Loss(LODC )+ initial number of options, K, to be discovered and incre- ment/decrement it to move towards decreasing LBC . Behavior Cloning Loss(LBC ) + Entropy Regularization Ensuring smooth backpropagation To ensure that the Planning Option Sequences gradients flow through differentiable functions only during Although the main motivation for PODNet is to segment un- backpropagation, ct is represented by a Gumbel-Softmax structured trajectories, the learned option dynamics model distribution, as illustrated in the literature on Categorical combined with the option-conditioned policy network can VAEs (Jang, Gu, and Poole 2016). Using argmax to select be used for planning option sequences. As shown in Figure 2, the option dynamics model learned with PODNets can and Namboodiri, V. P. 2019. InfoRL: Interpretable Rein- be integrated to meta-controllers to plan trajectories. Given forcement Learning using Information Maximization. a goal state sgoal , the meta-controller simulate trajectories [Higgins et al. 2017] Higgins, I.; Matthey, L.; Pal, A.; using the option dynamics model and output the best esti- Burgess, C.; Glorot, X.; Botvinick, M.; Mohamed, S.; and mated sequence of options to achieve the goal state. This se- Lerchner, A. 2017. beta-vae: Learning basic visual concepts quence is then passed to the option-conditioned policy net- with a constrained variational framework. ICLR 2(5):6. work, which outputs the sequence of estimated actions re- [Hochreiter and Schmidhuber 1997] Hochreiter, S., and quired to follow the planned option sequence. Schmidhuber, J. 1997. Long short-term memory. Neural computation 9(8):1735–1780. Conclusion [Jang, Gu, and Poole 2016] Jang, E.; Gu, S.; and Poole, In this paper we presented PODNet, a neural network ar- B. 2016. Categorical Reparameterization with Gumbel- chitecture for discovery of plannable options. Our approach Softmax. combines a custom categorical variational autoencoder, a re- [Kipf et al. 2019] Kipf, T.; Li, Y.; Dai, H.; Zambaldi, V.; current option inference network, option-conditioned policy Sanchez-Gonzalez, A.; Grefenstette, E.; Kohli, P.; and network, and option dynamics model for end-to-end train- Battaglia, P. 2019. Compile: Compositional imitation learn- ing and segmentation of an unstructured set of demonstrated ing and execution. In International Conference on Machine trajectories for option discovery. PODNet’s architecture im- Learning, 3418–3428. plicitly utilizes prior knowledge about options being dynam- ically consistent (plannable and representable by a skill dy- [Li, Song, and Ermon 2017] Li, Y.; Song, J.; and Ermon, S. namics model), being temporally extended and definitive of 2017. InfoGAIL: Interpretable Imitation Learning from Vi- the agent’s actions at a particular state (as enforced by a sual Demonstrations. option-conditioned policy network). This leads to discovery [Mikolov et al. 2013] Mikolov, T.; Chen, K.; Corrado, G.; of plannable options that enable predictable behavior in AI and Dean, J. 2013. Efficient estimation of word representa- agents when they adapt to newer tasks in a transfer learning tions in vector space. arXiv preprint arXiv:1301.3781. setting. The proposed architecture has implications in multi- [Ross et al. 2013] Ross, S.; Melik-Barkhudarov, N.; task and hierarchical learning, explainable and interpretable Shankar, K. S.; Wendel, A.; Dey, D.; Bagnell, J. A.; artificial intelligence. and Hebert, M. 2013. Learning monocular reactive uav control in cluttered natural environments. In 2013 IEEE References international conference on robotics and automation, 1765–1772. IEEE. [Achiam et al. 2018] Achiam, J.; Edwards, H.; Amodei, D.; and Abbeel, P. 2018. Variational option discovery algo- [Ross, Gordon, and Bagnell 2011] Ross, S.; Gordon, G.; and rithms. arXiv preprint arXiv:1807.10299. Bagnell, D. 2011. A reduction of imitation learning and structured prediction to no-regret online learning. In Pro- [Argall et al. 2009] Argall, B. D.; Chernova, S.; Veloso, M.; ceedings of the fourteenth international conference on arti- and Browning, B. 2009. A survey of robot learning ficial intelligence and statistics, 627–635. from demonstration. Robotics and autonomous systems 57(5):469–483. [Sharma et al. 2018] Sharma, A.; Sharma, M.; Rhinehart, N.; and Kitani, K. M. 2018. Directed-Info GAIL: Learning Hi- [Bojarski et al. 2016] Bojarski, M.; Del Testa, D.; erarchical Policies from Unsegmented Demonstrations us- Dworakowski, D.; Firner, B.; Flepp, B.; Goyal, P.; ing Directed Information. Jackel, L. D.; Monfort, M.; Muller, U.; Zhang, J.; et al. 2016. End to end learning for self-driving cars. arXiv [Stolle and Precup 2002] Stolle, M., and Precup, D. 2002. preprint arXiv:1604.07316. Learning options in reinforcement learning. In International Symposium on abstraction, reformulation, and approxima- [Chen et al. 2016] Chen, X.; Duan, Y.; Houthooft, R.; Schul- tion, 212–223. Springer. man, J.; Sutskever, I.; and Abbeel, P. 2016. Infogan: Inter- [Sutton, Precup, and Singh 1999] Sutton, R. S.; Precup, D.; pretable representation learning by information maximizing and Singh, S. 1999. Between mdps and semi-mdps: A generative adversarial nets. In Advances in neural informa- framework for temporal abstraction in reinforcement learn- tion processing systems, 2172–2180. ing. Artificial intelligence 112(1-2):181–211. [Eysenbach et al. 2018] Eysenbach, B.; Gupta, A.; Ibarz, J.; [Wang et al. 2017] Wang, Z.; Merel, J.; Reed, S.; Wayne, G.; and Levine, S. 2018. Diversity is all you need: Learn- de Freitas, N.; and Heess, N. 2017. Robust Imitation of ing skills without a reward function. arXiv preprint Diverse Behaviors. arXiv:1802.06070. [Goecks et al. 2018] Goecks, V. G.; Gremillion, G. M.; Lawhern, V. J.; Valasek, J.; and Waytowich, N. R. 2018. Efficiently combining human demonstrations and interven- tions for safe training of autonomous systems in real-time. CoRR abs/1810.11545. [Hayat, Singh, and Namboodiri 2019] Hayat, A.; Singh, U.;