V. Kůrková et al. (Eds.): ITAT 2014 with selected papers from Znalosti 2014, CEUR Workshop Proceedings Vol. 1214, pp. 21–27 http://ceur-ws.org/Vol-1214, Series ISSN 1613-0073, c 2014 R. Brunetto, M. Vomlelová Acting and Bayesian reinforcement structure learning of partially observable environment Robert Brunetto, Marta Vomlelová Charles University, Czech Republic robert@brunetto.cz Abstract: This article shows how to learn both the This article addresses generalization of the tasks men- structure and the parameters of partially observable en- tioned above. We show the way to navigate an agent in an vironment simultaneously while also online performing unknown stochastic dynamic partially observable environ- near-optimal sequence of actions taking into account ment using near optimal actions. To do so, we combine the exploration-exploitation tradeoff. It combines two re- algorithm [11] for learning parameters of POMDP with sults of recent research: The former extends model-based the known structure with the algorithm [10] for learning Bayesian reinforcement learning of fully observable envi- the structure of MDP. ronment to bigger domains by learning the structure. The We take the best ideas of each of them. Right after ex- latter shows how a known structure can be exploited to plaining the basic preliminaries in section 2 and demon- model-based Bayesian reinforcement learning of partially strating them on an example in section 3 we present an an- observable domains. This article shows that merging both alytical representation of the belief over parameters (anal- approaches is possible without too excessive increase in ogous to [11]) in section 4. Similarly to the article [10] the computational complexity. belief over structures will be maintained by MCMC2 algorithm described in section 5. This allows us to 1 Introduction use an online action selection procedure described in sec- tion 7. In the last section we review the approximations Partially observable Markov decision processes from [11] which can be also applied in our settings mak- (POMDPs) are a well known framework for model- ing our model more traceable and compact. ing and planning in partially observable stochastic environments [13], [8]. Such plans could be used in robotics, in dialogue man- 2 Preliminaries agement or elsewhere. However, the range of their practi- POMDPs usually model unknown environments as fol- cal applicability is limited by several difficulties. Firstly, lows. At each time step the environment with the agent their usage is limited to small state spaces only (curse is in an unobserved state s ∈ S. At this time step agent re- of dimensionality) even when using several approxima- ceives observation o ∈ O and reward r. Then the agent is tions [9]. free to choose an action a ∈ A. As the time advances, the Secondly, it could be difficult for an expert to specify the state changes to a new state s′ that stochastically depends probabilities exactly when applying POMDP to some do- on previous state and action. main. There were several attempts to overcome this prob- We are interested in the special case of POMDPs where lem by learning the model [7], [3], [4] or applying other the state space factorizes along several state variables X ∈ techniques [2], [6]. X hence each state s is a vector s = (s1 , ...s|X| ) ∈ S = Following the recent research this article proposes an approach which could solve both problems at the same ∏X∈X . Rewards and observations behave analogously r ∈ time. Learning the model by interacting with the environment ∏R∈R and o ∈ ∏O∈O . Moreover we can assume without loss of generality that (reinforcement learning) was researched a lot [1]. Dur- O ⊆ X and R ⊆ X. It implies that the observation is the ing the last decade Jaulmes and Pineau tried to learn un- state restricted to observation variables. It will be denoted factorized POMDP [7] by assuming a prior probability as o = sO . Other restriction operations will be denoted distribution over all possible models. Then Poupart and analogously. For an instance, values of reward variables in Vlassis [11] used similar Bayesian reinforcement learn- the state s can be denoted sR . ing technique to learn the parameters of a DBN 1 with the We will always restrict vectors which are denoted by known structure. bold lower case letters to sets of variables denoted as bold If the structure were not known then learning it would upper case letters. Capital letter which won’t be bold will be a challenging problem even in fully observable environ- denote single variable. Lower case non-bold letter will be ments. Ross and Pineau [10] addressed it and proposed an its value. Specially in the case when G is online algorithm for it. 1 DBN stands for Dynamic Bayesian Network. 2 MCMC stands for Markov Chain Monte Carlo. 22 R. Brunetto, M. Vomlelová the graph of a Bayesian network, PAG X we will denote with unknown structure and parameters can be regarded as set of variables which are parents3 of variable X in graph a bigger POMDP with known dynamics. G. Values of variables of parents of X will be denoted POMDP agent is usually maintaining so called belief paG X where X ∈ X. state b(s̃) which represents probability distribution over When the time advances to the next step the state s possible states. In our case s̃ consists not only of the cur- changes to s′ according to unknown transition probability rent state s but also of the graph structure and its parame- P(s′ |s, a). ters. We assume that the transition probability P(s′ |s, a) fac- Surprisingly, as Poupart and Vlassis showed [11], even torizes according to a dynamic Bayesian network with un- though there is an infinite number of possible parame- known structure G with unknown parameters. ters the belief for a given structure can be maintained in a closed form. We will review it in section 4. P(s′ |s, a) = ∏ P(s′X |(s′ , s, a)PAG X ) There is only a finite number of possible graphical struc- X∈X tures implying that the whole belief b(s̃) can be maintained Dynamic Bayesian network with structure G is de- in a closed form. But the number of possible graphs may scribed as Bayesian network which has X′ ∪ X ∪ {A} as be large. That is why approximation is introduced in sec- its nodes. By (s′ , s, a)PAG X we denote (s′ , s, a) restricted to tion 5. the variables which are parents of variable X according to The belief is a sufficient statistic for taking a decision. structure G. Agent’s overall algorithm is summarized in algorithm 1. To emphasize that we take P(s′X |(s′ , s, a)PAG X ) as an un- known parameter, we will denote it: Algorithm 1: Agent’s life cycle Data: b initial belief over current state, structures and (s′ ,s,a)PAG X their parameters θX . (1) 1 while not TerminationCondition() do It is actually a vector of real values between 0 and 1 2 a := selectAction(b).action; summing to 1 containing for each value s′X the probability 3 performAction(a); P(s′X |(s′ , s, a)PAG X ). As the indexes suggest that we have 4 o := receiveObservation(); such a set of parameters for each state variable X and for 5 b := updateBelief(b,a,o); each combination of values of its parents. Even though we do not assume structure G to be known we still assume that a prior probability distribution P(G) over structures G is known. This probability distribution can express expert’s prior knowledge or it can just prefer 3 Example simple structures over more complicated ones. The transition probability can be expressed as: For better readability we illustrate the terms on a problem taken from [8]. Imagine the following scenario. The agent is before two doors. Behind one of the doors is a fortune P(s′ |s, a) = ∑ P(G) · P(s′ |s, a, G). (2) which would give the agent big reward but behind the sec- G ond door is a tiger which could cause the agent even higher (s′ ,s,a)PA X negative reward. The agent doesn’t know which door is We assume that the parameters θX G follow the which. It has three possibilities what to do. It can walk Dirichlet distribution, which is conjugate to multinomial through one door, walk through second door or wait and distribution. listen behind which door is bigger noise and then face the The hyperparameters are initially set to one or set to the same decision with the newly gained information. values that preserve the likelihood of equivalent parame- This corresponds to algorithm 1. Choosing and per- ters in equivalent Bayesian networks (see [5]). forming action is on its lines 2 and 3. After the agent lis- (s′ ,s,a)PA X Each unknown parameter θX G can be regarded tens or opens a door it knows what it has heard or whether as an additional state feature. it met a tiger or found a fortune. This is shown on line 4 This way a POMDP with unknown parameters can be of algorithm 1. Agent then uses this observation to update converted into a bigger POMDP without unknown param- its belief on line 5. The belief contains all information eters. Similar remark also holds for the unknown structure. the agent knows i.e. the agent should count how many It can be also regarded as a part of the state hence POMDP times it has heard the the tiger on each side and estimate where the tiger is. It should also learn usual tiger position 3 and other facts about the environments’ behaviour. These Parent variable V of variable W is a term used in Bayesian- networks-literature to denote that there is an arrow from V to W in the information are also stored and updated with belief. We graph of Bayesian network. This is in greater detail explained in sec- assume that after opening a door and finding a tiger or a tion 3. fortune the experiment restarts itself and the tiger is again Acting and Bayesian Reinforcement Structure Learning of Partially Observable Environment 23 The node shapes have its usual meaning. Squares de- At At+1 note the decision(action) variables. Circles denote random variables and diamonds denote reward variable. As we are modelling a dynamic environment. We have a copy of each variable for each time step. The figure shows only two time slices - for time t and t + 1. We have also Tigert Tigert+1 shown only the arrows which point to time t+1. We have omitted all other arrows. An arrow starts in a node called parent and goes to a node called child. As each node is a variable we inter- change the term parent node and parent variable. Heardt Heardt+1 The intuitive meaning of arrows is this: The child node(variable) X stochastically depends only on combina- tion of values of parent variables. More rigorous meaning of an arrow is that child variable X is conditionally inde- Rewardt Rewardt+1 pendent on all other variables given values paG X of its parents PAG X. Let G denote the graph from figure 1. Then PAG Rewardt+1 = {At , Tigert } which is quite natural. The Figure 1: The tiger example reward which agent will receive depends on the combi- nation where it decided to go and where the tiger has been. This dependence in our example is even determinis- placed behind random door. Everything repeats until a ter- tic. However this is not general. mination condition is reached (line 1). This unspecified For example the variable Heardt+1 depends on its par- condition allows for example infinite repetition of the ex- ents stochastically. The reason for it is simple. The agent periment until the user terminates it. could independently of the current state randomly hear in- This scenario can be modelled as follows. The set of ac- correctly. PAG Heardt+1 = {At , Tigert } because what the tions A contains actions for going left, right and listening. agent hears depends on where the tiger is and on the fact whether the agent even listed. A ∈ {le f t, right, listen} In order to make it possible for the agent to learn the environment it is necessary to let the agent interact with The set of reward variables R contains only one vari- the environment repeatedly. So the experiment with tiger able, namely the variable Reward. restarts itself everytime the agent opens a door. The tiger is then again placed behind a random door. That is why R = {Reward} PAG Tigert+1 = {At , Tigert }. Each variable in this example has two parents: the We assume that domains of all variables are known. ternary action variable and one binary variable. Hence to Hence the domain of variable Reward is also assumed to specify the probability distribution over child variable we be known. Let us say that cost of listening is 1, cost of fac- need to specify the probability of each of its values for all ing a tiger 100 and reward for finding a fortune 10. Then six combinations of values of parent variables ie. we need Reward ∈ {−1, −100, 10}. to specify P(s′X |(s′ , s, a)PAG X ). It is specified by 2 ∗ 3 = 6 The set of state variables X contains variables Tiger, probability distributions over values of child variable. As Heard and Reward. all child variables are binary we need to specify 6 pairs of numbers between 0 and 1 such that each pair sums to 1. X = {Tiger, Heard, Reward} These pairs of numbers are denoted as in (1). Tiger is position of the tiger and Heard is a variable We interchangeably call such pairs or even above men- containing what has the agent heard. tioned 6-tuples of pairs parameters in plural or a parameter From this three variables the variable Tiger is unob- in singular. By substituting values to s, s′ and a one can served because agent doens’t know the position of the restrict the parameter to one concrete value. tiger. Other two variables are observed. For example let us assume that the agent is listening. The tiger is behind the right door and that the probability O = {Heard, Reward} of correct listening is 0.9 and the probability of opposite listening outcome is 0.1 then following equalities would The agent knows neither the state transition probabili- hold: ties nor the structure between the variables. It has to learn it first. The real graphical structure could be for example (right,listen) like figure 1. P(s′Heard |(s′ , s, a)PAG Heard ) = θHeard = (0.9, 0.1). 24 R. Brunetto, M. Vomlelová The posterior probability that the variable V contains ΘTiger At At+1 value i is then equal to proportion of how many times was value i observed. ΘHeard Tigert Tigert+1 ni P(V = i) = . (3) ∑i ni In the fully observable environment then we could es- ΘReward Heardt Heardt+1 timate all parameters in the whole structure by (3). We would use this estimate for each state variable X and for each combination of values of its parents paG X. That is Rewardt Rewardt+1 the reason why we add X as a lower index to θ and paG X as an upper index to θ as in (1). pa X Each set of parameters θX G sums to one. Figure 2: The tiger example From now on θ without any indexes will denote all these sets of parameters together. In this simplified case when the whole history was ob- served the density of probability of being in "information The first equality holds because state" θ is pa X pa X paG Heard = (s′ , s, a)PAG Heard = (right, listen). ∏ D(θX G ; nX G ) X,paG X The agent doesn’t know the value of mentioned param- The problem that not all state features are observable eter and as was already explained in previous section the can be overcome by the following theorem proven by agent can treat parameters as random variables. This con- Poupart and Vlassis [11]. verts the POMDP with unknown parameters to a bigger POMDP with known parameters. The structure of this Theorem 1. If the prior is a mixture of products of Dirich- POMDP is shown in figure 2. lets Notice that we don’t have a copy of these parameter- pa X ′ pa X ′ b(s, θ ) = ∑ ci,s ∏ ′ D(θX ′ G ; nX ′ ,i,s G ) (4) random-variables for each time slice. All time slices share i X ,paG X ′ the same parameter. That is because the parameter doesn’t change over time. then the posterior is also a mixture of products of Dirich- Information about possible values of these parameters, lets along with possible structures and states are maintained in pa X ′ pa X ′ ba,o′ (s′ , θ ) = ∑ c j,s′ ∏ ′ D(θX ′ G ; nX ′ ,Gj,s′ ). the belief, which is described in following sections. j X ,paG X ′ Because the proof of the theorem 1 actually gives us the 4 Belief for a given structure algorithm to update the belief we repeat the proof in this article but for better readability we split it in two lemmas. We begin by describing how belief looks like when we are The first lemma is technical observation which says that given the structure. It is exactly the same as described by pa X Poupart & Vlassis [11]. multiplication by θX G only scales the Dirichlet distribu- The key component is the nice properties of Dirichlet tion. distribution. Let us have one discrete random variable V Lemma 1. Let θ = (θ1 , ...θK ) and n = (n1 , ..., nK ) then which can take values from 1 up to K with probabilities (θ1 , θ2 , ...θK ), where ∑i θi = 1. θ j D(θ ; n) = cD(θ ; m) The usual way to estimate these parameters in Bayesian for some constant c and vector m. statistics is to compute the posterior when assuming that the prior follows dirichlet distribution. Its density is given Proof. 1 n −1 by D(θ ; n) = B(n) ∏i θi i , where θ = {θi }i , n = {ni }i are 1 B(n) ∏ some hyperparameters and B(n) is a constant depending θ j D(θ ; n) = θ j θini −1 = i on them which makes the distribution sum to one. It is 1 B(n) ∏ n −1+[i== j] known as multinomial beta function. = θi i = The prior distribution which states that we have no evi- i dence is expressed by setting all hyperparameters ni equal B(n + δ ) = D(θ ; n + δ ) = to 1. It can be interpreted as evidence that all values B(n) were observed exactly once or it can be thought of only nj = D(θ ; n + δ ) as smoothing the posterior. K ∑i=1 ni Acting and Bayesian Reinforcement Structure Learning of Partially Observable Environment 25 where δ = (0, ...1, ...0) is a vector of zeroes with one in Proof. j-th position. Z The last equality follows from the property of B(n) say- b(s) = b(s, θ )d θ = ∏K Γ(n ) ing that B(n) = Γ(i=1K n i) . Z pa X ′ pa X ′ ∑i=1 i = ∑ ci,s ∏ D(θX ′ G ; nX ′ ,i,s G )d θ = i X ,paG X ′ ′ Lemma 2. P(s′ |s, a, θ )b(s, θ ) is mixture of products of Z pa X ′ pa X ′ Dirichlets. = ∑ ci,s ∏ ′ D(θX ′ G ; nX ′ ,i,s G )d θ = i X ,paG X ′ Proof. = ∑ ci,s i P(s′ |s, a, θ )b(s, θ ) = The first equation defines symbol b(s). The second fol- lows from the definition of b(s, θ ) by switching sum and (s′ ,a,s)PA X integral. The third equation switches integral and multi- = ( ∏ θX )b(s, θ ) (5) X∈X plication which is possible in this case because each factor (s′ ,a,s)PA X depends on the different variable of multidimensional in- = ( ∏ θX ) ∑ ci,s ∏ D(θXpaX ; npaX X,i,s ) tegration. Density of all probability distributions (includ- X∈X i X,paX ing Dirichlet distribution) integrates to one and product of (s′ ,a,s) ones is one. That is why the last equation holds. = ∑ ci,s ∏ θX PA X ) ∏ D(θXpaX ; npaX X,i,s ) (6) i X∈X paX paX Despite this encouraging results the number of compo- ∑ csi,s ∏ D(θXpaX ; n′ X,i,s ) ′ = (7) nents in the mixture grows exponentially. Luckily the be- i X,paX lief can be approximated by the approximation proposed (8) by Poupart & Vlassis. This will be described in section 7. Firstly, in the next section, we describe the way the belief b(s, θ ) in the formula (5) is replaced by its the defi- over possible structures is maintained. nition. Then formula (6) is received by switching the (s′ ,a,s)PA X terms. Moreover thanks to lemma 1 θX can be paX ′ paX consumed by dirichlet distribution D(θX ; n X,i,s ) where 5 Belief over structures paX = (s′ , a, s)PA X . The simplest and most naive approach to maintaining the Let c and δ be the as in lemma 1. ci,s must be multi- ′ overall belief which contains the probability of structure, plied by c. That is why it was replaced by csi,s = c · ci,s in its parameters and state would be straightforward. It is suf- equation (7). ficient to keep the belief for each structure and probability n′ paX paX X,i,s = nX,i,s + δ when paX = (s , a, s)PA X and ′ of the structure. The problem is that the number of graphs n′ paX paX X,i,s = nX,i,s otherwise. on given number of vertices grows very fast with the in- creasing number of vertices but we want to maintain only the small number of graphs. Proof of Theorem 1. We propose to remember only one randomly chosen structure where the probability that the structure G is cho- ba,o′ (s, θ ) = kδ (s′O′ = o) ∑ P(s′ |s, a, θ )b(s, θ ) sen would be proportional to the probability P(G|history). s The probability P(G|history) could be difficult to com- pute. But, as Ross & Pineau noted in article [10] about P(s′ |s, a, θ )b(s, θ ) is mixture of products of Dirichlets MDP, a Markov chain of graphs can be maintained using by lemma . Hence ∑s P(s′ |s, a, θ )b(s, θ ) is also mixture of Metropolis-Hastings algorithm. The algorithm will ensure products of dirichlets which are closed under multiplica- that the Markov chain converges to distribution of graphs tion by constant. It follows that ba,o′ (s, θ ) is mixture of which is equal to P(G|history). products of dirichlets. The Metropolis-Hastings algorithm needs to use P(history|G) which can be computed as follows: It is Theorem 1 implies that the belief b(s, θ ) over state s and equal to P(o|a, b, G) · P(ht−1 |G), where ht−1 denotes his- information state θ can be maintained in a closed form. tory up to previous time step. But how can we from this representation get the probabil- Then the idea of computing P(o′ |a, b) is as follows: ity of being in a specific state? We show in the following P(o′ |a, b) is equal (9) to sum of P(s′ |a, b) over states s′ theorem that this can be easily done. compatible with observation o′ and P(s′ |a, b) can be com- puted directly from hyperparameters. For the simplicity of Theorem 2. θ in formula (4) for the mixture of products notation we omit conditioning on G. of dirichlet can be integrated out in a closed form. More precisely it is as follows: 26 R. Brunetto, M. Vomlelová The changes are described in the proof of theorem 1 and associated lemmas. P(o′ |a, b) = 6 Action selection = ∑ ′ P(s′ |a, b) (9) s ∈S s′ =o′ O Which action should the agent choose? It should choose Z the action which maximizes his expected reward with re- = ∑ ∑ θ P(s′ |a, s, θ )b(s, θ ) (10) spect to the probability distribution b(s) over the states. s′ ∈S s s′O =o′ This expected reward can be tractably estimated by a re- cursive approach using depth limited Monte Carlo search. ∑ ∑ ∑ csi,s ′ = (11) The algorithm for each action samples several new ′ s ∈S s i s′ =o′ O states. Each sampled state s is then restricted to the ob- servation and then used to update the belief. Its value can Conditioning on b can be replaced as in (10). be then again estimated by the same algorithm using re- P(s′ |a, s, θ )b(s, θ ) in formula (10) is a mixture of prod- cursion up to some maximum depth. ucts of dirichlets by lemma 4. Hence by theorem 2 it can This generates sampled walks(series of states) of equal be replaced by sum of coefficients as in 10. length. Their rewards are summed and in the maximum The algorithm for belief update is given in algorithm 2 depth some simple estimate V (b) of the value of belief b is where q(G′ |G) is the probability of transition from graph added to the estimate eg. V (b) = ∑s b(s)reward(s) where G to graph G′ . This distribution can be set any arbitrary reward(s) = ∑R∈R sR . way which will ensure that all graphs are reachable. P(G) At each level of recursion is the best action chosen. For is prior distribution over graph structures and the probabil- clarity this Monte Carlo search is shown in algorithm 3. ity P(history|G′ ) can be computed as described above. Random transitions  with probability  P(history|G′ )P(G′ )q(G′ |G) Algorithm 3: selectAction(b, d) min 1, P(history|G)P(G)q(G′ |G) are well known under Data: belief b, depth d the name Metropolis-Hastlings algorithm. This algorithm Static variable: the number of samples N ensures that the Markov chain of graphs converges to the Result: utility estimate of belief state and distribution P(G|history) which implies that our algorithm selected action eventually learns either the correct structure or a structure if d = 0 then with is equally good with respect to encountered history. return V (b); One possible way of implementing random changes and maxQ := -∞; distribution q(G′ |G) is local change of the graph structure forall the actions a ∈ A do which can include deleting, reversing or adding edge. If Q := 0; the change would be local affecting only limited number for i=1 to N do of variables then necessary changes to representation of s′ := sampleState(ba ); ba,o will be also local. o′ := s′O ; This change depends on distribution q(G′ |G) and isn’t b′ := updateBelief(b, a, o); explicitly written in algorithm 2. We assume that this Q+= (reward(s′ )+ change is done on the line G := G′ which changes the +γ · selectAction(b′ , d − 1).utility)/N; structure. if Q > maxQ then Algorithm 2: updateBelief(b, a, o) maxQ := Q; Data: belief b (containing structure G and belief for selectedAction := a; that given structure), action a, observation o return (maxQ,selectedAction); Result: representation of belief for the next time step G’ := random modification of G;  ′ )P(G′ )q(G′ |G)  The state can be sampled by algorithm 4. With a probability min 1, P(history|G P(history|G)P(G)q(G |G)′ G:=G’; Algorithm 4: sampleState(b) b := ba,o ; Data: b(s) Simplify representation of b by an approximation Result: sampled state s from section 7. return state sampled according to weights ∑i ci,s Final change of belief inside one given structure b := Firstly, it draws a graph according to graph posterior. ba,o can be done as in theorem 1. This change includes the Then it randomly draws a state with probabilities ∑i ci,s . pa X ′ changes of coefficients ci,s and hyperparameters nX ′ ,i,s G . Which is correct as can be seen thanks to theorem 2. Acting and Bayesian Reinforcement Structure Learning of Partially Observable Environment 27 7 Approximate representation of the References mixture of products of Dirichlets [1] Barto, A. G. (1998). Reinforcement learning: An introduc- tion. MIT press. As noted in section 4 the number of components of the mixture representing current belief for a given graph grows [2] Brunetto, R. (2013). Probabilistic modeling of dynamic sys- tems. Informacne Technologie-Aplikacie a Teoria. exponentially with time which is untraceable. Each com- ponent of the mixture is associated with coefficients ci,s . [3] Doshi-Velez, F. (2009). The infinite partially observable markov decision process. In NIPS (pp. 477-485). Naturally some of these coefficients will be smaller while the others will be bigger. We propose to handle this ap- [4] Doshi, F., Wingate, D., Tenenbaum, J., & Roy, N. (2011). Infinite dynamic bayesian networks. In Proceedings of the proximately and keep only some components with bigger 28th International Conference on Machine Learning (ICML- coefficients. 11) (pp. 913-920). There are several possibilities: [5] Heckerman, D., Geiger, D., & Chickering, D. M. (1995). 1. Keep k components with the greatest coefficients. Learning Bayesian networks: The combination of knowl- edge and statistical data. Machine learning, 20(3), 197-243. 2. Sample k components with coefficients used as a [6] Itoh, H., & Nakamura, K. (2007). Partially observable probability. Markov decision processes with imprecise parameters. Ar- tificial Intelligence, 171(8), 453-490. 3. Instead of generating a lot of components and the [7] Jaulmes, R., Pineau, J., & Precup, D. (2005). Active learning sampling only k of them one can directly generate in partially observable markov decision processes (pp. 601- only these randomly chosen components. 608). Springer Berlin Heidelberg. [8] Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). For a more detailed description of these and other ap- Planning and acting in partially observable stochastic do- proximations along with some of their advantages and dis- mains. Artificial intelligence, 101(1), 99-134 advantages we encourage the reader to read [11] where the [9] Pineau, J., Gordon, G., & Thrun, S. (2011). Anytime Point- same approximations are described. Based Approximations for Large POMDPs. arXiv preprint arXiv:1110.0027. [10] Poupart, P., & Vlassis, N. A. (2008). Model-based Bayesian 8 Conclusion and Future Work Reinforcement Learning in Partially Observable Domains. In ISAIM. This article has shown the design of an agent. The [11] Ross, S., & Pineau, J. (2012). Model-based Bayesian re- agent can be placed to an unknown partially observable inforcement learning in large structured domains. arXiv environment and it can learn its dynamics by interac- preprint arXiv:1206.3281. tion with it. During the interaction it takes into account [12] Russell, S., & Norvig, P. (2009). Artificial Intelligence: A the exploration-exploitation trade-off and chooses a near- Modern Approach. Prentice Hall, 3rd edition optimal action. The dynamics of the environment learnt by [13] Smallwood, R. D., & Sondik, E. J. (1973). The optimal con- our agent is represented by a dynamic Bayesian network trol of partially observable Markov processes over a finite which structure is also learnt by our model. horizon. Operations Research, 21(5), 1071-1088. The algorithm could at any time return the graph of dy- namic Bayesian network it is currently maintaining as the estimate of the real structure. This article has shown a possible way of combining known techniques of reinforcement learning of structure of MDP and parameters of POMDP in order to learn both the structure and parameters of POMDP simultaneously. I am currently implementing and testing the proposed algorithms. One thing I am going to test is the alterna- tive representation of belief over possible structures. We proposed to maintain one Markov chain of possible graph structures. Alternatively algorithm could maintain several such Markov chains. It could lead to the possibility of more precise decisions. Another alternative is to use a par- ticle filter [12] where each paticle is a graph of dynamic Bayesian network. I am looking forward to reporting em- piric results in the near future. Acknowledgement: This research was supported by SVV project number 260 104 and by GAČR No. P202/10/1333.