V. Kůrková et al. (Eds.): ITAT 2014 with selected papers from Znalosti 2014, CEUR Workshop Proceedings Vol. 1214, pp. 21–27
http://ceur-ws.org/Vol-1214, Series ISSN 1613-0073, c 2014 R. Brunetto, M. Vomlelová


  Acting and Bayesian reinforcement structure learning of partially observable
                                environment

                                                  Robert Brunetto, Marta Vomlelová

                                                   Charles University, Czech Republic
                                                        robert@brunetto.cz

Abstract: This article shows how to learn both the                         This article addresses generalization of the tasks men-
structure and the parameters of partially observable en-                tioned above. We show the way to navigate an agent in an
vironment simultaneously while also online performing                   unknown stochastic dynamic partially observable environ-
near-optimal sequence of actions taking into account                    ment using near optimal actions. To do so, we combine the
exploration-exploitation tradeoff. It combines two re-                  algorithm [11] for learning parameters of POMDP with
sults of recent research: The former extends model-based                the known structure with the algorithm [10] for learning
Bayesian reinforcement learning of fully observable envi-               the structure of MDP.
ronment to bigger domains by learning the structure. The                   We take the best ideas of each of them. Right after ex-
latter shows how a known structure can be exploited to                  plaining the basic preliminaries in section 2 and demon-
model-based Bayesian reinforcement learning of partially                strating them on an example in section 3 we present an an-
observable domains. This article shows that merging both                alytical representation of the belief over parameters (anal-
approaches is possible without too excessive increase in                ogous to [11]) in section 4. Similarly to the article [10] the
computational complexity.                                               belief over structures will be maintained by MCMC2
                                                                           algorithm described in section 5. This allows us to
1    Introduction                                                       use an online action selection procedure described in sec-
                                                                        tion 7. In the last section we review the approximations
Partially observable Markov decision processes                          from [11] which can be also applied in our settings mak-
(POMDPs) are a well known framework for model-                          ing our model more traceable and compact.
ing and planning in partially observable stochastic
environments [13], [8].
   Such plans could be used in robotics, in dialogue man-
                                                                        2 Preliminaries
agement or elsewhere. However, the range of their practi-
                                                                        POMDPs usually model unknown environments as fol-
cal applicability is limited by several difficulties. Firstly,
                                                                        lows. At each time step the environment with the agent
their usage is limited to small state spaces only (curse
                                                                        is in an unobserved state s ∈ S. At this time step agent re-
of dimensionality) even when using several approxima-
                                                                        ceives observation o ∈ O and reward r. Then the agent is
tions [9].
                                                                        free to choose an action a ∈ A. As the time advances, the
   Secondly, it could be difficult for an expert to specify the
                                                                        state changes to a new state s′ that stochastically depends
probabilities exactly when applying POMDP to some do-
                                                                        on previous state and action.
main. There were several attempts to overcome this prob-
                                                                           We are interested in the special case of POMDPs where
lem by learning the model [7], [3], [4] or applying other
                                                                        the state space factorizes along several state variables X ∈
techniques [2], [6].
                                                                        X hence each state s is a vector s = (s1 , ...s|X| ) ∈ S =
   Following the recent research this article proposes an
approach which could solve both problems at the same                    ∏X∈X .
                                                                           Rewards and observations behave analogously r ∈
time.
   Learning the model by interacting with the environment               ∏R∈R and o ∈ ∏O∈O .
                                                                           Moreover we can assume without loss of generality that
(reinforcement learning) was researched a lot [1]. Dur-
                                                                        O ⊆ X and R ⊆ X. It implies that the observation is the
ing the last decade Jaulmes and Pineau tried to learn un-
                                                                        state restricted to observation variables. It will be denoted
factorized POMDP [7] by assuming a prior probability
                                                                        as o = sO . Other restriction operations will be denoted
distribution over all possible models. Then Poupart and
                                                                        analogously. For an instance, values of reward variables in
Vlassis [11] used similar Bayesian reinforcement learn-
                                                                        the state s can be denoted sR .
ing technique to learn the parameters of a DBN 1 with the
                                                                           We will always restrict vectors which are denoted by
known structure.
                                                                        bold lower case letters to sets of variables denoted as bold
   If the structure were not known then learning it would
                                                                        upper case letters. Capital letter which won’t be bold will
be a challenging problem even in fully observable environ-
                                                                        denote single variable. Lower case non-bold letter will be
ments. Ross and Pineau [10] addressed it and proposed an
                                                                        its value. Specially in the case when G is
online algorithm for it.
    1 DBN stands for Dynamic Bayesian Network.                               2 MCMC stands for Markov Chain Monte Carlo.
22                                                                                                              R. Brunetto, M. Vomlelová


   the graph of a Bayesian network, PAG X we will denote                  with unknown structure and parameters can be regarded as
set of variables which are parents3 of variable X in graph                a bigger POMDP with known dynamics.
G. Values of variables of parents of X will be denoted                       POMDP agent is usually maintaining so called belief
paG X where X ∈ X.                                                        state b(s̃) which represents probability distribution over
   When the time advances to the next step the state s                    possible states. In our case s̃ consists not only of the cur-
changes to s′ according to unknown transition probability                 rent state s but also of the graph structure and its parame-
P(s′ |s, a).                                                              ters.
   We assume that the transition probability P(s′ |s, a) fac-                Surprisingly, as Poupart and Vlassis showed [11], even
torizes according to a dynamic Bayesian network with un-                  though there is an infinite number of possible parame-
known structure G with unknown parameters.                                ters the belief for a given structure can be maintained in
                                                                          a closed form. We will review it in section 4.
             P(s′ |s, a) = ∏ P(s′X |(s′ , s, a)PAG X )                       There is only a finite number of possible graphical struc-
                              X∈X
                                                                          tures implying that the whole belief b(s̃) can be maintained
   Dynamic Bayesian network with structure G is de-                       in a closed form. But the number of possible graphs may
scribed as Bayesian network which has X′ ∪ X ∪ {A} as                     be large. That is why approximation is introduced in sec-
its nodes. By (s′ , s, a)PAG X we denote (s′ , s, a) restricted to        tion 5.
the variables which are parents of variable X according to                   The belief is a sufficient statistic for taking a decision.
structure G.                                                              Agent’s overall algorithm is summarized in algorithm 1.
   To emphasize that we take P(s′X |(s′ , s, a)PAG X ) as an un-
known parameter, we will denote it:
                                                                           Algorithm 1: Agent’s life cycle
                                                                            Data: b initial belief over current state, structures and
                           (s′ ,s,a)PAG X                                         their parameters
                         θX                 .                      (1)
                                                                          1 while not TerminationCondition() do

  It is actually a vector of real values between 0 and 1                  2    a := selectAction(b).action;
summing to 1 containing for each value s′X the probability                3    performAction(a);
P(s′X |(s′ , s, a)PAG X ). As the indexes suggest that we have            4    o := receiveObservation();
such a set of parameters for each state variable X and for                5    b := updateBelief(b,a,o);
each combination of values of its parents.
  Even though we do not assume structure G to be known
we still assume that a prior probability distribution P(G)
over structures G is known. This probability distribution
can express expert’s prior knowledge or it can just prefer                3 Example
simple structures over more complicated ones.
  The transition probability can be expressed as:                         For better readability we illustrate the terms on a problem
                                                                          taken from [8]. Imagine the following scenario. The agent
                                                                          is before two doors. Behind one of the doors is a fortune
           P(s′ |s, a) = ∑ P(G) · P(s′ |s, a, G).                  (2)    which would give the agent big reward but behind the sec-
                           G
                                                                          ond door is a tiger which could cause the agent even higher
                                                (s′ ,s,a)PA X             negative reward. The agent doesn’t know which door is
   We assume that the parameters θX           G
                                                 follow the               which. It has three possibilities what to do. It can walk
Dirichlet distribution, which is conjugate to multinomial                 through one door, walk through second door or wait and
distribution.                                                             listen behind which door is bigger noise and then face the
   The hyperparameters are initially set to one or set to the             same decision with the newly gained information.
values that preserve the likelihood of equivalent parame-                    This corresponds to algorithm 1. Choosing and per-
ters in equivalent Bayesian networks (see [5]).                           forming action is on its lines 2 and 3. After the agent lis-
                                        (s′ ,s,a)PA X
   Each unknown parameter θX             G
                                             can be regarded              tens or opens a door it knows what it has heard or whether
as an additional state feature.                                           it met a tiger or found a fortune. This is shown on line 4
   This way a POMDP with unknown parameters can be                        of algorithm 1. Agent then uses this observation to update
converted into a bigger POMDP without unknown param-                      its belief on line 5. The belief contains all information
eters. Similar remark also holds for the unknown structure.               the agent knows i.e. the agent should count how many
It can be also regarded as a part of the state hence POMDP                times it has heard the the tiger on each side and estimate
                                                                          where the tiger is. It should also learn usual tiger position
     3
                                                                          and other facts about the environments’ behaviour. These
       Parent variable V of variable W is a term used in Bayesian-
networks-literature to denote that there is an arrow from V to W in the
                                                                          information are also stored and updated with belief. We
graph of Bayesian network. This is in greater detail explained in sec-    assume that after opening a door and finding a tiger or a
tion 3.                                                                   fortune the experiment restarts itself and the tiger is again
Acting and Bayesian Reinforcement Structure Learning of Partially Observable Environment                                                       23


                                                                           The node shapes have its usual meaning. Squares de-
               At                                 At+1                  note the decision(action) variables. Circles denote random
                                                                        variables and diamonds denote reward variable.
                                                                           As we are modelling a dynamic environment. We have a
                                                                        copy of each variable for each time step. The figure shows
                                                                        only two time slices - for time t and t + 1. We have also
             Tigert                         Tigert+1                    shown only the arrows which point to time t+1. We have
                                                                        omitted all other arrows.
                                                                           An arrow starts in a node called parent and goes to a
                                                                        node called child. As each node is a variable we inter-
                                                                        change the term parent node and parent variable.
            Heardt                          Heardt+1                       The intuitive meaning of arrows is this: The child
                                                                        node(variable) X stochastically depends only on combina-
                                                                        tion of values of parent variables. More rigorous meaning
                                                                        of an arrow is that child variable X is conditionally inde-
           Rewardt                         Rewardt+1                    pendent on all other variables given values paG X of its
                                                                        parents PAG X.
                                                                           Let G denote the graph from figure 1.                     Then
                                                                        PAG Rewardt+1 = {At , Tigert } which is quite natural. The
                 Figure 1: The tiger example                            reward which agent will receive depends on the combi-
                                                                        nation where it decided to go and where the tiger has
                                                                        been. This dependence in our example is even determinis-
placed behind random door. Everything repeats until a ter-              tic. However this is not general.
mination condition is reached (line 1). This unspecified                   For example the variable Heardt+1 depends on its par-
condition allows for example infinite repetition of the ex-             ents stochastically. The reason for it is simple. The agent
periment until the user terminates it.                                  could independently of the current state randomly hear in-
   This scenario can be modelled as follows. The set of ac-             correctly. PAG Heardt+1 = {At , Tigert } because what the
tions A contains actions for going left, right and listening.           agent hears depends on where the tiger is and on the fact
                                                                        whether the agent even listed.
                    A ∈ {le f t, right, listen}                            In order to make it possible for the agent to learn the
                                                                        environment it is necessary to let the agent interact with
  The set of reward variables R contains only one vari-                 the environment repeatedly. So the experiment with tiger
able, namely the variable Reward.                                       restarts itself everytime the agent opens a door. The tiger
                                                                        is then again placed behind a random door. That is why
                        R = {Reward}                                    PAG Tigert+1 = {At , Tigert }.
                                                                           Each variable in this example has two parents: the
  We assume that domains of all variables are known.                    ternary action variable and one binary variable. Hence to
Hence the domain of variable Reward is also assumed to                  specify the probability distribution over child variable we
be known. Let us say that cost of listening is 1, cost of fac-          need to specify the probability of each of its values for all
ing a tiger 100 and reward for finding a fortune 10. Then               six combinations of values of parent variables ie. we need
Reward ∈ {−1, −100, 10}.                                                to specify P(s′X |(s′ , s, a)PAG X ). It is specified by 2 ∗ 3 = 6
  The set of state variables X contains variables Tiger,                probability distributions over values of child variable. As
Heard and Reward.                                                       all child variables are binary we need to specify 6 pairs of
                                                                        numbers between 0 and 1 such that each pair sums to 1.
                 X = {Tiger, Heard, Reward}                                These pairs of numbers are denoted as in (1).
   Tiger is position of the tiger and Heard is a variable                  We interchangeably call such pairs or even above men-
containing what has the agent heard.                                    tioned 6-tuples of pairs parameters in plural or a parameter
   From this three variables the variable Tiger is unob-                in singular. By substituting values to s, s′ and a one can
served because agent doens’t know the position of the                   restrict the parameter to one concrete value.
tiger. Other two variables are observed.                                   For example let us assume that the agent is listening.
                                                                        The tiger is behind the right door and that the probability
                    O = {Heard, Reward}                                 of correct listening is 0.9 and the probability of opposite
                                                                        listening outcome is 0.1 then following equalities would
    The agent knows neither the state transition probabili-             hold:
ties nor the structure between the variables. It has to learn
it first. The real graphical structure could be for example
                                                                                                              (right,listen)
like figure 1.                                                            P(s′Heard |(s′ , s, a)PAG Heard ) = θHeard           = (0.9, 0.1).
24                                                                                                                    R. Brunetto, M. Vomlelová


                                                                      The posterior probability that the variable V contains
          ΘTiger            At                   At+1               value i is then equal to proportion of how many times was
                                                                    value i observed.

          ΘHeard          Tigert              Tigert+1                                                              ni
                                                                                            P(V = i) =                   .                 (3)
                                                                                                                   ∑i ni
                                                                       In the fully observable environment then we could es-
         ΘReward         Heardt              Heardt+1               timate all parameters in the whole structure by (3). We
                                                                    would use this estimate for each state variable X and for
                                                                    each combination of values of its parents paG X. That is
                         Rewardt             Rewardt+1              the reason why we add X as a lower index to θ and paG X
                                                                    as an upper index to θ as in (1).
                                                                                                 pa X
                                                                       Each set of parameters θX G sums to one.
                   Figure 2: The tiger example                         From now on θ without any indexes will denote all these
                                                                    sets of parameters together.
                                                                       In this simplified case when the whole history was ob-
                                                                    served the density of probability of being in "information
     The first equality holds because
                                                                    state" θ is
                                                                                                          pa X      pa X
       paG Heard = (s′ , s, a)PAG Heard = (right, listen).                                   ∏ D(θX G ; nX G )
                                                                                            X,paG X
   The agent doesn’t know the value of mentioned param-               The problem that not all state features are observable
eter and as was already explained in previous section the           can be overcome by the following theorem proven by
agent can treat parameters as random variables. This con-           Poupart and Vlassis [11].
verts the POMDP with unknown parameters to a bigger
POMDP with known parameters. The structure of this                  Theorem 1. If the prior is a mixture of products of Dirich-
POMDP is shown in figure 2.                                         lets
   Notice that we don’t have a copy of these parameter-                                                            pa X ′    pa X ′
                                                                          b(s, θ ) = ∑ ci,s           ∏ ′ D(θX ′ G ; nX ′ ,i,s
                                                                                                                          G
                                                                                                                               )           (4)
random-variables for each time slice. All time slices share                             i          X ,paG X
                                                                                                    ′
the same parameter. That is because the parameter doesn’t
change over time.                                                   then the posterior is also a mixture of products of Dirich-
   Information about possible values of these parameters,           lets
along with possible structures and states are maintained in                                                             pa X ′    pa X ′
                                                                          ba,o′ (s′ , θ ) = ∑ c j,s′       ∏ ′ D(θX ′ G ; nX ′ ,Gj,s′ ).
the belief, which is described in following sections.                                          j        X ,paG X
                                                                                                         ′


                                                                       Because the proof of the theorem 1 actually gives us the
4      Belief for a given structure                                 algorithm to update the belief we repeat the proof in this
                                                                    article but for better readability we split it in two lemmas.
We begin by describing how belief looks like when we are
                                                                       The first lemma is technical observation which says that
given the structure. It is exactly the same as described by                              pa X
Poupart & Vlassis [11].                                             multiplication by θX G only scales the Dirichlet distribu-
   The key component is the nice properties of Dirichlet            tion.
distribution. Let us have one discrete random variable V            Lemma 1. Let θ = (θ1 , ...θK ) and n = (n1 , ..., nK ) then
which can take values from 1 up to K with probabilities
(θ1 , θ2 , ...θK ), where ∑i θi = 1.                                                        θ j D(θ ; n) = cD(θ ; m)
   The usual way to estimate these parameters in Bayesian             for some constant c and vector m.
statistics is to compute the posterior when assuming that
the prior follows dirichlet distribution. Its density is given      Proof.
                    1      n −1
by D(θ ; n) = B(n)    ∏i θi i , where θ = {θi }i , n = {ni }i are                                       1
                                                                                                       B(n) ∏
some hyperparameters and B(n) is a constant depending                        θ j D(θ ; n) =                 θ j θini −1 =
                                                                                                                i
on them which makes the distribution sum to one. It is
                                                                                                        1
                                                                                                       B(n) ∏
                                                                                                                  n −1+[i== j]
known as multinomial beta function.                                                            =                θi i           =
   The prior distribution which states that we have no evi-                                                   i
dence is expressed by setting all hyperparameters ni equal                                             B(n + δ )
                                                                                               =                 D(θ ; n + δ ) =
to 1. It can be interpreted as evidence that all values                                                  B(n)
were observed exactly once or it can be thought of only                                                  nj
                                                                                               =               D(θ ; n + δ )
as smoothing the posterior.                                                                             K
                                                                                                       ∑i=1 ni
Acting and Bayesian Reinforcement Structure Learning of Partially Observable Environment                                                               25


where δ = (0, ...1, ...0) is a vector of zeroes with one in                      Proof.
j-th position.                                                                                  Z
   The last equality follows from the property of B(n) say-                         b(s) =           b(s, θ )d θ =
                ∏K Γ(n )
ing that B(n) = Γ(i=1K n i) .                                                                            Z
                                                                                                                              pa X ′   pa X ′
                   ∑i=1 i
                                                                                           =    ∑ ci,s           ∏   D(θX ′ G ; nX ′ ,i,s
                                                                                                                                     G
                                                                                                                                          )d θ =
                                                                                                 i            X ,paG X
                                                                                                               ′   ′

Lemma 2. P(s′ |s, a, θ )b(s, θ ) is mixture of products of                                                           Z
                                                                                                                              pa X ′   pa X ′
Dirichlets.                                                                                =    ∑ ci,s         ∏ ′       D(θX ′ G ; nX ′ ,i,s
                                                                                                                                         G
                                                                                                                                              )d θ =
                                                                                                 i       X ,paG X
                                                                                                             ′

Proof.                                                                                     =    ∑ ci,s
                                                                                                 i
             P(s′ |s, a, θ )b(s, θ ) =
                                                                                    The first equation defines symbol b(s). The second fol-
                                                                                 lows from the definition of b(s, θ ) by switching sum and
                       (s′ ,a,s)PA X                                             integral. The third equation switches integral and multi-
         =   ( ∏ θX                    )b(s, θ )                           (5)
              X∈X
                                                                                 plication which is possible in this case because each factor
                       (s′ ,a,s)PA X
                                                                                 depends on the different variable of multidimensional in-
         =   ( ∏ θX                    ) ∑ ci,s ∏ D(θXpaX ; npaX
                                                             X,i,s )             tegration. Density of all probability distributions (includ-
              X∈X                           i          X,paX
                                                                                 ing Dirichlet distribution) integrates to one and product of
                               (s′ ,a,s)                                         ones is one. That is why the last equation holds.
         =   ∑ ci,s ∏ θX                   PA X
                                                  ) ∏ D(θXpaX ; npaX
                                                                 X,i,s )   (6)
              i        X∈X                            paX
                                                      paX
                                                                                    Despite this encouraging results the number of compo-
             ∑ csi,s ∏ D(θXpaX ; n′ X,i,s )
                   ′
         =                                                                 (7)   nents in the mixture grows exponentially. Luckily the be-
              i        X,paX
                                                                                 lief can be approximated by the approximation proposed
                                                                           (8)   by Poupart & Vlassis. This will be described in section 7.
                                                                                 Firstly, in the next section, we describe the way the belief
b(s, θ ) in the formula (5) is replaced by its the defi-                         over possible structures is maintained.
nition. Then formula (6) is received by switching the
                                         (s′ ,a,s)PA X
terms. Moreover thanks to lemma 1 θX                   can be
                                        paX ′ paX
consumed by dirichlet distribution D(θX ; n X,i,s ) where                        5 Belief over structures
paX = (s′ , a, s)PA X .
                                                                                 The simplest and most naive approach to maintaining the
   Let c and δ be the as in lemma 1. ci,s must be multi-
                                                 ′                               overall belief which contains the probability of structure,
plied by c. That is why it was replaced by csi,s = c · ci,s in
                                                                                 its parameters and state would be straightforward. It is suf-
equation (7).
                                                                                 ficient to keep the belief for each structure and probability
   n′ paX       paX
      X,i,s = nX,i,s + δ when paX = (s , a, s)PA X and
                                           ′
                                                                                 of the structure. The problem is that the number of graphs
n′ paX      paX
   X,i,s = nX,i,s otherwise.                                                     on given number of vertices grows very fast with the in-
                                                                                 creasing number of vertices but we want to maintain only
                                                                                 the small number of graphs.
Proof of Theorem 1.                                                                 We propose to remember only one randomly chosen
                                                                                 structure where the probability that the structure G is cho-
       ba,o′ (s, θ ) = kδ (s′O′ = o) ∑ P(s′ |s, a, θ )b(s, θ )                   sen would be proportional to the probability P(G|history).
                                                  s                                 The probability P(G|history) could be difficult to com-
                                                                                 pute. But, as Ross & Pineau noted in article [10] about
   P(s′ |s, a, θ )b(s, θ ) is mixture of products of Dirichlets                  MDP, a Markov chain of graphs can be maintained using
by lemma . Hence ∑s P(s′ |s, a, θ )b(s, θ ) is also mixture of                   Metropolis-Hastings algorithm. The algorithm will ensure
products of dirichlets which are closed under multiplica-                        that the Markov chain converges to distribution of graphs
tion by constant. It follows that ba,o′ (s, θ ) is mixture of                    which is equal to P(G|history).
products of dirichlets.                                                             The Metropolis-Hastings algorithm needs to use
                                                                                 P(history|G) which can be computed as follows: It is
   Theorem 1 implies that the belief b(s, θ ) over state s and                   equal to P(o|a, b, G) · P(ht−1 |G), where ht−1 denotes his-
information state θ can be maintained in a closed form.                          tory up to previous time step.
But how can we from this representation get the probabil-                           Then the idea of computing P(o′ |a, b) is as follows:
ity of being in a specific state? We show in the following                       P(o′ |a, b) is equal (9) to sum of P(s′ |a, b) over states s′
theorem that this can be easily done.                                            compatible with observation o′ and P(s′ |a, b) can be com-
                                                                                 puted directly from hyperparameters. For the simplicity of
Theorem 2. θ in formula (4) for the mixture of products                          notation we omit conditioning on G.
of dirichlet can be integrated out in a closed form.                                More precisely it is as follows:
26                                                                                                      R. Brunetto, M. Vomlelová


                                                                  The changes are described in the proof of theorem 1 and
                                                                  associated lemmas.
P(o′ |a, b) =

                                                                  6 Action selection
                =    ∑
                     ′
                       P(s′ |a, b)                          (9)
                     s ∈S
                    s′ =o′
                     O
                                                                  Which action should the agent choose? It should choose
                                  Z                               the action which maximizes his expected reward with re-
                =    ∑ ∑ θ P(s′ |a, s, θ )b(s, θ )        (10)    spect to the probability distribution b(s) over the states.
                      s′ ∈S   s
                    s′O =o′                                       This expected reward can be tractably estimated by a re-
                                                                  cursive approach using depth limited Monte Carlo search.
                     ∑ ∑ ∑ csi,s
                                      ′
                =                                         (11)       The algorithm for each action samples several new
                     ′
                     s ∈S     s   i
                    s′ =o′
                     O
                                                                  states. Each sampled state s is then restricted to the ob-
                                                                  servation and then used to update the belief. Its value can
   Conditioning on b can be replaced as in (10).                  be then again estimated by the same algorithm using re-
P(s′ |a, s, θ )b(s, θ ) in formula (10) is a mixture of prod-     cursion up to some maximum depth.
ucts of dirichlets by lemma 4. Hence by theorem 2 it can             This generates sampled walks(series of states) of equal
be replaced by sum of coefficients as in 10.                      length. Their rewards are summed and in the maximum
   The algorithm for belief update is given in algorithm 2        depth some simple estimate V (b) of the value of belief b is
where q(G′ |G) is the probability of transition from graph        added to the estimate eg. V (b) = ∑s b(s)reward(s) where
G to graph G′ . This distribution can be set any arbitrary        reward(s) = ∑R∈R sR .
way which will ensure that all graphs are reachable. P(G)            At each level of recursion is the best action chosen. For
is prior distribution over graph structures and the probabil-     clarity this Monte Carlo search is shown in algorithm 3.
ity P(history|G′ ) can be computed as described above.
   Random               transitions      with     probability
     
          P(history|G′ )P(G′ )q(G′ |G)
                                                                   Algorithm 3: selectAction(b, d)
min 1, P(history|G)P(G)q(G′ |G)        are well known under         Data: belief b, depth d
the name Metropolis-Hastlings algorithm. This algorithm             Static variable: the number of samples N
ensures that the Markov chain of graphs converges to the                      Result: utility estimate of belief state and
distribution P(G|history) which implies that our algorithm                    selected action
eventually learns either the correct structure or a structure       if d = 0 then
with is equally good with respect to encountered history.                return V (b);
   One possible way of implementing random changes and
                                                                    maxQ := -∞;
distribution q(G′ |G) is local change of the graph structure
                                                                    forall the actions a ∈ A do
which can include deleting, reversing or adding edge. If
                                                                        Q := 0;
the change would be local affecting only limited number
                                                                        for i=1 to N do
of variables then necessary changes to representation of
                                                                            s′ := sampleState(ba );
ba,o will be also local.
                                                                            o′ := s′O ;
   This change depends on distribution q(G′ |G) and isn’t
                                                                            b′ := updateBelief(b, a, o);
explicitly written in algorithm 2. We assume that this
                                                                            Q+= (reward(s′ )+
change is done on the line G := G′ which changes the
                                                                            +γ · selectAction(b′ , d − 1).utility)/N;
structure.
                                                                        if Q > maxQ then
 Algorithm 2: updateBelief(b, a, o)                                         maxQ := Q;
  Data: belief b (containing structure G and belief for                     selectedAction := a;
         that given structure), action a, observation o             return (maxQ,selectedAction);
  Result: representation of belief for the next time step
  G’ := random modification of G;
                                         ′ )P(G′ )q(G′ |G)
                                                                   The state can be sampled by algorithm 4.
  With a probability min 1, P(history|G
                               P(history|G)P(G)q(G |G)′

        G:=G’;                                                     Algorithm 4: sampleState(b)
     b := ba,o ;                                                    Data: b(s)
     Simplify representation of b by an approximation               Result: sampled state s
     from section 7.                                                return state sampled according to weights ∑i ci,s

   Final change of belief inside one given structure b :=           Firstly, it draws a graph according to graph posterior.
ba,o can be done as in theorem 1. This change includes the        Then it randomly draws a state with probabilities ∑i ci,s .
                                                         pa X ′
changes of coefficients ci,s and hyperparameters nX ′ ,i,s
                                                      G
                                                           .      Which is correct as can be seen thanks to theorem 2.
Acting and Bayesian Reinforcement Structure Learning of Partially Observable Environment                                               27


7     Approximate representation of the                                 References
      mixture of products of Dirichlets
                                                                        [1] Barto, A. G. (1998). Reinforcement learning: An introduc-
                                                                            tion. MIT press.
As noted in section 4 the number of components of the
mixture representing current belief for a given graph grows             [2] Brunetto, R. (2013). Probabilistic modeling of dynamic sys-
                                                                            tems. Informacne Technologie-Aplikacie a Teoria.
exponentially with time which is untraceable. Each com-
ponent of the mixture is associated with coefficients ci,s .            [3] Doshi-Velez, F. (2009). The infinite partially observable
                                                                            markov decision process. In NIPS (pp. 477-485).
Naturally some of these coefficients will be smaller while
the others will be bigger. We propose to handle this ap-                [4] Doshi, F., Wingate, D., Tenenbaum, J., & Roy, N. (2011).
                                                                            Infinite dynamic bayesian networks. In Proceedings of the
proximately and keep only some components with bigger
                                                                            28th International Conference on Machine Learning (ICML-
coefficients.
                                                                            11) (pp. 913-920).
  There are several possibilities:
                                                                        [5] Heckerman, D., Geiger, D., & Chickering, D. M. (1995).
    1. Keep k components with the greatest coefficients.                    Learning Bayesian networks: The combination of knowl-
                                                                            edge and statistical data. Machine learning, 20(3), 197-243.
    2. Sample k components with coefficients used as a                  [6] Itoh, H., & Nakamura, K. (2007). Partially observable
       probability.                                                         Markov decision processes with imprecise parameters. Ar-
                                                                            tificial Intelligence, 171(8), 453-490.
    3. Instead of generating a lot of components and the                [7] Jaulmes, R., Pineau, J., & Precup, D. (2005). Active learning
       sampling only k of them one can directly generate                    in partially observable markov decision processes (pp. 601-
       only these randomly chosen components.                               608). Springer Berlin Heidelberg.
                                                                        [8] Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998).
  For a more detailed description of these and other ap-                    Planning and acting in partially observable stochastic do-
proximations along with some of their advantages and dis-                   mains. Artificial intelligence, 101(1), 99-134
advantages we encourage the reader to read [11] where the               [9] Pineau, J., Gordon, G., & Thrun, S. (2011). Anytime Point-
same approximations are described.                                          Based Approximations for Large POMDPs. arXiv preprint
                                                                            arXiv:1110.0027.
                                                                        [10] Poupart, P., & Vlassis, N. A. (2008). Model-based Bayesian
8     Conclusion and Future Work                                            Reinforcement Learning in Partially Observable Domains.
                                                                            In ISAIM.
This article has shown the design of an agent. The                      [11] Ross, S., & Pineau, J. (2012). Model-based Bayesian re-
agent can be placed to an unknown partially observable                      inforcement learning in large structured domains. arXiv
environment and it can learn its dynamics by interac-                       preprint arXiv:1206.3281.
tion with it. During the interaction it takes into account              [12] Russell, S., & Norvig, P. (2009). Artificial Intelligence: A
the exploration-exploitation trade-off and chooses a near-                  Modern Approach. Prentice Hall, 3rd edition
optimal action. The dynamics of the environment learnt by               [13] Smallwood, R. D., & Sondik, E. J. (1973). The optimal con-
our agent is represented by a dynamic Bayesian network                      trol of partially observable Markov processes over a finite
which structure is also learnt by our model.                                horizon. Operations Research, 21(5), 1071-1088.
   The algorithm could at any time return the graph of dy-
namic Bayesian network it is currently maintaining as the
estimate of the real structure.
   This article has shown a possible way of combining
known techniques of reinforcement learning of structure
of MDP and parameters of POMDP in order to learn both
the structure and parameters of POMDP simultaneously.
   I am currently implementing and testing the proposed
algorithms. One thing I am going to test is the alterna-
tive representation of belief over possible structures. We
proposed to maintain one Markov chain of possible graph
structures. Alternatively algorithm could maintain several
such Markov chains. It could lead to the possibility of
more precise decisions. Another alternative is to use a par-
ticle filter [12] where each paticle is a graph of dynamic
Bayesian network. I am looking forward to reporting em-
piric results in the near future.

Acknowledgement: This research was supported by SVV
project number 260 104 and by GAČR No. P202/10/1333.