Yanlin Han and Piotr Gmytrasiewicz                            MAICS 2017                                                 pp. 69–76


  Learning Others’ Intentional Models in Multi-Agent Settings Using Interactive
                                  POMDPs
                                             Yanlin Han            Piotr Gmytrasiewicz
                                                   Department of Computer Science
                                                   University of Illinois at Chicago
                                                         Chicago, IL 60607


                           Abstract                                      structure that represents an agent’s belief about the physical
  Interactive partially observable Markov decision processes
                                                                         state, belief about the other agents and their beliefs about
  (I-POMDPs) provide a principled framework for planning                 others’ beliefs. The models of other agents included in the
  and acting in a partially observable, stochastic and multi-            new augmented state space consist of two types: the inten-
  agent environment, extending POMDPs to multi-agent set-                tional models and subintentional models. The sophisticated
  tings by including models of other agents in the state space           intentional model ascribes beliefs, preferences, and rational-
  and forming a hierarchical belief structure. In order to pre-          ity to other agents (Gmytrasiewicz and Doshi 2005), while
  dict other agents’ actions using I-POMDP, we propose an ap-            the simpler subintentional model, such as finite state con-
  proach that effectively uses Bayesian inference and sequen-            trollers (Panella and Gmytrasiewicz 2016), does not. So-
  tial Monte Carlo (SMC) sampling to learn others’ intentional           lutions of I-POMDPs map an agent’s belief about the en-
  models which ascribe them beliefs, preferences and rational-           vironment and other agents’ models to actions. Therefore,
  ity in action selection. For problems of various complexities,
  empirical results show that our algorithm accurately learns
                                                                         it is applicable to all important agent, human, and mixed
  models of other agents and has superior performance in com-            agent-human applications. It has been clearly shown (Gmy-
  parison with other methods. Our approach serves as a gener-            trasiewicz and Doshi 2005) that the added sophistication
  alized reinforcement learning algorithm that learns over other         for modeling others as rational agents results in a higher
  agents’ transition, observation and reward functions. It also          value function which dominates the one obtained from sim-
  effectively mitigates the belief space complexity due to the           ply treating others as noise, which implies the modeling su-
  nested belief hierarchy.                                               periority of I-POMDPs for multi-agent systems over other
                                                                         approaches.
                       Introduction                                         However, the interactive belief modification for I-
Partially observable Markov decision processes (POMDPs)                  POMDPs results in a drastic increase of the belief space
(Kaelbling, Littman, and Cassandra 1998) provide a princi-               complexity, adding to the curse of dimensionality: the com-
pled, decision-theoretic framework for planning under un-                plexity of the belief representation is proportional to be-
certainty in a partially observable, stochastic environment.             lief dimensions due to exponential growth of agent mod-
An autonomous agent operates rationally in such settings                 els with increase of nesting level. Since exact solutions
by maintaining a belief of the physical state at any given               to POMDPs are proven to be PSPACE-complete for fi-
time, in doing so it sequentially chooses the optimal ac-                nite horizon and undecidable for infinite time horizon (Pa-
tions that maximize the future rewards. Therefore, solutions             padimitriou and Tsitsiklis 1987), the time complexity of the
of POMDPs are mappings from an agent’s beliefs to ac-                    more generalized I-POMDPs, which may contain multiple
tions. Although POMDPs can be used in multi-agent set-                   POMDPs and I-POMDPs of other agents, is greater than
tings, it is doing so under strong assumptions that the effects          or equal to PSPACE-complete for finite horizon and unde-
of other agents’ actions are implicitly treated as noise and             cidable for infinite time horizon. Due to this severe space
folded in the state transitions, such as recent Bayes-adaptive           complexity, currently no complete belief update has been ac-
POMDPs (Ross, Draa, and Pineau 2007), infinite general-                  complished using the sophisticated intentional models over
ized policy representation (Liu, Liao, and Carin 2011), in-              entire interactive belief space. There are only partial up-
finite POMDPs (Doshi-Velez et al. 2013). Therefore, an                   dates on other agents’ sole beliefs about the physical states
agent’s beliefs about other agents are not in the solutions              (Doshi and Gmytrasiewicz 2009) and indirect approach such
of POMDPs.                                                               as subintentional finite state controllers (Panella and Gmy-
   Interactive POMDP (I-POMDP) (Gmytrasiewicz and                        trasiewicz 2016). Therefore, in order to unleash the full
Doshi 2005) are a generalization of POMDP to multi-agent                 modeling power of intentional models and apply I-POMDPs
settings by replacing POMDP belief spaces with interac-                  to more realistic settings, a good approximation algorithm
tive hierarchical belief systems. Specifically, it augments the          for computing the nested interactive belief and predicting
plain beliefs about the physical states in POMDP by includ-              other agents’ actions is crucial to the trade-off between so-
ing models of other agents, which forms a hierarchical belief            lution quality and computation complexity.


                                                                    69
Learning Others Intentional Models in Multi-Agent Settings Using Interactive POMDPs                                                  pp. 69–76


    To address this issue, we propose a Bayesian approach
that utilizes customized sequential Monte Carlo sampling                                           nX
algorithms (Doucet, De Freitas, and Gordon 2001) to ob-                    OP T (bi ) =arg max                bi (s)R(s, ai )                      (3)
                                                                                          ai 2Ai
tain approximating solutions to interactive I-POMDPs and                                                s2S
                                                                                                                                                   o
implement the algorithms in a software package1 . Specifi-                                     X
                                                                                         +              P (oi |ai , bi ) ⇥ U (SE(bi , ai , oi ))
cally, We assume that models of other agents are unknown
                                                                                               oi 2⌦i
and learned from imperfect observations of the other agents’
behaviors. We parametrize other agents’ intentional mod-                  Particle Filter
els and maintain a belief over them, making sequential
Bayesian updates using only observations from the environ-                The Markov Chain Monte Carlo (MCMC) method (Gilks
ment. Since this Bayesian inference task is analytically in-              et al., 1996) is widely used to approximate probability dis-
tractable, to approximate the posterior distribution, we de-              tributions when they are unable to be computed directly. It
vise a customized sequential Monte Carlo method to de-                    generates samples from a posterior distribution ⇡(x) over
scend the belief hierarchy and sample all model parameters                state space x, by simulating a Markov chain p(x0 |x) whose
at each nesting level, starting from the interactive particle             state space is x and stationary distribution is ⇡(x). The sam-
filter (I-PF) (Doshi and Gmytrasiewicz 2009) for I-POMDP                  ples drawn from p converge to the target distribution ⇡ as
belief update.                                                            the number of samples goes to infinity.
    Our approach, for the first time, successfully learns oth-               In order to make MCMC work on sequential inference
ers’ models over the entire intentional model space which                 task, especially sequential decision makings under Markov
contains their initial belief, transition, observation and re-            assumptions, sequential versions of Monte Carlo methods
ward functions, making it a generalized reinforcement learn-              have been proposed and some of them are capable of deal-
ing method for multi-agent settings. Our algorithm accu-                  ing with high dimensionality and/or complexity problems,
rately predicts others’ actions on various problem settings,              such as particle filters (Del Moral and Pierre 1996). At each
therefore enables the modeling agent to make corresponding                time step, particle filters draw samples (or particles) from
optimal action to maximize its own rewards. By approximat-                a proposal distribution, commonly p(xt |xt 1 ), which is es-
ing Bayesian inference using a customized sequential Monte                sentially the conditional distribution of the current state xt
Carlo sampling method, we significantly mitigate the belief               given the previous xt 1 , then use the observation function
space complexity of the I-POMDPs.                                         p(yt |xt ) to compute the importance weight for each particle
                                                                          and resample all particles according to the weights.
                        Background
                                                                                                    The Model
POMDP
A Partially observable Markov decision process (POMDP)
                                                                          I-POMDP framework
(Kaelbling, Littman, and Cassandra 1998) is a general re-                 An interactive POMDP of agent i, I-POMDP i, is defined
inforcement learning model for planning and acting in a                   as:
single-agent, partially observable, stochastic domain. It is                       I-P OM DPi = hISi,l , A, ⌦i , Ti , Oi , Ri i (4)
defined for a single agent i as:                                          where ISi,l is the set of interactive states of the environ-
            P OM DPi = hS, Ai , ⌦i , Ti , Oi , Ri i           (1)         ment, defined as ISi,l = S ⇥ Mi,l 1 , l             1, where S is
                                                                          the set of states and Mi,l 1 is the set of possible models
Where the meaning for each element in the 6-tuple is:                     of agent j, and l is the strategy level. A specific class of
• S is the set of states of the environment.                              models are the (l 1)th level intentional models, ⇥j,l 1 , of
• Ai is the set of agent i’s possible actions                             agent j: ✓j,l 1 = hbj,l 1 , A, ⌦j , Tj , Oj , Rj , OCj i, bj,l 1 is
                                                                          agent j’s belief nested to level (l 1), bj,l 1 2 (ISj,l 1 ),
• ⌦i is the set of agent i’s possible observations                        and OCj is j’s optimality criterion. The intentional model
• Ti : S ⇥ Ai ⇥ S ! [0, 1] is the state transition function               ✓j,l 1 , sometimes is referred to as type, can be rewritten as
• Oi : S ⇥ Ai ⇥ ⌦i ! [0, 1] is the observation function                   ✓j,l 1 = hbj,l 1 , ✓ˆj i, where ✓ˆj includes all elements of the
• Ri : S ⇥ Ai ! Ri is the reward function.                                intentional model other than the belief and is called the agent
                                                                          j’s frame.
   Given the definition above, an agent’s belief about the                   The ISi,l could be defined in an inductive manner (note
state can be represented as a probability distribution over
S. The belief update can be simply done using the following               that when ✓ˆj is usually known, ✓ˆj reduces to bj ):
formula, where ↵ is the normalizing constant:
                                 X                                        ISi,0 = S,               ✓j,0 = {hbj,0 , ✓ˆj i : bj,0 2       (S)}
          b (s0 ) = ↵O(o, s, a)      T (s0 , a, s)b(s)    (2)             ISi,1 = S ⇥ ✓j,0 ,       ✓j,1 = {hbj,1 , ✓ˆj i : bj,1 2       (ISj,1 )}
                                  s2S
                                                                           ......                                                              (5)
Given the agent’s belief, then the optimal action, a⇤ , is sim-
ply part of the set of optimal actions, OP T (bi ), for the belief        ISi,l = S ⇥ ✓j,l 1 ,      ✓j,l = {hbj,l , ✓ˆj i : bj,l 2     (ISj,l )}
state defined as:
                                                                            And all other remaining components in an I-POMDP are
   1
       https://github.com/solohan22/IPOMDP.git                            similar to those in a POMDP:


                                                                     70
Yanlin Han and Piotr Gmytrasiewicz                                            MAICS 2017                                                              pp. 69–76


• A = Ai ⇥ Aj is the set of joint actions of all agents.                                    to incorrect models, we add another resampling step to sam-
• ⌦i is the set of agent i’s possible observations.                                         ple from the neighboring similar models given the current
                                                                                            samples. Consequently, our algorithm is able to maintain a
• Ti : S ⇥ Ai ⇥ S ! [0, 1] is the state transition function.                                probability distribution of the most possible models of other
• Oi : S ⇥ Ai ⇥ ⌦i ! [0, 1] is the observation function.                                    agents and eventually learn the optimal actions of them.
• Ri : IS ⇥ Ai ! Ri is the reward function.
                                                                                             Algorithm 1: Interactive Belief Update
Interactive belief update
Given all the definitions above, the interactive belief up-date                              b̃tk,l = InteractiveBeliefUpdate(b̃tk,l1 , atk 1 , otk , l > 0)
                                                                                                         (n),t 1                       (n),t 1
can be performed as follows:                                                                 1 For isk               =< s(n),t 1 , ✓ k           >2 b̃tk,l1 ,
                                                                                                                  t 1                 (n),t 1
                                                                                             2       sample a k ⇠ P (A k |✓ k                  )
bti (ist ) = P r(ist |bti 1 , ati 1 , oti )              (6)                                 3       sample s(n),t ⇠ Tk (S t |S (n),t 1 , atk 1 , at k1 )
         X               X                                                                   4       for ot k 2 ⌦ k :
                 t 1                  t 1 t 1 t 1 t 1 t
 =↵          b(is )           P r(aj |✓j )T (s , a , s )                                     5          if l = 1:
      ist   1
                          atj    1                                                                            (n),t                                 (n),t 1
                                                                                             6              b k,0 = Level0BeliefUpdate(b k,0 , at k1
                          X                                                                                                                                (n),t 1
⇥ Oi (st , at 1 , oti )          Oj (st , at 1 , otj )⌧ (btj 1 , atj 1 , otj , btj )                                                             , ot k , ✓ k      )
                                                                                                              (n),t         (n),t ˆ(n),t 1
                           otj                                                               7              ✓ k =< b k,0 , ✓ k                >
                                                                                                               (n),t                 (n),t
   Unlike plain belief update in POMDP, the interactive be-                                  8              isk       =< s(n),t , ✓ k >
lief update in I-POMDP takes two additional sophistica-                                      9          else:
                                                                                                                (n),t
tions into account. Firstly, the probabilities of other’s actions                            10               b k,l 1 = InteractiveBeliefUpdate(b̃t k,l         1
                                                                                                                                                                    1
given its models (the second summation) need to be com-                                                                                             t 1 t
                                                                                                                                                 , a k , o k , l 1)
puted since the state of physical environment now depends                                                       (n),t         (n),t      (n),t 1
on both agents’ actions. Secondly, the agent needs to update                                 11               ✓ k =< b k,l 1 , ✓ˆ k               >
                                                                                                                 (n),t         (n),t (n),t
its beliefs based on the anticipation of what observations the                               12               isk      =< s         ,✓ k >
other agent might get and how it updates (the third summa-                                   13
                                                                                                             (n)        (n)
                                                                                                          wt = O k (ot k |s(n),t , atk 1 , at k1 )
tion).                                                                                                       (n)        (n)
                                                                                             14           wt = wt ⇥ Ok (otk |s(n),t , atk 1 , at k1 )
   Then the optimal action, a⇤ , for the case of infinite hori-                                                             (n),t    (n)
zon criterion with discounting, is part of the set of optimal                                15           b̃temp
                                                                                                            k,l    =< isk , wt >
                                                                                                                                      PN
actions, OP T (✓i ), for the belief state defined as:                                                                    (n)
                                                                                             16 normalize all wt so that n=1 wt = 1
                                                                                                                                                 (n)

                                                                                                                          temp                                    (n)
                                 n X                                                         17 resample from b̃k,l accroding to normalized wt
                                                                                                                    (n),t
OP T (✓i ) = arg max                         bis (s)ERi (is, ai )              (7)           18 resample ✓ k according to neighboring similar
                 ai 2Ai
                                     is2IS                                                   models
                 X                                                                o                                      (n),t
            +             P (oi |ai , bi ) ⇥ U (hSE✓i (bi , ai , oi ), ✓ˆi i)                19 return b̃tk,l = isk
                oi 2⌦i

                                                                                               The interactive belief update described in Algo-
                   Sampling Algorithms                                                      rithm 1 is similar to I-PF in terms of the recur-
The Interactive Particle Filter (I-PF) (Doshi and Gmy-                                      sive Monte Carlo sampling and nesting hierarchy, but
trasiewicz 2009) was proposed as a filtering algorithm for                                  it has three major differences. Firstly, the belief up-
interactive belief update in I-POMDP. It generalizes the clas-                              date is over the entire intentional model space of other
sic particle filter algorithm to multi-agent settings and uses                              agents, therefore the initial set of N samples b̃tk,l1 =<
the state transition function as the proposal distribution,                                  (n),t 1                   (n)    (n)     (n)
which is usually used in a specific particle filter algorithm                               b k,l 1 , A k , ⌦ k , T k , O k , R k , OCj >, where k here
called bootstrap filter (Gordon, ect 1993). However, due to                                 denotes the modeling agent and k denotes all other mod-
the enormous belief space, I-PF assumes that other agent’s                                  eled agents. We only assume that the actions A k , ob-
frame ✓ˆj is known to the modeling agent, therefore simpli-                                 servations ⌦ k and optimal criteria OCj are known, as
fies the belief update process from S ⇥ ⇥j,l 1 to a signifi-                                in a multi-agent game the rules are usually known to
cantly smaller space S ⇥ bj,l 1 .                                                           all agents or could be obtained through intelligence. Sec-
   The intuition of our algorithm is to assign appropriate                                  ondly, it is intuitive to see that the observation function
                                                                                              (n)
prior distributions over agent j’s all possible models ✓j =<                                O k (ot k |s(n),t , atk 1 , at k1 ) in line 13 is now randomized
bj (s), Aj , ⌦j , Tj , Oj , Rj , OCj > and sample from each di-                             as well, as each of them is a particular observation function
mension of them. At each time step, we update all samples                                   of that agent. Lastly, we add another resampling step in line
using perceived observations, namely computing and assign-                                  18 in order to avoid divergence, by resampling each dimen-
ing weights to each sample, and resample them according to                                  sion of the model samples from a Gaussian distribution with
the weights. At last, since it is a randomized Monte Carlo                                  the mean of current sample value. Intuitively, similar models
method, to prevent the learning algorithm from converging                                   are resampled from a relatively tight neighboring region of


                                                                                       71
Learning Others Intentional Models in Multi-Agent Settings Using Interactive POMDPs                                             pp. 69–76


                                                                                                     (n)
the current model samples to maintain the learning accuracy.                  transition function Tk (st |st 1 , atk 1 , at k1 ) and observa-
   Algorithm 1 can be viewed as two major steps. The impor-                                   (n)
                                                                              tion function Ok (otk |st , atk 1 , at k1 ) are now both samples
tance sampling step (line 1 to line 16) samples from belief                   from input arguments, depending on model parameters of
priors b̃tk,l1 and propagates forward using related proposal                  the actual agent on the 0th level.
distributions and computes the weights of all samples. And
the selection or resapmling step (line 17 to line 18) resam-
ples according to weights and similar models. Specifically,
                                                    (n),t 1
the algorithm starts from a set of initial priors isk       , for
each of them, it samples other agents’ optimal action at k1
                              (n),t 1
from its policy P (A k |✓ k          ), which is solved using a
very efficient POMDP solver called Perseus2 (Spaan and
Vlassis 2005). Then it samples the physical state st using
the state transition Tk (S t |S (n),t 1 , atk 1 , at k1 ). Once at k1
and st are sampled, the algorithm calls for the 0-level be-
lief update (line 5 to 8), described in Algorithm 2, to update
other agents’ plain beliefs bt k,0 if the current nesting level l
is 1, or recursively calls for itself at a lower level l 1 (line
9 to 12) if the current nesting level is greater than 1. The
                   (n)
sample weights wt are computed according to observation
likelihoods of modeling and modeled agents (line 13, 14),                     Figure 1: An illustration of interactive belief update for two
and then got normalized so that they sum up to 1 (line 16).                   agents and 1 level nesting.
Lastly, the algorithm resamples the intermediate samples ac-
cording to the computed weights (line 17) and resamples an-                      In figure 1, we illustrate the interactive belief update us-
other time from similar neighboring models (line 18).                         ing the problem discussed in the following section . Suppose
                                                                              there are two agents i and j in the environment, the sample
 Algorithm 2: Level-0 Belief Update                                           size is 8 and the nesting level is 2, the subscripts in figure 1
                                                                              denotes the corresponding agents and each dot represents a
                                                     (n)    (n)
 btk,0 =Level0BeliefUpdate(btk,01 ,atk 1 ,otk , Tk ,Ok )                      particular belief sample. The propagate step corresponds to
 1 P (at k1 ) = 1/at k1                                                       line 2 to 12 in Algorithm 1, the weight step corresponds to
 2 for st 2 S:                                                                line 13 to 16, the resample step corresponds to line 17 and
 3      for st 1 :                                                            18. The belief update for a particular level-0 model sam-
                  (t 1)                                                       ple (✓j = hbj (s) = 0.5, pT 1 = 0.67, pT 2 = 0.5, pO1 =
 4          for a k 2 A k :                                                   0.85, pO2 = 0.5, pR1 = 1, pR2 = 100, pR3 = 10i) is
                 (n) t t 1 t 1
 5             P (s |s , ak ) =                                               solved using Algorithm 2, and the optimal action is com-
                              (n)
                           Tk (st |st 1 , atk 1 , at k1 ) ⇥ P (at k1 )        puted by calling the Perseus POMDP solver.
 6          sum + = P (n) (st |st 1 , atk 1 )btk,01 (st 1 )
                  (n)

 7
               (t 1)
        for a k 2 A k :                                                                              Experiments
 8          P (n) (otk |st , atk 1 )+ =                                       Setup
                              (n)
                           Ok (otk |st , atk 1 , at k1 )P (at k1 )            We present the results using the multi-agent tiger game
 9      btk,0 = sum(n) ⇥ P (n) (otk |st , atk 1 )                             (Gmytrasiewicz and Doshi 2005) with various settings. The
 10 normalize and return btk,0                                                multi-agent tiger game is a generalization of the classical
                                                                              single agent tiger game (Kaelbling, Littman, and Cassandra
                                                                              1998) with adding observations which caused by others’ ac-
   The 0-level belief update, described in Algorithm 2, is                    tions. The generalized multi-agent game contains additional
similar to POMDP belief update but treats other agents’                       observations regarding other players, while the state transi-
actions as noise and randomized the state transition func-                    tion and reward function involve others’ actions as well.
tion and observation function as input parameters. It assume                     Let’s see a specific game example with known parame-
other agents in the environment choose their actions accord-                  ters: there are a tiger and a pile of gold behind two doors
ing to a uniform distribution (line 1), therefore is essentially              respectively, two players can both listen for a growl of the
                                                        (t 1)
a no-information model. For each possible action a k , it                     tiger and a creak caused by the other player, or open doors
computes the actual state transition (line 5) and actual ob-                  which resets the tiger’s location with equal probability. Their
servation function (line 8) by marginalizing over others’ ac-                 observation toward the tiger and the other player are both
tions, and returns the normalized belief btk,0 . Notice that the              relatively high (0.85 and 0.9 respectively). No matter trig-
                                                                              gered by which player, the reward for listening action is -1,
  2
    http://www.st.ewi.tudelft.nl/˜mtjspaan/                                   opening the tiger door is -100 and opening the gold door is
pomdp/index_en.html                                                           10.


                                                                         72
Yanlin Han and Piotr Gmytrasiewicz                        MAICS 2017                                                       pp. 69–76


        Table 1: Parameters for transition functions                   reasonable assumptions that Aj and ⌦j are known, OCj is
                                                                       infinite horizon with discounting. Now what we really want
              S     A      TL          TR                              to learn are as follow:
              TL    L      pT 1        1 pT 1
              TR    L      1 pT 1      pT 1                            • b0j : the initial belief of agent j about the physical state.
              *     OL     pT 2        1 pT 2                          • Tj : the transition function of agent j, which can be
              *     OR     1 pT 2      pT 2                              parametrized by two parameters pT 1 and pT 2 , as shown
                                                                         in Table 1.
       Table 2: Parameters for observation functions                   • Oj : the observation function of agent j, which can be
                                                                         parametrized by two parameters pO1 and pO2 , as shown
              S     A      GL          GR                                in Table 2.
              TL    L      pO1         1 pO1
              TR    L      1 pO1       pO1                             • Tj : the reward function of agent j, which can be
              *     OL     pO2         1 pO2                             parametrized by three parameters pR1 , pR2 and pR3 , as
                                                                         shown in Table 3.
              *     OR     1 pO2       pO2
                                                                         We could easily see that it is a enormous 8-dimensional
          Table 3: Parameters for reward functions                     parameter space to learn from: b0j ⇥pT 1 ⇥pT 2 ⇥pO1 ⇥pO2 ⇥
                                                                       pR1 ⇥ pR2 ⇥ pR3 , where bj 2 [0, 1] ⇢ R, pT 1 2 [0, 1] ⇢ R,
                      S      A      R                                  pT 2 2 [0, 1] ⇢ R, pO1 2 [0, 1] ⇢ R, pO2 2 [0, 1] ⇢ R,
                      *      L      pR1                                pR1 2 [ 1, +1] ⇢ R, pR2 2 [ 1, +1] ⇢ R, pR3 2
                      TL     OL     pR1                                [ 1, +1] ⇢ R.
                      TR     OR     pR2                                  We mainly reduce this huge space by two means: utilizing
                      TL     OR     pR3                                Monte Carlo sampling methods and giving them problem-
                      TR     OL     pR3                                specific priors so that they are not over informative but pro-
                                                                       vide enough information for the algorithm to learn from.

   For the sake of brevity, we restrict the experiments to             Results
a two-agent setting and nesting level of one, but the sam-             For the actual experiments, we fix the number of samples
pling algorithm is extensible to any number of agents and              to be 2000 and run it on a two agent tiger game simulation
nesting levels in a straightforward manner. Recall that an             as described above. We run experiments for learning three
interactive POMDP of agent i is defined as a six tuple                 difference models of agent j:
I-P OM DPi = hISi,l , A, ⌦i , Ti , Oi , Ri i. Thus for the spe-        1. ✓j1 =< 0.5, 0.67, 0.5, 0.85, 0.5, 1, 100, 10 >
cific setting of multi-agent tiger problem:
                                                                       2. ✓j2 =< 0.5, 1.00, 0.5, 0.95, 0.5, 1, 10, 10 >
• ISi,1 = S ⇥ ✓j,0 , where S = {tiger on the
  left (TL), tiger on the right (TR)} and ✓j,0 =<                      3. ✓j3 =< 0.5, 0.66, 0.5, 0.85, 0.5, 10, 100, 10 >
  bj (s), Aj , ⌦j , Tj , Oj , Rj , OCj >}.                                These models are all very special cases and carefully cho-
• A = Ai ⇥ Aj is a combination of both agents’ possi-                  sen in order to verify the correctness and evaluate the per-
  ble actions: listen (L), open left door (OL) and open right          formance of our algorithm. For instance, the first model is a
  door(OR).                                                            sophisticated one when the other agent is actually modeling
                                                                       his opponent using a subintentional model, while the sec-
• ⌦i is all the combinations of each agent’s possible ob-
                                                                       ond is a classic single-agent POMDP and the third is a very
  servations: growl from left (GL) or right (GR), combined
                                                                       simple one but contains a large model space. We want to in-
  with creak from left (CL), right (CR) or silence (S).
                                                                       vestigate if our framework is able to correctly and efficiently
• Ti = Tj : S ⇥ Ai ⇥ Aj ⇥ S ! [0, 1] is a joint state                  learn these models through these experiments.
  transition probability that involves both actions.
• Oi : S ⇥ Ai ⇥ Aj ⇥ ⌦i ! [0, 1] becomes a joint observa-
  tion probability that involves both actions. Oj is symmet-
  ric of Oi with respect to the joint actions.
• Ri : IS ⇥ Ai ⇥ Aj ! Ri : agent i gets corresponding
  rewards when he listens, opens the wrong door and opens
  the correct door respectively. They are independent of j’s
  actions.

Parameter Space
For the experiments of multi-agent tiger game, we want to
learn over all possible intentional models of the other agent
j: ✓j =< bj (s), Aj , ⌦j , Tj , Oj , Rj , OCj >. We only make              Figure 2: Optimal policy of a no-information model.


                                                                  73
Learning Others Intentional Models in Multi-Agent Settings Using Interactive POMDPs                                      pp. 69–76


   The aim of first experiment to try to learn rela-                   ure 3, specifically, they are uniform U(0,1) for b0j , Beta(5,3)
tively complicated models of agent j with ✓j =<                        with mode 0.67 for pT 1 , Beta(5,5) for pT 2 , Beta(3.5,1.4)
0.5, 0.67, 0.5, 0.85, 0.5, 1, 100, 10 >, who assumes oth-              with mode 0.85 for pO1 , Beta(5,5) for pO2 , Gaussian N(-
ers’ actions are drawn from a uniform distribution. Equiv-             1,2) for pR1 , N(-100,4) for pR2 , and N(10,2) for pR3 .
alently, agent j’s actual policy, as shown in figure 2, is to
look for three consecutive growls from same direction and
then open the corresponding door. For this particular exper-
iment, we simulated the observation history for agent i, for
the sake of firstly verifying the correctness of our algorithm,
excluding the impacts of uncertainties from hearing accu-
racy. The simulated observation history is as follows: {GL,S
GL,S GL,S GL,CR GL,S GL,S GL,S GR,CR GL,S GL,S
GL,S GR,CR GL,S GL,S GL,S GR,CR GR,S GR,S GR,S
GR,CL GR,S GR,S GR,S GR,CL GR,S GR,S GR,S GR,CL
GR,S GR,S GR,S GR,CL GR,S GR,S GR,S GR,CL GR,S
GR,S GR,S GL,CL GL,S GL,S GL,S GR,CR GL,S GL,S
GL,S GR,CR GR,S GR,S}


                                                                              Figure 4: 3D histogram of all model samples.

                                                                          After 50 time steps, the algorithm converge to a posterior
                                                                       distribution over agent j’s intentional models, the results are
                                                                       also given in figure 3. Since the parameter space of agent j’s
                                                                       models is 8-dimensional, here we only show the marginal
                                                                       distributions of each parameter space in histograms. We can
                                                                       easily see that the majority of samples are centered around
                                                                       the true parameter values.
                                                                          We use principal component analysis (PCA) (Abdi and
                                                                       Williams 2010) to reduce sample dimensionality to 2-
                                                                       dimensional and plot them out in a 3-dimensional histogram,
                                                                       as shown in Figure 4. It starts from a Gaussian-like prior and
                                                                       gradually converges to the most likely models. Eventually
                                                                       the mean value of this cluster h 0.49, 0.69, 0.49, 0.82, 0.51, -
                                                                       0.95, -99.23, 10.09 i is very close to the true model. Here
                                                                       we give two examples from the big cluster after 50 time
                                                                       steps: h0.56, 0.66, 0.49, 0.84, 0.59, -0.95, -101.37, 11.42i
                                                                       and h0.51, 0.68, 0.52, 0.89, 0.56, -1.33, -98.39, 12.55i.
                                                                       The former has a corresponding optimal policy of [0—
                                                                       OL—0.10—L—1], while the latter has a [0—OL—0.09—
                                                                       L–0.91—OR—1], which are both extremely close to the
                                                                       optimal policy of the true model: [0—OL—0.1—L–0.9—
                                                                       OR—1]. Consequently, the framework is able to predict
                                                                       other agents’ actions with high accuracy.
                                                                          We tested the performance of our algorithms in terms
                                                                       of prediction accuracy towards others’ actions. We com-
                                                                       pared the results with other modeling approaches, such as
                                                                       a frequency-based approach, in which agent j is assumed
                                                                       to choose his action according to a fixed but unknown dis-
                                                                       tribution, and a no-information model which treats j’s ac-
                                                                       tions purely as uniform noise. The results shown in figure
Figure 3: Assigned priors and learned posterior dis-                   5 are averaged plots of 10 random runs, each of which has
tributions over model parameters for model ✓j1 =<                      50 time steps. It shows clearly that the intentional I-POMDP
0.5, 0.67, 0.5, 0.85, 0.5, 1, 100, 10 >.                               approach has significantly lower error rates as agent i per-
                                                                       ceives more observations. The subintenional model assume
  The priors we assign to each parameters are shown in fig-            j’s action is draw from a uniform distribution, therefore has


                                                                  74
Yanlin Han and Piotr Gmytrasiewicz                             MAICS 2017                                                      pp. 69–76


    Figure 5: Prediction error rate vs observation length.


a fixed high error rate. The frequency based approach has
certain learning ability but is far from enough sophisticated
for modeling a fully rational agent.


Figure 6: (a) optimal policy for ✓j = h 0.5, 1, 0.5, 0.95, 0.5,
-1, -10, 10 i. (b) optimal policy for ✓j = h0.5, 0.66, 0.5,
0.85, 0.5, 10, -100, 10i.

   In the second experiment, we run our algorithm on ac-
tual observations for 30 time steps until it converges, and
try to learn models of a simpler classic POMDP with high
listening accuracy of 0.95 and small penalty of -10, e.g. the
agent j alternately opens door and listens as shown in Fig-
ure 6 left. The actual model of j is ✓j = h 0.5, 1, 0.5,
0.95, 0.5, -1, -10, 10 i, the priors assigned to b0j , pT 1 , pT 2 ,
pO1 , pO2 , pR2 , pR3 are U(0,1), Beta(2,0.5), Beta(10,10),
Beta(19,1), Beta(10,10), N(-1,1), N(-10,2), N(10,2), and the
actual observation history is {GR,S GL,CR GL,S GL,CL                        Figure 7: Learned posterior distributions for model ✓j = h
GL,S GL,CR GL,S GL,CL GL,S GR,S GR,CL GR,CL GL,S                            0.5, 1, 0.5, 0.95, 0.5, -1, -10, 10 i.
GR,S GR,S GL,CL GR,S GL,CR GR,S GR,CR GR,CR
GR,CL GL,S GL,S GL,S GL,CR GL,S GL,CL GR,S GR,S}.
   Similarly, we report the learned posterior distributions
over model parameters in figure 7. We observe an interest-
ing pattern that while some parameters, such as bj,0 , pT 2 and
pO2 are concentrated around the actual values, others like                  h0.5, 0.66, 0.5, 0.85, 0.5, 10, -100, 10i, who always listens
pT 1 and pO1 become more dispersed than initial priors. The                 since the listening penalty is now equal to the reward, as
intuition behind is that the penalty and reward are -10 and                 shown in figure 6(b). For brevity, we only show the marginal
10, so one listening of reward -1 is enough for making deci-                distributions over model parameters in figure 8. The priors
sion of opening doors. That is to say, as long as tiger likely              assigned to b0j , pT 1 , pT 2 , pO1 , pO2 , pR2 , pR3 are U(0,1),
remains behind the same door when agent listens (the mean-                  Beta(5,3), Beta(10,10), Beta(3.5,1.4), Beta(10,10), N(10,1),
ing of pT 1 ) and has a reliable hearing accuracy (the meaning              N(-100,2), N(10,2), and the actual observation history i
of pO1 ), there are many models which satisfy this particular               learns from is {GL,S GL,S GR,S GL,S GL,CL GR,S GR,S
observation sequence, hence our algorithm learns them all.                  GL,CL GR,S GL,S GL,S GR,S GL,S GL,S GL,S GL,CL
   For conciseness, we show the average prediction error                    GR,S GL,S GL,S GL,S}. We can see that all three reward
rates for both second and third experiments in figure 9.                    parameters are correctly learned, while samples of pT 1 , pT 2 ,
Both results are averaged among 10 random runs, each of                     pO1 and pO2 are not very concentrated to their true values
which has 30 time steps. In the second experiment in fig-                   but close to their corresponding priors, since intuitively they
ure 9(a), the intentional I-POMDP approach still has signif-                become less important and can be in a relatively loose re-
icantly lower error rates than others.                                      gion due to the increased pR1 =10. Lastly, the performance
   In the last experiment, we wants to learn a model of ✓j =                comparison is given in figure 9(b).


                                                                       75
Learning Others Intentional Models in Multi-Agent Settings Using Interactive POMDPs                                          pp. 69–76


                                                                       Figure 9: (a) Prediction error rate vs observation length for
                                                                       ✓j = h 0.5, 1, 0.5, 0.95, 0.5, -1, -10, 10 i. (b) for ✓j = h0.5,
                                                                       0.66, 0.5, 0.85, 0.5, 10, -100, 10i.


                                                                                                 References
                                                                       Abdi, H. and Williams, L.J., 2010. Principal component analy-
                                                                       sis. Wiley interdisciplinary reviews: computational statistics, 2(4),
                                                                       pp.433-459.
                                                                       Doucet, A., De Freitas, N. and Gordon, N., 2001. An introduc-
                                                                       tion to sequential Monte Carlo methods. In Sequential Monte Carlo
                                                                       methods in practice (pp. 3-14). Springer New York.
                                                                       Doshi, P., and Gmytrasiewicz, P. J. 2009. Monte Carlo sampling
                                                                       methods for approximating interactive POMDPs. Journal of Artifi-
                                                                       cial Intelligence Research 34: 297-337.
                                                                       Doshi-Velez, F. and Konidaris, G., 2013. Hidden parameter Markov
                                                                       decision processes: A semiparametric regression approach for dis-
                                                                       covering latent task parametrizations. arXiv:1308.3513.
                                                                       Del Moral, P., 1996. Non-linear filtering: interacting particle reso-
                                                                       lution. Markov processes and related fields, 2(4), pp.555-581.
                                                                       Doshi, P., Zeng, Y., and Chen, Q. 2009. Graphical models for
                                                                       interactive POMDPs: representations and solutions. Autonomous
                                                                       Agents and Multi-Agent Systems 18.3: 376-416.
                                                                       Gilks, W.R., Richardson, S. and Spiegelhalter, D.J., 1996. Intro-
                                                                       ducing markov chain monte carlo. Markov chain Monte Carlo in
                                                                       practice, 1, p.19.
                                                                       Gmytrasiewicz, P. J., and Doshi, P. 2005. A framework for sequen-
                                                                       tial planning in multi-agent settings. Journal of Artificial Intelli-
                                                                       gence Research 24(1): 49-79.
Figure 8: Learned posterior distributions for model ✓j =               Gmytrasiewicz, P. J., and Doshi, P. 2005. A framework for sequen-
h0.5, 0.66, 0.5, 0.85, 0.5, 10, -100, 10i.                             tial planning in multi-agent settings. Journal of Artificial Intelli-
                                                                       gence Research 24(1): 49-79.
                                                                       Gordon, N.J., Salmond, D.J. and Smith, A.F., 1993, April. Novel
                                                                       approach to nonlinear/non-Gaussian Bayesian state estimation. In
           Conclusions and Future Work                                 IEE Proceedings F (Radar and Signal Processing) (Vol. 140, No. 2,
                                                                       pp. 107-113). IET Digital Library.
We have described a new approach to learn other agents’                Kaelbling, L.P., Littman, M.L. and Cassandra, A.R., 1998. Plan-
models by approximating the interactive belief update us-              ning and acting in partially observable stochastic domains. Artifi-
ing Bayesian inference and Monte Carlo sampling methods.               cial intelligence, 101(1), pp.99-134.
Our framework correctly learns others’ models over the en-             Spaan, M.T. and Vlassis, N., 2005. Perseus: Randomized point-
tire intentional model space and therefore is a generalized            based value iteration for POMDPs. Journal of artificial intelligence
                                                                       research, 24, pp.195-220.
reinforcement learning algorithm for multi-agent settings. It
                                                                       Liu, M., Liao, X. and Carin, L., 2011. The infinite regionalized pol-
also effectively mitigates the belief space complexity and             icy representation. In Proceedings of the 28th International Confer-
has a significant better performance than other approaches             ence on Machine Learning (ICML-11) (pp. 769-776).
in terms of predicting others’ actions.                                Panella, A. and Gmytrasiewicz, P., 2016, March. Bayesian Learn-
   In the future, in order to fully evaluate the practicability        ing of Other Agents’ Finite Controllers for Interactive POMDPs.
on larger problem space, more multi-agent problems of var-             In Thirtieth AAAI Conference on Artificial Intelligence.
ious sizes could be tested. Due to computation complexity,             Pearl, J., 1988. Probabilistic reasoning in intelligent systems: Net-
experiments on higher nesting levels are currently limited.            works of plausible reasoning.
                                                                       Ross, S., Chaib-draa, B. and Pineau, J., 2007. Bayes-adaptive
Thus, more efforts could be made on utilizing nonparametric            pomdps. In Advances in neural information processing systems
Bayesian methods which inherently deal with nested belief              (pp. 1225-1232).
structures.


                                                                  76