<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Learning Others' Intentional Models in Multi-Agent Settings Using Interactive POMDPs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yanlin Han Piotr Gmytrasiewicz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science University of Illinois at Chicago Chicago</institution>
          ,
          <addr-line>IL 60607</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <fpage>69</fpage>
      <lpage>76</lpage>
      <abstract>
        <p>Interactive partially observable Markov decision processes (I-POMDPs) provide a principled framework for planning and acting in a partially observable, stochastic and multiagent environment, extending POMDPs to multi-agent settings by including models of other agents in the state space and forming a hierarchical belief structure. In order to predict other agents' actions using I-POMDP, we propose an approach that effectively uses Bayesian inference and sequential Monte Carlo (SMC) sampling to learn others' intentional models which ascribe them beliefs, preferences and rationality in action selection. For problems of various complexities, empirical results show that our algorithm accurately learns models of other agents and has superior performance in comparison with other methods. Our approach serves as a generalized reinforcement learning algorithm that learns over other agents' transition, observation and reward functions. It also effectively mitigates the belief space complexity due to the nested belief hierarchy.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Partially observable Markov decision processes (POMDPs)
        <xref ref-type="bibr" rid="ref11">(Kaelbling, Littman, and Cassandra 1998)</xref>
        provide a
principled, decision-theoretic framework for planning under
uncertainty in a partially observable, stochastic environment.
An autonomous agent operates rationally in such settings
by maintaining a belief of the physical state at any given
time, in doing so it sequentially chooses the optimal
actions that maximize the future rewards. Therefore, solutions
of POMDPs are mappings from an agent’s beliefs to
actions. Although POMDPs can be used in multi-agent
settings, it is doing so under strong assumptions that the effects
of other agents’ actions are implicitly treated as noise and
folded in the state transitions, such as recent Bayes-adaptive
POMDPs
        <xref ref-type="bibr" rid="ref16">(Ross, Draa, and Pineau 2007)</xref>
        , infinite
generalized policy representation
        <xref ref-type="bibr" rid="ref13">(Liu, Liao, and Carin 2011)</xref>
        ,
infinite POMDPs
        <xref ref-type="bibr" rid="ref4">(Doshi-Velez et al. 2013)</xref>
        . Therefore, an
agent’s beliefs about other agents are not in the solutions
of POMDPs.
      </p>
      <p>
        Interactive POMDP (I-POMDP)
        <xref ref-type="bibr" rid="ref12 ref8 ref8 ref9 ref9">(Gmytrasiewicz and
Doshi 2005)</xref>
        are a generalization of POMDP to multi-agent
settings by replacing POMDP belief spaces with
interactive hierarchical belief systems. Specifically, it augments the
plain beliefs about the physical states in POMDP by
including models of other agents, which forms a hierarchical belief
structure that represents an agent’s belief about the physical
state, belief about the other agents and their beliefs about
others’ beliefs. The models of other agents included in the
new augmented state space consist of two types: the
intentional models and subintentional models. The sophisticated
intentional model ascribes beliefs, preferences, and
rationality to other agents
        <xref ref-type="bibr" rid="ref12 ref8 ref8 ref9 ref9">(Gmytrasiewicz and Doshi 2005)</xref>
        , while
the simpler subintentional model, such as finite state
controllers
        <xref ref-type="bibr" rid="ref14">(Panella and Gmytrasiewicz 2016)</xref>
        , does not.
Solutions of I-POMDPs map an agent’s belief about the
environment and other agents’ models to actions. Therefore,
it is applicable to all important agent, human, and mixed
agent-human applications. It has been clearly shown
        <xref ref-type="bibr" rid="ref12 ref8 ref8 ref9 ref9">(Gmytrasiewicz and Doshi 2005)</xref>
        that the added sophistication
for modeling others as rational agents results in a higher
value function which dominates the one obtained from
simply treating others as noise, which implies the modeling
superiority of I-POMDPs for multi-agent systems over other
approaches.
      </p>
      <p>
        However, the interactive belief modification for
IPOMDPs results in a drastic increase of the belief space
complexity, adding to the curse of dimensionality: the
complexity of the belief representation is proportional to
belief dimensions due to exponential growth of agent
models with increase of nesting level. Since exact solutions
to POMDPs are proven to be PSPACE-complete for
finite horizon and undecidable for infinite time horizon
(Papadimitriou and Tsitsiklis 1987), the time complexity of the
more generalized I-POMDPs, which may contain multiple
POMDPs and I-POMDPs of other agents, is greater than
or equal to PSPACE-complete for finite horizon and
undecidable for infinite time horizon. Due to this severe space
complexity, currently no complete belief update has been
accomplished using the sophisticated intentional models over
entire interactive belief space. There are only partial
updates on other agents’ sole beliefs about the physical states
        <xref ref-type="bibr" rid="ref3 ref6">(Doshi and Gmytrasiewicz 2009)</xref>
        and indirect approach such
as subintentional finite state controllers
        <xref ref-type="bibr" rid="ref14">(Panella and
Gmytrasiewicz 2016)</xref>
        . Therefore, in order to unleash the full
modeling power of intentional models and apply I-POMDPs
to more realistic settings, a good approximation algorithm
for computing the nested interactive belief and predicting
other agents’ actions is crucial to the trade-off between
solution quality and computation complexity.
      </p>
      <p>
        To address this issue, we propose a Bayesian approach
that utilizes customized sequential Monte Carlo sampling
algorithms
        <xref ref-type="bibr" rid="ref2">(Doucet, De Freitas, and Gordon 2001)</xref>
        to
obtain approximating solutions to interactive I-POMDPs and
implement the algorithms in a software package1.
Specifically, We assume that models of other agents are unknown
and learned from imperfect observations of the other agents’
behaviors. We parametrize other agents’ intentional
models and maintain a belief over them, making sequential
Bayesian updates using only observations from the
environment. Since this Bayesian inference task is analytically
intractable, to approximate the posterior distribution, we
devise a customized sequential Monte Carlo method to
descend the belief hierarchy and sample all model parameters
at each nesting level, starting from the interactive particle
filter (I-PF)
        <xref ref-type="bibr" rid="ref3 ref6">(Doshi and Gmytrasiewicz 2009)</xref>
        for I-POMDP
belief update.
      </p>
      <p>Our approach, for the first time, successfully learns
others’ models over the entire intentional model space which
contains their initial belief, transition, observation and
reward functions, making it a generalized reinforcement
learning method for multi-agent settings. Our algorithm
accurately predicts others’ actions on various problem settings,
therefore enables the modeling agent to make corresponding
optimal action to maximize its own rewards. By
approximating Bayesian inference using a customized sequential Monte
Carlo sampling method, we significantly mitigate the belief
space complexity of the I-POMDPs.</p>
    </sec>
    <sec id="sec-2">
      <title>Background</title>
    </sec>
    <sec id="sec-3">
      <title>POMDP</title>
      <p>
        A Partially observable Markov decision process (POMDP)
        <xref ref-type="bibr" rid="ref11">(Kaelbling, Littman, and Cassandra 1998)</xref>
        is a general
reinforcement learning model for planning and acting in a
single-agent, partially observable, stochastic domain. It is
defined for a single agent i as:
      </p>
      <p>P OM DPi = hS, Ai, ⌦ i, Ti, Oi, Rii
(1)
Where the meaning for each element in the 6-tuple is:
• S is the set of states of the environment.
• Ai is the set of agent i’s possible actions
• ⌦ i is the set of agent i’s possible observations
• Ri : S ⇥ Ai ! Ri is the reward function.</p>
      <p>Given the definition above, an agent’s belief about the
state can be represented as a probability distribution over
S. The belief update can be simply done using the following
formula, where ↵ is the normalizing constant:
b (s0) = ↵O (o, s, a) X T (s0, a, s)b(s)
(2)
s2 S
Given the agent’s belief, then the optimal action, a⇤ , is
simply part of the set of optimal actions, OP T (bi), for the belief
state defined as:
1https://github.com/solohan22/IPOMDP.git</p>
      <p>OP T (bi) =arg max n X bi(s)R(s, ai)
ai2 Ai
+</p>
      <p>
        X
The Markov Chain Monte Carlo (MCMC) method
        <xref ref-type="bibr" rid="ref7">(Gilks
et al., 1996)</xref>
        is widely used to approximate probability
distributions when they are unable to be computed directly. It
generates samples from a posterior distribution ⇡ (x) over
state space x, by simulating a Markov chain p(x0|x) whose
state space is x and stationary distribution is ⇡ (x). The
samples drawn from p converge to the target distribution ⇡ as
the number of samples goes to infinity.
      </p>
      <p>
        In order to make MCMC work on sequential inference
task, especially sequential decision makings under Markov
assumptions, sequential versions of Monte Carlo methods
have been proposed and some of them are capable of
dealing with high dimensionality and/or complexity problems,
such as particle filters
        <xref ref-type="bibr" rid="ref5">(Del Moral and Pierre 1996)</xref>
        . At each
time step, particle filters draw samples (or particles) from
a proposal distribution, commonly p(xt|xt 1), which is
essentially the conditional distribution of the current state xt
given the previous xt 1, then use the observation function
p(yt|xt) to compute the importance weight for each particle
and resample all particles according to the weights.
      </p>
    </sec>
    <sec id="sec-4">
      <title>The Model</title>
    </sec>
    <sec id="sec-5">
      <title>I-POMDP framework</title>
      <p>An interactive POMDP of agent i, I-POMDP i, is defined
as:</p>
      <p>I-P OM DPi = hISi,l, A, ⌦ i, Ti, Oi, Rii
(4)
where ISi,l is the set of interactive states of the
environment, defined as ISi,l = S ⇥ Mi,l 1, l 1, where S is
the set of states and Mi,l 1 is the set of possible models
of agent j, and l is the strategy level. A specific class of
models are the (l 1)th level intentional models, ⇥ j,l 1, of
agent j: ✓ j,l 1 = hbj,l 1, A, ⌦ j , Tj , Oj , Rj , OCj i, bj,l 1 is
agent j’s belief nested to level (l 1), bj,l 1 2 ( ISj,l 1),
and OCj is j’s optimality criterion. The intentional model
✓ j,l 1, sometimes is referred to as type, can be rewritten as
✓ j,l 1 = hbj,l 1, ✓ ˆj i, where ✓ ˆj includes all elements of the
intentional model other than the belief and is called the agent
j’s frame.</p>
      <p>The ISi,l could be defined in an inductive manner (note
that when ✓ ˆj is usually known, ✓ ˆj reduces to bj ):
ISi,0 = S,
ISi,1 = S ⇥ ✓ j,0,
......</p>
      <p>✓ j,0 = {hbj,0, ✓ ˆj i : bj,0 2 ( S)}
✓ j,1 = {hbj,1, ✓ ˆj i : bj,1 2 ( ISj,1)}
(5)
ISi,l = S ⇥ ✓ j,l 1, ✓ j,l = {hbj,l, ✓ ˆj i : bj,l 2 ( ISj,l)}</p>
      <p>And all other remaining components in an I-POMDP are
similar to those in a POMDP:
• ⌦ i is the set of agent i’s possible observations.
• A = Ai ⇥</p>
      <p>Unlike plain belief update in POMDP, the interactive
belief update in I-POMDP takes two additional
sophistications into account. Firstly, the probabilities of other’s actions
given its models (the second summation) need to be
computed since the state of physical environment now depends
on both agents’ actions. Secondly, the agent needs to update
its beliefs based on the anticipation of what observations the
other agent might get and how it updates (the third
summation).</p>
      <p>Then the optimal action, a⇤ , for the case of infinite
horizon criterion with discounting, is part of the set of optimal
actions, OP T (✓ i), for the belief state defined as:
OP T (✓ i) = arg max
ai2 Ai
n X bis(s)ERi(is, ai)</p>
      <p>
        is2 IS
The Interactive Particle Filter (I-PF)
        <xref ref-type="bibr" rid="ref3 ref6">(Doshi and
Gmytrasiewicz 2009)</xref>
        was proposed as a filtering algorithm for
interactive belief update in I-POMDP. It generalizes the
classic particle filter algorithm to multi-agent settings and uses
the state transition function as the proposal distribution,
which is usually used in a specific particle filter algorithm
called bootstrap filter
        <xref ref-type="bibr" rid="ref10">(Gordon, ect 1993)</xref>
        . However, due to
the enormous belief space, I-PF assumes that other agent’s
frame ✓ ˆj is known to the modeling agent, therefore
simplifies the belief update process from S ⇥ ⇥ j,l 1 to a
significantly smaller space S ⇥ bj,l 1.
      </p>
      <p>The intuition of our algorithm is to assign appropriate
prior distributions over agent j’s all possible models ✓ j =&lt;
bj (s), Aj , ⌦ j , Tj , Oj , Rj , OCj &gt; and sample from each
dimension of them. At each time step, we update all samples
using perceived observations, namely computing and
assigning weights to each sample, and resample them according to
the weights. At last, since it is a randomized Monte Carlo
method, to prevent the learning algorithm from converging
(6)
(7)
o
to incorrect models, we add another resampling step to
sample from the neighboring similar models given the current
samples. Consequently, our algorithm is able to maintain a
probability distribution of the most possible models of other
agents and eventually learn the optimal actions of them.</p>
      <p>The interactive belief update described in
Algorithm 1 is similar to I-PF in terms of the
recursive Monte Carlo sampling and nesting hierarchy, but
it has three major differences. Firstly, the belief
update is over the entire intentional model space of other
agents, therefore the initial set of N samples ˜btk,l1 =&lt;
b(nk),,lt 11, A k, ⌦</p>
      <p>k, T (nk), O(nk), R(nk), OCj &gt;, where k here
denotes the modeling agent and k denotes all other
modeled agents. We only assume that the actions A k,
observations ⌦ k and optimal criteria OCj are known, as
in a multi-agent game the rules are usually known to
all agents or could be obtained through intelligence.
Secondly, it is intuitive to see that the observation function
O(nk)(ot k|s(n),t, atk 1, at k1) in line 13 is now randomized
as well, as each of them is a particular observation function
of that agent. Lastly, we add another resampling step in line
18 in order to avoid divergence, by resampling each
dimension of the model samples from a Gaussian distribution with
the mean of current sample value. Intuitively, similar models
are resampled from a relatively tight neighboring region of
the current model samples to maintain the learning accuracy.</p>
      <p>
        Algorithm 1 can be viewed as two major steps. The
importance sampling step (line 1 to line 16) samples from belief
priors ˜btk,l1 and propagates forward using related proposal
distributions and computes the weights of all samples. And
the selection or resapmling step (line 17 to line 18)
resamples according to weights and similar models. Specifically,
the algorithm starts from a set of initial priors is(kn),t 1, for
each of them, it samples other agents’ optimal action at k1
from its policy P (A k|✓ (n),t 1), which is solved using a
k
very efficient POMDP solver called Perseus2
        <xref ref-type="bibr" rid="ref12 ref8 ref8 ref9 ref9">(Spaan and
Vlassis 2005)</xref>
        . Then it samples the physical state st using
the state transition Tk(St|S(n),t 1, atk 1, at k1). Once at k1
and st are sampled, the algorithm calls for the 0-level
belief update (line 5 to 8), described in Algorithm 2, to update
other agents’ plain beliefs bt k,0 if the current nesting level l
is 1, or recursively calls for itself at a lower level l 1 (line
9 to 12) if the current nesting level is greater than 1. The
sample weights w(n) are computed according to observation
t
likelihoods of modeling and modeled agents (line 13, 14),
and then got normalized so that they sum up to 1 (line 16).
Lastly, the algorithm resamples the intermediate samples
according to the computed weights (line 17) and resamples
another time from similar neighboring models (line 18).
      </p>
      <p>Algorithm 2: Level-0 Belief Update
bk,0 =Level0BeliefUpdate(btk,01,atk 1,otk, Tk(n),Ok(n))
t
1 P (at k1) = 1/at k1
2 for st 2 S:
3 for st 1:
4 for a(tk 1) 2 A k:
5 P (n)(st|st 1, atk 1) =
6</p>
      <p>O(n)(otk|st, at 1, at k1)P (at k1)</p>
      <p>k k
btk,0 = sum(n) ⇥ P (n)(otk|st, atk 1)
normalize and return btk,0</p>
      <p>The 0-level belief update, described in Algorithm 2, is
similar to POMDP belief update but treats other agents’
actions as noise and randomized the state transition
function and observation function as input parameters. It assume
other agents in the environment choose their actions
according to a uniform distribution (line 1), therefore is essentially
a no-information model. For each possible action a(tk 1), it
computes the actual state transition (line 5) and actual
observation function (line 8) by marginalizing over others’
actions, and returns the normalized belief btk,0. Notice that the
2http://www.st.ewi.tudelft.nl/˜mtjspaan/
pomdp/index_en.html
transition function T (n)(st|st 1, atk 1, at k1) and
observak
tion function O(n)(otk|st, at 1, at k1) are now both samples
k k
from input arguments, depending on model parameters of
the actual agent on the 0th level.</p>
      <p>In figure 1, we illustrate the interactive belief update
using the problem discussed in the following section . Suppose
there are two agents i and j in the environment, the sample
size is 8 and the nesting level is 2, the subscripts in figure 1
denotes the corresponding agents and each dot represents a
particular belief sample. The propagate step corresponds to
line 2 to 12 in Algorithm 1, the weight step corresponds to
line 13 to 16, the resample step corresponds to line 17 and
18. The belief update for a particular level-0 model
sample (✓ j = hbj (s) = 0.5, pT 1 = 0.67, pT 2 = 0.5, pO1 =
0.85, pO2 = 0.5, pR1 = 1, pR2 = 100, pR3 = 10i) is
solved using Algorithm 2, and the optimal action is
computed by calling the Perseus POMDP solver.</p>
    </sec>
    <sec id="sec-6">
      <title>Experiments</title>
    </sec>
    <sec id="sec-7">
      <title>Setup</title>
      <p>
        We present the results using the multi-agent tiger game
        <xref ref-type="bibr" rid="ref12 ref8 ref8 ref9 ref9">(Gmytrasiewicz and Doshi 2005)</xref>
        with various settings. The
multi-agent tiger game is a generalization of the classical
single agent tiger game
        <xref ref-type="bibr" rid="ref11">(Kaelbling, Littman, and Cassandra
1998)</xref>
        with adding observations which caused by others’
actions. The generalized multi-agent game contains additional
observations regarding other players, while the state
transition and reward function involve others’ actions as well.
      </p>
      <p>Let’s see a specific game example with known
parameters: there are a tiger and a pile of gold behind two doors
respectively, two players can both listen for a growl of the
tiger and a creak caused by the other player, or open doors
which resets the tiger’s location with equal probability. Their
observation toward the tiger and the other player are both
relatively high (0.85 and 0.9 respectively). No matter
triggered by which player, the reward for listening action is -1,
opening the tiger door is -100 and opening the gold door is
10.</p>
      <p>For the sake of brevity, we restrict the experiments to
a two-agent setting and nesting level of one, but the
sampling algorithm is extensible to any number of agents and
nesting levels in a straightforward manner. Recall that an
interactive POMDP of agent i is defined as a six tuple
I-P OM DPi = hISi,l, A, ⌦ i, Ti, Oi, Rii. Thus for the
specific setting of multi-agent tiger problem:
• ISi,1 = S ⇥ ✓ j,0, where S = {tiger on the
left (TL), tiger on the right (TR)} and ✓ j,0 =&lt;
bj (s), Aj , ⌦ j , Tj , Oj , Rj , OCj &gt;}.
• A = Ai ⇥ Aj is a combination of both agents’
possible actions: listen (L), open left door (OL) and open right
door(OR).
• ⌦ i is all the combinations of each agent’s possible
observations: growl from left (GL) or right (GR), combined
with creak from left (CL), right (CR) or silence (S).
• Ti = Tj : S ⇥ Ai ⇥ Aj ⇥ S ! [0, 1] is a joint state
transition probability that involves both actions.
• Oi : S ⇥ Ai ⇥ Aj ⇥ ⌦ i ! [0, 1] becomes a joint
observation probability that involves both actions. Oj is
symmetric of Oi with respect to the joint actions.
• Ri : IS ⇥ Ai ⇥ Aj ! Ri: agent i gets corresponding
rewards when he listens, opens the wrong door and opens
the correct door respectively. They are independent of j’s
actions.</p>
    </sec>
    <sec id="sec-8">
      <title>Parameter Space</title>
      <p>For the experiments of multi-agent tiger game, we want to
learn over all possible intentional models of the other agent
j: ✓ j =&lt; bj (s), Aj , ⌦ j , Tj , Oj , Rj , OCj &gt;. We only make
reasonable assumptions that Aj and ⌦ j are known, OCj is
infinite horizon with discounting. Now what we really want
to learn are as follow:
• bj0: the initial belief of agent j about the physical state.
• Tj : the transition function of agent j, which can be
parametrized by two parameters pT 1 and pT 2, as shown
in Table 1.
• Oj : the observation function of agent j, which can be
parametrized by two parameters pO1 and pO2, as shown
in Table 2.
• Tj : the reward function of agent j, which can be
parametrized by three parameters pR1, pR2 and pR3, as
shown in Table 3.
pR1 ⇥ pR2 ⇥ pR3, where bj 2 [0, 1] ⇢
pT 2 2 [0, 1] ⇢ R, pO1 2 [0, 1] ⇢
pR1 2 [1 , +1 ] ⇢ R, pR2 2 [1
[1 , +1 ] ⇢ R.</p>
      <p>We mainly reduce this huge space by two means: utilizing
Monte Carlo sampling methods and giving them
problemspecific priors so that they are not over informative but
provide enough information for the algorithm to learn from.</p>
    </sec>
    <sec id="sec-9">
      <title>Results</title>
      <p>For the actual experiments, we fix the number of samples
to be 2000 and run it on a two agent tiger game simulation
as described above. We run experiments for learning three
difference models of agent j:
1. ✓ j1 =&lt; 0.5, 0.67, 0.5, 0.85, 0.5, 1, 100, 10 &gt;
2. ✓ j2 =&lt; 0.5, 1.00, 0.5, 0.95, 0.5, 1, 10, 10 &gt;
3. ✓ j3 =&lt; 0.5, 0.66, 0.5, 0.85, 0.5, 10, 100, 10 &gt;</p>
      <p>These models are all very special cases and carefully
chosen in order to verify the correctness and evaluate the
performance of our algorithm. For instance, the first model is a
sophisticated one when the other agent is actually modeling
his opponent using a subintentional model, while the
second is a classic single-agent POMDP and the third is a very
simple one but contains a large model space. We want to
investigate if our framework is able to correctly and efficiently
learn these models through these experiments.</p>
      <p>The aim of first experiment to try to learn
relatively complicated models of agent j with ✓ j =&lt;
0.5, 0.67, 0.5, 0.85, 0.5, 1, 100, 10 &gt;, who assumes
others’ actions are drawn from a uniform distribution.
Equivalently, agent j’s actual policy, as shown in figure 2, is to
look for three consecutive growls from same direction and
then open the corresponding door. For this particular
experiment, we simulated the observation history for agent i, for
the sake of firstly verifying the correctness of our algorithm,
excluding the impacts of uncertainties from hearing
accuracy. The simulated observation history is as follows: {GL,S
GL,S GL,S GL,CR GL,S GL,S GL,S GR,CR GL,S GL,S
GL,S GR,CR GL,S GL,S GL,S GR,CR GR,S GR,S GR,S
GR,CL GR,S GR,S GR,S GR,CL GR,S GR,S GR,S GR,CL
GR,S GR,S GR,S GR,CL GR,S GR,S GR,S GR,CL GR,S
GR,S GR,S GL,CL GL,S GL,S GL,S GR,CR GL,S GL,S
GL,S GR,CR GR,S GR,S}</p>
      <p>The priors we assign to each parameters are shown in
figure 3, specifically, they are uniform U(0,1) for bj0, Beta(5,3)
with mode 0.67 for pT 1, Beta(5,5) for pT 2, Beta(3.5,1.4)
with mode 0.85 for pO1, Beta(5,5) for pO2, Gaussian
N(1,2) for pR1, N(-100,4) for pR2, and N(10,2) for pR3.</p>
      <p>After 50 time steps, the algorithm converge to a posterior
distribution over agent j’s intentional models, the results are
also given in figure 3. Since the parameter space of agent j’s
models is 8-dimensional, here we only show the marginal
distributions of each parameter space in histograms. We can
easily see that the majority of samples are centered around
the true parameter values.</p>
      <p>
        We use principal component analysis (PCA)
        <xref ref-type="bibr" rid="ref1">(Abdi and
Williams 2010)</xref>
        to reduce sample dimensionality to
2dimensional and plot them out in a 3-dimensional histogram,
as shown in Figure 4. It starts from a Gaussian-like prior and
gradually converges to the most likely models. Eventually
the mean value of this cluster h 0.49, 0.69, 0.49, 0.82, 0.51,
0.95, -99.23, 10.09 i is very close to the true model. Here
we give two examples from the big cluster after 50 time
steps: h0.56, 0.66, 0.49, 0.84, 0.59, -0.95, -101.37, 11.42i
and h0.51, 0.68, 0.52, 0.89, 0.56, -1.33, -98.39, 12.55i.
The former has a corresponding optimal policy of [0—
OL—0.10—L—1], while the latter has a [0—OL—0.09—
L–0.91—OR—1], which are both extremely close to the
optimal policy of the true model: [0—OL—0.1—L–0.9—
OR—1]. Consequently, the framework is able to predict
other agents’ actions with high accuracy.
      </p>
      <p>We tested the performance of our algorithms in terms
of prediction accuracy towards others’ actions. We
compared the results with other modeling approaches, such as
a frequency-based approach, in which agent j is assumed
to choose his action according to a fixed but unknown
distribution, and a no-information model which treats j’s
actions purely as uniform noise. The results shown in figure
5 are averaged plots of 10 random runs, each of which has
50 time steps. It shows clearly that the intentional I-POMDP
approach has significantly lower error rates as agent i
perceives more observations. The subintenional model assume
j’s action is draw from a uniform distribution, therefore has
a fixed high error rate. The frequency based approach has
certain learning ability but is far from enough sophisticated
for modeling a fully rational agent.</p>
      <p>In the second experiment, we run our algorithm on
actual observations for 30 time steps until it converges, and
try to learn models of a simpler classic POMDP with high
listening accuracy of 0.95 and small penalty of -10, e.g. the
agent j alternately opens door and listens as shown in
Figure 6 left. The actual model of j is ✓ j = h 0.5, 1, 0.5,
0.95, 0.5, -1, -10, 10 i, the priors assigned to b0, pT 1, pT 2,
j
pO1, pO2, pR2, pR3 are U(0,1), Beta(2,0.5), Beta(10,10),
Beta(19,1), Beta(10,10), N(-1,1), N(-10,2), N(10,2), and the
actual observation history is {GR,S GL,CR GL,S GL,CL
GL,S GL,CR GL,S GL,CL GL,S GR,S GR,CL GR,CL GL,S
GR,S GR,S GL,CL GR,S GL,CR GR,S GR,CR GR,CR
GR,CL GL,S GL,S GL,S GL,CR GL,S GL,CL GR,S GR,S}.</p>
      <p>Similarly, we report the learned posterior distributions
over model parameters in figure 7. We observe an
interesting pattern that while some parameters, such as bj,0, pT 2 and
pO2 are concentrated around the actual values, others like
pT 1 and pO1 become more dispersed than initial priors. The
intuition behind is that the penalty and reward are -10 and
10, so one listening of reward -1 is enough for making
decision of opening doors. That is to say, as long as tiger likely
remains behind the same door when agent listens (the
meaning of pT 1) and has a reliable hearing accuracy (the meaning
of pO1), there are many models which satisfy this particular
observation sequence, hence our algorithm learns them all.</p>
      <p>For conciseness, we show the average prediction error
rates for both second and third experiments in figure 9.
Both results are averaged among 10 random runs, each of
which has 30 time steps. In the second experiment in
figure 9(a), the intentional I-POMDP approach still has
significantly lower error rates than others.</p>
      <p>In the last experiment, we wants to learn a model of ✓ j =</p>
      <p>Figure 7: Learned posterior distributions for model ✓ j = h
0.5, 1, 0.5, 0.95, 0.5, -1, -10, 10 i.
h0.5, 0.66, 0.5, 0.85, 0.5, 10, -100, 10i, who always listens
since the listening penalty is now equal to the reward, as
shown in figure 6(b). For brevity, we only show the marginal
distributions over model parameters in figure 8. The priors
assigned to bj0, pT 1, pT 2, pO1, pO2, pR2, pR3 are U(0,1),
Beta(5,3), Beta(10,10), Beta(3.5,1.4), Beta(10,10), N(10,1),
N(-100,2), N(10,2), and the actual observation history i
learns from is {GL,S GL,S GR,S GL,S GL,CL GR,S GR,S
GL,CL GR,S GL,S GL,S GR,S GL,S GL,S GL,S GL,CL
GR,S GL,S GL,S GL,S}. We can see that all three reward
parameters are correctly learned, while samples of pT 1, pT 2,
pO1 and pO2 are not very concentrated to their true values
but close to their corresponding priors, since intuitively they
become less important and can be in a relatively loose
region due to the increased pR1=10. Lastly, the performance
comparison is given in figure 9(b).</p>
    </sec>
    <sec id="sec-10">
      <title>Conclusions and Future Work</title>
      <p>We have described a new approach to learn other agents’
models by approximating the interactive belief update
using Bayesian inference and Monte Carlo sampling methods.
Our framework correctly learns others’ models over the
entire intentional model space and therefore is a generalized
reinforcement learning algorithm for multi-agent settings. It
also effectively mitigates the belief space complexity and
has a significant better performance than other approaches
in terms of predicting others’ actions.</p>
      <p>In the future, in order to fully evaluate the practicability
on larger problem space, more multi-agent problems of
various sizes could be tested. Due to computation complexity,
experiments on higher nesting levels are currently limited.
Thus, more efforts could be made on utilizing nonparametric
Bayesian methods which inherently deal with nested belief
structures.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Abdi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <article-title>and</article-title>
          <string-name>
            <surname>Williams</surname>
            ,
            <given-names>L.J.</given-names>
          </string-name>
          ,
          <year>2010</year>
          .
          <article-title>Principal component analysis</article-title>
          .
          <source>Wiley interdisciplinary reviews: computational statistics</source>
          ,
          <volume>2</volume>
          (
          <issue>4</issue>
          ), pp.
          <fpage>433</fpage>
          -
          <lpage>459</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Doucet</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Freitas</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Gordon</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <year>2001</year>
          .
          <article-title>An introduction to sequential Monte Carlo methods</article-title>
          .
          <source>In Sequential Monte Carlo methods in practice</source>
          (pp.
          <fpage>3</fpage>
          -
          <lpage>14</lpage>
          ). Springer New York.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Doshi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Gmytrasiewicz</surname>
            ,
            <given-names>P. J.</given-names>
          </string-name>
          <year>2009</year>
          .
          <article-title>Monte Carlo sampling methods for approximating interactive POMDPs</article-title>
          .
          <source>Journal of Artificial Intelligence Research</source>
          <volume>34</volume>
          :
          <fpage>297</fpage>
          -
          <lpage>337</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Doshi-Velez</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Konidaris</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <year>2013</year>
          .
          <article-title>Hidden parameter Markov decision processes: A semiparametric regression approach for discovering latent task parametrizations</article-title>
          .
          <source>arXiv:1308</source>
          .
          <fpage>3513</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Del Moral</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <year>1996</year>
          .
          <article-title>Non-linear filtering: interacting particle resolution</article-title>
          .
          <source>Markov processes and related fields</source>
          ,
          <volume>2</volume>
          (
          <issue>4</issue>
          ), pp.
          <fpage>555</fpage>
          -
          <lpage>581</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Doshi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zeng</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          <year>2009</year>
          .
          <article-title>Graphical models for interactive POMDPs: representations and solutions</article-title>
          .
          <source>Autonomous Agents and Multi-Agent Systems 18.3</source>
          :
          <fpage>376</fpage>
          -
          <lpage>416</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Gilks</surname>
            ,
            <given-names>W.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Richardson</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Spiegelhalter</surname>
            ,
            <given-names>D.J.</given-names>
          </string-name>
          ,
          <year>1996</year>
          .
          <article-title>Introducing markov chain monte carlo</article-title>
          .
          <source>Markov chain Monte Carlo in practice, 1</source>
          , p.
          <fpage>19</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Gmytrasiewicz</surname>
            ,
            <given-names>P. J.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Doshi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <year>2005</year>
          .
          <article-title>A framework for sequential planning in multi-agent settings</article-title>
          .
          <source>Journal of Artificial Intelligence Research</source>
          <volume>24</volume>
          (
          <issue>1</issue>
          ):
          <fpage>49</fpage>
          -
          <lpage>79</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Gmytrasiewicz</surname>
            ,
            <given-names>P. J.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Doshi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <year>2005</year>
          .
          <article-title>A framework for sequential planning in multi-agent settings</article-title>
          .
          <source>Journal of Artificial Intelligence Research</source>
          <volume>24</volume>
          (
          <issue>1</issue>
          ):
          <fpage>49</fpage>
          -
          <lpage>79</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Gordon</surname>
            ,
            <given-names>N.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salmond</surname>
            ,
            <given-names>D.J.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>A.F.</given-names>
          </string-name>
          ,
          <year>1993</year>
          , April.
          <article-title>Novel approach to nonlinear/non-Gaussian Bayesian state estimation</article-title>
          .
          <source>In IEE Proceedings F (Radar and Signal Processing)</source>
          (Vol.
          <volume>140</volume>
          , No.
          <issue>2</issue>
          , pp.
          <fpage>107</fpage>
          -
          <lpage>113</lpage>
          ).
          <source>IET Digital Library.</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Kaelbling</surname>
            ,
            <given-names>L.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Littman</surname>
            ,
            <given-names>M.L.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Cassandra</surname>
            ,
            <given-names>A.R.</given-names>
          </string-name>
          ,
          <year>1998</year>
          .
          <article-title>Planning and acting in partially observable stochastic domains</article-title>
          .
          <source>Artificial intelligence</source>
          ,
          <volume>101</volume>
          (
          <issue>1</issue>
          ), pp.
          <fpage>99</fpage>
          -
          <lpage>134</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Spaan</surname>
            ,
            <given-names>M.T.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Vlassis</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <year>2005</year>
          . Perseus:
          <article-title>Randomized pointbased value iteration for POMDPs</article-title>
          .
          <source>Journal of artificial intelligence research</source>
          ,
          <volume>24</volume>
          , pp.
          <fpage>195</fpage>
          -
          <lpage>220</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liao</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Carin</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <year>2011</year>
          .
          <article-title>The infinite regionalized policy representation</article-title>
          .
          <source>In Proceedings of the 28th International Conference on Machine Learning (ICML-11)</source>
          (pp.
          <fpage>769</fpage>
          -
          <lpage>776</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Panella</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Gmytrasiewicz</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <year>2016</year>
          , March. Bayesian Learning of Other Agents'
          <article-title>Finite Controllers for Interactive POMDPs</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Pearl</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <year>1988</year>
          .
          <article-title>Probabilistic reasoning in intelligent systems: Networks of plausible reasoning</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Ross</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , Chaib-draa,
          <string-name>
            <given-names>B.</given-names>
            and
            <surname>Pineau</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          ,
          <year>2007</year>
          .
          <article-title>Bayes-adaptive pomdps</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          (pp.
          <fpage>1225</fpage>
          -
          <lpage>1232</lpage>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>