=Paper=
{{Paper
|id=Vol-1964/MA3
|storemode=property
|title=Learning Others' Intentional Models in Multi-Agent Settings Using Interactive POMDPs
|pdfUrl=https://ceur-ws.org/Vol-1964/MA3.pdf
|volume=Vol-1964
|authors=Yanlin Han,Piotr Gmytrasiewicz
|dblpUrl=https://dblp.org/rec/conf/maics/HanG17
}}
==Learning Others' Intentional Models in Multi-Agent Settings Using Interactive POMDPs==
Yanlin Han and Piotr Gmytrasiewicz MAICS 2017 pp. 69–76
Learning Others’ Intentional Models in Multi-Agent Settings Using Interactive
POMDPs
Yanlin Han Piotr Gmytrasiewicz
Department of Computer Science
University of Illinois at Chicago
Chicago, IL 60607
Abstract structure that represents an agent’s belief about the physical
Interactive partially observable Markov decision processes
state, belief about the other agents and their beliefs about
(I-POMDPs) provide a principled framework for planning others’ beliefs. The models of other agents included in the
and acting in a partially observable, stochastic and multi- new augmented state space consist of two types: the inten-
agent environment, extending POMDPs to multi-agent set- tional models and subintentional models. The sophisticated
tings by including models of other agents in the state space intentional model ascribes beliefs, preferences, and rational-
and forming a hierarchical belief structure. In order to pre- ity to other agents (Gmytrasiewicz and Doshi 2005), while
dict other agents’ actions using I-POMDP, we propose an ap- the simpler subintentional model, such as finite state con-
proach that effectively uses Bayesian inference and sequen- trollers (Panella and Gmytrasiewicz 2016), does not. So-
tial Monte Carlo (SMC) sampling to learn others’ intentional lutions of I-POMDPs map an agent’s belief about the en-
models which ascribe them beliefs, preferences and rational- vironment and other agents’ models to actions. Therefore,
ity in action selection. For problems of various complexities,
empirical results show that our algorithm accurately learns
it is applicable to all important agent, human, and mixed
models of other agents and has superior performance in com- agent-human applications. It has been clearly shown (Gmy-
parison with other methods. Our approach serves as a gener- trasiewicz and Doshi 2005) that the added sophistication
alized reinforcement learning algorithm that learns over other for modeling others as rational agents results in a higher
agents’ transition, observation and reward functions. It also value function which dominates the one obtained from sim-
effectively mitigates the belief space complexity due to the ply treating others as noise, which implies the modeling su-
nested belief hierarchy. periority of I-POMDPs for multi-agent systems over other
approaches.
Introduction However, the interactive belief modification for I-
Partially observable Markov decision processes (POMDPs) POMDPs results in a drastic increase of the belief space
(Kaelbling, Littman, and Cassandra 1998) provide a princi- complexity, adding to the curse of dimensionality: the com-
pled, decision-theoretic framework for planning under un- plexity of the belief representation is proportional to be-
certainty in a partially observable, stochastic environment. lief dimensions due to exponential growth of agent mod-
An autonomous agent operates rationally in such settings els with increase of nesting level. Since exact solutions
by maintaining a belief of the physical state at any given to POMDPs are proven to be PSPACE-complete for fi-
time, in doing so it sequentially chooses the optimal ac- nite horizon and undecidable for infinite time horizon (Pa-
tions that maximize the future rewards. Therefore, solutions padimitriou and Tsitsiklis 1987), the time complexity of the
of POMDPs are mappings from an agent’s beliefs to ac- more generalized I-POMDPs, which may contain multiple
tions. Although POMDPs can be used in multi-agent set- POMDPs and I-POMDPs of other agents, is greater than
tings, it is doing so under strong assumptions that the effects or equal to PSPACE-complete for finite horizon and unde-
of other agents’ actions are implicitly treated as noise and cidable for infinite time horizon. Due to this severe space
folded in the state transitions, such as recent Bayes-adaptive complexity, currently no complete belief update has been ac-
POMDPs (Ross, Draa, and Pineau 2007), infinite general- complished using the sophisticated intentional models over
ized policy representation (Liu, Liao, and Carin 2011), in- entire interactive belief space. There are only partial up-
finite POMDPs (Doshi-Velez et al. 2013). Therefore, an dates on other agents’ sole beliefs about the physical states
agent’s beliefs about other agents are not in the solutions (Doshi and Gmytrasiewicz 2009) and indirect approach such
of POMDPs. as subintentional finite state controllers (Panella and Gmy-
Interactive POMDP (I-POMDP) (Gmytrasiewicz and trasiewicz 2016). Therefore, in order to unleash the full
Doshi 2005) are a generalization of POMDP to multi-agent modeling power of intentional models and apply I-POMDPs
settings by replacing POMDP belief spaces with interac- to more realistic settings, a good approximation algorithm
tive hierarchical belief systems. Specifically, it augments the for computing the nested interactive belief and predicting
plain beliefs about the physical states in POMDP by includ- other agents’ actions is crucial to the trade-off between so-
ing models of other agents, which forms a hierarchical belief lution quality and computation complexity.
69
Learning Others Intentional Models in Multi-Agent Settings Using Interactive POMDPs pp. 69–76
To address this issue, we propose a Bayesian approach
that utilizes customized sequential Monte Carlo sampling nX
algorithms (Doucet, De Freitas, and Gordon 2001) to ob- OP T (bi ) =arg max bi (s)R(s, ai ) (3)
ai 2Ai
tain approximating solutions to interactive I-POMDPs and s2S
o
implement the algorithms in a software package1 . Specifi- X
+ P (oi |ai , bi ) ⇥ U (SE(bi , ai , oi ))
cally, We assume that models of other agents are unknown
oi 2⌦i
and learned from imperfect observations of the other agents’
behaviors. We parametrize other agents’ intentional mod- Particle Filter
els and maintain a belief over them, making sequential
Bayesian updates using only observations from the environ- The Markov Chain Monte Carlo (MCMC) method (Gilks
ment. Since this Bayesian inference task is analytically in- et al., 1996) is widely used to approximate probability dis-
tractable, to approximate the posterior distribution, we de- tributions when they are unable to be computed directly. It
vise a customized sequential Monte Carlo method to de- generates samples from a posterior distribution ⇡(x) over
scend the belief hierarchy and sample all model parameters state space x, by simulating a Markov chain p(x0 |x) whose
at each nesting level, starting from the interactive particle state space is x and stationary distribution is ⇡(x). The sam-
filter (I-PF) (Doshi and Gmytrasiewicz 2009) for I-POMDP ples drawn from p converge to the target distribution ⇡ as
belief update. the number of samples goes to infinity.
Our approach, for the first time, successfully learns oth- In order to make MCMC work on sequential inference
ers’ models over the entire intentional model space which task, especially sequential decision makings under Markov
contains their initial belief, transition, observation and re- assumptions, sequential versions of Monte Carlo methods
ward functions, making it a generalized reinforcement learn- have been proposed and some of them are capable of deal-
ing method for multi-agent settings. Our algorithm accu- ing with high dimensionality and/or complexity problems,
rately predicts others’ actions on various problem settings, such as particle filters (Del Moral and Pierre 1996). At each
therefore enables the modeling agent to make corresponding time step, particle filters draw samples (or particles) from
optimal action to maximize its own rewards. By approximat- a proposal distribution, commonly p(xt |xt 1 ), which is es-
ing Bayesian inference using a customized sequential Monte sentially the conditional distribution of the current state xt
Carlo sampling method, we significantly mitigate the belief given the previous xt 1 , then use the observation function
space complexity of the I-POMDPs. p(yt |xt ) to compute the importance weight for each particle
and resample all particles according to the weights.
Background
The Model
POMDP
A Partially observable Markov decision process (POMDP)
I-POMDP framework
(Kaelbling, Littman, and Cassandra 1998) is a general re- An interactive POMDP of agent i, I-POMDP i, is defined
inforcement learning model for planning and acting in a as:
single-agent, partially observable, stochastic domain. It is I-P OM DPi = hISi,l , A, ⌦i , Ti , Oi , Ri i (4)
defined for a single agent i as: where ISi,l is the set of interactive states of the environ-
P OM DPi = hS, Ai , ⌦i , Ti , Oi , Ri i (1) ment, defined as ISi,l = S ⇥ Mi,l 1 , l 1, where S is
the set of states and Mi,l 1 is the set of possible models
Where the meaning for each element in the 6-tuple is: of agent j, and l is the strategy level. A specific class of
• S is the set of states of the environment. models are the (l 1)th level intentional models, ⇥j,l 1 , of
• Ai is the set of agent i’s possible actions agent j: ✓j,l 1 = hbj,l 1 , A, ⌦j , Tj , Oj , Rj , OCj i, bj,l 1 is
agent j’s belief nested to level (l 1), bj,l 1 2 (ISj,l 1 ),
• ⌦i is the set of agent i’s possible observations and OCj is j’s optimality criterion. The intentional model
• Ti : S ⇥ Ai ⇥ S ! [0, 1] is the state transition function ✓j,l 1 , sometimes is referred to as type, can be rewritten as
• Oi : S ⇥ Ai ⇥ ⌦i ! [0, 1] is the observation function ✓j,l 1 = hbj,l 1 , ✓ˆj i, where ✓ˆj includes all elements of the
• Ri : S ⇥ Ai ! Ri is the reward function. intentional model other than the belief and is called the agent
j’s frame.
Given the definition above, an agent’s belief about the The ISi,l could be defined in an inductive manner (note
state can be represented as a probability distribution over
S. The belief update can be simply done using the following that when ✓ˆj is usually known, ✓ˆj reduces to bj ):
formula, where ↵ is the normalizing constant:
X ISi,0 = S, ✓j,0 = {hbj,0 , ✓ˆj i : bj,0 2 (S)}
b (s0 ) = ↵O(o, s, a) T (s0 , a, s)b(s) (2) ISi,1 = S ⇥ ✓j,0 , ✓j,1 = {hbj,1 , ✓ˆj i : bj,1 2 (ISj,1 )}
s2S
...... (5)
Given the agent’s belief, then the optimal action, a⇤ , is sim-
ply part of the set of optimal actions, OP T (bi ), for the belief ISi,l = S ⇥ ✓j,l 1 , ✓j,l = {hbj,l , ✓ˆj i : bj,l 2 (ISj,l )}
state defined as:
And all other remaining components in an I-POMDP are
1
https://github.com/solohan22/IPOMDP.git similar to those in a POMDP:
70
Yanlin Han and Piotr Gmytrasiewicz MAICS 2017 pp. 69–76
• A = Ai ⇥ Aj is the set of joint actions of all agents. to incorrect models, we add another resampling step to sam-
• ⌦i is the set of agent i’s possible observations. ple from the neighboring similar models given the current
samples. Consequently, our algorithm is able to maintain a
• Ti : S ⇥ Ai ⇥ S ! [0, 1] is the state transition function. probability distribution of the most possible models of other
• Oi : S ⇥ Ai ⇥ ⌦i ! [0, 1] is the observation function. agents and eventually learn the optimal actions of them.
• Ri : IS ⇥ Ai ! Ri is the reward function.
Algorithm 1: Interactive Belief Update
Interactive belief update
Given all the definitions above, the interactive belief up-date b̃tk,l = InteractiveBeliefUpdate(b̃tk,l1 , atk 1 , otk , l > 0)
(n),t 1 (n),t 1
can be performed as follows: 1 For isk =< s(n),t 1 , ✓ k >2 b̃tk,l1 ,
t 1 (n),t 1
2 sample a k ⇠ P (A k |✓ k )
bti (ist ) = P r(ist |bti 1 , ati 1 , oti ) (6) 3 sample s(n),t ⇠ Tk (S t |S (n),t 1 , atk 1 , at k1 )
X X 4 for ot k 2 ⌦ k :
t 1 t 1 t 1 t 1 t 1 t
=↵ b(is ) P r(aj |✓j )T (s , a , s ) 5 if l = 1:
ist 1
atj 1 (n),t (n),t 1
6 b k,0 = Level0BeliefUpdate(b k,0 , at k1
X (n),t 1
⇥ Oi (st , at 1 , oti ) Oj (st , at 1 , otj )⌧ (btj 1 , atj 1 , otj , btj ) , ot k , ✓ k )
(n),t (n),t ˆ(n),t 1
otj 7 ✓ k =< b k,0 , ✓ k >
(n),t (n),t
Unlike plain belief update in POMDP, the interactive be- 8 isk =< s(n),t , ✓ k >
lief update in I-POMDP takes two additional sophistica- 9 else:
(n),t
tions into account. Firstly, the probabilities of other’s actions 10 b k,l 1 = InteractiveBeliefUpdate(b̃t k,l 1
1
given its models (the second summation) need to be com- t 1 t
, a k , o k , l 1)
puted since the state of physical environment now depends (n),t (n),t (n),t 1
on both agents’ actions. Secondly, the agent needs to update 11 ✓ k =< b k,l 1 , ✓ˆ k >
(n),t (n),t (n),t
its beliefs based on the anticipation of what observations the 12 isk =< s ,✓ k >
other agent might get and how it updates (the third summa- 13
(n) (n)
wt = O k (ot k |s(n),t , atk 1 , at k1 )
tion). (n) (n)
14 wt = wt ⇥ Ok (otk |s(n),t , atk 1 , at k1 )
Then the optimal action, a⇤ , for the case of infinite hori- (n),t (n)
zon criterion with discounting, is part of the set of optimal 15 b̃temp
k,l =< isk , wt >
PN
actions, OP T (✓i ), for the belief state defined as: (n)
16 normalize all wt so that n=1 wt = 1
(n)
temp (n)
n X 17 resample from b̃k,l accroding to normalized wt
(n),t
OP T (✓i ) = arg max bis (s)ERi (is, ai ) (7) 18 resample ✓ k according to neighboring similar
ai 2Ai
is2IS models
X o (n),t
+ P (oi |ai , bi ) ⇥ U (hSE✓i (bi , ai , oi ), ✓ˆi i) 19 return b̃tk,l = isk
oi 2⌦i
The interactive belief update described in Algo-
Sampling Algorithms rithm 1 is similar to I-PF in terms of the recur-
The Interactive Particle Filter (I-PF) (Doshi and Gmy- sive Monte Carlo sampling and nesting hierarchy, but
trasiewicz 2009) was proposed as a filtering algorithm for it has three major differences. Firstly, the belief up-
interactive belief update in I-POMDP. It generalizes the clas- date is over the entire intentional model space of other
sic particle filter algorithm to multi-agent settings and uses agents, therefore the initial set of N samples b̃tk,l1 =<
the state transition function as the proposal distribution, (n),t 1 (n) (n) (n)
which is usually used in a specific particle filter algorithm b k,l 1 , A k , ⌦ k , T k , O k , R k , OCj >, where k here
called bootstrap filter (Gordon, ect 1993). However, due to denotes the modeling agent and k denotes all other mod-
the enormous belief space, I-PF assumes that other agent’s eled agents. We only assume that the actions A k , ob-
frame ✓ˆj is known to the modeling agent, therefore simpli- servations ⌦ k and optimal criteria OCj are known, as
fies the belief update process from S ⇥ ⇥j,l 1 to a signifi- in a multi-agent game the rules are usually known to
cantly smaller space S ⇥ bj,l 1 . all agents or could be obtained through intelligence. Sec-
The intuition of our algorithm is to assign appropriate ondly, it is intuitive to see that the observation function
(n)
prior distributions over agent j’s all possible models ✓j =< O k (ot k |s(n),t , atk 1 , at k1 ) in line 13 is now randomized
bj (s), Aj , ⌦j , Tj , Oj , Rj , OCj > and sample from each di- as well, as each of them is a particular observation function
mension of them. At each time step, we update all samples of that agent. Lastly, we add another resampling step in line
using perceived observations, namely computing and assign- 18 in order to avoid divergence, by resampling each dimen-
ing weights to each sample, and resample them according to sion of the model samples from a Gaussian distribution with
the weights. At last, since it is a randomized Monte Carlo the mean of current sample value. Intuitively, similar models
method, to prevent the learning algorithm from converging are resampled from a relatively tight neighboring region of
71
Learning Others Intentional Models in Multi-Agent Settings Using Interactive POMDPs pp. 69–76
(n)
the current model samples to maintain the learning accuracy. transition function Tk (st |st 1 , atk 1 , at k1 ) and observa-
Algorithm 1 can be viewed as two major steps. The impor- (n)
tion function Ok (otk |st , atk 1 , at k1 ) are now both samples
tance sampling step (line 1 to line 16) samples from belief from input arguments, depending on model parameters of
priors b̃tk,l1 and propagates forward using related proposal the actual agent on the 0th level.
distributions and computes the weights of all samples. And
the selection or resapmling step (line 17 to line 18) resam-
ples according to weights and similar models. Specifically,
(n),t 1
the algorithm starts from a set of initial priors isk , for
each of them, it samples other agents’ optimal action at k1
(n),t 1
from its policy P (A k |✓ k ), which is solved using a
very efficient POMDP solver called Perseus2 (Spaan and
Vlassis 2005). Then it samples the physical state st using
the state transition Tk (S t |S (n),t 1 , atk 1 , at k1 ). Once at k1
and st are sampled, the algorithm calls for the 0-level be-
lief update (line 5 to 8), described in Algorithm 2, to update
other agents’ plain beliefs bt k,0 if the current nesting level l
is 1, or recursively calls for itself at a lower level l 1 (line
9 to 12) if the current nesting level is greater than 1. The
(n)
sample weights wt are computed according to observation
likelihoods of modeling and modeled agents (line 13, 14), Figure 1: An illustration of interactive belief update for two
and then got normalized so that they sum up to 1 (line 16). agents and 1 level nesting.
Lastly, the algorithm resamples the intermediate samples ac-
cording to the computed weights (line 17) and resamples an- In figure 1, we illustrate the interactive belief update us-
other time from similar neighboring models (line 18). ing the problem discussed in the following section . Suppose
there are two agents i and j in the environment, the sample
Algorithm 2: Level-0 Belief Update size is 8 and the nesting level is 2, the subscripts in figure 1
denotes the corresponding agents and each dot represents a
(n) (n)
btk,0 =Level0BeliefUpdate(btk,01 ,atk 1 ,otk , Tk ,Ok ) particular belief sample. The propagate step corresponds to
1 P (at k1 ) = 1/at k1 line 2 to 12 in Algorithm 1, the weight step corresponds to
2 for st 2 S: line 13 to 16, the resample step corresponds to line 17 and
3 for st 1 : 18. The belief update for a particular level-0 model sam-
(t 1) ple (✓j = hbj (s) = 0.5, pT 1 = 0.67, pT 2 = 0.5, pO1 =
4 for a k 2 A k : 0.85, pO2 = 0.5, pR1 = 1, pR2 = 100, pR3 = 10i) is
(n) t t 1 t 1
5 P (s |s , ak ) = solved using Algorithm 2, and the optimal action is com-
(n)
Tk (st |st 1 , atk 1 , at k1 ) ⇥ P (at k1 ) puted by calling the Perseus POMDP solver.
6 sum + = P (n) (st |st 1 , atk 1 )btk,01 (st 1 )
(n)
7
(t 1)
for a k 2 A k : Experiments
8 P (n) (otk |st , atk 1 )+ = Setup
(n)
Ok (otk |st , atk 1 , at k1 )P (at k1 ) We present the results using the multi-agent tiger game
9 btk,0 = sum(n) ⇥ P (n) (otk |st , atk 1 ) (Gmytrasiewicz and Doshi 2005) with various settings. The
10 normalize and return btk,0 multi-agent tiger game is a generalization of the classical
single agent tiger game (Kaelbling, Littman, and Cassandra
1998) with adding observations which caused by others’ ac-
The 0-level belief update, described in Algorithm 2, is tions. The generalized multi-agent game contains additional
similar to POMDP belief update but treats other agents’ observations regarding other players, while the state transi-
actions as noise and randomized the state transition func- tion and reward function involve others’ actions as well.
tion and observation function as input parameters. It assume Let’s see a specific game example with known parame-
other agents in the environment choose their actions accord- ters: there are a tiger and a pile of gold behind two doors
ing to a uniform distribution (line 1), therefore is essentially respectively, two players can both listen for a growl of the
(t 1)
a no-information model. For each possible action a k , it tiger and a creak caused by the other player, or open doors
computes the actual state transition (line 5) and actual ob- which resets the tiger’s location with equal probability. Their
servation function (line 8) by marginalizing over others’ ac- observation toward the tiger and the other player are both
tions, and returns the normalized belief btk,0 . Notice that the relatively high (0.85 and 0.9 respectively). No matter trig-
gered by which player, the reward for listening action is -1,
2
http://www.st.ewi.tudelft.nl/˜mtjspaan/ opening the tiger door is -100 and opening the gold door is
pomdp/index_en.html 10.
72
Yanlin Han and Piotr Gmytrasiewicz MAICS 2017 pp. 69–76
Table 1: Parameters for transition functions reasonable assumptions that Aj and ⌦j are known, OCj is
infinite horizon with discounting. Now what we really want
S A TL TR to learn are as follow:
TL L pT 1 1 pT 1
TR L 1 pT 1 pT 1 • b0j : the initial belief of agent j about the physical state.
* OL pT 2 1 pT 2 • Tj : the transition function of agent j, which can be
* OR 1 pT 2 pT 2 parametrized by two parameters pT 1 and pT 2 , as shown
in Table 1.
Table 2: Parameters for observation functions • Oj : the observation function of agent j, which can be
parametrized by two parameters pO1 and pO2 , as shown
S A GL GR in Table 2.
TL L pO1 1 pO1
TR L 1 pO1 pO1 • Tj : the reward function of agent j, which can be
* OL pO2 1 pO2 parametrized by three parameters pR1 , pR2 and pR3 , as
shown in Table 3.
* OR 1 pO2 pO2
We could easily see that it is a enormous 8-dimensional
Table 3: Parameters for reward functions parameter space to learn from: b0j ⇥pT 1 ⇥pT 2 ⇥pO1 ⇥pO2 ⇥
pR1 ⇥ pR2 ⇥ pR3 , where bj 2 [0, 1] ⇢ R, pT 1 2 [0, 1] ⇢ R,
S A R pT 2 2 [0, 1] ⇢ R, pO1 2 [0, 1] ⇢ R, pO2 2 [0, 1] ⇢ R,
* L pR1 pR1 2 [ 1, +1] ⇢ R, pR2 2 [ 1, +1] ⇢ R, pR3 2
TL OL pR1 [ 1, +1] ⇢ R.
TR OR pR2 We mainly reduce this huge space by two means: utilizing
TL OR pR3 Monte Carlo sampling methods and giving them problem-
TR OL pR3 specific priors so that they are not over informative but pro-
vide enough information for the algorithm to learn from.
For the sake of brevity, we restrict the experiments to Results
a two-agent setting and nesting level of one, but the sam- For the actual experiments, we fix the number of samples
pling algorithm is extensible to any number of agents and to be 2000 and run it on a two agent tiger game simulation
nesting levels in a straightforward manner. Recall that an as described above. We run experiments for learning three
interactive POMDP of agent i is defined as a six tuple difference models of agent j:
I-P OM DPi = hISi,l , A, ⌦i , Ti , Oi , Ri i. Thus for the spe- 1. ✓j1 =< 0.5, 0.67, 0.5, 0.85, 0.5, 1, 100, 10 >
cific setting of multi-agent tiger problem:
2. ✓j2 =< 0.5, 1.00, 0.5, 0.95, 0.5, 1, 10, 10 >
• ISi,1 = S ⇥ ✓j,0 , where S = {tiger on the
left (TL), tiger on the right (TR)} and ✓j,0 =< 3. ✓j3 =< 0.5, 0.66, 0.5, 0.85, 0.5, 10, 100, 10 >
bj (s), Aj , ⌦j , Tj , Oj , Rj , OCj >}. These models are all very special cases and carefully cho-
• A = Ai ⇥ Aj is a combination of both agents’ possi- sen in order to verify the correctness and evaluate the per-
ble actions: listen (L), open left door (OL) and open right formance of our algorithm. For instance, the first model is a
door(OR). sophisticated one when the other agent is actually modeling
his opponent using a subintentional model, while the sec-
• ⌦i is all the combinations of each agent’s possible ob-
ond is a classic single-agent POMDP and the third is a very
servations: growl from left (GL) or right (GR), combined
simple one but contains a large model space. We want to in-
with creak from left (CL), right (CR) or silence (S).
vestigate if our framework is able to correctly and efficiently
• Ti = Tj : S ⇥ Ai ⇥ Aj ⇥ S ! [0, 1] is a joint state learn these models through these experiments.
transition probability that involves both actions.
• Oi : S ⇥ Ai ⇥ Aj ⇥ ⌦i ! [0, 1] becomes a joint observa-
tion probability that involves both actions. Oj is symmet-
ric of Oi with respect to the joint actions.
• Ri : IS ⇥ Ai ⇥ Aj ! Ri : agent i gets corresponding
rewards when he listens, opens the wrong door and opens
the correct door respectively. They are independent of j’s
actions.
Parameter Space
For the experiments of multi-agent tiger game, we want to
learn over all possible intentional models of the other agent
j: ✓j =< bj (s), Aj , ⌦j , Tj , Oj , Rj , OCj >. We only make Figure 2: Optimal policy of a no-information model.
73
Learning Others Intentional Models in Multi-Agent Settings Using Interactive POMDPs pp. 69–76
The aim of first experiment to try to learn rela- ure 3, specifically, they are uniform U(0,1) for b0j , Beta(5,3)
tively complicated models of agent j with ✓j =< with mode 0.67 for pT 1 , Beta(5,5) for pT 2 , Beta(3.5,1.4)
0.5, 0.67, 0.5, 0.85, 0.5, 1, 100, 10 >, who assumes oth- with mode 0.85 for pO1 , Beta(5,5) for pO2 , Gaussian N(-
ers’ actions are drawn from a uniform distribution. Equiv- 1,2) for pR1 , N(-100,4) for pR2 , and N(10,2) for pR3 .
alently, agent j’s actual policy, as shown in figure 2, is to
look for three consecutive growls from same direction and
then open the corresponding door. For this particular exper-
iment, we simulated the observation history for agent i, for
the sake of firstly verifying the correctness of our algorithm,
excluding the impacts of uncertainties from hearing accu-
racy. The simulated observation history is as follows: {GL,S
GL,S GL,S GL,CR GL,S GL,S GL,S GR,CR GL,S GL,S
GL,S GR,CR GL,S GL,S GL,S GR,CR GR,S GR,S GR,S
GR,CL GR,S GR,S GR,S GR,CL GR,S GR,S GR,S GR,CL
GR,S GR,S GR,S GR,CL GR,S GR,S GR,S GR,CL GR,S
GR,S GR,S GL,CL GL,S GL,S GL,S GR,CR GL,S GL,S
GL,S GR,CR GR,S GR,S}
Figure 4: 3D histogram of all model samples.
After 50 time steps, the algorithm converge to a posterior
distribution over agent j’s intentional models, the results are
also given in figure 3. Since the parameter space of agent j’s
models is 8-dimensional, here we only show the marginal
distributions of each parameter space in histograms. We can
easily see that the majority of samples are centered around
the true parameter values.
We use principal component analysis (PCA) (Abdi and
Williams 2010) to reduce sample dimensionality to 2-
dimensional and plot them out in a 3-dimensional histogram,
as shown in Figure 4. It starts from a Gaussian-like prior and
gradually converges to the most likely models. Eventually
the mean value of this cluster h 0.49, 0.69, 0.49, 0.82, 0.51, -
0.95, -99.23, 10.09 i is very close to the true model. Here
we give two examples from the big cluster after 50 time
steps: h0.56, 0.66, 0.49, 0.84, 0.59, -0.95, -101.37, 11.42i
and h0.51, 0.68, 0.52, 0.89, 0.56, -1.33, -98.39, 12.55i.
The former has a corresponding optimal policy of [0—
OL—0.10—L—1], while the latter has a [0—OL—0.09—
L–0.91—OR—1], which are both extremely close to the
optimal policy of the true model: [0—OL—0.1—L–0.9—
OR—1]. Consequently, the framework is able to predict
other agents’ actions with high accuracy.
We tested the performance of our algorithms in terms
of prediction accuracy towards others’ actions. We com-
pared the results with other modeling approaches, such as
a frequency-based approach, in which agent j is assumed
to choose his action according to a fixed but unknown dis-
tribution, and a no-information model which treats j’s ac-
tions purely as uniform noise. The results shown in figure
Figure 3: Assigned priors and learned posterior dis- 5 are averaged plots of 10 random runs, each of which has
tributions over model parameters for model ✓j1 =< 50 time steps. It shows clearly that the intentional I-POMDP
0.5, 0.67, 0.5, 0.85, 0.5, 1, 100, 10 >. approach has significantly lower error rates as agent i per-
ceives more observations. The subintenional model assume
The priors we assign to each parameters are shown in fig- j’s action is draw from a uniform distribution, therefore has
74
Yanlin Han and Piotr Gmytrasiewicz MAICS 2017 pp. 69–76
Figure 5: Prediction error rate vs observation length.
a fixed high error rate. The frequency based approach has
certain learning ability but is far from enough sophisticated
for modeling a fully rational agent.
Figure 6: (a) optimal policy for ✓j = h 0.5, 1, 0.5, 0.95, 0.5,
-1, -10, 10 i. (b) optimal policy for ✓j = h0.5, 0.66, 0.5,
0.85, 0.5, 10, -100, 10i.
In the second experiment, we run our algorithm on ac-
tual observations for 30 time steps until it converges, and
try to learn models of a simpler classic POMDP with high
listening accuracy of 0.95 and small penalty of -10, e.g. the
agent j alternately opens door and listens as shown in Fig-
ure 6 left. The actual model of j is ✓j = h 0.5, 1, 0.5,
0.95, 0.5, -1, -10, 10 i, the priors assigned to b0j , pT 1 , pT 2 ,
pO1 , pO2 , pR2 , pR3 are U(0,1), Beta(2,0.5), Beta(10,10),
Beta(19,1), Beta(10,10), N(-1,1), N(-10,2), N(10,2), and the
actual observation history is {GR,S GL,CR GL,S GL,CL Figure 7: Learned posterior distributions for model ✓j = h
GL,S GL,CR GL,S GL,CL GL,S GR,S GR,CL GR,CL GL,S 0.5, 1, 0.5, 0.95, 0.5, -1, -10, 10 i.
GR,S GR,S GL,CL GR,S GL,CR GR,S GR,CR GR,CR
GR,CL GL,S GL,S GL,S GL,CR GL,S GL,CL GR,S GR,S}.
Similarly, we report the learned posterior distributions
over model parameters in figure 7. We observe an interest-
ing pattern that while some parameters, such as bj,0 , pT 2 and
pO2 are concentrated around the actual values, others like h0.5, 0.66, 0.5, 0.85, 0.5, 10, -100, 10i, who always listens
pT 1 and pO1 become more dispersed than initial priors. The since the listening penalty is now equal to the reward, as
intuition behind is that the penalty and reward are -10 and shown in figure 6(b). For brevity, we only show the marginal
10, so one listening of reward -1 is enough for making deci- distributions over model parameters in figure 8. The priors
sion of opening doors. That is to say, as long as tiger likely assigned to b0j , pT 1 , pT 2 , pO1 , pO2 , pR2 , pR3 are U(0,1),
remains behind the same door when agent listens (the mean- Beta(5,3), Beta(10,10), Beta(3.5,1.4), Beta(10,10), N(10,1),
ing of pT 1 ) and has a reliable hearing accuracy (the meaning N(-100,2), N(10,2), and the actual observation history i
of pO1 ), there are many models which satisfy this particular learns from is {GL,S GL,S GR,S GL,S GL,CL GR,S GR,S
observation sequence, hence our algorithm learns them all. GL,CL GR,S GL,S GL,S GR,S GL,S GL,S GL,S GL,CL
For conciseness, we show the average prediction error GR,S GL,S GL,S GL,S}. We can see that all three reward
rates for both second and third experiments in figure 9. parameters are correctly learned, while samples of pT 1 , pT 2 ,
Both results are averaged among 10 random runs, each of pO1 and pO2 are not very concentrated to their true values
which has 30 time steps. In the second experiment in fig- but close to their corresponding priors, since intuitively they
ure 9(a), the intentional I-POMDP approach still has signif- become less important and can be in a relatively loose re-
icantly lower error rates than others. gion due to the increased pR1 =10. Lastly, the performance
In the last experiment, we wants to learn a model of ✓j = comparison is given in figure 9(b).
75
Learning Others Intentional Models in Multi-Agent Settings Using Interactive POMDPs pp. 69–76
Figure 9: (a) Prediction error rate vs observation length for
✓j = h 0.5, 1, 0.5, 0.95, 0.5, -1, -10, 10 i. (b) for ✓j = h0.5,
0.66, 0.5, 0.85, 0.5, 10, -100, 10i.
References
Abdi, H. and Williams, L.J., 2010. Principal component analy-
sis. Wiley interdisciplinary reviews: computational statistics, 2(4),
pp.433-459.
Doucet, A., De Freitas, N. and Gordon, N., 2001. An introduc-
tion to sequential Monte Carlo methods. In Sequential Monte Carlo
methods in practice (pp. 3-14). Springer New York.
Doshi, P., and Gmytrasiewicz, P. J. 2009. Monte Carlo sampling
methods for approximating interactive POMDPs. Journal of Artifi-
cial Intelligence Research 34: 297-337.
Doshi-Velez, F. and Konidaris, G., 2013. Hidden parameter Markov
decision processes: A semiparametric regression approach for dis-
covering latent task parametrizations. arXiv:1308.3513.
Del Moral, P., 1996. Non-linear filtering: interacting particle reso-
lution. Markov processes and related fields, 2(4), pp.555-581.
Doshi, P., Zeng, Y., and Chen, Q. 2009. Graphical models for
interactive POMDPs: representations and solutions. Autonomous
Agents and Multi-Agent Systems 18.3: 376-416.
Gilks, W.R., Richardson, S. and Spiegelhalter, D.J., 1996. Intro-
ducing markov chain monte carlo. Markov chain Monte Carlo in
practice, 1, p.19.
Gmytrasiewicz, P. J., and Doshi, P. 2005. A framework for sequen-
tial planning in multi-agent settings. Journal of Artificial Intelli-
gence Research 24(1): 49-79.
Figure 8: Learned posterior distributions for model ✓j = Gmytrasiewicz, P. J., and Doshi, P. 2005. A framework for sequen-
h0.5, 0.66, 0.5, 0.85, 0.5, 10, -100, 10i. tial planning in multi-agent settings. Journal of Artificial Intelli-
gence Research 24(1): 49-79.
Gordon, N.J., Salmond, D.J. and Smith, A.F., 1993, April. Novel
approach to nonlinear/non-Gaussian Bayesian state estimation. In
Conclusions and Future Work IEE Proceedings F (Radar and Signal Processing) (Vol. 140, No. 2,
pp. 107-113). IET Digital Library.
We have described a new approach to learn other agents’ Kaelbling, L.P., Littman, M.L. and Cassandra, A.R., 1998. Plan-
models by approximating the interactive belief update us- ning and acting in partially observable stochastic domains. Artifi-
ing Bayesian inference and Monte Carlo sampling methods. cial intelligence, 101(1), pp.99-134.
Our framework correctly learns others’ models over the en- Spaan, M.T. and Vlassis, N., 2005. Perseus: Randomized point-
tire intentional model space and therefore is a generalized based value iteration for POMDPs. Journal of artificial intelligence
research, 24, pp.195-220.
reinforcement learning algorithm for multi-agent settings. It
Liu, M., Liao, X. and Carin, L., 2011. The infinite regionalized pol-
also effectively mitigates the belief space complexity and icy representation. In Proceedings of the 28th International Confer-
has a significant better performance than other approaches ence on Machine Learning (ICML-11) (pp. 769-776).
in terms of predicting others’ actions. Panella, A. and Gmytrasiewicz, P., 2016, March. Bayesian Learn-
In the future, in order to fully evaluate the practicability ing of Other Agents’ Finite Controllers for Interactive POMDPs.
on larger problem space, more multi-agent problems of var- In Thirtieth AAAI Conference on Artificial Intelligence.
ious sizes could be tested. Due to computation complexity, Pearl, J., 1988. Probabilistic reasoning in intelligent systems: Net-
experiments on higher nesting levels are currently limited. works of plausible reasoning.
Ross, S., Chaib-draa, B. and Pineau, J., 2007. Bayes-adaptive
Thus, more efforts could be made on utilizing nonparametric pomdps. In Advances in neural information processing systems
Bayesian methods which inherently deal with nested belief (pp. 1225-1232).
structures.
76