Yanlin Han and Piotr Gmytrasiewicz MAICS 2017 pp. 69–76 Learning Others’ Intentional Models in Multi-Agent Settings Using Interactive POMDPs Yanlin Han Piotr Gmytrasiewicz Department of Computer Science University of Illinois at Chicago Chicago, IL 60607 Abstract structure that represents an agent’s belief about the physical Interactive partially observable Markov decision processes state, belief about the other agents and their beliefs about (I-POMDPs) provide a principled framework for planning others’ beliefs. The models of other agents included in the and acting in a partially observable, stochastic and multi- new augmented state space consist of two types: the inten- agent environment, extending POMDPs to multi-agent set- tional models and subintentional models. The sophisticated tings by including models of other agents in the state space intentional model ascribes beliefs, preferences, and rational- and forming a hierarchical belief structure. In order to pre- ity to other agents (Gmytrasiewicz and Doshi 2005), while dict other agents’ actions using I-POMDP, we propose an ap- the simpler subintentional model, such as finite state con- proach that effectively uses Bayesian inference and sequen- trollers (Panella and Gmytrasiewicz 2016), does not. So- tial Monte Carlo (SMC) sampling to learn others’ intentional lutions of I-POMDPs map an agent’s belief about the en- models which ascribe them beliefs, preferences and rational- vironment and other agents’ models to actions. Therefore, ity in action selection. For problems of various complexities, empirical results show that our algorithm accurately learns it is applicable to all important agent, human, and mixed models of other agents and has superior performance in com- agent-human applications. It has been clearly shown (Gmy- parison with other methods. Our approach serves as a gener- trasiewicz and Doshi 2005) that the added sophistication alized reinforcement learning algorithm that learns over other for modeling others as rational agents results in a higher agents’ transition, observation and reward functions. It also value function which dominates the one obtained from sim- effectively mitigates the belief space complexity due to the ply treating others as noise, which implies the modeling su- nested belief hierarchy. periority of I-POMDPs for multi-agent systems over other approaches. Introduction However, the interactive belief modification for I- Partially observable Markov decision processes (POMDPs) POMDPs results in a drastic increase of the belief space (Kaelbling, Littman, and Cassandra 1998) provide a princi- complexity, adding to the curse of dimensionality: the com- pled, decision-theoretic framework for planning under un- plexity of the belief representation is proportional to be- certainty in a partially observable, stochastic environment. lief dimensions due to exponential growth of agent mod- An autonomous agent operates rationally in such settings els with increase of nesting level. Since exact solutions by maintaining a belief of the physical state at any given to POMDPs are proven to be PSPACE-complete for fi- time, in doing so it sequentially chooses the optimal ac- nite horizon and undecidable for infinite time horizon (Pa- tions that maximize the future rewards. Therefore, solutions padimitriou and Tsitsiklis 1987), the time complexity of the of POMDPs are mappings from an agent’s beliefs to ac- more generalized I-POMDPs, which may contain multiple tions. Although POMDPs can be used in multi-agent set- POMDPs and I-POMDPs of other agents, is greater than tings, it is doing so under strong assumptions that the effects or equal to PSPACE-complete for finite horizon and unde- of other agents’ actions are implicitly treated as noise and cidable for infinite time horizon. Due to this severe space folded in the state transitions, such as recent Bayes-adaptive complexity, currently no complete belief update has been ac- POMDPs (Ross, Draa, and Pineau 2007), infinite general- complished using the sophisticated intentional models over ized policy representation (Liu, Liao, and Carin 2011), in- entire interactive belief space. There are only partial up- finite POMDPs (Doshi-Velez et al. 2013). Therefore, an dates on other agents’ sole beliefs about the physical states agent’s beliefs about other agents are not in the solutions (Doshi and Gmytrasiewicz 2009) and indirect approach such of POMDPs. as subintentional finite state controllers (Panella and Gmy- Interactive POMDP (I-POMDP) (Gmytrasiewicz and trasiewicz 2016). Therefore, in order to unleash the full Doshi 2005) are a generalization of POMDP to multi-agent modeling power of intentional models and apply I-POMDPs settings by replacing POMDP belief spaces with interac- to more realistic settings, a good approximation algorithm tive hierarchical belief systems. Specifically, it augments the for computing the nested interactive belief and predicting plain beliefs about the physical states in POMDP by includ- other agents’ actions is crucial to the trade-off between so- ing models of other agents, which forms a hierarchical belief lution quality and computation complexity. 69 Learning Others Intentional Models in Multi-Agent Settings Using Interactive POMDPs pp. 69–76 To address this issue, we propose a Bayesian approach that utilizes customized sequential Monte Carlo sampling nX algorithms (Doucet, De Freitas, and Gordon 2001) to ob- OP T (bi ) =arg max bi (s)R(s, ai ) (3) ai 2Ai tain approximating solutions to interactive I-POMDPs and s2S o implement the algorithms in a software package1 . Specifi- X + P (oi |ai , bi ) ⇥ U (SE(bi , ai , oi )) cally, We assume that models of other agents are unknown oi 2⌦i and learned from imperfect observations of the other agents’ behaviors. We parametrize other agents’ intentional mod- Particle Filter els and maintain a belief over them, making sequential Bayesian updates using only observations from the environ- The Markov Chain Monte Carlo (MCMC) method (Gilks ment. Since this Bayesian inference task is analytically in- et al., 1996) is widely used to approximate probability dis- tractable, to approximate the posterior distribution, we de- tributions when they are unable to be computed directly. It vise a customized sequential Monte Carlo method to de- generates samples from a posterior distribution ⇡(x) over scend the belief hierarchy and sample all model parameters state space x, by simulating a Markov chain p(x0 |x) whose at each nesting level, starting from the interactive particle state space is x and stationary distribution is ⇡(x). The sam- filter (I-PF) (Doshi and Gmytrasiewicz 2009) for I-POMDP ples drawn from p converge to the target distribution ⇡ as belief update. the number of samples goes to infinity. Our approach, for the first time, successfully learns oth- In order to make MCMC work on sequential inference ers’ models over the entire intentional model space which task, especially sequential decision makings under Markov contains their initial belief, transition, observation and re- assumptions, sequential versions of Monte Carlo methods ward functions, making it a generalized reinforcement learn- have been proposed and some of them are capable of deal- ing method for multi-agent settings. Our algorithm accu- ing with high dimensionality and/or complexity problems, rately predicts others’ actions on various problem settings, such as particle filters (Del Moral and Pierre 1996). At each therefore enables the modeling agent to make corresponding time step, particle filters draw samples (or particles) from optimal action to maximize its own rewards. By approximat- a proposal distribution, commonly p(xt |xt 1 ), which is es- ing Bayesian inference using a customized sequential Monte sentially the conditional distribution of the current state xt Carlo sampling method, we significantly mitigate the belief given the previous xt 1 , then use the observation function space complexity of the I-POMDPs. p(yt |xt ) to compute the importance weight for each particle and resample all particles according to the weights. Background The Model POMDP A Partially observable Markov decision process (POMDP) I-POMDP framework (Kaelbling, Littman, and Cassandra 1998) is a general re- An interactive POMDP of agent i, I-POMDP i, is defined inforcement learning model for planning and acting in a as: single-agent, partially observable, stochastic domain. It is I-P OM DPi = hISi,l , A, ⌦i , Ti , Oi , Ri i (4) defined for a single agent i as: where ISi,l is the set of interactive states of the environ- P OM DPi = hS, Ai , ⌦i , Ti , Oi , Ri i (1) ment, defined as ISi,l = S ⇥ Mi,l 1 , l 1, where S is the set of states and Mi,l 1 is the set of possible models Where the meaning for each element in the 6-tuple is: of agent j, and l is the strategy level. A specific class of • S is the set of states of the environment. models are the (l 1)th level intentional models, ⇥j,l 1 , of • Ai is the set of agent i’s possible actions agent j: ✓j,l 1 = hbj,l 1 , A, ⌦j , Tj , Oj , Rj , OCj i, bj,l 1 is agent j’s belief nested to level (l 1), bj,l 1 2 (ISj,l 1 ), • ⌦i is the set of agent i’s possible observations and OCj is j’s optimality criterion. The intentional model • Ti : S ⇥ Ai ⇥ S ! [0, 1] is the state transition function ✓j,l 1 , sometimes is referred to as type, can be rewritten as • Oi : S ⇥ Ai ⇥ ⌦i ! [0, 1] is the observation function ✓j,l 1 = hbj,l 1 , ✓ˆj i, where ✓ˆj includes all elements of the • Ri : S ⇥ Ai ! Ri is the reward function. intentional model other than the belief and is called the agent j’s frame. Given the definition above, an agent’s belief about the The ISi,l could be defined in an inductive manner (note state can be represented as a probability distribution over S. The belief update can be simply done using the following that when ✓ˆj is usually known, ✓ˆj reduces to bj ): formula, where ↵ is the normalizing constant: X ISi,0 = S, ✓j,0 = {hbj,0 , ✓ˆj i : bj,0 2 (S)} b (s0 ) = ↵O(o, s, a) T (s0 , a, s)b(s) (2) ISi,1 = S ⇥ ✓j,0 , ✓j,1 = {hbj,1 , ✓ˆj i : bj,1 2 (ISj,1 )} s2S ...... (5) Given the agent’s belief, then the optimal action, a⇤ , is sim- ply part of the set of optimal actions, OP T (bi ), for the belief ISi,l = S ⇥ ✓j,l 1 , ✓j,l = {hbj,l , ✓ˆj i : bj,l 2 (ISj,l )} state defined as: And all other remaining components in an I-POMDP are 1 https://github.com/solohan22/IPOMDP.git similar to those in a POMDP: 70 Yanlin Han and Piotr Gmytrasiewicz MAICS 2017 pp. 69–76 • A = Ai ⇥ Aj is the set of joint actions of all agents. to incorrect models, we add another resampling step to sam- • ⌦i is the set of agent i’s possible observations. ple from the neighboring similar models given the current samples. Consequently, our algorithm is able to maintain a • Ti : S ⇥ Ai ⇥ S ! [0, 1] is the state transition function. probability distribution of the most possible models of other • Oi : S ⇥ Ai ⇥ ⌦i ! [0, 1] is the observation function. agents and eventually learn the optimal actions of them. • Ri : IS ⇥ Ai ! Ri is the reward function. Algorithm 1: Interactive Belief Update Interactive belief update Given all the definitions above, the interactive belief up-date b̃tk,l = InteractiveBeliefUpdate(b̃tk,l1 , atk 1 , otk , l > 0) (n),t 1 (n),t 1 can be performed as follows: 1 For isk =< s(n),t 1 , ✓ k >2 b̃tk,l1 , t 1 (n),t 1 2 sample a k ⇠ P (A k |✓ k ) bti (ist ) = P r(ist |bti 1 , ati 1 , oti ) (6) 3 sample s(n),t ⇠ Tk (S t |S (n),t 1 , atk 1 , at k1 ) X X 4 for ot k 2 ⌦ k : t 1 t 1 t 1 t 1 t 1 t =↵ b(is ) P r(aj |✓j )T (s , a , s ) 5 if l = 1: ist 1 atj 1 (n),t (n),t 1 6 b k,0 = Level0BeliefUpdate(b k,0 , at k1 X (n),t 1 ⇥ Oi (st , at 1 , oti ) Oj (st , at 1 , otj )⌧ (btj 1 , atj 1 , otj , btj ) , ot k , ✓ k ) (n),t (n),t ˆ(n),t 1 otj 7 ✓ k =< b k,0 , ✓ k > (n),t (n),t Unlike plain belief update in POMDP, the interactive be- 8 isk =< s(n),t , ✓ k > lief update in I-POMDP takes two additional sophistica- 9 else: (n),t tions into account. Firstly, the probabilities of other’s actions 10 b k,l 1 = InteractiveBeliefUpdate(b̃t k,l 1 1 given its models (the second summation) need to be com- t 1 t , a k , o k , l 1) puted since the state of physical environment now depends (n),t (n),t (n),t 1 on both agents’ actions. Secondly, the agent needs to update 11 ✓ k =< b k,l 1 , ✓ˆ k > (n),t (n),t (n),t its beliefs based on the anticipation of what observations the 12 isk =< s ,✓ k > other agent might get and how it updates (the third summa- 13 (n) (n) wt = O k (ot k |s(n),t , atk 1 , at k1 ) tion). (n) (n) 14 wt = wt ⇥ Ok (otk |s(n),t , atk 1 , at k1 ) Then the optimal action, a⇤ , for the case of infinite hori- (n),t (n) zon criterion with discounting, is part of the set of optimal 15 b̃temp k,l =< isk , wt > PN actions, OP T (✓i ), for the belief state defined as: (n) 16 normalize all wt so that n=1 wt = 1 (n) temp (n) n X 17 resample from b̃k,l accroding to normalized wt (n),t OP T (✓i ) = arg max bis (s)ERi (is, ai ) (7) 18 resample ✓ k according to neighboring similar ai 2Ai is2IS models X o (n),t + P (oi |ai , bi ) ⇥ U (hSE✓i (bi , ai , oi ), ✓ˆi i) 19 return b̃tk,l = isk oi 2⌦i The interactive belief update described in Algo- Sampling Algorithms rithm 1 is similar to I-PF in terms of the recur- The Interactive Particle Filter (I-PF) (Doshi and Gmy- sive Monte Carlo sampling and nesting hierarchy, but trasiewicz 2009) was proposed as a filtering algorithm for it has three major differences. Firstly, the belief up- interactive belief update in I-POMDP. It generalizes the clas- date is over the entire intentional model space of other sic particle filter algorithm to multi-agent settings and uses agents, therefore the initial set of N samples b̃tk,l1 =< the state transition function as the proposal distribution, (n),t 1 (n) (n) (n) which is usually used in a specific particle filter algorithm b k,l 1 , A k , ⌦ k , T k , O k , R k , OCj >, where k here called bootstrap filter (Gordon, ect 1993). However, due to denotes the modeling agent and k denotes all other mod- the enormous belief space, I-PF assumes that other agent’s eled agents. We only assume that the actions A k , ob- frame ✓ˆj is known to the modeling agent, therefore simpli- servations ⌦ k and optimal criteria OCj are known, as fies the belief update process from S ⇥ ⇥j,l 1 to a signifi- in a multi-agent game the rules are usually known to cantly smaller space S ⇥ bj,l 1 . all agents or could be obtained through intelligence. Sec- The intuition of our algorithm is to assign appropriate ondly, it is intuitive to see that the observation function (n) prior distributions over agent j’s all possible models ✓j =< O k (ot k |s(n),t , atk 1 , at k1 ) in line 13 is now randomized bj (s), Aj , ⌦j , Tj , Oj , Rj , OCj > and sample from each di- as well, as each of them is a particular observation function mension of them. At each time step, we update all samples of that agent. Lastly, we add another resampling step in line using perceived observations, namely computing and assign- 18 in order to avoid divergence, by resampling each dimen- ing weights to each sample, and resample them according to sion of the model samples from a Gaussian distribution with the weights. At last, since it is a randomized Monte Carlo the mean of current sample value. Intuitively, similar models method, to prevent the learning algorithm from converging are resampled from a relatively tight neighboring region of 71 Learning Others Intentional Models in Multi-Agent Settings Using Interactive POMDPs pp. 69–76 (n) the current model samples to maintain the learning accuracy. transition function Tk (st |st 1 , atk 1 , at k1 ) and observa- Algorithm 1 can be viewed as two major steps. The impor- (n) tion function Ok (otk |st , atk 1 , at k1 ) are now both samples tance sampling step (line 1 to line 16) samples from belief from input arguments, depending on model parameters of priors b̃tk,l1 and propagates forward using related proposal the actual agent on the 0th level. distributions and computes the weights of all samples. And the selection or resapmling step (line 17 to line 18) resam- ples according to weights and similar models. Specifically, (n),t 1 the algorithm starts from a set of initial priors isk , for each of them, it samples other agents’ optimal action at k1 (n),t 1 from its policy P (A k |✓ k ), which is solved using a very efficient POMDP solver called Perseus2 (Spaan and Vlassis 2005). Then it samples the physical state st using the state transition Tk (S t |S (n),t 1 , atk 1 , at k1 ). Once at k1 and st are sampled, the algorithm calls for the 0-level be- lief update (line 5 to 8), described in Algorithm 2, to update other agents’ plain beliefs bt k,0 if the current nesting level l is 1, or recursively calls for itself at a lower level l 1 (line 9 to 12) if the current nesting level is greater than 1. The (n) sample weights wt are computed according to observation likelihoods of modeling and modeled agents (line 13, 14), Figure 1: An illustration of interactive belief update for two and then got normalized so that they sum up to 1 (line 16). agents and 1 level nesting. Lastly, the algorithm resamples the intermediate samples ac- cording to the computed weights (line 17) and resamples an- In figure 1, we illustrate the interactive belief update us- other time from similar neighboring models (line 18). ing the problem discussed in the following section . Suppose there are two agents i and j in the environment, the sample Algorithm 2: Level-0 Belief Update size is 8 and the nesting level is 2, the subscripts in figure 1 denotes the corresponding agents and each dot represents a (n) (n) btk,0 =Level0BeliefUpdate(btk,01 ,atk 1 ,otk , Tk ,Ok ) particular belief sample. The propagate step corresponds to 1 P (at k1 ) = 1/at k1 line 2 to 12 in Algorithm 1, the weight step corresponds to 2 for st 2 S: line 13 to 16, the resample step corresponds to line 17 and 3 for st 1 : 18. The belief update for a particular level-0 model sam- (t 1) ple (✓j = hbj (s) = 0.5, pT 1 = 0.67, pT 2 = 0.5, pO1 = 4 for a k 2 A k : 0.85, pO2 = 0.5, pR1 = 1, pR2 = 100, pR3 = 10i) is (n) t t 1 t 1 5 P (s |s , ak ) = solved using Algorithm 2, and the optimal action is com- (n) Tk (st |st 1 , atk 1 , at k1 ) ⇥ P (at k1 ) puted by calling the Perseus POMDP solver. 6 sum + = P (n) (st |st 1 , atk 1 )btk,01 (st 1 ) (n) 7 (t 1) for a k 2 A k : Experiments 8 P (n) (otk |st , atk 1 )+ = Setup (n) Ok (otk |st , atk 1 , at k1 )P (at k1 ) We present the results using the multi-agent tiger game 9 btk,0 = sum(n) ⇥ P (n) (otk |st , atk 1 ) (Gmytrasiewicz and Doshi 2005) with various settings. The 10 normalize and return btk,0 multi-agent tiger game is a generalization of the classical single agent tiger game (Kaelbling, Littman, and Cassandra 1998) with adding observations which caused by others’ ac- The 0-level belief update, described in Algorithm 2, is tions. The generalized multi-agent game contains additional similar to POMDP belief update but treats other agents’ observations regarding other players, while the state transi- actions as noise and randomized the state transition func- tion and reward function involve others’ actions as well. tion and observation function as input parameters. It assume Let’s see a specific game example with known parame- other agents in the environment choose their actions accord- ters: there are a tiger and a pile of gold behind two doors ing to a uniform distribution (line 1), therefore is essentially respectively, two players can both listen for a growl of the (t 1) a no-information model. For each possible action a k , it tiger and a creak caused by the other player, or open doors computes the actual state transition (line 5) and actual ob- which resets the tiger’s location with equal probability. Their servation function (line 8) by marginalizing over others’ ac- observation toward the tiger and the other player are both tions, and returns the normalized belief btk,0 . Notice that the relatively high (0.85 and 0.9 respectively). No matter trig- gered by which player, the reward for listening action is -1, 2 http://www.st.ewi.tudelft.nl/˜mtjspaan/ opening the tiger door is -100 and opening the gold door is pomdp/index_en.html 10. 72 Yanlin Han and Piotr Gmytrasiewicz MAICS 2017 pp. 69–76 Table 1: Parameters for transition functions reasonable assumptions that Aj and ⌦j are known, OCj is infinite horizon with discounting. Now what we really want S A TL TR to learn are as follow: TL L pT 1 1 pT 1 TR L 1 pT 1 pT 1 • b0j : the initial belief of agent j about the physical state. * OL pT 2 1 pT 2 • Tj : the transition function of agent j, which can be * OR 1 pT 2 pT 2 parametrized by two parameters pT 1 and pT 2 , as shown in Table 1. Table 2: Parameters for observation functions • Oj : the observation function of agent j, which can be parametrized by two parameters pO1 and pO2 , as shown S A GL GR in Table 2. TL L pO1 1 pO1 TR L 1 pO1 pO1 • Tj : the reward function of agent j, which can be * OL pO2 1 pO2 parametrized by three parameters pR1 , pR2 and pR3 , as shown in Table 3. * OR 1 pO2 pO2 We could easily see that it is a enormous 8-dimensional Table 3: Parameters for reward functions parameter space to learn from: b0j ⇥pT 1 ⇥pT 2 ⇥pO1 ⇥pO2 ⇥ pR1 ⇥ pR2 ⇥ pR3 , where bj 2 [0, 1] ⇢ R, pT 1 2 [0, 1] ⇢ R, S A R pT 2 2 [0, 1] ⇢ R, pO1 2 [0, 1] ⇢ R, pO2 2 [0, 1] ⇢ R, * L pR1 pR1 2 [ 1, +1] ⇢ R, pR2 2 [ 1, +1] ⇢ R, pR3 2 TL OL pR1 [ 1, +1] ⇢ R. TR OR pR2 We mainly reduce this huge space by two means: utilizing TL OR pR3 Monte Carlo sampling methods and giving them problem- TR OL pR3 specific priors so that they are not over informative but pro- vide enough information for the algorithm to learn from. For the sake of brevity, we restrict the experiments to Results a two-agent setting and nesting level of one, but the sam- For the actual experiments, we fix the number of samples pling algorithm is extensible to any number of agents and to be 2000 and run it on a two agent tiger game simulation nesting levels in a straightforward manner. Recall that an as described above. We run experiments for learning three interactive POMDP of agent i is defined as a six tuple difference models of agent j: I-P OM DPi = hISi,l , A, ⌦i , Ti , Oi , Ri i. Thus for the spe- 1. ✓j1 =< 0.5, 0.67, 0.5, 0.85, 0.5, 1, 100, 10 > cific setting of multi-agent tiger problem: 2. ✓j2 =< 0.5, 1.00, 0.5, 0.95, 0.5, 1, 10, 10 > • ISi,1 = S ⇥ ✓j,0 , where S = {tiger on the left (TL), tiger on the right (TR)} and ✓j,0 =< 3. ✓j3 =< 0.5, 0.66, 0.5, 0.85, 0.5, 10, 100, 10 > bj (s), Aj , ⌦j , Tj , Oj , Rj , OCj >}. These models are all very special cases and carefully cho- • A = Ai ⇥ Aj is a combination of both agents’ possi- sen in order to verify the correctness and evaluate the per- ble actions: listen (L), open left door (OL) and open right formance of our algorithm. For instance, the first model is a door(OR). sophisticated one when the other agent is actually modeling his opponent using a subintentional model, while the sec- • ⌦i is all the combinations of each agent’s possible ob- ond is a classic single-agent POMDP and the third is a very servations: growl from left (GL) or right (GR), combined simple one but contains a large model space. We want to in- with creak from left (CL), right (CR) or silence (S). vestigate if our framework is able to correctly and efficiently • Ti = Tj : S ⇥ Ai ⇥ Aj ⇥ S ! [0, 1] is a joint state learn these models through these experiments. transition probability that involves both actions. • Oi : S ⇥ Ai ⇥ Aj ⇥ ⌦i ! [0, 1] becomes a joint observa- tion probability that involves both actions. Oj is symmet- ric of Oi with respect to the joint actions. • Ri : IS ⇥ Ai ⇥ Aj ! Ri : agent i gets corresponding rewards when he listens, opens the wrong door and opens the correct door respectively. They are independent of j’s actions. Parameter Space For the experiments of multi-agent tiger game, we want to learn over all possible intentional models of the other agent j: ✓j =< bj (s), Aj , ⌦j , Tj , Oj , Rj , OCj >. We only make Figure 2: Optimal policy of a no-information model. 73 Learning Others Intentional Models in Multi-Agent Settings Using Interactive POMDPs pp. 69–76 The aim of first experiment to try to learn rela- ure 3, specifically, they are uniform U(0,1) for b0j , Beta(5,3) tively complicated models of agent j with ✓j =< with mode 0.67 for pT 1 , Beta(5,5) for pT 2 , Beta(3.5,1.4) 0.5, 0.67, 0.5, 0.85, 0.5, 1, 100, 10 >, who assumes oth- with mode 0.85 for pO1 , Beta(5,5) for pO2 , Gaussian N(- ers’ actions are drawn from a uniform distribution. Equiv- 1,2) for pR1 , N(-100,4) for pR2 , and N(10,2) for pR3 . alently, agent j’s actual policy, as shown in figure 2, is to look for three consecutive growls from same direction and then open the corresponding door. For this particular exper- iment, we simulated the observation history for agent i, for the sake of firstly verifying the correctness of our algorithm, excluding the impacts of uncertainties from hearing accu- racy. The simulated observation history is as follows: {GL,S GL,S GL,S GL,CR GL,S GL,S GL,S GR,CR GL,S GL,S GL,S GR,CR GL,S GL,S GL,S GR,CR GR,S GR,S GR,S GR,CL GR,S GR,S GR,S GR,CL GR,S GR,S GR,S GR,CL GR,S GR,S GR,S GR,CL GR,S GR,S GR,S GR,CL GR,S GR,S GR,S GL,CL GL,S GL,S GL,S GR,CR GL,S GL,S GL,S GR,CR GR,S GR,S} Figure 4: 3D histogram of all model samples. After 50 time steps, the algorithm converge to a posterior distribution over agent j’s intentional models, the results are also given in figure 3. Since the parameter space of agent j’s models is 8-dimensional, here we only show the marginal distributions of each parameter space in histograms. We can easily see that the majority of samples are centered around the true parameter values. We use principal component analysis (PCA) (Abdi and Williams 2010) to reduce sample dimensionality to 2- dimensional and plot them out in a 3-dimensional histogram, as shown in Figure 4. It starts from a Gaussian-like prior and gradually converges to the most likely models. Eventually the mean value of this cluster h 0.49, 0.69, 0.49, 0.82, 0.51, - 0.95, -99.23, 10.09 i is very close to the true model. Here we give two examples from the big cluster after 50 time steps: h0.56, 0.66, 0.49, 0.84, 0.59, -0.95, -101.37, 11.42i and h0.51, 0.68, 0.52, 0.89, 0.56, -1.33, -98.39, 12.55i. The former has a corresponding optimal policy of [0— OL—0.10—L—1], while the latter has a [0—OL—0.09— L–0.91—OR—1], which are both extremely close to the optimal policy of the true model: [0—OL—0.1—L–0.9— OR—1]. Consequently, the framework is able to predict other agents’ actions with high accuracy. We tested the performance of our algorithms in terms of prediction accuracy towards others’ actions. We com- pared the results with other modeling approaches, such as a frequency-based approach, in which agent j is assumed to choose his action according to a fixed but unknown dis- tribution, and a no-information model which treats j’s ac- tions purely as uniform noise. The results shown in figure Figure 3: Assigned priors and learned posterior dis- 5 are averaged plots of 10 random runs, each of which has tributions over model parameters for model ✓j1 =< 50 time steps. It shows clearly that the intentional I-POMDP 0.5, 0.67, 0.5, 0.85, 0.5, 1, 100, 10 >. approach has significantly lower error rates as agent i per- ceives more observations. The subintenional model assume The priors we assign to each parameters are shown in fig- j’s action is draw from a uniform distribution, therefore has 74 Yanlin Han and Piotr Gmytrasiewicz MAICS 2017 pp. 69–76 Figure 5: Prediction error rate vs observation length. a fixed high error rate. The frequency based approach has certain learning ability but is far from enough sophisticated for modeling a fully rational agent. Figure 6: (a) optimal policy for ✓j = h 0.5, 1, 0.5, 0.95, 0.5, -1, -10, 10 i. (b) optimal policy for ✓j = h0.5, 0.66, 0.5, 0.85, 0.5, 10, -100, 10i. In the second experiment, we run our algorithm on ac- tual observations for 30 time steps until it converges, and try to learn models of a simpler classic POMDP with high listening accuracy of 0.95 and small penalty of -10, e.g. the agent j alternately opens door and listens as shown in Fig- ure 6 left. The actual model of j is ✓j = h 0.5, 1, 0.5, 0.95, 0.5, -1, -10, 10 i, the priors assigned to b0j , pT 1 , pT 2 , pO1 , pO2 , pR2 , pR3 are U(0,1), Beta(2,0.5), Beta(10,10), Beta(19,1), Beta(10,10), N(-1,1), N(-10,2), N(10,2), and the actual observation history is {GR,S GL,CR GL,S GL,CL Figure 7: Learned posterior distributions for model ✓j = h GL,S GL,CR GL,S GL,CL GL,S GR,S GR,CL GR,CL GL,S 0.5, 1, 0.5, 0.95, 0.5, -1, -10, 10 i. GR,S GR,S GL,CL GR,S GL,CR GR,S GR,CR GR,CR GR,CL GL,S GL,S GL,S GL,CR GL,S GL,CL GR,S GR,S}. Similarly, we report the learned posterior distributions over model parameters in figure 7. We observe an interest- ing pattern that while some parameters, such as bj,0 , pT 2 and pO2 are concentrated around the actual values, others like h0.5, 0.66, 0.5, 0.85, 0.5, 10, -100, 10i, who always listens pT 1 and pO1 become more dispersed than initial priors. The since the listening penalty is now equal to the reward, as intuition behind is that the penalty and reward are -10 and shown in figure 6(b). For brevity, we only show the marginal 10, so one listening of reward -1 is enough for making deci- distributions over model parameters in figure 8. The priors sion of opening doors. That is to say, as long as tiger likely assigned to b0j , pT 1 , pT 2 , pO1 , pO2 , pR2 , pR3 are U(0,1), remains behind the same door when agent listens (the mean- Beta(5,3), Beta(10,10), Beta(3.5,1.4), Beta(10,10), N(10,1), ing of pT 1 ) and has a reliable hearing accuracy (the meaning N(-100,2), N(10,2), and the actual observation history i of pO1 ), there are many models which satisfy this particular learns from is {GL,S GL,S GR,S GL,S GL,CL GR,S GR,S observation sequence, hence our algorithm learns them all. GL,CL GR,S GL,S GL,S GR,S GL,S GL,S GL,S GL,CL For conciseness, we show the average prediction error GR,S GL,S GL,S GL,S}. We can see that all three reward rates for both second and third experiments in figure 9. parameters are correctly learned, while samples of pT 1 , pT 2 , Both results are averaged among 10 random runs, each of pO1 and pO2 are not very concentrated to their true values which has 30 time steps. In the second experiment in fig- but close to their corresponding priors, since intuitively they ure 9(a), the intentional I-POMDP approach still has signif- become less important and can be in a relatively loose re- icantly lower error rates than others. gion due to the increased pR1 =10. Lastly, the performance In the last experiment, we wants to learn a model of ✓j = comparison is given in figure 9(b). 75 Learning Others Intentional Models in Multi-Agent Settings Using Interactive POMDPs pp. 69–76 Figure 9: (a) Prediction error rate vs observation length for ✓j = h 0.5, 1, 0.5, 0.95, 0.5, -1, -10, 10 i. (b) for ✓j = h0.5, 0.66, 0.5, 0.85, 0.5, 10, -100, 10i. References Abdi, H. and Williams, L.J., 2010. Principal component analy- sis. Wiley interdisciplinary reviews: computational statistics, 2(4), pp.433-459. Doucet, A., De Freitas, N. and Gordon, N., 2001. An introduc- tion to sequential Monte Carlo methods. In Sequential Monte Carlo methods in practice (pp. 3-14). Springer New York. Doshi, P., and Gmytrasiewicz, P. J. 2009. Monte Carlo sampling methods for approximating interactive POMDPs. Journal of Artifi- cial Intelligence Research 34: 297-337. Doshi-Velez, F. and Konidaris, G., 2013. Hidden parameter Markov decision processes: A semiparametric regression approach for dis- covering latent task parametrizations. arXiv:1308.3513. Del Moral, P., 1996. Non-linear filtering: interacting particle reso- lution. Markov processes and related fields, 2(4), pp.555-581. Doshi, P., Zeng, Y., and Chen, Q. 2009. Graphical models for interactive POMDPs: representations and solutions. Autonomous Agents and Multi-Agent Systems 18.3: 376-416. Gilks, W.R., Richardson, S. and Spiegelhalter, D.J., 1996. Intro- ducing markov chain monte carlo. Markov chain Monte Carlo in practice, 1, p.19. Gmytrasiewicz, P. J., and Doshi, P. 2005. A framework for sequen- tial planning in multi-agent settings. Journal of Artificial Intelli- gence Research 24(1): 49-79. Figure 8: Learned posterior distributions for model ✓j = Gmytrasiewicz, P. J., and Doshi, P. 2005. A framework for sequen- h0.5, 0.66, 0.5, 0.85, 0.5, 10, -100, 10i. tial planning in multi-agent settings. Journal of Artificial Intelli- gence Research 24(1): 49-79. Gordon, N.J., Salmond, D.J. and Smith, A.F., 1993, April. Novel approach to nonlinear/non-Gaussian Bayesian state estimation. In Conclusions and Future Work IEE Proceedings F (Radar and Signal Processing) (Vol. 140, No. 2, pp. 107-113). IET Digital Library. We have described a new approach to learn other agents’ Kaelbling, L.P., Littman, M.L. and Cassandra, A.R., 1998. Plan- models by approximating the interactive belief update us- ning and acting in partially observable stochastic domains. Artifi- ing Bayesian inference and Monte Carlo sampling methods. cial intelligence, 101(1), pp.99-134. Our framework correctly learns others’ models over the en- Spaan, M.T. and Vlassis, N., 2005. Perseus: Randomized point- tire intentional model space and therefore is a generalized based value iteration for POMDPs. Journal of artificial intelligence research, 24, pp.195-220. reinforcement learning algorithm for multi-agent settings. It Liu, M., Liao, X. and Carin, L., 2011. The infinite regionalized pol- also effectively mitigates the belief space complexity and icy representation. In Proceedings of the 28th International Confer- has a significant better performance than other approaches ence on Machine Learning (ICML-11) (pp. 769-776). in terms of predicting others’ actions. Panella, A. and Gmytrasiewicz, P., 2016, March. Bayesian Learn- In the future, in order to fully evaluate the practicability ing of Other Agents’ Finite Controllers for Interactive POMDPs. on larger problem space, more multi-agent problems of var- In Thirtieth AAAI Conference on Artificial Intelligence. ious sizes could be tested. Due to computation complexity, Pearl, J., 1988. Probabilistic reasoning in intelligent systems: Net- experiments on higher nesting levels are currently limited. works of plausible reasoning. Ross, S., Chaib-draa, B. and Pineau, J., 2007. Bayes-adaptive Thus, more efforts could be made on utilizing nonparametric pomdps. In Advances in neural information processing systems Bayesian methods which inherently deal with nested belief (pp. 1225-1232). structures. 76