Learning Others' Intentional Models in Multi-Agent Settings Using Interactive POMDPs

Learning Others' Intentional Models in Multi-Agent Settings Using Interactive POMDPs YanlinHan Department of Computer Science University of Illinois at Chicago Chicago

60607 IL

PiotrGmytrasiewicz Department of Computer Science University of Illinois at Chicago Chicago

60607 IL

Learning Others' Intentional Models in Multi-Agent Settings Using Interactive POMDPs C1E5AE1B853E8F550800109F7EA0850B GROBID - A machine learning software for extracting information from scholarly documents

Interactive partially observable Markov decision processes (I-POMDPs) provide a principled framework for planning and acting in a partially observable, stochastic and multiagent environment, extending POMDPs to multi-agent settings by including models of other agents in the state space and forming a hierarchical belief structure. In order to predict other agents' actions using I-POMDP, we propose an approach that effectively uses Bayesian inference and sequential Monte Carlo (SMC) sampling to learn others' intentional models which ascribe them beliefs, preferences and rationality in action selection. For problems of various complexities, empirical results show that our algorithm accurately learns models of other agents and has superior performance in comparison with other methods. Our approach serves as a generalized reinforcement learning algorithm that learns over other agents' transition, observation and reward functions. It also effectively mitigates the belief space complexity due to the nested belief hierarchy.

Introduction

Partially observable Markov decision processes (POMDPs) (Kaelbling, Littman, and Cassandra 1998) provide a principled, decision-theoretic framework for planning under uncertainty in a partially observable, stochastic environment. An autonomous agent operates rationally in such settings by maintaining a belief of the physical state at any given time, in doing so it sequentially chooses the optimal actions that maximize the future rewards. Therefore, solutions of POMDPs are mappings from an agent's beliefs to actions. Although POMDPs can be used in multi-agent settings, it is doing so under strong assumptions that the effects of other agents' actions are implicitly treated as noise and folded in the state transitions, such as recent Bayes-adaptive POMDPs (Ross, Draa, and Pineau 2007), infinite generalized policy representation (Liu, Liao, and Carin 2011), infinite POMDPs (Doshi-Velez et al. 2013). Therefore, an agent's beliefs about other agents are not in the solutions of POMDPs.

Interactive POMDP (I-POMDP) (Gmytrasiewicz and Doshi 2005) are a generalization of POMDP to multi-agent settings by replacing POMDP belief spaces with interactive hierarchical belief systems. Specifically, it augments the plain beliefs about the physical states in POMDP by including models of other agents, which forms a hierarchical belief structure that represents an agent's belief about the physical state, belief about the other agents and their beliefs about others' beliefs. The models of other agents included in the new augmented state space consist of two types: the intentional models and subintentional models. The sophisticated intentional model ascribes beliefs, preferences, and rationality to other agents (Gmytrasiewicz and Doshi 2005), while the simpler subintentional model, such as finite state controllers (Panella and Gmytrasiewicz 2016), does not. Solutions of I-POMDPs map an agent's belief about the environment and other agents' models to actions. Therefore, it is applicable to all important agent, human, and mixed agent-human applications. It has been clearly shown (Gmytrasiewicz and Doshi 2005) that the added sophistication for modeling others as rational agents results in a higher value function which dominates the one obtained from simply treating others as noise, which implies the modeling superiority of I-POMDPs for multi-agent systems over other approaches.

However, the interactive belief modification for I-POMDPs results in a drastic increase of the belief space complexity, adding to the curse of dimensionality: the complexity of the belief representation is proportional to belief dimensions due to exponential growth of agent models with increase of nesting level. Since exact solutions to POMDPs are proven to be PSPACE-complete for finite horizon and undecidable for infinite time horizon (Papadimitriou and Tsitsiklis 1987), the time complexity of the more generalized I-POMDPs, which may contain multiple POMDPs and I-POMDPs of other agents, is greater than or equal to PSPACE-complete for finite horizon and undecidable for infinite time horizon. Due to this severe space complexity, currently no complete belief update has been accomplished using the sophisticated intentional models over entire interactive belief space. There are only partial updates on other agents' sole beliefs about the physical states (Doshi and Gmytrasiewicz 2009) and indirect approach such as subintentional finite state controllers (Panella and Gmytrasiewicz 2016). Therefore, in order to unleash the full modeling power of intentional models and apply I-POMDPs to more realistic settings, a good approximation algorithm for computing the nested interactive belief and predicting other agents' actions is crucial to the trade-off between solution quality and computation complexity.

To address this issue, we propose a Bayesian approach that utilizes customized sequential Monte Carlo sampling algorithms (Doucet, De Freitas, and Gordon 2001) to obtain approximating solutions to interactive I-POMDPs and implement the algorithms in a software package1 . Specifically, We assume that models of other agents are unknown and learned from imperfect observations of the other agents' behaviors. We parametrize other agents' intentional models and maintain a belief over them, making sequential Bayesian updates using only observations from the environment. Since this Bayesian inference task is analytically intractable, to approximate the posterior distribution, we devise a customized sequential Monte Carlo method to descend the belief hierarchy and sample all model parameters at each nesting level, starting from the interactive particle filter (I-PF) (Doshi and Gmytrasiewicz 2009) for I-POMDP belief update.

Our approach, for the first time, successfully learns others' models over the entire intentional model space which contains their initial belief, transition, observation and reward functions, making it a generalized reinforcement learning method for multi-agent settings. Our algorithm accurately predicts others' actions on various problem settings, therefore enables the modeling agent to make corresponding optimal action to maximize its own rewards. By approximating Bayesian inference using a customized sequential Monte Carlo sampling method, we significantly mitigate the belief space complexity of the I-POMDPs.

Background POMDP

A Partially observable Markov decision process (POMDP) (Kaelbling, Littman, and Cassandra 1998) is a general reinforcement learning model for planning and acting in a single-agent, partially observable, stochastic domain. It is defined for a single agent i as:

P OMDP i = hS, A i , ⌦ i , T i , O i , R i i (1)

Where the meaning for each element in the 6-tuple is:

• S is the set of states of the environment.

• A i is the set of agent i's possible actions

• ⌦ i is the set of agent i's possible observations • T i : S ⇥ A i ⇥ S ! [0, 1] is the state transition function • O i : S ⇥ A i ⇥ ⌦ i ! [0, 1] is the observation function • R i : S ⇥ A i ! R i is the reward function.

Given the definition above, an agent's belief about the state can be represented as a probability distribution over S. The belief update can be simply done using the following formula, where ↵ is the normalizing constant:

b (s 0 ) = ↵O(o, s, a) X s2S T (s 0 , a, s)b(s)(2)

Given the agent's belief, then the optimal action, a ⇤ , is simply part of the set of optimal actions, OP T (b i ), for the belief state defined as:

OP T (b i ) =arg max ai2Ai n X s2S b i (s)R(s, a i ) (3) + X oi2⌦i P (o i |a i , b i ) ⇥ U (SE(b i , a i , o i ))

Particle Filter

The Markov Chain Monte Carlo (MCMC) method (Gilks et al., 1996) is widely used to approximate probability distributions when they are unable to be computed directly. It generates samples from a posterior distribution ⇡(x) over state space x, by simulating a Markov chain p(x 0 |x) whose state space is x and stationary distribution is ⇡(x). The samples drawn from p converge to the target distribution ⇡ as the number of samples goes to infinity.

In order to make MCMC work on sequential inference task, especially sequential decision makings under Markov assumptions, sequential versions of Monte Carlo methods have been proposed and some of them are capable of dealing with high dimensionality and/or complexity problems, such as particle filters (Del Moral and Pierre 1996). At each time step, particle filters draw samples (or particles) from a proposal distribution, commonly p(x t |x t 1 ), which is essentially the conditional distribution of the current state x t given the previous x t 1 , then use the observation function p(y t |x t ) to compute the importance weight for each particle and resample all particles according to the weights.

The Model I-POMDP framework

An interactive POMDP of agent i, I-POMDP i, is defined as:

I-P OMDP i = hIS i,l , A, ⌦ i , T i , O i , R i i (4)

where IS i,l is the set of interactive states of the environment, defined as IS i,l = S ⇥ M i,l 1 , l 1, where S is the set of states and M i,l 1 is the set of possible models of agent j, and l is the strategy level. A specific class of models are the (l 1)th level intentional models, ⇥ j,l 1 , of agent j: ✓ j,l 1 = hb j,l 1 , A, ⌦ j , T j , O j , R j , OC j i, b j,l 1 is agent j's belief nested to level (l 1), b j,l 1 2 (IS j,l 1 ), and OC j is j's optimality criterion. The intentional model ✓ j,l 1 , sometimes is referred to as type, can be rewritten as ✓ j,l 1 = hb j,l 1 , ✓j i, where ✓j includes all elements of the intentional model other than the belief and is called the agent j's frame.

The IS i,l could be defined in an inductive manner (note that when ✓j is usually known, ✓j reduces to b j ):

IS i,0 = S, ✓ j,0 = {hb j,0 , ✓j i : b j,0 2 (S)} IS i,1 = S ⇥ ✓ j,0 , ✓ j,1 = {hb j,1 , ✓j i : b j,1 2 (IS j,1 )} ...... (5)IS i,l = S ⇥ ✓ j,l 1 , ✓ j,l = {hb j,l , ✓j i : b j,l 2 (IS j,l )}

And all other remaining components in an I-POMDP are similar to those in a POMDP:

• A = A i ⇥ A j is the set of joint actions of all agents. • ⌦ i is the set of agent i's possible observations. • T i : S ⇥ A i ⇥ S ! [0, 1] is the state transition function. • O i : S ⇥ A i ⇥ ⌦ i ! [0, 1] is the observation function. • R i : IS ⇥ A i ! R i is the reward function.

Interactive belief update

Given all the definitions above, the interactive belief up-date can be performed as follows:

b t i (is t ) = P r(is t |b t 1 i , a t 1 i , o t i ) (6) = ↵ X is t 1 b(is t 1 ) X a t 1 j P r(a t 1 j |✓ t 1 j )T (s t 1 , a t 1 , s t ) ⇥ O i (s t , a t 1 , o t i ) X o t j O j (s t , a t 1 , o t j )⌧ (b t 1 j , a t 1 j , o t j , b t j )

Unlike plain belief update in POMDP, the interactive belief update in I-POMDP takes two additional sophistications into account. Firstly, the probabilities of other's actions given its models (the second summation) need to be computed since the state of physical environment now depends on both agents' actions. Secondly, the agent needs to update its beliefs based on the anticipation of what observations the other agent might get and how it updates (the third summation).

Then the optimal action, a ⇤ , for the case of infinite horizon criterion with discounting, is part of the set of optimal actions, OP T (✓ i ), for the belief state defined as:

OP T (✓ i ) = arg max ai2Ai n X is2IS b is (s)ER i (is, a i ) (7) + X oi2⌦i P (o i |a i , b i ) ⇥ U (hSE ✓i (b i , a i , o i ), ✓i i) o

Sampling Algorithms

The Interactive Particle Filter (I-PF) (Doshi and Gmytrasiewicz 2009) was proposed as a filtering algorithm for interactive belief update in I-POMDP. It generalizes the classic particle filter algorithm to multi-agent settings and uses the state transition function as the proposal distribution, which is usually used in a specific particle filter algorithm called bootstrap filter (Gordon, ect 1993). However, due to the enormous belief space, I-PF assumes that other agent's frame ✓j is known to the modeling agent, therefore simplifies the belief update process from S ⇥ ⇥ j,l 1 to a significantly smaller space S ⇥ b j,l 1 . The intuition of our algorithm is to assign appropriate prior distributions over agent j's all possible models ✓ j = and sample from each dimension of them. At each time step, we update all samples using perceived observations, namely computing and assigning weights to each sample, and resample them according to the weights. At last, since it is a randomized Monte Carlo method, to prevent the learning algorithm from converging to incorrect models, we add another resampling step to sample from the neighboring similar models given the current samples. Consequently, our algorithm is able to maintain a probability distribution of the most possible models of other agents and eventually learn the optimal actions of them.

Algorithm 1: Interactive Belief Update bt k,l = InteractiveBeliefUpdate( bt 1 k,l , a t 1 k , o t k , l > 0) 1 For is (n),t 1 k =< s (n),t 1 , ✓ (n),t 1 k >2 bt 1 k,l , 2 sample a t 1 k ⇠ P (A k |✓ (n),t 1 k ) 3 sample s (n),t ⇠ T k (S t |S (n),t 1 , a t 1 k , a t 1 k ) 4 for o t k 2 ⌦ k : 5 if l = 1: 6 b (n),t k,0 = Level0BeliefUpdate(b (n),t 1 k,0 , a t 1 k , o t k , ✓ (n),t 1 k ) 7 ✓ (n),t k = 8 is (n),t k =< s (n),t , ✓ (n),t k > 9 else: 10 b (n),t k,l 1 = InteractiveBeliefUpdate( bt 1 k,l 1 , a t 1 k , o t k , l 1) 11 ✓ (n),t k = 12 is (n),t k =< s (n),t , ✓ (n),t k > 13 w (n) t = O (n) k (o t k |s (n),t , a t 1 k , a t 1 k ) 14 w (n) t = w (n) t ⇥ O k (o t k |s (n),t , a t 1 k , a t 1 k ) 15 btemp k,l =< is (n),t k , w (n) t > 16 normalize all w (n) t so that P N n=1 w (n) t = 1 17 resample from btemp k,l accroding to normalized w (n) t 18 resample ✓ (n),t k according to neighboring similar models 19 return bt k,l = is (n),t k

The interactive belief update described in Algorithm 1 is similar to I-PF in terms of the recursive Monte Carlo sampling and nesting hierarchy, but it has three major differences. Firstly, the belief update is over the entire intentional model space of other agents, therefore the initial set of N samples bt 1

k,l =,

where k here denotes the modeling agent and k denotes all other modeled agents. We only assume that the actions A k , observations ⌦ k and optimal criteria OC j are known, as in a multi-agent game the rules are usually known to all agents or could be obtained through intelligence. Secondly, it is intuitive to see that the observation function

O (n) k (o t k |s (n),t , a t 1 k , a t 1 k

) in line 13 is now randomized as well, as each of them is a particular observation function of that agent. Lastly, we add another resampling step in line 18 in order to avoid divergence, by resampling each dimension of the model samples from a Gaussian distribution with the mean of current sample value. Intuitively, similar models are resampled from a relatively tight neighboring region of the current model samples to maintain the learning accuracy.

Algorithm 1 can be viewed as two major steps. The importance sampling step (line 1 to line 16) samples from belief priors bt 1 k,l and propagates forward using related proposal distributions and computes the weights of all samples. And the selection or resapmling step (line 17 to line 18) resamples according to weights and similar models. Specifically, the algorithm starts from a set of initial priors is (n),t 1 k , for each of them, it samples other agents' optimal action a t 1 k from its policy P (A k |✓ (n),t 1 k ), which is solved using a very efficient POMDP solver called Perseus2 (Spaan and Vlassis 2005). Then it samples the physical state s t using the state transition T k (S t |S (n),t 1 , a t 1 k , a t 1 k ). Once a t 1 k and s t are sampled, the algorithm calls for the 0-level belief update (line 5 to 8), described in Algorithm 2, to update other agents' plain beliefs b t k,0 if the current nesting level l is 1, or recursively calls for itself at a lower level l 1 (line 9 to 12) if the current nesting level is greater than 1. The sample weights w (n) t are computed according to observation likelihoods of modeling and modeled agents (line 13, 14), and then got normalized so that they sum up to 1 (line 16). Lastly, the algorithm resamples the intermediate samples according to the computed weights (line 17) and resamples another time from similar neighboring models (line 18).

Algorithm 2: Level-0 Belief Update b t k,0 =Level0BeliefUpdate(b t 1 k,0 ,a t 1 k ,o t k , T (n) k ,O (n) k ) 1 P (a t 1 k ) = 1/a t 1 k 2 for s t 2 S: 3 for s t 1 : 4 for a (t 1) k 2 A k : 5 P (n) (s t |s t 1 , a t 1 k ) = T (n) k (s t |s t 1 , a t 1 k , a t 1 k ) ⇥ P (a t 1 k ) 6 sum (n) + = P (n) (s t |s t 1 , a t 1 k )b t 1 k,0 (s t 1 ) 7 for a (t 1) k 2 A k : 8 P (n) (o t k |s t , a t 1 k )+ = O (n) k (o t k |s t , a t 1 k , a t 1 k )P (a t 1 k ) 9 b t k,0 = sum (n) ⇥ P (n) (o t k |s t , a t 1 k ) 10 normalize and return b t k,0

The 0-level belief update, described in Algorithm 2, is similar to POMDP belief update but treats other agents' actions as noise and randomized the state transition function and observation function as input parameters. It assume other agents in the environment choose their actions according to a uniform distribution (line 1), therefore is essentially a no-information model. For each possible action a (t 1) k , it computes the actual state transition (line 5) and actual observation function (line 8) by marginalizing over others' actions, and returns the normalized belief b t k,0 . Notice that the transition function T (n) k (s t |s t 1 , a t 1 k , a t 1 k ) and observation function O (n) k (o t k |s t , a t 1 k , a t 1 k ) are now both samples from input arguments, depending on model parameters of the actual agent on the 0th level. In figure 1, we illustrate the interactive belief update using the problem discussed in the following section . Suppose there are two agents i and j in the environment, the sample size is 8 and the nesting level is 2, the subscripts in figure 1 denotes the corresponding agents and each dot represents a particular belief sample. The propagate step corresponds to line 2 to 12 in Algorithm 1, the weight step corresponds to line 13 to 16, the resample step corresponds to line 17 and 18. The belief update for a particular level-0 model sample (✓ j = hb j (s) = 0.5, p T 1 = 0.67, p T 2 = 0.5, p O1 = 0.85, p O2 = 0.5, p R1 = 1, p R2 = 100, p R3 = 10i) is solved using Algorithm 2, and the optimal action is computed by calling the Perseus POMDP solver.

Experiments Setup

We present the results using the multi-agent tiger game (Gmytrasiewicz and Doshi 2005) with various settings. The multi-agent tiger game is a generalization of the classical single agent tiger game (Kaelbling, Littman, and Cassandra 1998) with adding observations which caused by others' actions. The generalized multi-agent game contains additional observations regarding other players, while the state transition and reward function involve others' actions as well.

Let's see a specific game example with known parameters: there are a tiger and a pile of gold behind two doors respectively, two players can both listen for a growl of the tiger and a creak caused by the other player, or open doors which resets the tiger's location with equal probability. Their observation toward the tiger and the other player are both relatively high (0.85 and 0.9 respectively). No matter triggered by which player, the reward for listening action is -1, opening the tiger door is -100 and opening the gold door is 10. For the sake of brevity, we restrict the experiments to a two-agent setting and nesting level of one, but the sampling algorithm is extensible to any number of agents and nesting levels in a straightforward manner. Recall that an interactive POMDP of agent i is defined as a six tuple

L p T 1 1 p T 1 TR L 1 p T 1 p T 1 * OL p T 2 1 p T 2 * OR 1 p T 2 p T 2I-P OMDP i = hIS i,l , A, ⌦ i , T i , O i , R i i.

Thus for the specific setting of multi-agent tiger problem:

• IS i,1 = S ⇥ ✓ j,0

, where S = {tiger on the left (TL), tiger on the right (TR)} and ✓ j,0 =}. • ⌦ i is all the combinations of each agent's possible observations: growl from left (GL) or right (GR), combined with creak from left (CL), right (CR) or silence (S).

• A = A i ⇥ A j is

• T i = T j : S ⇥ A i ⇥ A j ⇥ S ! [0, 1] is a joint state transition probability that involves both actions.

• O i : S ⇥ A i ⇥ A j ⇥ ⌦ i ! [0, 1

] becomes a joint observation probability that involves both actions. O j is symmetric of O i with respect to the joint actions.

• R i : IS ⇥ A i ⇥ A j ! R i : agent i gets corresponding rewards when he listens, opens the wrong door and opens the correct door respectively. They are independent of j's actions.

Parameter Space

For the experiments of multi-agent tiger game, we want to learn over all possible intentional models of the other agent j: ✓ j =. We only make reasonable assumptions that A j and ⌦ j are known, OC j is infinite horizon with discounting. Now what we really want to learn are as follow:

• b 0 j : the initial belief of agent j about the physical state. • T j : the transition function of agent j, which can be parametrized by two parameters p T 1 and p T 2 , as shown in Table 1.

• O j : the observation function of agent j, which can be parametrized by two parameters p O1 and p O2 , as shown in Table 2.

• T j : the reward function of agent j, which can be parametrized by three parameters p R1 , p R2 and p R3 , as shown in Table 3.

We could easily see that it is a enormous 8-dimensional parameter space to learn from: b

0 j ⇥p T 1 ⇥p T 2 ⇥p O1 ⇥p O2 ⇥ p R1 ⇥ p R2 ⇥ p R3 , where b j 2 [0, 1] ⇢ R, p T 1 2 [0, 1] ⇢ R, p T 2 2 [0, 1] ⇢ R, p O1 2 [0, 1] ⇢ R, p O2 2 [0, 1] ⇢ R, p R1 2 [ 1, +1] ⇢ R, p R2 2 [ 1, +1] ⇢ R, p R3 2 [ 1, +1] ⇢ R.

We mainly reduce this huge space by two means: utilizing Monte Carlo sampling methods and giving them problemspecific priors so that they are not over informative but provide enough information for the algorithm to learn from.

Results

For the actual experiments, we fix the number of samples to be 2000 and run it on a two agent tiger game simulation as described above. We run experiments for learning three difference models of agent j:

1. ✓ j1 =< 0.5, 0.67, 0.5, 0.85, 0.5, 1, 100, 10 > 2. ✓ j2 =< 0.5, 1.00, 0.5, 0.95, 0.5, 1, 10, 10 > 3. ✓ j3 =< 0.5, 0.66, 0.5, 0.85, 0.5, 10, 100, 10 > These models are all very special cases and carefully chosen in order to verify the correctness and evaluate the performance of our algorithm. For instance, the first model is a sophisticated one when the other agent is actually modeling his opponent using a subintentional model, while the second is a classic single-agent POMDP and the third is a very simple one but contains a large model space. We want to investigate if our framework is able to correctly and efficiently learn these models through these experiments. The aim of first experiment to try to learn relatively complicated models of agent j with ✓ j =< 0.5, 0.67, 0.5, 0.85, 0.5, 1, 100, 10 >, who assumes others' actions are drawn from a uniform distribution. Equivalently, agent j's actual policy, as shown in figure 2, is to look for three consecutive growls from same direction and then open the corresponding door. For this particular experiment, we simulated the observation history for agent i, for the sake of firstly verifying the correctness of our algorithm, excluding the impacts of uncertainties from hearing accuracy. The simulated observation history is as follows: {GL,S GL,S GL,S GL,CR GL,S GL,S GL,S GR,CR GL,S GL,S GL,S GR,CR GL,S GL,S GL,S GR,CR GR,S GR,S GR,S GR,CL GR,S GR,S GR,S GR,CL GR,S GR,S GR,S GR,CL GR,S GR,S GR,S GR,CL GR,S GR,S GR,S GR,CL GR,S GR,S GR,S GL,CL GL,S GL,S GL,S GR,CR GL,S GL,S GL,S GR,CR GR,S GR,S} The priors we assign to each parameters are shown in fig-ure 3, specifically, they are uniform U(0,1) for b 0 j , Beta(5,3) with mode 0.67 for p T 1 , Beta(5,5) for p T 2 , Beta(3.5,1.4) with mode 0.85 for p O1 , Beta(5,5) for p O2 , Gaussian N(-1,2) for p R1 , N(-100,4) for p R2 , and N(10,2) for p R3 . After 50 time steps, the algorithm converge to a posterior distribution over agent j's intentional models, the results are also given in figure 3. Since the parameter space of agent j's models is 8-dimensional, here we only show the marginal distributions of each parameter space in histograms. We can easily see that the majority of samples are centered around the true parameter values.

We use principal component analysis (PCA) (Abdi and Williams 2010) to reduce sample dimensionality to 2dimensional and plot them out in a 3-dimensional histogram, as shown in Figure 4. It starts from a Gaussian-like prior and gradually converges to the most likely models. Eventually the mean value of this cluster h 0.49, 0.69, 0.49, 0.82, 0.51, -0.95, -99.23, 10.09 i is very close to the true model. Here we give two examples from the big cluster after 50 time steps: h0. 56, 0.66, 0.49, 0.84, 0.59, -0.95, -101.37, 11.42i and h0.51, 0.68, 0.52, 0.89, 0.56, -1.33, -98.39, 12.55i. The former has a corresponding optimal policy of [0-OL-0.10-L-1], while the latter has a [0-OL-0.09-L-0.91-OR-1], which are both extremely close to the optimal policy of the true model: [0-OL-0.1-L-0.9-OR-1]. Consequently, the framework is able to predict other agents' actions with high accuracy.

We tested the performance of our algorithms in terms of prediction accuracy towards others' actions. We compared the results with other modeling approaches, such as a frequency-based approach, in which agent j is assumed to choose his action according to a fixed but unknown distribution, and a no-information model which treats j's actions purely as uniform noise. The results shown in figure 5 are averaged plots of 10 random runs, each of which has 50 time steps. It shows clearly that the intentional I-POMDP approach has significantly lower error rates as agent i perceives more observations. The subintenional model assume j's action is draw from a uniform distribution, therefore has a fixed high error rate. The frequency based approach has certain learning ability but is far from enough sophisticated for modeling a fully rational agent.

Figure 6: (a) optimal policy for ✓ j = h 0.5, 1, 0.5, 0.95, 0.5, -1, -10, 10 i. (b) optimal policy for ✓ j = h0.5, 0.66, 0.5, 0.85, 0.5, 10, -100, 10i.

In the second experiment, we run our algorithm on actual observations for 30 time steps until it converges, and try to learn models of a simpler classic POMDP with high listening accuracy of 0.95 and small penalty of -10, e.g. the agent j alternately opens door and listens as shown in Figure 6 left. The actual model of j is ✓ j = h 0.5, 1, 0.5, 0.95, 0.5, -1, -10, 10 i, the priors assigned to b 0 j , p T 1 , p T 2 , p O1 , p O2 , p R2 , p R3 are U(0,1), Beta(2,0.5), Beta(10,10), Beta(19,1), Beta(10,10), N(-1,1), N(-10,2), N(10,2), and the actual observation history is {GR,S GL,CR GL,S GL,CL GL,S GL,CR GL,S GL,CL GL,S GR,S GR,CL GR,CL GL,S GR,S GR,S GL,CL GR,S GL,CR GR,S GR,CR GR,CR GR,CL GL,S GL,S GL,S GL,CR GL,S GL,CL GR,S GR,S}.

Similarly, we report the learned posterior distributions over model parameters in figure 7. We observe an interesting pattern that while some parameters, such as b j,0 , p T 2 and p O2 are concentrated around the actual values, others like p T 1 and p O1 become more dispersed than initial priors. The intuition behind is that the penalty and reward are -10 and 10, so one listening of reward -1 is enough for making decision of opening doors. That is to say, as long as tiger likely remains behind the same door when agent listens (the meaning of p T 1 ) and has a reliable hearing accuracy (the meaning of p O1 ), there are many models which satisfy this particular observation sequence, hence our algorithm learns them all.

For conciseness, we show the average prediction error rates for both second and third experiments in figure 9. Both results are averaged among 10 random runs, each of which has 30 time steps. In the second experiment in figure 9(a), the intentional I-POMDP approach still has significantly lower error rates than others.

In the last experiment, we wants to learn a model of ✓ j = Figure 7: Learned posterior distributions for model ✓ j = h 0.5, 1, 0.5, 0.95, 0.5, -1, -10, 10 i. h0.5, 0.66, 0.5, 0.85, 0.5, 10, -100, 10i, who always listens since the listening penalty is now equal to the reward, as shown in figure 6(b). For brevity, we only show the marginal distributions over model parameters in figure 8. The priors assigned to b 0 j , p T 1 , p T 2 , p O1 , p O2 , p R2 , p R3 are U(0,1), Beta(5,3), Beta(10,10), Beta(3.5,1.4), Beta(10,10), N(10,1), N(-100,2), N(10,2), and the actual observation history i learns from is {GL,S GL,S GR,S GL,S GL,CL GR,S GR,S GL,CL GR,S GL,S GL,S GR,S GL,S GL,S GL,S GL,CL GR,S GL,S GL,S GL,S}. We can see that all three reward parameters are correctly learned, while samples of p T 1 , p T 2 , p O1 and p O2 are not very concentrated to their true values but close to their corresponding priors, since intuitively they become less important and can be in a relatively loose region due to the increased p R1 =10. Lastly, the performance comparison is given in figure 9

Conclusions and Future Work

We have described a new approach to learn other agents' models by approximating the interactive belief update using Bayesian inference and Monte Carlo sampling methods. Our framework correctly learns others' models over the entire intentional model space and therefore is a generalized reinforcement learning algorithm for multi-agent settings. It also effectively mitigates the belief space complexity and has a significant better performance than other approaches in terms of predicting others' actions.

In the future, in order to fully evaluate the practicability on larger problem space, more multi-agent problems of various sizes could be tested. Due to computation complexity, experiments on higher nesting levels are currently limited. Thus, more efforts could be made on utilizing nonparametric Bayesian methods which inherently deal with nested belief structures.

Figure 9: (a) Prediction error rate vs observation length for ✓ j = h 0.5, 1, 0.5, 0.95, 0.5, -1, -10, 10 i. (b) for ✓ j = h0.5, 0.66, 0.5, 0.85, 0.5, 10, -100, 10i.

Figure 1 :1Figure 1: An illustration of interactive belief update for two agents and 1 level nesting.

a combination of both agents' possible actions: listen (L), open left door (OL) and open right door(OR).

Figure 2 :2Figure 2: Optimal policy of a no-information model.

Figure 3 :3Figure 3: Assigned priors and learned posterior distributions over model parameters for model ✓ j1 =< 0.5, 0.67, 0.5, 0.85, 0.5, 1, 100, 10 >.

Figure 4 :4Figure 4: 3D histogram of all model samples.

Figure 5 :5Figure 5: Prediction error rate vs observation length.

(b).

Figure 8 :8Figure 8: Learned posterior distributions for model ✓ j = h0.5, 0.66, 0.5, 0.85, 0.5, 10, -100, 10i.

Table 1 :1Parameters for transition functionsSATLTRTL

Table 22: Parameters for observation functionsSAGLGRTL Lp O11 p O1TR L1 p O1 p O1*OL p O21 p O2*OR 1 p O2 p O2

Table 33: Parameters for reward functionsSAR*Lp R1TL OL p R1TR OR p R2TL OR p R3TR OL p R3

https://github.com/solohan22/IPOMDP.git http://www.st.ewi.tudelft.nl/ ˜mtjspaan/ pomdp/index_en.html

Principal component analysis HAbdi LJWilliams Wiley interdisciplinary reviews: computational statistics 2 4 2010 An introduction to sequential Monte Carlo methods ADoucet NDe Freitas NGordon Sequential Monte Carlo methods in practice

New York

Springer 2001 Monte Carlo sampling methods for approximating interactive POMDPs PDoshi PJGmytrasiewicz Journal of Artificial Intelligence Research 34 2009 Hidden parameter Markov decision processes: A semiparametric regression approach for discovering latent task parametrizations FDoshi-Velez GKonidaris PDel Moral arXiv:1308.3513 Markov processes and related fields 2 4 2013. 1996 Non-linear filtering: interacting particle resolution Graphical models for interactive POMDPs: representations and solutions PDoshi YZeng QChen Autonomous Agents and Multi-Agent Systems 18 3 2009 Introducing markov chain monte carlo WRGilks SRichardson DJSpiegelhalter Markov chain Monte Carlo in practice 1996 1 19 A framework for sequential planning in multi-agent settings PJGmytrasiewicz PDoshi Journal of Artificial Intelligence Research 24 1 2005 A framework for sequential planning in multi-agent settings PJGmytrasiewicz PDoshi Journal of Artificial Intelligence Research 24 1 2005 Novel approach to nonlinear/non-Gaussian Bayesian state estimation NJGordon DJSalmond AFSmith IEE Proceedings F (Radar and Signal Processing) 140 2 1993. April IET Digital Library Planning and acting in partially observable stochastic domains LPKaelbling MLLittman ARCassandra Artificial intelligence 101 1 1998 Perseus: Randomized pointbased value iteration for POMDPs MTSpaan NVlassis Journal of artificial intelligence research 24 2005 The infinite regionalized policy representation MLiu XLiao LCarin Proceedings of the 28th International Conference on Machine Learning (ICML-11) the 28th International Conference on Machine Learning (ICML-11) 2011 March. Bayesian Learning of Other Agents' Finite Controllers for Interactive POMDPs APanella PGmytrasiewicz Thirtieth AAAI Conference on Artificial Intelligence 2016 Probabilistic reasoning in intelligent systems: Networks of plausible reasoning JPearl 1988 Bayes-adaptive pomdps SRoss BChaib-Draa JPineau Advances in neural information processing systems 2007