<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Learning Others&apos; Intentional Models in Multi-Agent Settings Using Interactive POMDPs</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Yanlin</forename><surname>Han</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">University of Illinois at Chicago Chicago</orgName>
								<address>
									<postCode>60607</postCode>
									<region>IL</region>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Piotr</forename><surname>Gmytrasiewicz</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">University of Illinois at Chicago Chicago</orgName>
								<address>
									<postCode>60607</postCode>
									<region>IL</region>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Learning Others&apos; Intentional Models in Multi-Agent Settings Using Interactive POMDPs</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">C1E5AE1B853E8F550800109F7EA0850B</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-25T02:35+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Interactive partially observable Markov decision processes (I-POMDPs) provide a principled framework for planning and acting in a partially observable, stochastic and multiagent environment, extending POMDPs to multi-agent settings by including models of other agents in the state space and forming a hierarchical belief structure. In order to predict other agents' actions using I-POMDP, we propose an approach that effectively uses Bayesian inference and sequential Monte Carlo (SMC) sampling to learn others' intentional models which ascribe them beliefs, preferences and rationality in action selection. For problems of various complexities, empirical results show that our algorithm accurately learns models of other agents and has superior performance in comparison with other methods. Our approach serves as a generalized reinforcement learning algorithm that learns over other agents' transition, observation and reward functions. It also effectively mitigates the belief space complexity due to the nested belief hierarchy.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Introduction</head><p>Partially observable Markov decision processes (POMDPs) <ref type="bibr" target="#b9">(Kaelbling, Littman, and Cassandra 1998)</ref> provide a principled, decision-theoretic framework for planning under uncertainty in a partially observable, stochastic environment. An autonomous agent operates rationally in such settings by maintaining a belief of the physical state at any given time, in doing so it sequentially chooses the optimal actions that maximize the future rewards. Therefore, solutions of POMDPs are mappings from an agent's beliefs to actions. Although POMDPs can be used in multi-agent settings, it is doing so under strong assumptions that the effects of other agents' actions are implicitly treated as noise and folded in the state transitions, such as recent Bayes-adaptive POMDPs <ref type="bibr" target="#b14">(Ross, Draa, and Pineau 2007)</ref>, infinite generalized policy representation <ref type="bibr" target="#b11">(Liu, Liao, and Carin 2011)</ref>, infinite POMDPs <ref type="bibr" target="#b3">(Doshi-Velez et al. 2013)</ref>. Therefore, an agent's beliefs about other agents are not in the solutions of POMDPs.</p><p>Interactive POMDP (I-POMDP) <ref type="bibr">(Gmytrasiewicz and Doshi 2005</ref>) are a generalization of POMDP to multi-agent settings by replacing POMDP belief spaces with interactive hierarchical belief systems. Specifically, it augments the plain beliefs about the physical states in POMDP by including models of other agents, which forms a hierarchical belief structure that represents an agent's belief about the physical state, belief about the other agents and their beliefs about others' beliefs. The models of other agents included in the new augmented state space consist of two types: the intentional models and subintentional models. The sophisticated intentional model ascribes beliefs, preferences, and rationality to other agents <ref type="bibr">(Gmytrasiewicz and Doshi 2005)</ref>, while the simpler subintentional model, such as finite state controllers <ref type="bibr" target="#b12">(Panella and Gmytrasiewicz 2016)</ref>, does not. Solutions of I-POMDPs map an agent's belief about the environment and other agents' models to actions. Therefore, it is applicable to all important agent, human, and mixed agent-human applications. It has been clearly shown <ref type="bibr">(Gmytrasiewicz and Doshi 2005</ref>) that the added sophistication for modeling others as rational agents results in a higher value function which dominates the one obtained from simply treating others as noise, which implies the modeling superiority of I-POMDPs for multi-agent systems over other approaches.</p><p>However, the interactive belief modification for I-POMDPs results in a drastic increase of the belief space complexity, adding to the curse of dimensionality: the complexity of the belief representation is proportional to belief dimensions due to exponential growth of agent models with increase of nesting level. Since exact solutions to POMDPs are proven to be PSPACE-complete for finite horizon and undecidable for infinite time horizon <ref type="bibr">(Papadimitriou and Tsitsiklis 1987)</ref>, the time complexity of the more generalized I-POMDPs, which may contain multiple POMDPs and I-POMDPs of other agents, is greater than or equal to PSPACE-complete for finite horizon and undecidable for infinite time horizon. Due to this severe space complexity, currently no complete belief update has been accomplished using the sophisticated intentional models over entire interactive belief space. There are only partial updates on other agents' sole beliefs about the physical states <ref type="bibr" target="#b2">(Doshi and Gmytrasiewicz 2009)</ref> and indirect approach such as subintentional finite state controllers <ref type="bibr" target="#b12">(Panella and Gmytrasiewicz 2016)</ref>. Therefore, in order to unleash the full modeling power of intentional models and apply I-POMDPs to more realistic settings, a good approximation algorithm for computing the nested interactive belief and predicting other agents' actions is crucial to the trade-off between solution quality and computation complexity.</p><p>To address this issue, we propose a Bayesian approach that utilizes customized sequential Monte Carlo sampling algorithms <ref type="bibr" target="#b1">(Doucet, De Freitas, and Gordon 2001)</ref> to obtain approximating solutions to interactive I-POMDPs and implement the algorithms in a software package<ref type="foot" target="#foot_0">1</ref> . Specifically, We assume that models of other agents are unknown and learned from imperfect observations of the other agents' behaviors. We parametrize other agents' intentional models and maintain a belief over them, making sequential Bayesian updates using only observations from the environment. Since this Bayesian inference task is analytically intractable, to approximate the posterior distribution, we devise a customized sequential Monte Carlo method to descend the belief hierarchy and sample all model parameters at each nesting level, starting from the interactive particle filter (I-PF) <ref type="bibr" target="#b2">(Doshi and Gmytrasiewicz 2009)</ref> for I-POMDP belief update.</p><p>Our approach, for the first time, successfully learns others' models over the entire intentional model space which contains their initial belief, transition, observation and reward functions, making it a generalized reinforcement learning method for multi-agent settings. Our algorithm accurately predicts others' actions on various problem settings, therefore enables the modeling agent to make corresponding optimal action to maximize its own rewards. By approximating Bayesian inference using a customized sequential Monte Carlo sampling method, we significantly mitigate the belief space complexity of the I-POMDPs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Background POMDP</head><p>A Partially observable Markov decision process (POMDP) <ref type="bibr" target="#b9">(Kaelbling, Littman, and Cassandra 1998</ref>) is a general reinforcement learning model for planning and acting in a single-agent, partially observable, stochastic domain. It is defined for a single agent i as:</p><formula xml:id="formula_0">P OMDP i = hS, A i , ⌦ i , T i , O i , R i i (1)</formula><p>Where the meaning for each element in the 6-tuple is:</p><p>• S is the set of states of the environment.</p><p>• A i is the set of agent i's possible actions</p><formula xml:id="formula_1">• ⌦ i is the set of agent i's possible observations • T i : S ⇥ A i ⇥ S ! [0, 1] is the state transition function • O i : S ⇥ A i ⇥ ⌦ i ! [0, 1] is the observation function • R i : S ⇥ A i ! R i is the reward function.</formula><p>Given the definition above, an agent's belief about the state can be represented as a probability distribution over S. The belief update can be simply done using the following formula, where ↵ is the normalizing constant:</p><formula xml:id="formula_2">b (s 0 ) = ↵O(o, s, a) X s2S T (s 0 , a, s)b(s)<label>(2)</label></formula><p>Given the agent's belief, then the optimal action, a ⇤ , is simply part of the set of optimal actions, OP T (b i ), for the belief state defined as:</p><formula xml:id="formula_3">OP T (b i ) =arg max ai2Ai n X s2S b i (s)R(s, a i ) (3) + X oi2⌦i P (o i |a i , b i ) ⇥ U (SE(b i , a i , o i ))</formula><p>o</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Particle Filter</head><p>The Markov Chain Monte Carlo (MCMC) method <ref type="bibr" target="#b5">(Gilks et al., 1996)</ref> is widely used to approximate probability distributions when they are unable to be computed directly. It generates samples from a posterior distribution ⇡(x) over state space x, by simulating a Markov chain p(x 0 |x) whose state space is x and stationary distribution is ⇡(x). The samples drawn from p converge to the target distribution ⇡ as the number of samples goes to infinity.</p><p>In order to make MCMC work on sequential inference task, especially sequential decision makings under Markov assumptions, sequential versions of Monte Carlo methods have been proposed and some of them are capable of dealing with high dimensionality and/or complexity problems, such as particle filters <ref type="bibr">(Del Moral and Pierre 1996)</ref>. At each time step, particle filters draw samples (or particles) from a proposal distribution, commonly p(x t |x t 1 ), which is essentially the conditional distribution of the current state x t given the previous x t 1 , then use the observation function p(y t |x t ) to compute the importance weight for each particle and resample all particles according to the weights.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>The Model I-POMDP framework</head><p>An interactive POMDP of agent i, I-POMDP i, is defined as:</p><formula xml:id="formula_4">I-P OMDP i = hIS i,l , A, ⌦ i , T i , O i , R i i (4)</formula><p>where IS i,l is the set of interactive states of the environment, defined as IS i,l = S ⇥ M i,l 1 , l 1, where S is the set of states and M i,l 1 is the set of possible models of agent j, and l is the strategy level. A specific class of models are the (l 1)th level intentional models, ⇥ j,l 1 , of agent j: ✓ j,l 1 = hb j,l 1 , A, ⌦ j , T j , O j , R j , OC j i, b j,l 1 is agent j's belief nested to level (l 1), b j,l 1 2 (IS j,l 1 ), and OC j is j's optimality criterion. The intentional model ✓ j,l 1 , sometimes is referred to as type, can be rewritten as ✓ j,l 1 = hb j,l 1 , ✓j i, where ✓j includes all elements of the intentional model other than the belief and is called the agent j's frame.</p><p>The IS i,l could be defined in an inductive manner (note that when ✓j is usually known, ✓j reduces to b j ):</p><formula xml:id="formula_5">IS i,0 = S, ✓ j,0 = {hb j,0 , ✓j i : b j,0 2 (S)} IS i,1 = S ⇥ ✓ j,0 , ✓ j,1 = {hb j,1 , ✓j i : b j,1 2 (IS j,1 )} ...... (<label>5</label></formula><formula xml:id="formula_6">)</formula><formula xml:id="formula_7">IS i,l = S ⇥ ✓ j,l 1 , ✓ j,l = {hb j,l , ✓j i : b j,l 2 (IS j,l )}</formula><p>And all other remaining components in an I-POMDP are similar to those in a POMDP:</p><formula xml:id="formula_8">• A = A i ⇥ A j is the set of joint actions of all agents. • ⌦ i is the set of agent i's possible observations. • T i : S ⇥ A i ⇥ S ! [0, 1] is the state transition function. • O i : S ⇥ A i ⇥ ⌦ i ! [0, 1] is the observation function. • R i : IS ⇥ A i ! R i is the reward function.</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Interactive belief update</head><p>Given all the definitions above, the interactive belief up-date can be performed as follows:</p><formula xml:id="formula_9">b t i (is t ) = P r(is t |b t 1 i , a t 1 i , o t i ) (6) = ↵ X is t 1 b(is t 1 ) X a t 1 j P r(a t 1 j |✓ t 1 j )T (s t 1 , a t 1 , s t ) ⇥ O i (s t , a t 1 , o t i ) X o t j O j (s t , a t 1 , o t j )⌧ (b t 1 j , a t 1 j , o t j , b t j )</formula><p>Unlike plain belief update in POMDP, the interactive belief update in I-POMDP takes two additional sophistications into account. Firstly, the probabilities of other's actions given its models (the second summation) need to be computed since the state of physical environment now depends on both agents' actions. Secondly, the agent needs to update its beliefs based on the anticipation of what observations the other agent might get and how it updates (the third summation).</p><p>Then the optimal action, a ⇤ , for the case of infinite horizon criterion with discounting, is part of the set of optimal actions, OP T (✓ i ), for the belief state defined as:</p><formula xml:id="formula_10">OP T (✓ i ) = arg max ai2Ai n X is2IS b is (s)ER i (is, a i ) (7) + X oi2⌦i P (o i |a i , b i ) ⇥ U (hSE ✓i (b i , a i , o i ), ✓i i) o</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Sampling Algorithms</head><p>The Interactive Particle Filter (I-PF) <ref type="bibr" target="#b2">(Doshi and Gmytrasiewicz 2009)</ref> was proposed as a filtering algorithm for interactive belief update in I-POMDP. It generalizes the classic particle filter algorithm to multi-agent settings and uses the state transition function as the proposal distribution, which is usually used in a specific particle filter algorithm called bootstrap filter <ref type="bibr">(Gordon, ect 1993)</ref>. However, due to the enormous belief space, I-PF assumes that other agent's frame ✓j is known to the modeling agent, therefore simplifies the belief update process from S ⇥ ⇥ j,l 1 to a significantly smaller space S ⇥ b j,l 1 . The intuition of our algorithm is to assign appropriate prior distributions over agent j's all possible models ✓ j =&lt; b j (s), A j , ⌦ j , T j , O j , R j , OC j &gt; and sample from each dimension of them. At each time step, we update all samples using perceived observations, namely computing and assigning weights to each sample, and resample them according to the weights. At last, since it is a randomized Monte Carlo method, to prevent the learning algorithm from converging to incorrect models, we add another resampling step to sample from the neighboring similar models given the current samples. Consequently, our algorithm is able to maintain a probability distribution of the most possible models of other agents and eventually learn the optimal actions of them.</p><formula xml:id="formula_11">Algorithm 1: Interactive Belief Update bt k,l = InteractiveBeliefUpdate( bt 1 k,l , a t 1 k , o t k , l &gt; 0) 1 For is (n),t 1 k =&lt; s (n),t 1 , ✓ (n),t 1 k &gt;2 bt 1 k,l , 2 sample a t 1 k ⇠ P (A k |✓ (n),t 1 k ) 3 sample s (n),t ⇠ T k (S t |S (n),t 1 , a t 1 k , a t 1 k ) 4 for o t k 2 ⌦ k : 5 if l = 1: 6 b (n),t k,0 = Level0BeliefUpdate(b (n),t 1 k,0 , a t 1 k , o t k , ✓ (n),t 1 k ) 7 ✓ (n),t k =&lt; b (n),t k,0 , ✓(n),t 1 k &gt; 8 is (n),t k =&lt; s (n),t , ✓ (n),t k &gt; 9 else: 10 b (n),t k,l 1 = InteractiveBeliefUpdate( bt 1 k,l 1 , a t 1 k , o t k , l 1) 11 ✓ (n),t k =&lt; b (n),t k,l 1 , ✓(n),t 1 k &gt; 12 is (n),t k =&lt; s (n),t , ✓ (n),t k &gt; 13 w (n) t = O (n) k (o t k |s (n),t , a t 1 k , a t 1 k ) 14 w (n) t = w (n) t ⇥ O k (o t k |s (n),t , a t 1 k , a t 1 k ) 15 btemp k,l =&lt; is (n),t k , w (n) t &gt; 16 normalize all w (n) t so that P N n=1 w (n) t = 1 17 resample from btemp k,l accroding to normalized w (n) t 18 resample ✓ (n),t k according to neighboring similar models 19 return bt k,l = is (n),t k</formula><p>The interactive belief update described in Algorithm 1 is similar to I-PF in terms of the recursive Monte Carlo sampling and nesting hierarchy, but it has three major differences. Firstly, the belief update is over the entire intentional model space of other agents, therefore the initial set of N samples bt 1</p><formula xml:id="formula_12">k,l =&lt; b (n),t 1 k,l 1 , A k , ⌦ k , T (n) k , O (n) k , R (n) k , OC j &gt;,</formula><p>where k here denotes the modeling agent and k denotes all other modeled agents. We only assume that the actions A k , observations ⌦ k and optimal criteria OC j are known, as in a multi-agent game the rules are usually known to all agents or could be obtained through intelligence. Secondly, it is intuitive to see that the observation function</p><formula xml:id="formula_13">O (n) k (o t k |s (n),t , a t 1 k , a t 1 k</formula><p>) in line 13 is now randomized as well, as each of them is a particular observation function of that agent. Lastly, we add another resampling step in line 18 in order to avoid divergence, by resampling each dimension of the model samples from a Gaussian distribution with the mean of current sample value. Intuitively, similar models are resampled from a relatively tight neighboring region of the current model samples to maintain the learning accuracy.</p><p>Algorithm 1 can be viewed as two major steps. The importance sampling step (line 1 to line 16) samples from belief priors bt 1 k,l and propagates forward using related proposal distributions and computes the weights of all samples. And the selection or resapmling step (line 17 to line 18) resamples according to weights and similar models. Specifically, the algorithm starts from a set of initial priors is (n),t 1 k , for each of them, it samples other agents' optimal action a t 1 k from its policy P (A k |✓ (n),t 1 k ), which is solved using a very efficient POMDP solver called Perseus<ref type="foot" target="#foot_1">2</ref>  <ref type="bibr" target="#b10">(Spaan and Vlassis 2005)</ref>. Then it samples the physical state s t using the state transition T k (S t |S (n),t 1 , a t 1 k , a t 1 k ). Once a t 1 k and s t are sampled, the algorithm calls for the 0-level belief update (line 5 to 8), described in Algorithm 2, to update other agents' plain beliefs b t k,0 if the current nesting level l is 1, or recursively calls for itself at a lower level l 1 (line 9 to 12) if the current nesting level is greater than 1. The sample weights w (n) t are computed according to observation likelihoods of modeling and modeled agents (line 13, 14), and then got normalized so that they sum up to 1 (line 16). Lastly, the algorithm resamples the intermediate samples according to the computed weights (line 17) and resamples another time from similar neighboring models (line 18).</p><formula xml:id="formula_14">Algorithm 2: Level-0 Belief Update b t k,0 =Level0BeliefUpdate(b t 1 k,0 ,a t 1 k ,o t k , T (n) k ,O (n) k ) 1 P (a t 1 k ) = 1/a t 1 k 2 for s t 2 S: 3 for s t 1 : 4 for a (t 1) k 2 A k : 5 P (n) (s t |s t 1 , a t 1 k ) = T (n) k (s t |s t 1 , a t 1 k , a t 1 k ) ⇥ P (a t 1 k ) 6 sum (n) + = P (n) (s t |s t 1 , a t 1 k )b t 1 k,0 (s t 1 ) 7 for a (t 1) k 2 A k : 8 P (n) (o t k |s t , a t 1 k )+ = O (n) k (o t k |s t , a t 1 k , a t 1 k )P (a t 1 k ) 9 b t k,0 = sum (n) ⇥ P (n) (o t k |s t , a t 1 k ) 10 normalize and return b t k,0</formula><p>The 0-level belief update, described in Algorithm 2, is similar to POMDP belief update but treats other agents' actions as noise and randomized the state transition function and observation function as input parameters. It assume other agents in the environment choose their actions according to a uniform distribution (line 1), therefore is essentially a no-information model. For each possible action a (t 1) k , it computes the actual state transition (line 5) and actual observation function (line 8) by marginalizing over others' actions, and returns the normalized belief b t k,0 . Notice that the transition function T (n) k (s t |s t 1 , a t 1 k , a t 1 k ) and observation function O (n)  k (o t k |s t , a t 1 k , a t 1 k ) are now both samples from input arguments, depending on model parameters of the actual agent on the 0th level. In figure <ref type="figure" target="#fig_0">1</ref>, we illustrate the interactive belief update using the problem discussed in the following section . Suppose there are two agents i and j in the environment, the sample size is 8 and the nesting level is 2, the subscripts in figure <ref type="figure" target="#fig_0">1</ref> denotes the corresponding agents and each dot represents a particular belief sample. The propagate step corresponds to line 2 to 12 in Algorithm 1, the weight step corresponds to line 13 to 16, the resample step corresponds to line 17 and 18. The belief update for a particular level-0 model sample (✓ j = hb j (s) = 0.5, p T 1 = 0.67, p T 2 = 0.5, p O1 = 0.85, p O2 = 0.5, p R1 = 1, p R2 = 100, p R3 = 10i) is solved using Algorithm 2, and the optimal action is computed by calling the Perseus POMDP solver.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Experiments Setup</head><p>We present the results using the multi-agent tiger game <ref type="bibr">(Gmytrasiewicz and Doshi 2005)</ref> with various settings. The multi-agent tiger game is a generalization of the classical single agent tiger game <ref type="bibr" target="#b9">(Kaelbling, Littman, and Cassandra 1998)</ref> with adding observations which caused by others' actions. The generalized multi-agent game contains additional observations regarding other players, while the state transition and reward function involve others' actions as well.</p><p>Let's see a specific game example with known parameters: there are a tiger and a pile of gold behind two doors respectively, two players can both listen for a growl of the tiger and a creak caused by the other player, or open doors which resets the tiger's location with equal probability. Their observation toward the tiger and the other player are both relatively high (0.85 and 0.9 respectively). No matter triggered by which player, the reward for listening action is -1, opening the tiger door is -100 and opening the gold door is 10.   For the sake of brevity, we restrict the experiments to a two-agent setting and nesting level of one, but the sampling algorithm is extensible to any number of agents and nesting levels in a straightforward manner. Recall that an interactive POMDP of agent i is defined as a six tuple</p><formula xml:id="formula_15">L p T 1 1 p T 1 TR L 1 p T 1 p T 1 * OL p T 2 1 p T 2 * OR 1 p T 2 p T 2</formula><formula xml:id="formula_16">I-P OMDP i = hIS i,l , A, ⌦ i , T i , O i , R i i.</formula><p>Thus for the specific setting of multi-agent tiger problem:</p><formula xml:id="formula_17">• IS i,1 = S ⇥ ✓ j,0</formula><p>, where S = {tiger on the left (TL), tiger on the right (TR)} and ✓ j,0 =&lt; b j (s), A j , ⌦ j , T j , O j , R j , OC j &gt;}. • ⌦ i is all the combinations of each agent's possible observations: growl from left (GL) or right (GR), combined with creak from left (CL), right (CR) or silence (S).</p><formula xml:id="formula_18">• A = A i ⇥ A j is</formula><p>• T i = T j : S ⇥ A i ⇥ A j ⇥ S ! [0, 1] is a joint state transition probability that involves both actions.</p><formula xml:id="formula_19">• O i : S ⇥ A i ⇥ A j ⇥ ⌦ i ! [0, 1</formula><p>] becomes a joint observation probability that involves both actions. O j is symmetric of O i with respect to the joint actions.</p><p>• R i : IS ⇥ A i ⇥ A j ! R i : agent i gets corresponding rewards when he listens, opens the wrong door and opens the correct door respectively. They are independent of j's actions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Parameter Space</head><p>For the experiments of multi-agent tiger game, we want to learn over all possible intentional models of the other agent j: ✓ j =&lt; b j (s), A j , ⌦ j , T j , O j , R j , OC j &gt;. We only make reasonable assumptions that A j and ⌦ j are known, OC j is infinite horizon with discounting. Now what we really want to learn are as follow:</p><p>• b 0 j : the initial belief of agent j about the physical state. • T j : the transition function of agent j, which can be parametrized by two parameters p T 1 and p T 2 , as shown in Table <ref type="table" target="#tab_0">1</ref>.</p><p>• O j : the observation function of agent j, which can be parametrized by two parameters p O1 and p O2 , as shown in Table <ref type="table" target="#tab_1">2</ref>.</p><p>• T j : the reward function of agent j, which can be parametrized by three parameters p R1 , p R2 and p R3 , as shown in Table <ref type="table" target="#tab_2">3</ref>.</p><p>We could easily see that it is a enormous 8-dimensional parameter space to learn from: b</p><formula xml:id="formula_20">0 j ⇥p T 1 ⇥p T 2 ⇥p O1 ⇥p O2 ⇥ p R1 ⇥ p R2 ⇥ p R3 , where b j 2 [0, 1] ⇢ R, p T 1 2 [0, 1] ⇢ R, p T 2 2 [0, 1] ⇢ R, p O1 2 [0, 1] ⇢ R, p O2 2 [0, 1] ⇢ R, p R1 2 [ 1, +1] ⇢ R, p R2 2 [ 1, +1] ⇢ R, p R3 2 [ 1, +1] ⇢ R.</formula><p>We mainly reduce this huge space by two means: utilizing Monte Carlo sampling methods and giving them problemspecific priors so that they are not over informative but provide enough information for the algorithm to learn from.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Results</head><p>For the actual experiments, we fix the number of samples to be 2000 and run it on a two agent tiger game simulation as described above. We run experiments for learning three difference models of agent j:</p><p>1. ✓ j1 =&lt; 0.5, 0.67, 0.5, 0.85, 0.5, 1, 100, 10 &gt; 2. ✓ j2 =&lt; 0.5, 1.00, 0.5, 0.95, 0.5, 1, 10, 10 &gt; 3. ✓ j3 =&lt; 0.5, 0.66, 0.5, 0.85, 0.5, 10, 100, 10 &gt; These models are all very special cases and carefully chosen in order to verify the correctness and evaluate the performance of our algorithm. For instance, the first model is a sophisticated one when the other agent is actually modeling his opponent using a subintentional model, while the second is a classic single-agent POMDP and the third is a very simple one but contains a large model space. We want to investigate if our framework is able to correctly and efficiently learn these models through these experiments. The aim of first experiment to try to learn relatively complicated models of agent j with ✓ j =&lt; 0.5, 0.67, 0.5, 0.85, 0.5, 1, 100, 10 &gt;, who assumes others' actions are drawn from a uniform distribution. Equivalently, agent j's actual policy, as shown in figure <ref type="figure" target="#fig_2">2</ref>, is to look for three consecutive growls from same direction and then open the corresponding door. For this particular experiment, we simulated the observation history for agent i, for the sake of firstly verifying the correctness of our algorithm, excluding the impacts of uncertainties from hearing accuracy. The simulated observation history is as follows: {GL,S GL,S GL,S GL,CR GL,S GL,S GL,S GR,CR GL,S GL,S GL,S GR,CR GL,S GL,S GL,S GR,CR GR,S GR,S GR,S GR,CL GR,S GR,S GR,S GR,CL GR,S GR,S GR,S GR,CL GR,S GR,S GR,S GR,CL GR,S GR,S GR,S GR,CL GR,S GR,S GR,S GL,CL GL,S GL,S GL,S GR,CR GL,S GL,S GL,S GR,CR GR,S GR,S} The priors we assign to each parameters are shown in fig-ure 3, specifically, they are uniform U(0,1) for b 0 j , Beta(5,3) with mode 0.67 for p T 1 , Beta(5,5) for p T 2 , Beta(3.5,1.4) with mode 0.85 for p O1 , Beta(5,5) for p O2 , Gaussian N(-1,2) for p R1 , N(-100,4) for p R2 , and N(10,2) for p R3 . After 50 time steps, the algorithm converge to a posterior distribution over agent j's intentional models, the results are also given in figure <ref type="figure" target="#fig_3">3</ref>. Since the parameter space of agent j's models is 8-dimensional, here we only show the marginal distributions of each parameter space in histograms. We can easily see that the majority of samples are centered around the true parameter values.</p><p>We use principal component analysis (PCA) <ref type="bibr" target="#b0">(Abdi and Williams 2010)</ref> to reduce sample dimensionality to 2dimensional and plot them out in a 3-dimensional histogram, as shown in Figure <ref type="figure" target="#fig_4">4</ref>. It starts from a Gaussian-like prior and gradually converges to the most likely models. Eventually the mean value of this cluster h 0.49, 0.69, 0.49, 0.82, 0.51, -0.95, -99.23, 10.09 i is very close to the true model. Here we give two examples from the big cluster after 50 time steps: h0. <ref type="bibr">56, 0.66, 0.49, 0.84, 0.59, -0.95, -101.37, 11.42i and h0.51, 0.68, 0.52, 0.89, 0.56, -1.33, -98.39, 12.55i</ref>. The former has a corresponding optimal policy of [0-OL-0.10-L-1], while the latter has a [0-OL-0.09-L-0.91-OR-1], which are both extremely close to the optimal policy of the true model: [0-OL-0.1-L-0.9-OR-1]. Consequently, the framework is able to predict other agents' actions with high accuracy.</p><p>We tested the performance of our algorithms in terms of prediction accuracy towards others' actions. We compared the results with other modeling approaches, such as a frequency-based approach, in which agent j is assumed to choose his action according to a fixed but unknown distribution, and a no-information model which treats j's actions purely as uniform noise. The results shown in figure <ref type="figure" target="#fig_5">5</ref> are averaged plots of 10 random runs, each of which has 50 time steps. It shows clearly that the intentional I-POMDP approach has significantly lower error rates as agent i perceives more observations. The subintenional model assume j's action is draw from a uniform distribution, therefore has a fixed high error rate. The frequency based approach has certain learning ability but is far from enough sophisticated for modeling a fully rational agent.</p><p>Figure <ref type="figure">6</ref>: (a) optimal policy for ✓ j = h 0.5, 1, 0.5, 0.95, 0.5, -1, -10, 10 i. (b) optimal policy for ✓ j = h0.5, 0.66, 0.5, 0.85, 0.5, 10, -100, 10i.</p><p>In the second experiment, we run our algorithm on actual observations for 30 time steps until it converges, and try to learn models of a simpler classic POMDP with high listening accuracy of 0.95 and small penalty of -10, e.g. the agent j alternately opens door and listens as shown in Figure <ref type="figure">6</ref> left. The actual model of j is ✓ j = h 0.5, 1, 0.5, 0.95, 0.5, -1, -10, 10 i, the priors assigned to b 0 j , p T 1 , p T 2 , p O1 , p O2 , p R2 , p R3 are U(0,1), Beta(2,0.5), Beta(10,10), Beta(19,1), Beta(10,10), N(-1,1), N(-10,2), N(10,2), and the actual observation history is {GR,S GL,CR GL,S GL,CL GL,S GL,CR GL,S GL,CL GL,S GR,S GR,CL GR,CL GL,S GR,S GR,S GL,CL GR,S GL,CR GR,S GR,CR GR,CR GR,CL GL,S GL,S GL,S GL,CR GL,S GL,CL GR,S GR,S}.</p><p>Similarly, we report the learned posterior distributions over model parameters in figure <ref type="figure">7</ref>. We observe an interesting pattern that while some parameters, such as b j,0 , p T 2 and p O2 are concentrated around the actual values, others like p T 1 and p O1 become more dispersed than initial priors. The intuition behind is that the penalty and reward are -10 and 10, so one listening of reward -1 is enough for making decision of opening doors. That is to say, as long as tiger likely remains behind the same door when agent listens (the meaning of p T 1 ) and has a reliable hearing accuracy (the meaning of p O1 ), there are many models which satisfy this particular observation sequence, hence our algorithm learns them all.</p><p>For conciseness, we show the average prediction error rates for both second and third experiments in figure <ref type="figure">9</ref>. Both results are averaged among 10 random runs, each of which has 30 time steps. In the second experiment in figure 9(a), the intentional I-POMDP approach still has significantly lower error rates than others.</p><p>In the last experiment, we wants to learn a model of ✓ j = Figure <ref type="figure">7</ref>: Learned posterior distributions for model ✓ j = h 0.5, 1, 0.5, 0.95, 0.5, -1, -10, 10 i. h0.5, 0.66, 0.5, 0.85, 0.5, 10, -100, 10i, who always listens since the listening penalty is now equal to the reward, as shown in figure 6(b). For brevity, we only show the marginal distributions over model parameters in figure <ref type="figure" target="#fig_7">8</ref>. The priors assigned to b 0 j , p T 1 , p T 2 , p O1 , p O2 , p R2 , p R3 are U(0,1), Beta(5,3), Beta(10,10), Beta(3.5,1.4), Beta(10,10), N(10,1), N(-100,2), N(10,2), and the actual observation history i learns from is {GL,S GL,S GR,S GL,S GL,CL GR,S GR,S GL,CL GR,S GL,S GL,S GR,S GL,S GL,S GL,S GL,CL GR,S GL,S GL,S GL,S}. We can see that all three reward parameters are correctly learned, while samples of p T 1 , p T 2 , p O1 and p O2 are not very concentrated to their true values but close to their corresponding priors, since intuitively they become less important and can be in a relatively loose region due to the increased p R1 =10. Lastly, the performance comparison is given in figure <ref type="figure">9</ref>  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Conclusions and Future Work</head><p>We have described a new approach to learn other agents' models by approximating the interactive belief update using Bayesian inference and Monte Carlo sampling methods. Our framework correctly learns others' models over the entire intentional model space and therefore is a generalized reinforcement learning algorithm for multi-agent settings. It also effectively mitigates the belief space complexity and has a significant better performance than other approaches in terms of predicting others' actions.</p><p>In the future, in order to fully evaluate the practicability on larger problem space, more multi-agent problems of various sizes could be tested. Due to computation complexity, experiments on higher nesting levels are currently limited. Thus, more efforts could be made on utilizing nonparametric Bayesian methods which inherently deal with nested belief structures.</p><p>Figure <ref type="figure">9</ref>: (a) Prediction error rate vs observation length for ✓ j = h 0.5, 1, 0.5, 0.95, 0.5, -1, -10, 10 i. (b) for ✓ j = h0.5, 0.66, 0.5, 0.85, 0.5, 10, -100, 10i.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: An illustration of interactive belief update for two agents and 1 level nesting.</figDesc><graphic coords="4,324.75,122.79,206.79,142.69" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head></head><label></label><figDesc>a combination of both agents' possible actions: listen (L), open left door (OL) and open right door(OR).</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Optimal policy of a no-information model.</figDesc><graphic coords="5,341.08,530.58,174.15,107.29" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Assigned priors and learned posterior distributions over model parameters for model ✓ j1 =&lt; 0.5, 0.67, 0.5, 0.85, 0.5, 1, 100, 10 &gt;.</figDesc><graphic coords="6,78.06,246.55,215.51,353.95" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: 3D histogram of all model samples.</figDesc><graphic coords="6,320.40,117.91,215.52,159.48" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Figure 5 :</head><label>5</label><figDesc>Figure 5: Prediction error rate vs observation length.</figDesc><graphic coords="7,120.51,65.51,130.61,97.96" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head></head><label></label><figDesc>(b).</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_7"><head>Figure 8 :</head><label>8</label><figDesc>Figure 8: Learned posterior distributions for model ✓ j = h0.5, 0.66, 0.5, 0.85, 0.5, 10, -100, 10i.</figDesc><graphic coords="8,78.06,65.50,215.51,356.47" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>Parameters for transition functions</figDesc><table><row><cell>S</cell><cell>A</cell><cell>TL</cell><cell>TR</cell></row><row><cell>TL</cell><cell></cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2</head><label>2</label><figDesc></figDesc><table><row><cell cols="4">: Parameters for observation functions</cell></row><row><cell>S</cell><cell>A</cell><cell>GL</cell><cell>GR</cell></row><row><cell cols="2">TL L</cell><cell>p O1</cell><cell>1 p O1</cell></row><row><cell cols="2">TR L</cell><cell cols="2">1 p O1 p O1</cell></row><row><cell>*</cell><cell cols="2">OL p O2</cell><cell>1 p O2</cell></row><row><cell>*</cell><cell cols="3">OR 1 p O2 p O2</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3</head><label>3</label><figDesc></figDesc><table><row><cell cols="3">: Parameters for reward functions</cell></row><row><cell>S</cell><cell>A</cell><cell>R</cell></row><row><cell>*</cell><cell>L</cell><cell>p R1</cell></row><row><cell cols="3">TL OL p R1</cell></row><row><cell cols="3">TR OR p R2</cell></row><row><cell cols="3">TL OR p R3</cell></row><row><cell cols="3">TR OL p R3</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">https://github.com/solohan22/IPOMDP.git</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">http://www.st.ewi.tudelft.nl/ ˜mtjspaan/ pomdp/index_en.html</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Principal component analysis</title>
		<author>
			<persName><forename type="first">H</forename><surname>Abdi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">J</forename><surname>Williams</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Wiley interdisciplinary reviews: computational statistics</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="433" to="459" />
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">An introduction to sequential Monte Carlo methods</title>
		<author>
			<persName><forename type="first">A</forename><surname>Doucet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>De Freitas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Gordon</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Sequential Monte Carlo methods in practice</title>
				<meeting><address><addrLine>New York</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2001">2001</date>
			<biblScope unit="page" from="3" to="14" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Monte Carlo sampling methods for approximating interactive POMDPs</title>
		<author>
			<persName><forename type="first">P</forename><surname>Doshi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">J</forename><surname>Gmytrasiewicz</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Artificial Intelligence Research</title>
		<imprint>
			<biblScope unit="volume">34</biblScope>
			<biblScope unit="page" from="297" to="337" />
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Hidden parameter Markov decision processes: A semiparametric regression approach for discovering latent task parametrizations</title>
		<author>
			<persName><forename type="first">F</forename><surname>Doshi-Velez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Konidaris</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Del Moral</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1308.3513</idno>
	</analytic>
	<monogr>
		<title level="j">Markov processes and related fields</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="555" to="581" />
			<date type="published" when="1996">2013. 1996</date>
		</imprint>
	</monogr>
	<note>Non-linear filtering: interacting particle resolution</note>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Graphical models for interactive POMDPs: representations and solutions</title>
		<author>
			<persName><forename type="first">P</forename><surname>Doshi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zeng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Chen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Autonomous Agents and Multi-Agent Systems</title>
		<imprint>
			<biblScope unit="volume">18</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="376" to="416" />
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Introducing markov chain monte carlo</title>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">R</forename><surname>Gilks</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Richardson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">J</forename><surname>Spiegelhalter</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Markov chain Monte Carlo in practice</title>
				<imprint>
			<date type="published" when="1996">1996</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page">19</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">A framework for sequential planning in multi-agent settings</title>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">J</forename><surname>Gmytrasiewicz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Doshi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Artificial Intelligence Research</title>
		<imprint>
			<biblScope unit="volume">24</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="49" to="79" />
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">A framework for sequential planning in multi-agent settings</title>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">J</forename><surname>Gmytrasiewicz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Doshi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Artificial Intelligence Research</title>
		<imprint>
			<biblScope unit="volume">24</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="49" to="79" />
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Novel approach to nonlinear/non-Gaussian Bayesian state estimation</title>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">J</forename><surname>Gordon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">J</forename><surname>Salmond</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">F</forename><surname>Smith</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEE Proceedings F (Radar and Signal Processing)</title>
		<imprint>
			<biblScope unit="volume">140</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="107" to="113" />
			<date type="published" when="1993-04">1993. April</date>
			<publisher>IET Digital Library</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Planning and acting in partially observable stochastic domains</title>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">P</forename><surname>Kaelbling</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">L</forename><surname>Littman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">R</forename><surname>Cassandra</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Artificial intelligence</title>
		<imprint>
			<biblScope unit="volume">101</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="99" to="134" />
			<date type="published" when="1998">1998</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Perseus: Randomized pointbased value iteration for POMDPs</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">T</forename><surname>Spaan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Vlassis</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of artificial intelligence research</title>
		<imprint>
			<biblScope unit="volume">24</biblScope>
			<biblScope unit="page" from="195" to="220" />
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">The infinite regionalized policy representation</title>
		<author>
			<persName><forename type="first">M</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Liao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Carin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 28th International Conference on Machine Learning (ICML-11)</title>
				<meeting>the 28th International Conference on Machine Learning (ICML-11)</meeting>
		<imprint>
			<date type="published" when="2011">2011</date>
			<biblScope unit="page" from="769" to="776" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">March. Bayesian Learning of Other Agents&apos; Finite Controllers for Interactive POMDPs</title>
		<author>
			<persName><forename type="first">A</forename><surname>Panella</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Gmytrasiewicz</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Thirtieth AAAI Conference on Artificial Intelligence</title>
				<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<title level="m" type="main">Probabilistic reasoning in intelligent systems: Networks of plausible reasoning</title>
		<author>
			<persName><forename type="first">J</forename><surname>Pearl</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1988">1988</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Bayes-adaptive pomdps</title>
		<author>
			<persName><forename type="first">S</forename><surname>Ross</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Chaib-Draa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Pineau</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in neural information processing systems</title>
				<imprint>
			<date type="published" when="2007">2007</date>
			<biblScope unit="page" from="1225" to="1232" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
