-

Temporally Extended Metrics for Markov Decision Processes

Marc G. Bellemare Google Brain

0 1 2 0 Doina Precup McGill University / DeepMind 1 Philip Amortila McGill University 2 Prakash Panangaden McGill University

Developing safe and efficient methods for state abstraction in reinforcement learning systems is an open research problem. We propose to address it by leveraging ideas from formal verification, namely, bisimulation. Specifically, we generalize the notion of bisimulation by considering arbitrary comparisons between states instead of strict reward matching. We further develop a notion of temporally extended metrics, which extend a base metric between states of an environment so as to reflect not just the current difference but the extent to which the distance is preserved through the course of transitions. We show that this property is not satisfied by bisimulation metrics, which were previously used to compare states with respect to their longterm rewards. A temporal extension can be defined for any base metric of interest, thus making the construction very flexible. The kernel of the temporally extended metrics corresponds precisely to exact bisimulation (thus these metrics form a larger class of bisimulation metrics). We provide bounds relating bisimulation and temporally extended metrics and also examine the couplings of state distributions which are induced.

Building safe AI systems is crucial for a wide range of applications. This is especially difficult when the system relies on reinforcement learning, in which an agent learns from its own experience how to behave in the world. Because in most applications of reinforcement learning, the environment is very large or perhaps continuous, an agent is required to abstract over its state space. However, this operation can result in putting together states that actually have very different values, which can lead to suboptimal or even dangerous behaviour. This is especially true if an agent relies only on immediate observations to determine the similarity between states, rather than long-term predictions or behaviour. For example, a camera mounted on a car may show two very similar images, but one may lead to the car going up the curb and the other may be safe driving.

In reinforcement learning, we can leverage some structure in the problem formulation to attempt to tackle this problem.

CIFAR Fellow Copyright held by author(s) Specifically, the environment of an agent is a Markov Decision Process (MDP) in which state similarity has been studied for a couple of decades. One long-studied notion used to capture behavioral similarity of states in a Markov Decision Process (MDP) is called bisimulation. Bisimulation has originated in the fields of concurrency and formal verification for provably verifying the correctness and safety of processes and systems (Milner 1980) . An extension to MDPs was proposed by (Givan, Dean, and Greig 2003) . Bisimulation is a canonical tool for analyzing the behaviour of transition systems and clustering equivalent states in overly large systems. In the context of MDPs, bisimulation requires an exact match of both rewards and transition distributions. As this criterion is too brittle, a metric approach was developed by (Ferns, Panangaden, and Precup 2004) , which allows one to measure “how bisimilar” two states are. These bisimulation metrics assign distance of zero to states if and only if they are bisimilar. Bisimulation metrics can be used for state abstraction (by aggregating states that are " away), and doing so provides one with formal (rather than statistical) safety guarantees on the difference between the true optimal value function and the approximated one. The bisimulation metric can be computed iteratively, but each step requires one to solve a linear program involving the transition distributions for every pair of states, which is not only computationally expensive but also requires a full model of the MDP. In this work, we investigate alternative metrics for behavioural equivalence, with the goal of maintaining the useful theoretical guarantees of bisimulation metrics while reducing the computational burden and the need for knowledge of the model.

Our contributions are two-fold: firstly, we propose a coupling-based generalization of bisimulation which allows for greater flexibility in the comparisons between states (instead of strict reward matching), and consequently in the properties being checked. Secondly, we consider the class of quantitative bisimulations and show how this defines a notion of temporally extended (TE) metrics. Intuitively, these metrics compute the minimum value of a chosen base metric for which the states s and s0 can remain in that range throughout their dynamics. The TE metrics assign distance 0 to states if and only if they are bisimilar, much like the bisimulation metrics previously defined. However, both the construction and the resulting metric are quite different. The rest of the paper is organized as follows. In the next section we provide some necessary background. In Section 3 we characterize bisimulation via couplings and define the extension of this characterization to arbitrary relations. In Section 4 we define the temporally extended metrics. Section 5 compares the two metrics by providing some bounds relating them and analyzing the couplings induced by the two metrics. Lastly, we wrap up with a discussion on the benefits and disadvantages of these metrics, highlighting directions for future work.

Background

Let D(S ) denote the set of probability distributions on SS. ;AA;Mfaprako:vS d!eciDsi(oSn)pgrao2cAe;srs :(MS DPA)i!saR5-t0u;ple Mwhe=re we emphasize that pa(s) 2 D(S ) is to be read as a probability distribution over the states (one for each action). The transition probability from s to a set of states X is written pa(s)(X ) = P cations, the state spaxc2eXopfat(hse)(Mx)D.IPn imsasinmypplyractotiocallaragpepltioallow one to compute the value functions exactly without the use of state abstraction. Bisimulation is a canonical example of a safe abstraction, in that the optimal policy and optimal value functions will be preserved in the aggregated MDP (Li, Walsh, and Littman 2006) . Bisimulation is defined in terms of equivalence classes, we write S=R for the set of equivalence classes of an equivalence relation R. Definition 2.1 (Bisimulation). A bisimulation relation U on S is an equivalence relation such that sU s0 implies: s0 if there 1. 8a 2 A; r(s; a) = r(s0; a) and 2. 8a 2 A; 8C 2 S=U ; pa(s)(C) = pa(s0)(C): We say that s and s0 are bisimilar and write s is some bisimulation relation U relating them.

Thus bisimilar states will have equal rewards and thereafter transition with equal probability to more bisimilar states.

Given a relation R between states, the lifting (R)# of that relation allows one to naturally extend R to distributions on states. Liftings are defined in terms of couplings, we will later see that this notion will allow us to generalize bisimulation.

Definition 2.2 (Couplings and Liftings). A coupling of two distributions ( ; ) on S is a joint distribution on S S such that the marginals are and : (s; S ) = (s) & (s0) = (S ; s0): Moreover, a coupling of distributions ( ; ) is a lifting of R if: (s; s0) > 0 ) sRs0, i.e. support( ) R:

When there exists a lifting of

(R)# . and as above we write A simple example: bisimulation and liftings We consider the example of the bisimilar states in Figure 1. To see that s0 and t0 are bisimilar, take the equivalence classes S =U = ffs0; t0g; fs1; t1g; fs2; t2; t3gg. All states in each class receive the same rewards and transition with equal probability to all other classes, so U is indeed a bisimulation relation. Now consider the coupling of (p(s0); p(t0)) given by the dashed arrows, i.e. (s1; t1) = (s2; t2) = (s2; t3) = 13 . This coupling is a lifting of U , since all supported states are U -related. This is not a coincidence: the fact that bisimulation is equivalent to the existence of a lifting is the subject of Section 3. Not all couplings are liftings: the trivial coupling !(si; tj ) = p(s0)(si)p(t0)(tj ) is not a lifting of U . Previous work on metric extensions of bisimulation is built on the Kantorovich metric K(d)( ; ) = min 2 ( ; )h ; di (also called the Wasserstein metric). The bisimulation metric d is the fixed point of the operator F (d)(s; s0) = maxaf(1 )jr(s; a) r(s0; a)j + K(d)(pa(s); pa(s0))g; we refer the reader to (Ferns, Panangaden, and Precup 2004) for further background.

Generalized Bisimulation via Liftings

In bisimulation, the rewards match at the first step and thereafter the states transition such that the bisimulation relation is preserved. This definition can also be captured in terms of couplings: our first result is that bisimulation is equivalent to the existence of a particular lifting of the states.

Theorem 3.1. A relation U is a bisimulation relation () sU s0 implies: 1. 8a r(s; a) = r(s0; a) 2. 8a pa(s)(U )#pa(s0) The proofs of this result and later results are given in the Appendix. The backward implication follows from the remarkable Strassen’s theorem on couplings (see e.g. (Lindvall 1999) ), which implies that for any R, (R)# () 8A S ; (A) (R(A)). Applied to an equivalence class C and using that U is symmetric gives the bisimulation property.

Building on the previous result, one can readily generalize the first condition by using a generic relation between states instead of demanding that the rewards match.

Definition 3.1 (R-bisimulation). Given a base relation R S S , an R-bisimulation relation U S S is a new relation where the states are R-related and their transition distributions are U -lifted. Formally, sU s0 implies: 1. sRs0 This allows one to define arbitrary properties that are preserved by the dynamics of the MDP in a systematic way. We remark, firstly, that the second condition is much stronger than merely requiring that the lifting is a coupling supported by R, that is, pa(s)(R)#pa(s0). Requiring U to be lifted also demands (in a coinductive manner) that the successor states after a transition are R-related and can themselves exhibit an appropriate coupling. Secondly, we note that R is well-defined, since the union of R-bisimulation relations is itself an R-bisimulation relation. The well-behavedness of R-bisimulations depends on their base relations, i.e. R is reflexive, symmetric, and transitive whenever R has the same property (see Appendix).

Temporally Extended Metrics

Although the framework presented in Section 3 is agnostic with respect to the base relation R, we will be focusing on the setting of quantitative relations. These are relations parametrized by the use of a real number ", which arise from a base pseudometric : S S ! R 0 on states (or state-action pairs). More formally, given a base metric : S S ! R 0, a quantitative relation " is the relation " := 1 ([0; "]) = f(s; s0) j (s; s0) "g. We note the distinction between the metric and the relations " derived from the metric. We call a bisimulation arising from such a quantitative relation a quantitative bisimulation, and will abuse notation by writing " rather than ". An example of quantitative relations is approximate reward equality s "s0 := maxa jr(s; a) r(s0; a)j "; derived from the base metric (s; s0) = maxa jr(s; a) r(s0; a)j. In the context of quantitative bisimulations, we can define a new metric by taking an infimum over the " parameter. We call these the temporally extended (TE) metrics. The TE metric finds the minimum " such that the states are "-bisimilar. That is, the two states are a distance of " away (in the base metric ) and can be coupled corecursively so that future states are " away and can themselves be coupled. A temporal extension can be defined for any base metric. Definition 4.1 (TE metric). Given a base metric and a corresponding collection of quantitative relations f "g" 0, the TE metric for is defined by n d (s; s0) = inf " j s

o " s0 :

This construction gives well-defined pseudometrics. The proof follows from the symmetry, transitivity, and additivity1 of the relations " (see Appendix for details). Theorem 4.1. Given a base pseudometric , the TE metric d is indeed a pseudometric on S .

Moreover, the temporally extended metrics assign distance 0 to states if and only if they are perfectly bisimilar in the base metric (i.e. they are 0-bisimilar). In the context of reward differences, this implies: Theorem 4.2. Classical bisimulation corresponds exactly to the kernel of the temporal extension of , i.e.

s s0 ()

d (s; s0) = 0: For reward differences, the bisimulation metrics share this property, although our metrics are more general. Furthermore, despite the kernels matching, the TE metrics are not the same as the bisimulation metrics, both in construction and in the distances they assign. A simple example revisited We consider the almostbisimilar states in Figure 2, examined with as the base metric. In this example, all states are 1-bisimilar, but not 0-bisimilar. This is captured by the metric: one needs to couple (s2; t1), since the marginal onto t1 has to equal 1=3 + " and s1 only has 1=3 to spare. Since (s2; t1) 2 support( ) and jr(s2) r(t1)j = 1, then d (s0; t0) = 1. This example highlights the discontinuous behaviour of the TE metric – the states can either be coupled with rewards 0 or 1.

Comparing Bisimulation Metrics and Temporally Extended Metrics

In this section, we compare the temporally extended metrics with the bisimulation metrics. Results in this section are given in terms of reward metric but can be generalized to arbitrary base metrics. The proofs (see Appendix) elucidate the very useful properties of liftings. The bound is tight, and equality need not hold, as Figure 2 shows. Consequently, using bounds from (Ferns, Panangaden, and Precup 2004) , the TE metric gives a guarantee on the difference in optimal value functions and on the approximation error for state-abstraction.

Corollary 5.1.1. Let V^ be the value function in the abstract MDP of any abstraction , not necessarily a bisimulation.2 Then, 8s; s0 2 S : 1 1 In Figure 2, the same coupling minimized both the bisimulation metric and the TE metric. Interestingly, the couplings chosen need not be the same in general. This is the content of the next theorem.

Theorem 5.2. A minimum coupling 2 argmin (pa(s);pa(s0)) K(d )(pa(s); pa(s0)) of the bisimulation metric need not be a lifting of the optimal bisimulation d (s;s0). Conversely, which lifts the optimal bisimulation d (s;s0) need not be a minimizer of K(d )(pa(s); pa(s0)). This highlights the different behaviours of the two metrics – the TE metric aims to minimize the reward difference between coupled states at every step so as to ensure that a single bisimulation relation holds, whereas the bisimulation metric is not preserving a single relation and is willing to couple large differences at an initial step. The couplings chosen by the bisimulation metric do not give a (generalized) bisimulation relation, and the best that one can do with a (generalized) bisimulation relation is given by the temporal extension.

Discussion

We have introduced the temporally extended metrics, a novel class of metrics for behavioural equivalence, which are based on a generalized notion of bisimulation. We have established bounds and other connections with the more familiar bisimulation metric, and seen that they neither compute the same values nor pick out the same couplings of 2we refer the reader to (Li, Walsh, and Littman 2006) for background on abstract MDPs state distributions. After completing this work we discovered that similar ideas for quantitative bisimulations have successfully appeared in the control-theory literature (Girard and Pappas 2007) , although in the different setting of non-deterministic (rather than probabilistic) transition systems without rewards.

This work marks the beginning of an investigation into formally safe, computationally tractable, and model-free metrics for behavioural equivalence. There are many interesting avenues for future work that we intend to pursue. For computational aspects, the TE metric involves the computation of a bisimulation relation rather than a bisimulation metric, which can be done exactly in O(jAjjS j3) via partition refinements as opposed to approximated upto a degree of accuracy " in O(jAjjS j4 log jS j log ") (Ferns, Panangaden, and Precup 2004) . Deriving an exact algorithm, however, is left for future work. The possibility of model-free computation is hypothesized since the metric requires only the existence of a lifting, as opposed to the exact weights of a coupling as does the Kantorovich metric, thus should be easier to estimate from samples.

A slightly disconcerting aspect of the TE metrics is their discontinuity with respect to the transition distributions, observed in Figure 2. This is because we require exact couplings - we are currently investigating the use of approximate couplings to remedy this, which have recently surfaced in the study of differential privacy (Barthe et al. 2016) . Finally, as the general notions of R-bisimulations and temporal extensions can be considered for arbitrary relations and metrics, examining the interplay between different notions of bisimulation (e.g. for optimal value functions or policy value functions) promises to be a fruitful direction. (1 +

n+1 )d (s; s0) X

n ) X id (s; s0) ! d (sk; sj ) = inf" sk " sj

d (s;s0), and we conclude that d (s; s0); 8sk; sj .

Thus the inequality holds for all n. Taking limits finishes the proof:

Theorem A.4. A minimum coupling 2 argmin (pa(s);pa(s0)) K(d )(pa(s); pa(s0)) of the bisimulation metric need not be a lifting of the optimal bisimulation d (s;s0). Conversely, which lifts the optimal bisimulation d (s;s0) need not be a minimizer of K(d )(pa(s); pa(s0)).

Proof. Consider the following MDP, taking as our base metric. Proofs

Theorem A.1. A relation U is a bisimulation relation implies: 1. 8a r(s; a) = r(s0; a) and 2. 8a pa(s)(U )#p(a()s0)sU s0 Proof. The first condition is immediate in both cases. For the forward implication, we pick the coupling a(s0; t0) = 1[s0Ut0]pa(s)(s0)pa(t)(t0)=pa(s)([s0]), where [s0] := ft0 j s0U t0g is the equivalence class of s0. The marginals match since: a(s0; S) = Pt0:s0Ut0 pa(s)(s0)pa(t)(t0)=pa(s)([s0]) = pa(s)(s0)pa(t)([s0])=pa(s)([s0]) = pa(s)(s0); as pa(s)([s0]) = pa(t)([s0]) by the second condition of bisimulation. Similarly for a(S; t0). To make a a lifting of U we still need to check that support( a) U , which is evident since a(s0; t0) is only non-zero when s0U t0. For the converse, we use Strassen’s theorem. Let C 2 S=U , note that we have pa(s)(C) pa(t)(C) since U (C) = C. By symmetry of U , we also have pa(t)(U )#pa(s), so that pa(t)(C) pa(s)(C). So s t.

Theorem A.2. Given a base pseudometric , the TE metric d is indeed a pseudometric on S.

Proof. 1. Note that s

0 s (via the identity coupling (s0; s00) = 1[s=s0]pa(s)(s0)), thus d (s; s) = inf" 0fs " sg = 0.

2. Note that s

" t ) t (s0; t0)), thus d (s; t) = inf"fs d (t; s). " tg = inf"ft " sg = So d (s; t) = inf(B) d (s; w) + d (w; t). 3. Let A = A1 + A2 = f"1 + "2js "1 w & w "2 tg, and

Theorem A.3. The temporal extension of

bisimulation metric: 8s; s0 2 S, upper bounds the Proof. We proceed by induction, showing that 8s; t, dn(s; s0) (1 ) Pin=0 id (s; s0): The base case is d1(s; s0) = maxa f(1 )jr(s; a) r(s0; a)jg (1 )d (s; s0), using maxa jr(s; a) r(s0; a)j d (s; s0). For the induction step we upper bound the min-cost coupling from the minimization problem with the liftings a 2 (pa(s); pa(s0)) given by d (s;s0) (one can show the infs are always attained). dn+1(s; s0) = maaxf1 )jr(s; a)

r(s0; a)j +

K(dn)(pa(s); pa(s0))g (1 (1 + a(sk; sj )dn(sk; sj )

n ) X

! Now we use the lifting property: the only non-zero terms in the summation are those for which (sk; sj ) 2 support( a) " s (via the mirror coupling (t0; s0) = One can verify that ! minimizes the bisimulation distance and that minimizes the temporally extended metric. Indeed, Pu;v2S ! (u; v)d (u; v) = "d (s1; t1) + (1 ")d (s2; t2) = 2"f2(1 )g and this coupling is optimal for K(d ). Meanwhile Pu;v2S (u; v)d (u; v) = "d (s1; t2) + "d (s2; t1) + (1 2")d (s2; t2) = "d (s1; t2) + "d (s2; t1) = 2"f(1 )(1 + + 2)g: On the other hand, using ! , the best relation that can be lifted is 2, since (s1; t1) 2 support(! ) and r(s1) r(t1) = 2. Meanwhile, lifts 1 and thus achieves the minimum lifting, since (s1; t2); (s2; t1); (s2; t2) all have reward differences of 1. Thus ! minimizes the bisimulation metric but not the temporal extension metric. Conversely, minimizes the temporal extension metric since it achieves the minimum lifting, but not the bisimulation metric.

Barthe , G. ; Fong, N. ; Gaboardi , M. ; Gre´goire, B.; Hsu , J.; and Strub, P.-Y. 2016 . Advanced probabilistic couplings for differential privacy . In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security , 55 - 67 . ACM.

Ferns , N. ; Panangaden , P. ; and Precup , D. 2004 . Metrics for finite Markov decision precesses . In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence , 162 - 169 .

Girard , A. , and Pappas , G. J. 2007 . Approximation metrics for discrete and continuous systems . IEEE Transactions on Automatic Control 52 ( 5 ): 782 - 798 .

Givan , R. ; Dean , T. ; and Greig, M. 2003 . Equivalence notions and model minimization in Markov decision processes . Artificial Intelligence 147 ( 1-2 ): 163 - 223 .

Li , L. ; Walsh , T. J.; and Littman , M. L. 2006 . Towards a unified theory of state abstraction for mdps.

Lindvall , T.

1999 . On Strassen's theorem on stochastic domination.

Electronic communications in probability 4 : 51 - 59 .

Milner , R.

1980 . A Calculus for Communicating Systems , volume 92 of Lecture Notes in Computer Science. Springer-Verlag.