Temporally Extended Metrics for Markov Decision Processes

Temporally Extended Metrics for Markov Decision Processes PhilipAmortila MarcGBellemare GoogleBrain PrakashPanangaden DoinaPrecup McGill University McGill University McGill University / DeepMind Temporally Extended Metrics for Markov Decision Processes 47EF314C2DC2ABF0AB385727843870A2 GROBID - A machine learning software for extracting information from scholarly documents

Developing safe and efficient methods for state abstraction in reinforcement learning systems is an open research problem. We propose to address it by leveraging ideas from formal verification, namely, bisimulation. Specifically, we generalize the notion of bisimulation by considering arbitrary comparisons between states instead of strict reward matching. We further develop a notion of temporally extended metrics, which extend a base metric between states of an environment so as to reflect not just the current difference but the extent to which the distance is preserved through the course of transitions. We show that this property is not satisfied by bisimulation metrics, which were previously used to compare states with respect to their longterm rewards. A temporal extension can be defined for any base metric of interest, thus making the construction very flexible. The kernel of the temporally extended metrics corresponds precisely to exact bisimulation (thus these metrics form a larger class of bisimulation metrics). We provide bounds relating bisimulation and temporally extended metrics and also examine the couplings of state distributions which are induced.

Introduction

Building safe AI systems is crucial for a wide range of applications. This is especially difficult when the system relies on reinforcement learning, in which an agent learns from its own experience how to behave in the world. Because in most applications of reinforcement learning, the environment is very large or perhaps continuous, an agent is required to abstract over its state space. However, this operation can result in putting together states that actually have very different values, which can lead to suboptimal or even dangerous behaviour. This is especially true if an agent relies only on immediate observations to determine the similarity between states, rather than long-term predictions or behaviour. For example, a camera mounted on a car may show two very similar images, but one may lead to the car going up the curb and the other may be safe driving.

In reinforcement learning, we can leverage some structure in the problem formulation to attempt to tackle this problem.

Specifically, the environment of an agent is a Markov Decision Process (MDP) in which state similarity has been studied for a couple of decades. One long-studied notion used to capture behavioral similarity of states in a Markov Decision Process (MDP) is called bisimulation. Bisimulation has originated in the fields of concurrency and formal verification for provably verifying the correctness and safety of processes and systems (Milner 1980). An extension to MDPs was proposed by (Givan, Dean, and Greig 2003). Bisimulation is a canonical tool for analyzing the behaviour of transition systems and clustering equivalent states in overly large systems. In the context of MDPs, bisimulation requires an exact match of both rewards and transition distributions. As this criterion is too brittle, a metric approach was developed by (Ferns, Panangaden, and Precup 2004), which allows one to measure "how bisimilar" two states are. These bisimulation metrics assign distance of zero to states if and only if they are bisimilar. Bisimulation metrics can be used for state abstraction (by aggregating states that are ε away), and doing so provides one with formal (rather than statistical) safety guarantees on the difference between the true optimal value function and the approximated one. The bisimulation metric can be computed iteratively, but each step requires one to solve a linear program involving the transition distributions for every pair of states, which is not only computationally expensive but also requires a full model of the MDP. In this work, we investigate alternative metrics for behavioural equivalence, with the goal of maintaining the useful theoretical guarantees of bisimulation metrics while reducing the computational burden and the need for knowledge of the model.

Our contributions are two-fold: firstly, we propose a coupling-based generalization of bisimulation which allows for greater flexibility in the comparisons between states (instead of strict reward matching), and consequently in the properties being checked. Secondly, we consider the class of quantitative bisimulations and show how this defines a notion of temporally extended (TE) metrics. Intuitively, these metrics compute the minimum value of a chosen base metric for which the states s and s can remain in that range throughout their dynamics. The TE metrics assign distance 0 to states if and only if they are bisimilar, much like the bisimulation metrics previously defined. However, both the construction and the resulting metric are quite different.

The rest of the paper is organized as follows. In the next section we provide some necessary background. In Section 3 we characterize bisimulation via couplings and define the extension of this characterization to arbitrary relations. In Section 4 we define the temporally extended metrics. Section 5 compares the two metrics by providing some bounds relating them and analyzing the couplings induced by the two metrics. Lastly, we wrap up with a discussion on the benefits and disadvantages of these metrics, highlighting directions for future work.

Background

Let D(S) denote the set of probability distributions on S. A Markov decision process (MDP) is a 5-tuple M = S, A, {p a : S → D(S)} a∈A , r : S × A → R ≥0 , γ where we emphasize that p a (s) ∈ D(S) is to be read as a probability distribution over the states (one for each action). The transition probability from s to a set of states X is written p a (s)(X) = x∈X p a (s)(x). In many practical applications, the state space of the MDP is simply too large to allow one to compute the value functions exactly without the use of state abstraction. Bisimulation is a canonical example of a safe abstraction, in that the optimal policy and optimal value functions will be preserved in the aggregated MDP (Li, Walsh, and Littman 2006). Bisimulation is defined in terms of equivalence classes, we write S/R for the set of equivalence classes of an equivalence relation R. We say that s and s are bisimilar and write s ∼ s if there is some bisimulation relation U relating them.

Thus bisimilar states will have equal rewards and thereafter transition with equal probability to more bisimilar states.

Given a relation R between states, the lifting (R) # of that relation allows one to naturally extend R to distributions on states. Liftings are defined in terms of couplings, we will later see that this notion will allow us to generalize bisimulation. Definition 2.2 (Couplings and Liftings). A coupling of two distributions (µ, ν) on S is a joint distribution λ on S × S such that the marginals are µ and ν:

λ(s, S) = µ(s) & ν(s ) = λ(S, s ). Moreover, a coupling of distributions (µ, ν) is a lifting of R if: λ(s, s ) > 0 ⇒ sRs , i.e. support(λ) ⊆ R.

When there exists a lifting of µ and ν as above we write µ(R) # ν.

A simple example: bisimulation and liftings We consider the example of the bisimilar states in Figure 1. To see that s 0 and t 0 are bisimilar, take the equivalence classes S/U = {{s 0 , t 0 }, {s 1 , t 1 }, {s 2 , t 2 , t 3 }}. All states in each class receive the same rewards and transition with equal probability to all other classes, so U is indeed a bisimulation relation. Now consider the coupling λ of (p(s 0 ), p(t 0 )) given by the dashed arrows, i.e. λ(s 1 , t 1 ) = λ(s 2 , t 2 ) = λ(s 2 , t 3 ) = 1 3 . This coupling is a lifting of U, since all supported states are U-related. This is not a coincidence: the fact that bisimulation is equivalent to the existence of a lifting is the subject of Section 3. Not all couplings are liftings: the trivial coupling ω(s i , t j ) = p(s 0 )(s i )p(t 0 )(t j ) is not a lifting of U.

s 0 s 1 s 2 t 0 t 1 t 2 t 3 1 3 , [ 0 ] 2 3 , [ 0 ] 1, [1] 1, [0] 1 3 , [ 0 ] 1 3 , [0] 1 3 , [ 0 ] 1, [1] 1 3 1 3 1 3 1, [0] 1, [0]

Generalized Bisimulation via Liftings

In bisimulation, the rewards match at the first step and thereafter the states transition such that the bisimulation relation is preserved. This definition can also be captured in terms of couplings: our first result is that bisimulation is equivalent to the existence of a particular lifting of the states. Theorem 3.1. A relation U is a bisimulation relation ⇐⇒ sUs implies:

1. ∀a r(s, a) = r(s , a) 2. ∀a p a (s)(U) # p a (s )

The proofs of this result and later results are given in the Appendix. The backward implication follows from the remarkable Strassen's theorem on couplings (see e.g. (Lindvall 1999)), which implies that for any R, µ(R) # ν ⇐⇒ ∀A ⊆ S, µ(A) ≤ ν(R(A)). Applied to an equivalence class C and using that U is symmetric gives the bisimulation property.

Building on the previous result, one can readily generalize the first condition by using a generic relation between states instead of demanding that the rewards match. This allows one to define arbitrary properties that are preserved by the dynamics of the MDP in a systematic way. We remark, firstly, that the second condition is much stronger than merely requiring that the lifting is a coupling supported by R, that is, p a (s)(R) # p a (s ). Requiring U to be lifted also demands (in a coinductive manner) that the successor states after a transition are R-related and can themselves exhibit an appropriate coupling. Secondly, we note that R ∼ is well-defined, since the union of R-bisimulation relations is itself an R-bisimulation relation. The well-behavedness of R-bisimulations depends on their base relations, i.e.

∼ is reflexive, symmetric, and transitive whenever R has the same property (see Appendix).

Temporally Extended Metrics

Although the framework presented in Section 3 is agnostic with respect to the base relation R, we will be focusing on the setting of quantitative relations. These are relations parametrized by the use of a real number ε, which arise from a base pseudometric δ : S × S → R ≥0 on states (or state-action pairs). More formally, given a base metric δ : S × S → R ≥0 , a quantitative relation δ ε is the relation δ ε := δ −1 ([0, ε]) = {(s, s ) | δ(s, s ) ≤ ε}. We note the distinction between the metric δ and the relations δ ε derived from the metric. We call a bisimulation arising from such a quantitative relation a quantitative bisimulation, and will abuse notation by writing In the context of quantitative bisimulations, we can define a new metric by taking an infimum over the ε parameter. We call these the temporally extended (TE) metrics. The TE metric finds the minimum ε such that the states are δ ε -bisimilar. That is, the two states are a distance of ε away (in the base metric δ) and can be coupled corecursively so that future states are ε away and can themselves be coupled. A temporal extension can be defined for any base metric. Definition 4.1 (TE metric). Given a base metric δ and a corresponding collection of quantitative relations {δ ε } ε≥0 , the TE metric for δ is defined by

d δ τ (s, s ) = inf ε | s δ ∼ ε s .

This construction gives well-defined pseudometrics. The proof follows from the symmetry, transitivity, and additivity 1 of the relations δ ∼ ε (see Appendix for details). Theorem 4.1. Given a base pseudometric δ, the TE metric d δ τ is indeed a pseudometric on S. Moreover, the temporally extended metrics assign distance 0 to states if and only if they are perfectly bisimilar in the base metric (i.e. they are δ 0 -bisimilar). In the context of reward differences, this implies: Theorem 4.2. Classical bisimulation corresponds exactly to the kernel of the temporal extension of ρ, i.e.

s ∼ s ⇐⇒ d ρ τ (s, s ) = 0.

For reward differences, the bisimulation metrics share this property, although our metrics are more general. Furthermore, despite the kernels matching, the TE metrics are not the same as the bisimulation metrics, both in construction and in the distances they assign. A simple example revisited We consider the almostbisimilar states in Figure 2, examined with ρ as the base metric. In this example, all states are ρ 1 -bisimilar, but not ρ 0 -bisimilar. This is captured by the metric: one needs to couple (s 2 , t 1 ), since the marginal onto t 1 has to equal 1/3 + ε and s 1 only has 1/3 to spare. Since (s 2 , t 1 ) ∈ support(λ) and |r(s 2 ) − r(t 1 )| = 1, then d ρ τ (s 0 , t 0 ) = 1. This example highlights the discontinuous behaviour of the TE metric -the states can either be coupled with rewards 0 or 1.

s 0 s 1 s 2 t 0 t 1 t 2 t 3 1 3 , [ 0 ] 2 3 , [ 0 ] 1, [1] 1, [0] 1 3 + ε , [ 0 ] 1 3 , [0] 1 3 − ε , [ 0 ] 1, [1] 1 3 1 3 1 3 − ε ε 1, [0] 1, [0]

Comparing Bisimulation Metrics and Temporally Extended Metrics

In this section, we compare the temporally extended metrics with the bisimulation metrics. Results in this section are given in terms of reward metric ρ but can be generalized to arbitrary base metrics. The proofs (see Appendix) elucidate the very useful properties of liftings.

1 s δ ∼α t & t δ ∼ β w ⇒ s δ ∼ α+β w

Bounds

Our first result relates the TE metric and the bisimulation metric with a bound. Theorem 5.1. The temporal extension of ρ upper bounds the bisimulation metric: ∀s, s ∈ S,

d ∼ (s, s ) ≤ d ρ τ (s, s ).

The bound is tight, and equality need not hold, as Figure 2 shows. Consequently, using bounds from (Ferns, Panangaden, and Precup 2004), the TE metric gives a guarantee on the difference in optimal value functions and on the approximation error for state-abstraction.

Corollary 5.1.1. Let V be the value function in the abstract MDP of any abstraction φ, not necessarily a bisimulation. 2 Then, ∀s, s ∈ S:

Optimal Couplings

In Figure 2, the same coupling minimized both the bisimulation metric and the TE metric. Interestingly, the couplings chosen need not be the same in general. This is the content of the next theorem. Theorem 5.2. A minimum coupling λ ∈ argmin Λ(pa(s),pa(s )) K(d ∼ )(p a (s), p a (s )) of the bisimulation metric need not be a lifting of the optimal bisimulation ρ ∼ d ρ τ (s,s ) . Conversely, λ which lifts the optimal bisimulation ρ ∼ d ρ τ (s,s ) need not be a minimizer of K(d ∼ )(p a (s), p a (s )). This highlights the different behaviours of the two metricsthe TE metric aims to minimize the reward difference between coupled states at every step so as to ensure that a single bisimulation relation holds, whereas the bisimulation metric is not preserving a single relation and is willing to couple large differences at an initial step. The couplings chosen by the bisimulation metric do not give a (generalized) bisimulation relation, and the best that one can do with a (generalized) bisimulation relation is given by the temporal extension.

Discussion

We have introduced the temporally extended metrics, a novel class of metrics for behavioural equivalence, which are based on a generalized notion of bisimulation. We have established bounds and other connections with the more familiar bisimulation metric, and seen that they neither compute the same values nor pick out the same couplings of 2 we refer the reader to (Li, Walsh, and Littman 2006) for background on abstract MDPs state distributions. After completing this work we discovered that similar ideas for quantitative bisimulations have successfully appeared in the control-theory literature (Girard and Pappas 2007), although in the different setting of non-deterministic (rather than probabilistic) transition systems without rewards. This work marks the beginning of an investigation into formally safe, computationally tractable, and model-free metrics for behavioural equivalence. There are many interesting avenues for future work that we intend to pursue. For computational aspects, the TE metric involves the computation of a bisimulation relation rather than a bisimulation metric, which can be done exactly in O(|A||S| 3 ) via partition refinements as opposed to approximated upto a degree of accuracy ε in O(|A||S| 4 log |S| log ε) (Ferns, Panangaden, and Precup 2004). Deriving an exact algorithm, however, is left for future work. The possibility of model-free computation is hypothesized since the metric requires only the existence of a lifting, as opposed to the exact weights of a coupling as does the Kantorovich metric, thus should be easier to estimate from samples. A slightly disconcerting aspect of the TE metrics is their discontinuity with respect to the transition distributions, observed in Figure 2. This is because we require exact couplings -we are currently investigating the use of approximate couplings to remedy this, which have recently surfaced in the study of differential privacy (Barthe et al. 2016). Finally, as the general notions of R-bisimulations and δtemporal extensions can be considered for arbitrary relations and metrics, examining the interplay between different notions of bisimulation (e.g. for optimal value functions or policy value functions) promises to be a fruitful direction.

). Proof. We proceed by induction, showing that ∀s, t, dn(s, s ) ≤

(1 − γ) n i=0 γ i d ρ τ (s, s ). The base case is d1(s, s ) = maxa {(1 − γ)|r(s, a) − r(s , a)|} ≤ (1 − γ)d ρ τ (s, s ), using maxa |r(s, a) − r(s , a)| ≤ d ρ τ (s, s ).

For the induction step we upper bound the min-cost coupling from the minimization problem with the liftings λa ∈ Λ (pa(s), pa(s )) given by ρ ∼ d ρ τ (s,s ) (one can show the infs are always attained).

dn+1(s, s ) = max a {1 − γ)|r(s, a) − r(s , a)| + γK(dn)(pa(s), pa(s ))} ≤ (1 − γ)d ρ τ (s, s ) + γ k,j λa(s k , sj)dn(s k , sj) ≤ (1 − γ)d ρ τ (s, s ) + γ k,j λa(s k , sj) (1 − γ) n i=0 γ i d ρ τ (s k , sj) (inductionρ τ (s k , sj) = infε s k ρ ∼ε sj ≤ d ρ τ (s, s ), ∀s k , sj. dn+1(s, s ) ≤ (1 − γ)d ρ τ (s, s ) + γ k,j λa(s k , sj) (1 − γ) n i=0 γ i d ρ τ (s, s ) = (1 − γ)d ρ τ (s, s ) n+1 i=0 γ i

Thus the inequality holds for all n. Taking limits finishes the proof:

d∼(s, s ) ≤ 1 − γ 1 − γ d ρ τ (s, s ) = d ρ τ (s, s ),

as desired.

Theorem A.4. A minimum coupling λ ∈ argmin Λ(pa(s),pa(s )) K(d∼)(pa(s), pa(s )) of the bisimulation metric need not be a lifting of the optimal bisimulation ρ ∼ d ρ τ (s,s ) . Conversely, λ which lifts the optimal bisimulation ρ ∼ d ρ τ (s,s ) need not be a minimizer of K(d∼)(pa(s), pa(s )).

Proof. Consider the following MDP, taking ρ as our base metric.

s 0 s 1 s 1 s 2 s 2 s 2 s 2 ε, [0] 1 − ε, [0] [0] [0] [1] [1] [1] [0] t 0 t 1 t 1 t 2 t 2 t 2 t 2 ε, [0] 1 − ε, [0]

[2]

[0]

[1]

[0]

And consider the following two couplings ω∼, λτ ∈ Λ(p(s0), p(t0)). One can verify that ω∼ minimizes the bisimulation distance and that λτ minimizes the temporally extended metric. Indeed, u,v∈S ω∼(u, v)d∼(u, v) = εd∼(s1, t1) + (1 − ε)d∼(s2, t2) = 2ε{2(1 − γ)} and this coupling is optimal for K(d∼). Meanwhile u,v∈S λτ (u, v)d∼(u, v) = εd∼(s1, t2) + εd∼(s2, t1) + (1 − 2ε)d∼(s2, t2) = εd∼(s1, t2) + εd∼(s2, t1) = 2ε{(1 − γ)(1 + γ + γ 2 )}. On the other hand, using ω∼, the best relation that can be lifted is ρ ∼2, since (s1, t1) ∈ support(ω∼) and r(s1) − r(t1) = 2. Meanwhile, λτ lifts ρ ∼1 and thus achieves the minimum lifting, since (s1, t2), (s2, t1), (s2, t2) all have reward differences of 1. Thus ω∼ minimizes the bisimulation metric but not the temporal extension metric. Conversely, λτ minimizes the temporal extension metric since it achieves the minimum lifting, but not the bisimulation metric.

Definition 2.1 (Bisimulation). A bisimulation relation U on S is an equivalence relation such that sUs implies: 1. ∀a ∈ A, r(s, a) = r(s , a) and 2. ∀a ∈ A, ∀C ∈ S/U, p a (s)(C) = p a (s )(C).

Figure 1 :1Figure 1: Rewards indicated in square brackets. Dashed arrows give the weights of the coupling of p(s 0 ) and p(t 0 ). Colours represent different equivalence classes of ∼.

Definition 3.1 (R-bisimulation). Given a base relation R ⊆ S × S, an R-bisimulation relation U ⊆ S × S is a new relation where the states are R-related and their transition distributions are U-lifted. Formally, sUs implies: 1. sRs 2. ∀a p a (s)(U) # p a (s ) We define R-bisimulation R ∼ to be the largest Rbisimulation relation.

δ∼ε rather than δε ∼. An example of quantitative relations is approximate reward equality sρ ε s := max a |r(s, a) − r(s , a)| ≤ ε, derived from the base metric ρ(s, s ) = max a |r(s, a) − r(s , a)|.

Figure 2 :2Figure 2: Rewards indicated in square brackets. Dashed arrows give the weights of the coupling of p(s 0 ) and p(t 0 ).

So d δ τ (s, t) = inf(B) ≤ inf(A) = inf(A1) + inf(A2) = d δ τ (s, w) + d δ τ (w, t). Theorem A.3. The temporal extension of ρ upper bounds the bisimulation metric: ∀s, s ∈ S, d∼(s, s ) ≤ d ρ τ (s, s ).

A Proofs

Theorem A.1. A relation U is a bisimulation relation ⇐⇒ sUs implies: 1. ∀a r(s, a) = r(s , a) and 2. ∀a pa(s)(U) # pa(s )

Proof. The first condition is immediate in both cases. For the forward implication, we pick the coupling λa(s , t ) =

is the equivalence class of s . The marginals match since: λa(s , S)

) by the second condition of bisimulation. Similarly for λa(S, t ). To make λa a lifting of U we still need to check that support(λa) ⊆ U, which is evident since λa(s , t ) is only non-zero when s Ut . For the converse, we use Strassen's theorem. Let C ∈ S/U, note that we have pa(s)(C) ≤ pa(t)(C) since U(C) = C. By symmetry of U, we also have pa(t)(U) # pa(s), so that pa(t)(C) ≤ pa(s)(C). So s ∼ t.

Theorem A.2. Given a base pseudometric δ, the TE metric d δ τ is indeed a pseudometric on S.

Proof. 1. Note that s δ ∼0 s (via the identity coupling λ(s , s ) = λ a,1 (s ,w )λ a,2 (w ,t ) pa(w)(w )

Advanced probabilistic couplings for differential privacy GBarthe NFong MGaboardi BGrégoire JHsu P.-YStrub Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security the 2016 ACM SIGSAC Conference on Computer and Communications Security ACM 2016. 2004 Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence Approximation metrics for discrete and continuous systems AGirard GJPappas IEEE Transactions on Automatic Control 52 5 2007 Equivalence notions and model minimization in Markov decision processes RGivan TDean MGreig Artificial Intelligence 147 1-2 2003 Towards a unified theory of state abstraction for mdps LLi TJWalsh MLLittman 2006 On Strassen's theorem on stochastic domination TLindvall Electronic communications in probability 4 1999 A Calculus for Communicating Systems RMilner Lecture Notes in Computer Science 92 1980 Springer-Verlag