Temporally Extended Metrics for Markov Decision Processes Philip Amortila Marc G. Bellemare Prakash Panangaden Doina Precup McGill University Google Brain∗ McGill University McGill University / DeepMind Abstract Specifically, the environment of an agent is a Markov Deci- sion Process (MDP) in which state similarity has been stud- Developing safe and efficient methods for state abstraction ied for a couple of decades. One long-studied notion used in reinforcement learning systems is an open research prob- to capture behavioral similarity of states in a Markov Deci- lem. We propose to address it by leveraging ideas from for- sion Process (MDP) is called bisimulation. Bisimulation has mal verification, namely, bisimulation. Specifically, we gen- eralize the notion of bisimulation by considering arbitrary originated in the fields of concurrency and formal verifica- comparisons between states instead of strict reward matching. tion for provably verifying the correctness and safety of pro- We further develop a notion of temporally extended metrics, cesses and systems (Milner 1980). An extension to MDPs which extend a base metric between states of an environment was proposed by (Givan, Dean, and Greig 2003). Bisimula- so as to reflect not just the current difference but the extent to tion is a canonical tool for analyzing the behaviour of transi- which the distance is preserved through the course of transi- tion systems and clustering equivalent states in overly large tions. We show that this property is not satisfied by bisimu- systems. In the context of MDPs, bisimulation requires an lation metrics, which were previously used to compare states exact match of both rewards and transition distributions. As with respect to their longterm rewards. A temporal extension this criterion is too brittle, a metric approach was developed can be defined for any base metric of interest, thus making by (Ferns, Panangaden, and Precup 2004), which allows one the construction very flexible. The kernel of the temporally extended metrics corresponds precisely to exact bisimulation to measure “how bisimilar” two states are. These bisimula- (thus these metrics form a larger class of bisimulation met- tion metrics assign distance of zero to states if and only if rics). We provide bounds relating bisimulation and tempo- they are bisimilar. Bisimulation metrics can be used for state rally extended metrics and also examine the couplings of state abstraction (by aggregating states that are ε away), and doing distributions which are induced. so provides one with formal (rather than statistical) safety guarantees on the difference between the true optimal value function and the approximated one. The bisimulation met- 1 Introduction ric can be computed iteratively, but each step requires one to solve a linear program involving the transition distribu- Building safe AI systems is crucial for a wide range of ap- tions for every pair of states, which is not only computation- plications. This is especially difficult when the system relies ally expensive but also requires a full model of the MDP. In on reinforcement learning, in which an agent learns from its this work, we investigate alternative metrics for behavioural own experience how to behave in the world. Because in most equivalence, with the goal of maintaining the useful theoret- applications of reinforcement learning, the environment is ical guarantees of bisimulation metrics while reducing the very large or perhaps continuous, an agent is required to ab- computational burden and the need for knowledge of the stract over its state space. However, this operation can re- model. sult in putting together states that actually have very differ- Our contributions are two-fold: firstly, we propose a ent values, which can lead to suboptimal or even dangerous coupling-based generalization of bisimulation which allows behaviour. This is especially true if an agent relies only on for greater flexibility in the comparisons between states (in- immediate observations to determine the similarity between stead of strict reward matching), and consequently in the states, rather than long-term predictions or behaviour. For properties being checked. Secondly, we consider the class of example, a camera mounted on a car may show two very quantitative bisimulations and show how this defines a no- similar images, but one may lead to the car going up the tion of temporally extended (TE) metrics. Intuitively, these curb and the other may be safe driving. metrics compute the minimum value of a chosen base met- In reinforcement learning, we can leverage some structure in ric for which the states s and s0 can remain in that range the problem formulation to attempt to tackle this problem. throughout their dynamics. The TE metrics assign distance 0 to states if and only if they are bisimilar, much like the bisim- ∗ CIFAR Fellow ulation metrics previously defined. However, both the con- Copyright held by author(s) struction and the resulting metric are quite different. The rest of the paper is organized as follows. In the next sec- see that s0 and t0 are bisimilar, take the equivalence classes tion we provide some necessary background. In Section 3 S/U = {{s0 , t0 }, {s1 , t1 }, {s2 , t2 , t3 }}. All states in each we characterize bisimulation via couplings and define the class receive the same rewards and transition with equal extension of this characterization to arbitrary relations. In probability to all other classes, so U is indeed a bisimula- Section 4 we define the temporally extended metrics. Sec- tion relation. Now consider the coupling λ of (p(s0 ), p(t0 )) tion 5 compares the two metrics by providing some bounds given by the dashed arrows, i.e. λ(s1 , t1 ) = λ(s2 , t2 ) = relating them and analyzing the couplings induced by the λ(s2 , t3 ) = 31 . This coupling is a lifting of U, since all sup- two metrics. Lastly, we wrap up with a discussion on the ported states are U-related. This is not a coincidence: the benefits and disadvantages of these metrics, highlighting di- fact that bisimulation is equivalent to the existence of a lift- rections for future work. ing is the subject of Section 3. Not all couplings are liftings: the trivial coupling ω(si , tj ) = p(s0 )(si )p(t0 )(tj ) is not a lifting of U. 2 Background 1, [1] 1, [1] Let D(S) denote the set of probability distributions on 1 S. A Markov decision process (MDP) is a 5-tuple M = 3 s1 t1 1 , S, A, {pa : S → D(S)}a∈A , r : S × A → R≥0 , γ where ] [0 3 [0 3 , 1 ] we emphasize that pa (s) ∈ D(S) is to be read as a proba- bility distribution over the states (one for each action). The 1 3 , [0] P from s to a set of states X is writ- transition probability s0 t2 t0 ten pa (s)(X) = x∈X pa (s)(x). In many practical appli- 1 cations, the state space of the MDP is simply too large to 1, [0] 3 1, [0] 1, [0] allow one to compute the value functions exactly without 1 2 , ] [0 3 the use of state abstraction. Bisimulation is a canonical ex- s2 3 [0 t3 3 , 1 ] ample of a safe abstraction, in that the optimal policy and optimal value functions will be preserved in the aggregated MDP (Li, Walsh, and Littman 2006). Bisimulation is defined Figure 1: Rewards indicated in square brackets. Dashed ar- in terms of equivalence classes, we write S/R for the set of rows give the weights of the coupling of p(s0 ) and p(t0 ). equivalence classes of an equivalence relation R. Colours represent different equivalence classes of ∼. Definition 2.1 (Bisimulation). A bisimulation relation U on S is an equivalence relation such that sUs0 implies: Previous work on metric extensions of bisimulation is built 1. ∀a ∈ A, r(s, a) = r(s0 , a) and on the Kantorovich metric K(d)(µ, ν) = minλ∈Λ(µ,ν) hλ, di 2. ∀a ∈ A, ∀C ∈ S/U, pa (s)(C) = pa (s0 )(C). (also called the Wasserstein metric). The bisimulation met- ric d∼ is the fixed point of the operator F (d)(s, s0 ) = We say that s and s0 are bisimilar and write s ∼ s0 if there maxa {(1 − γ)|r(s, a) − r(s0 , a)| + γK(d)(pa (s), pa (s0 ))}, is some bisimulation relation U relating them. we refer the reader to (Ferns, Panangaden, and Precup 2004) Thus bisimilar states will have equal rewards and there- for further background. after transition with equal probability to more bisimilar states. 3 Generalized Bisimulation via Liftings Given a relation R between states, the lifting (R)# of that relation allows one to naturally extend R to distributions on In bisimulation, the rewards match at the first step and there- states. Liftings are defined in terms of couplings, we will after the states transition such that the bisimulation relation later see that this notion will allow us to generalize bisimu- is preserved. This definition can also be captured in terms of lation. couplings: our first result is that bisimulation is equivalent to Definition 2.2 (Couplings and Liftings). A coupling of two the existence of a particular lifting of the states. distributions (µ, ν) on S is a joint distribution λ on S × S Theorem 3.1. A relation U is a bisimulation relation ⇐⇒ such that the marginals are µ and ν: sUs0 implies: λ(s, S) = µ(s) & ν(s0 ) = λ(S, s0 ). 1. ∀a r(s, a) = r(s0 , a) Moreover, a coupling of distributions (µ, ν) is a lifting of R 2. ∀a pa (s)(U)# pa (s0 ) if: The proofs of this result and later results are given in the λ(s, s0 ) > 0 ⇒ sRs0 , i.e. support(λ) ⊆ R. Appendix. The backward implication follows from the re- When there exists a lifting of µ and ν as above we write markable Strassen’s theorem on couplings (see e.g. (Lind- µ(R)# ν. vall 1999)), which implies that for any R, µ(R)# ν ⇐⇒ ∀A ⊆ S, µ(A) ≤ ν(R(A)). Applied to an equivalence A simple example: bisimulation and liftings We con- class C and using that U is symmetric gives the bisimula- sider the example of the bisimilar states in Figure 1. To tion property. Building on the previous result, one can readily generalize This construction gives well-defined pseudometrics. The the first condition by using a generic relation between states proof follows from the symmetry, transitivity, and additiv- instead of demanding that the rewards match. δ ity1 of the relations ∼ε (see Appendix for details). Definition 3.1 (R-bisimulation). Given a base relation R ⊆ Theorem 4.1. Given a base pseudometric δ, the TE metric S × S, an R-bisimulation relation U ⊆ S × S is a new dδτ is indeed a pseudometric on S. relation where the states are R-related and their transition distributions are U-lifted. Formally, sUs0 implies: Moreover, the temporally extended metrics assign distance 0 0 to states if and only if they are perfectly bisimilar in the base 1. sRs metric (i.e. they are δ0 -bisimilar). In the context of reward 2. ∀a pa (s)(U)# pa (s0 ) differences, this implies: Theorem 4.2. Classical bisimulation corresponds exactly to R We define R-bisimulation ∼ to be the largest R- the kernel of the temporal extension of ρ, i.e. bisimulation relation. s ∼ s0 ⇐⇒ dρτ (s, s0 ) = 0. This allows one to define arbitrary properties that are pre- For reward differences, the bisimulation metrics share this served by the dynamics of the MDP in a systematic way. We property, although our metrics are more general. Further- remark, firstly, that the second condition is much stronger more, despite the kernels matching, the TE metrics are not than merely requiring that the lifting is a coupling supported the same as the bisimulation metrics, both in construction by R, that is, pa (s)(R)# pa (s0 ). Requiring U to be lifted and in the distances they assign. also demands (in a coinductive manner) that the successor states after a transition are R-related and can themselves ex- 1, [1] 1, [1] R hibit an appropriate coupling. Secondly, we note that ∼ is 1 well-defined, since the union of R-bisimulation relations is 3 1 itself an R-bisimulation relation. The well-behavedness of s1 t1 3 ] + [0 R ε, R-bisimulations depends on their base relations, i.e. ∼ is re- 3 , 1 [0 flexive, symmetric, and transitive whenever R has the same ] 1 3 , [0] property (see Appendix). ε s0 t2 t0 1 1, [0] 4 Temporally Extended Metrics 3 1, [0] 1, [0] ] [0 1 3 −ε 2 , Although the framework presented in Section 3 is agnos- ε, 3 s2 [0 t3 3 − tic with respect to the base relation R, we will be focusing ] 1 on the setting of quantitative relations. These are relations parametrized by the use of a real number ε, which arise from a base pseudometric δ : S × S → R≥0 on states Figure 2: Rewards indicated in square brackets. Dashed ar- (or state-action pairs). More formally, given a base metric rows give the weights of the coupling of p(s0 ) and p(t0 ). δ : S × S → R≥0 , a quantitative relation δε is the rela- tion δε := δ −1 ([0, ε]) = {(s, s0 ) | δ(s, s0 ) ≤ ε}. We note A simple example revisited We consider the almost- the distinction between the metric δ and the relations δε de- bisimilar states in Figure 2, examined with ρ as the base rived from the metric. We call a bisimulation arising from metric. In this example, all states are ρ1 -bisimilar, but such a quantitative relation a quantitative bisimulation, and not ρ0 -bisimilar. This is captured by the metric: one δ δε needs to couple (s2 , t1 ), since the marginal onto t1 has will abuse notation by writing ∼ε rather than ∼. An exam- to equal 1/3 + ε and s1 only has 1/3 to spare. Since ple of quantitative relations is approximate reward equality (s2 , t1 ) ∈ support(λ) and |r(s2 ) − r(t1 )| = 1, then sρε s0 := maxa |r(s, a) − r(s0 , a)| ≤ ε, derived from the dρτ (s0 , t0 ) = 1. This example highlights the discontinuous base metric ρ(s, s0 ) = maxa |r(s, a) − r(s0 , a)|. behaviour of the TE metric – the states can either be coupled In the context of quantitative bisimulations, we can define a with rewards 0 or 1. new metric by taking an infimum over the ε parameter. We call these the temporally extended (TE) metrics. The TE met- ric finds the minimum ε such that the states are δε -bisimilar. 5 Comparing Bisimulation Metrics and That is, the two states are a distance of ε away (in the base metric δ) and can be coupled corecursively so that future Temporally Extended Metrics states are ε away and can themselves be coupled. A tempo- In this section, we compare the temporally extended met- ral extension can be defined for any base metric. rics with the bisimulation metrics. Results in this section are Definition 4.1 (TE metric). Given a base metric δ and a cor- given in terms of reward metric ρ but can be generalized to responding collection of quantitative relations {δε }ε≥0 , the arbitrary base metrics. The proofs (see Appendix) elucidate TE metric for δ is defined by the very useful properties of liftings. n o δ dδτ (s, s0 ) = inf ε | s ∼ε s0 . 1 δ δ s ∼α t & t ∼β w ⇒ s ∼α+β w δ 5.1 Bounds state distributions. After completing this work we discov- ered that similar ideas for quantitative bisimulations have Our first result relates the TE metric and the bisimulation successfully appeared in the control-theory literature (Gi- metric with a bound. rard and Pappas 2007), although in the different setting of Theorem 5.1. The temporal extension of ρ upper bounds non-deterministic (rather than probabilistic) transition sys- the bisimulation metric: ∀s, s0 ∈ S, tems without rewards. d∼ (s, s0 ) ≤ dρτ (s, s0 ). This work marks the beginning of an investigation into for- mally safe, computationally tractable, and model-free met- rics for behavioural equivalence. There are many interesting The bound is tight, and equality need not hold, as Figure avenues for future work that we intend to pursue. For com- 2 shows. Consequently, using bounds from (Ferns, Panan- putational aspects, the TE metric involves the computation gaden, and Precup 2004), the TE metric gives a guarantee of a bisimulation relation rather than a bisimulation metric, on the difference in optimal value functions and on the ap- which can be done exactly in O(|A||S|3 ) via partition re- proximation error for state-abstraction. finements as opposed to approximated upto a degree of ac- Corollary 5.1.1. Let V̂ be the value function in the abstract curacy ε in O(|A||S|4 log |S| log ε) (Ferns, Panangaden, and MDP of any abstraction φ, not necessarily a bisimulation.2 Precup 2004). Deriving an exact algorithm, however, is left Then, ∀s, s0 ∈ S: for future work. The possibility of model-free computation 1 is hypothesized since the metric requires only the existence |V ∗ (s)−V ∗ (s0 )| ≤ dρ (s, s0 ) and of a lifting, as opposed to the exact weights of a coupling as 1−γ τ does the Kantorovich metric, thus should be easier to esti- |V̂ ∗ (φ(s))−V ∗ (s)| ≤ mate from samples. γ 1 X A slightly disconcerting aspect of the TE metrics is their max max dρτ (s0 , s00 ) discontinuity with respect to the transition distributions, ob- (1 − γ)2 φ(s0 ) s0 ∈φ(s0 ) |φ(s0 )| 00 s ∈φ(s) served in Figure 2. This is because we require exact cou- plings - we are currently investigating the use of approxi- 5.2 Optimal Couplings mate couplings to remedy this, which have recently surfaced in the study of differential privacy (Barthe et al. 2016). In Figure 2, the same coupling minimized both the bisimu- Finally, as the general notions of R-bisimulations and δ- lation metric and the TE metric. Interestingly, the couplings temporal extensions can be considered for arbitrary rela- chosen need not be the same in general. This is the content tions and metrics, examining the interplay between differ- of the next theorem. ent notions of bisimulation (e.g. for optimal value functions Theorem 5.2. A minimum coupling λ ∈ or policy value functions) promises to be a fruitful direc- argminΛ(pa (s),pa (s0 )) K(d∼ )(pa (s), pa (s0 )) of the bisimu- tion. lation metric need not be a lifting of the optimal bisimulation ρ ∼dρτ (s,s0 ) . Conversely, λ which lifts the optimal bisimulation ρ References ∼dρτ (s,s0 ) need not be a minimizer of K(d∼ )(pa (s), pa (s0 )). This highlights the different behaviours of the two metrics – Barthe, G.; Fong, N.; Gaboardi, M.; Grégoire, B.; Hsu, J.; and the TE metric aims to minimize the reward difference be- Strub, P.-Y. 2016. Advanced probabilistic couplings for differential tween coupled states at every step so as to ensure that a privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, 55–67. ACM. single bisimulation relation holds, whereas the bisimulation metric is not preserving a single relation and is willing to Ferns, N.; Panangaden, P.; and Precup, D. 2004. Metrics for finite couple large differences at an initial step. The couplings cho- Markov decision precesses. In Proceedings of the 20th Conference sen by the bisimulation metric do not give a (generalized) on Uncertainty in Artificial Intelligence, 162–169. bisimulation relation, and the best that one can do with a (generalized) bisimulation relation is given by the temporal Girard, A., and Pappas, G. J. 2007. Approximation metrics for extension. discrete and continuous systems. IEEE Transactions on Automatic Control 52(5):782–798. 6 Discussion Givan, R.; Dean, T.; and Greig, M. 2003. Equivalence notions and model minimization in Markov decision processes. Artificial We have introduced the temporally extended metrics, a novel Intelligence 147(1-2):163–223. class of metrics for behavioural equivalence, which are Li, L.; Walsh, T. J.; and Littman, M. L. 2006. Towards a unified based on a generalized notion of bisimulation. We have es- theory of state abstraction for mdps. tablished bounds and other connections with the more fa- miliar bisimulation metric, and seen that they neither com- Lindvall, T. 1999. On Strassen’s theorem on stochastic domination. pute the same values nor pick out the same couplings of Electronic communications in probability 4:51–59. 2 we refer the reader to (Li, Walsh, and Littman 2006) for back- Milner, R. 1980. A Calculus for Communicating Systems, vol- ground on abstract MDPs ume 92 of Lecture Notes in Computer Science. Springer-Verlag. ρ ρ A Proofs ∼dρτ (s,s0 ) . Thus (sk , sj ) ∈ ∼dρτ (s,s0 ) , and we conclude that ρ dρτ (sk , sj ) = inf ε sk ∼ε sj ≤ dρτ (s, s0 ), ∀sk , sj . Theorem A.1. A relation U is a bisimulation relation ⇐⇒ sUs0 implies: 1. ∀a r(s, a) = r(s0 , a) and 2. ∀a pa (s)(U)# pa (s0 ) dn+1 (s, s0 ) ≤ (1 − γ)dρτ (s, s0 ) n ! X X Proof. The first condition is immediate in both cases. For +γ λa (sk , sj ) (1 − γ) γ i dρτ (s, s0 ) the forward implication, we pick the coupling λa (s0 , t0 ) = k,j i=0 1[s0 U t0 ] pa (s)(s0 )pa (t)(t0 )/pa (s)([s0 ]), where [s0 ] := {t0 | s0 Ut0 } n+1 is the equivalencePclass of s0 . The marginals match since: X = (1 − γ)dρτ (s, s0 ) γi λa (s0 , S) = 0 0 0 t0 :s0 U t0 pa (s)(s )pa (t)(t )/pa (s)([s ]) = i=0 0 0 0 0 pa (s)(s )pa (t)([s ])/pa (s)([s ]) = pa (s)(s ), as pa (s)([s0 ]) = pa (t)([s0 ]) by the second condition of bisim- Thus the inequality holds for all n. Taking limits finishes the proof: ulation. Similarly for λa (S, t0 ). To make λa a lifting of U we 1−γ ρ still need to check that support(λa ) ⊆ U, which is evident d∼ (s, s0 ) ≤ dτ (s, s0 ) = dρτ (s, s0 ), since λa (s0 , t0 ) is only non-zero when s0 Ut0 . For the converse, 1−γ we use Strassen’s theorem. Let C ∈ S/U, note that we have as desired. pa (s)(C) ≤ pa (t)(C) since U(C) = C. By symmetry of U, we also have pa (t)(U)# pa (s), so that pa (t)(C) ≤ pa (s)(C). So Theorem A.4. A minimum coupling λ ∈ s ∼ t. argminΛ(pa (s),pa (s0 )) K(d∼ )(pa (s), pa (s0 )) of the bisimu- lation metric need not be a lifting of the optimal bisimulation Theorem A.2. Given a base pseudometric δ, the TE metric dδτ is ρ ∼dρτ (s,s0 ) . Conversely, λ which lifts the optimal bisimulation indeed a pseudometric on S. ρ ∼dρτ (s,s0 ) need not be a minimizer of K(d∼ )(pa (s), pa (s0 )). δ Proof. 1. Note that s ∼0 s (via the identity coupling λ(s0 , s00 ) = Proof. Consider the following MDP, taking ρ as our base metric. 1[s=s0 ] pa (s)(s )), thus dδτ (s, s) = inf ε≥0 {s ∼δ ε s} = 0. 0 δ δ 2. Note that s ∼ε t ⇒ t ∼ε s (via the mirror coupling ψ(t0 , s0 ) = 1 − ε, [0] s0 ε, [0] ε, [0] t0 1 − ε, [0] 0 0 δ δ λ(s , t )), thus dδτ (s, t) = inf ε {s ∼ε t} = inf ε {t ∼ε s} = s2 s1 t1 t2 dδτ (t, s). δ δ [1] [0] [2] [1] 3. Let A = A1 + A2 = {ε1 + ε2 |s ∼ε1 w & w ∼ε2 t}, and δ δ 0 0 B = {ε|s ∼ε t}. Note A ⊆ B since s ∼ε1 +ε2 t (via the transitive s2 s01 t01 t2 λa,1 (s0 ,w0 )λa,2 (w0 ,t0 ) coupling λa (s0 , t0 ) = P w0 ∈support(pa (w)) pa (w)(w0 ) ). [1] [1] So dδτ (s, t) = inf(B) ≤ inf(A) = inf(A1 ) + inf(A2 ) = 00 [0] [0] 00 s2 t2 dδτ (s, w) + dδτ (w, t). [1] [1] Theorem A.3. The temporal extension of ρ upper bounds the 000 000 s2 t2 bisimulation metric: ∀s, s0 ∈ S, d∼ (s, s0 ) ≤ dρτ (s, s0 ). [0] [0] 0 Proof. WeP proceed by induction, showing that ∀s, t, dn (s, s ) ≤ And consider the following two couplings ω∼ , λτ ∈ (1 − γ) n i ρ 0 i=0 γ dτ (s, s ). The base case is d1 (s, s ) 0 = Λ(p(s0 ), p(t0 )). 0 ρ 0 maxa {(1 − γ)|r(s, a) − r(s , a)|} ≤ (1 − γ)dτ (s, s ), using maxa |r(s, a) − r(s0 , a)| ≤ dρτ (s, s0 ). For the induction step we ω∼ s1 s2 λτ s1 s2 upper bound the min-cost coupling from the minimization problem t1 ε 0 t1 0 ε ρ with the liftings λa ∈ Λ (pa (s), pa (s0 )) given by ∼dρτ (s,s0 ) (one t2 0 1−ε t2 ε 1 − 2ε can show the infs are always attained). One can verify that ω∼ minimizes the bisimulation distance dn+1 (s, s0 ) = max{1 − γ)|r(s, a) − r(s0 , a)| and that λτ minimizes the temporally extended metric. Indeed, a P u,v∈S ω∼ (u, v)d∼ (u, v) = εd∼ (s1 , t1 ) + (1 − ε)d∼ (s2 , t2 ) = + γK(dn )(pa (s), pa (s0 ))} 2ε{2(1 − γ)} and this coupling is optimal for K(d∼ ). Meanwhile P u,v∈S λτ (u, v)d∼ (u, v) = εd∼ (s1 , t2 ) + εd∼ (s2 , t1 ) + (1 − X ≤ (1 − γ)dρτ (s, s0 ) + γ λa (sk , sj )dn (sk , sj ) k,j 2ε)d∼ (s2 , t2 ) = εd∼ (s1 , t2 ) + εd∼ (s2 , t1 ) = 2ε{(1 − γ)(1 + γ + γ 2 )}. ≤ (1 − γ)dρτ (s, s0 ) On the other hand, using ω∼ , the best relation that can be lifted is n ! ρ X X ∼2 , since (s1 , t1 ) ∈ support(ω∼ ) and r(s1 ) − r(t1 ) = 2. Mean- +γ λa (sk , sj ) (1 − γ) γ i dρτ (sk , sj ) ρ while, λτ lifts ∼1 and thus achieves the minimum lifting, since k,j i=0 (s1 , t2 ), (s2 , t1 ), (s2 , t2 ) all have reward differences of 1. Thus ω∼ (induction hypothesis) minimizes the bisimulation metric but not the temporal extension metric. Conversely, λτ minimizes the temporal extension metric since it achieves the minimum lifting, but not the bisimulation met- Now we use the lifting property: the only non-zero terms in ric. the summation are those for which (sk , sj ) ∈ support(λa ) ⊆