Temporally Extended Metrics for Markov Decision Processes

    Philip Amortila                      Marc G. Bellemare             Prakash Panangaden                     Doina Precup
     McGill University                        Google Brain∗                McGill University           McGill University / DeepMind


                              Abstract                                 Specifically, the environment of an agent is a Markov Deci-
                                                                       sion Process (MDP) in which state similarity has been stud-
  Developing safe and efficient methods for state abstraction          ied for a couple of decades. One long-studied notion used
  in reinforcement learning systems is an open research prob-          to capture behavioral similarity of states in a Markov Deci-
  lem. We propose to address it by leveraging ideas from for-          sion Process (MDP) is called bisimulation. Bisimulation has
  mal verification, namely, bisimulation. Specifically, we gen-
  eralize the notion of bisimulation by considering arbitrary
                                                                       originated in the fields of concurrency and formal verifica-
  comparisons between states instead of strict reward matching.        tion for provably verifying the correctness and safety of pro-
  We further develop a notion of temporally extended metrics,          cesses and systems (Milner 1980). An extension to MDPs
  which extend a base metric between states of an environment          was proposed by (Givan, Dean, and Greig 2003). Bisimula-
  so as to reflect not just the current difference but the extent to   tion is a canonical tool for analyzing the behaviour of transi-
  which the distance is preserved through the course of transi-        tion systems and clustering equivalent states in overly large
  tions. We show that this property is not satisfied by bisimu-        systems. In the context of MDPs, bisimulation requires an
  lation metrics, which were previously used to compare states         exact match of both rewards and transition distributions. As
  with respect to their longterm rewards. A temporal extension         this criterion is too brittle, a metric approach was developed
  can be defined for any base metric of interest, thus making          by (Ferns, Panangaden, and Precup 2004), which allows one
  the construction very flexible. The kernel of the temporally
  extended metrics corresponds precisely to exact bisimulation
                                                                       to measure “how bisimilar” two states are. These bisimula-
  (thus these metrics form a larger class of bisimulation met-         tion metrics assign distance of zero to states if and only if
  rics). We provide bounds relating bisimulation and tempo-            they are bisimilar. Bisimulation metrics can be used for state
  rally extended metrics and also examine the couplings of state       abstraction (by aggregating states that are ε away), and doing
  distributions which are induced.                                     so provides one with formal (rather than statistical) safety
                                                                       guarantees on the difference between the true optimal value
                                                                       function and the approximated one. The bisimulation met-
                      1    Introduction                                ric can be computed iteratively, but each step requires one
                                                                       to solve a linear program involving the transition distribu-
Building safe AI systems is crucial for a wide range of ap-            tions for every pair of states, which is not only computation-
plications. This is especially difficult when the system relies        ally expensive but also requires a full model of the MDP. In
on reinforcement learning, in which an agent learns from its           this work, we investigate alternative metrics for behavioural
own experience how to behave in the world. Because in most             equivalence, with the goal of maintaining the useful theoret-
applications of reinforcement learning, the environment is             ical guarantees of bisimulation metrics while reducing the
very large or perhaps continuous, an agent is required to ab-          computational burden and the need for knowledge of the
stract over its state space. However, this operation can re-           model.
sult in putting together states that actually have very differ-        Our contributions are two-fold: firstly, we propose a
ent values, which can lead to suboptimal or even dangerous             coupling-based generalization of bisimulation which allows
behaviour. This is especially true if an agent relies only on          for greater flexibility in the comparisons between states (in-
immediate observations to determine the similarity between             stead of strict reward matching), and consequently in the
states, rather than long-term predictions or behaviour. For            properties being checked. Secondly, we consider the class of
example, a camera mounted on a car may show two very                   quantitative bisimulations and show how this defines a no-
similar images, but one may lead to the car going up the               tion of temporally extended (TE) metrics. Intuitively, these
curb and the other may be safe driving.                                metrics compute the minimum value of a chosen base met-
In reinforcement learning, we can leverage some structure in           ric for which the states s and s0 can remain in that range
the problem formulation to attempt to tackle this problem.             throughout their dynamics. The TE metrics assign distance 0
                                                                       to states if and only if they are bisimilar, much like the bisim-
  ∗
    CIFAR Fellow                                                       ulation metrics previously defined. However, both the con-
Copyright held by author(s)                                            struction and the resulting metric are quite different.
The rest of the paper is organized as follows. In the next sec-   see that s0 and t0 are bisimilar, take the equivalence classes
tion we provide some necessary background. In Section 3           S/U = {{s0 , t0 }, {s1 , t1 }, {s2 , t2 , t3 }}. All states in each
we characterize bisimulation via couplings and define the         class receive the same rewards and transition with equal
extension of this characterization to arbitrary relations. In     probability to all other classes, so U is indeed a bisimula-
Section 4 we define the temporally extended metrics. Sec-         tion relation. Now consider the coupling λ of (p(s0 ), p(t0 ))
tion 5 compares the two metrics by providing some bounds          given by the dashed arrows, i.e. λ(s1 , t1 ) = λ(s2 , t2 ) =
relating them and analyzing the couplings induced by the          λ(s2 , t3 ) = 31 . This coupling is a lifting of U, since all sup-
two metrics. Lastly, we wrap up with a discussion on the          ported states are U-related. This is not a coincidence: the
benefits and disadvantages of these metrics, highlighting di-     fact that bisimulation is equivalent to the existence of a lift-
rections for future work.                                         ing is the subject of Section 3. Not all couplings are liftings:
                                                                  the trivial coupling ω(si , tj ) = p(s0 )(si )p(t0 )(tj ) is not a
                                                                  lifting of U.
                    2    Background                                                    1, [1]                1, [1]
Let D(S) denote the set of probability distributions on                                           1
S. A Markov decision process (MDP) is a 5-tuple M =                                               3
                                                                                       s1                         t1


                                                                                                                                1 ,
  S, A, {pa : S → D(S)}a∈A , r : S × A → R≥0 , γ where


                                                                              ]
                                                                            [0


                                                                                                                                  3
                                                                                                                                  [0
                                                                          3 ,
                                                                            1


                                                                                                                                     ]
we emphasize that pa (s) ∈ D(S) is to be read as a proba-
bility distribution over the states (one for each action). The                                                         1
                                                                                                                       3 , [0]
                   P from s to a set of states X is writ-
transition probability                                                    s0                                      t2                  t0
ten pa (s)(X) = x∈X pa (s)(x). In many practical appli-                                           1
cations, the state space of the MDP is simply too large to                           1, [0]       3


                                                                                                                       1, [0]
                                                                                                         1, [0]
allow one to compute the value functions exactly without                                          1


                                                                          2 ,


                                                                                                                                   ]
                                                                                                                                 [0
                                                                            3
the use of state abstraction. Bisimulation is a canonical ex-                          s2
                                                                                                  3


                                                                             [0
                                                                                                                  t3


                                                                                                                                3 ,
                                                                                                                                  1
                                                                                ]
ample of a safe abstraction, in that the optimal policy and
optimal value functions will be preserved in the aggregated
MDP (Li, Walsh, and Littman 2006). Bisimulation is defined        Figure 1: Rewards indicated in square brackets. Dashed ar-
in terms of equivalence classes, we write S/R for the set of      rows give the weights of the coupling of p(s0 ) and p(t0 ).
equivalence classes of an equivalence relation R.                 Colours represent different equivalence classes of ∼.
Definition 2.1 (Bisimulation). A bisimulation relation U on
S is an equivalence relation such that sUs0 implies:
                                                                  Previous work on metric extensions of bisimulation is built
1. ∀a ∈ A, r(s, a) = r(s0 , a) and
                                                                  on the Kantorovich metric K(d)(µ, ν) = minλ∈Λ(µ,ν) hλ, di
2. ∀a ∈ A, ∀C ∈ S/U, pa (s)(C) = pa (s0 )(C).                     (also called the Wasserstein metric). The bisimulation met-
                                                                  ric d∼ is the fixed point of the operator F (d)(s, s0 ) =
We say that s and s0 are bisimilar and write s ∼ s0 if there      maxa {(1 − γ)|r(s, a) − r(s0 , a)| + γK(d)(pa (s), pa (s0 ))},
is some bisimulation relation U relating them.                    we refer the reader to (Ferns, Panangaden, and Precup 2004)
Thus bisimilar states will have equal rewards and there-          for further background.
after transition with equal probability to more bisimilar
states.
                                                                      3    Generalized Bisimulation via Liftings
Given a relation R between states, the lifting (R)# of that
relation allows one to naturally extend R to distributions on     In bisimulation, the rewards match at the first step and there-
states. Liftings are defined in terms of couplings, we will       after the states transition such that the bisimulation relation
later see that this notion will allow us to generalize bisimu-    is preserved. This definition can also be captured in terms of
lation.                                                           couplings: our first result is that bisimulation is equivalent to
Definition 2.2 (Couplings and Liftings). A coupling of two        the existence of a particular lifting of the states.
distributions (µ, ν) on S is a joint distribution λ on S × S      Theorem 3.1. A relation U is a bisimulation relation ⇐⇒
such that the marginals are µ and ν:                              sUs0 implies:
          λ(s, S) = µ(s)     & ν(s0 ) = λ(S, s0 ).                1. ∀a r(s, a) = r(s0 , a)

Moreover, a coupling of distributions (µ, ν) is a lifting of R    2. ∀a pa (s)(U)# pa (s0 )
if:                                                               The proofs of this result and later results are given in the
      λ(s, s0 ) > 0 ⇒ sRs0 , i.e. support(λ) ⊆ R.                 Appendix. The backward implication follows from the re-
When there exists a lifting of µ and ν as above we write          markable Strassen’s theorem on couplings (see e.g. (Lind-
µ(R)# ν.                                                          vall 1999)), which implies that for any R, µ(R)# ν ⇐⇒
                                                                  ∀A ⊆ S, µ(A) ≤ ν(R(A)). Applied to an equivalence
A simple example: bisimulation and liftings We con-               class C and using that U is symmetric gives the bisimula-
sider the example of the bisimilar states in Figure 1. To         tion property.
Building on the previous result, one can readily generalize       This construction gives well-defined pseudometrics. The
the first condition by using a generic relation between states    proof follows from the symmetry, transitivity, and additiv-
instead of demanding that the rewards match.                                            δ
                                                                  ity1 of the relations ∼ε (see Appendix for details).
Definition 3.1 (R-bisimulation). Given a base relation R ⊆        Theorem 4.1. Given a base pseudometric δ, the TE metric
S × S, an R-bisimulation relation U ⊆ S × S is a new              dδτ is indeed a pseudometric on S.
relation where the states are R-related and their transition
distributions are U-lifted. Formally, sUs0 implies:               Moreover, the temporally extended metrics assign distance 0
       0
                                                                  to states if and only if they are perfectly bisimilar in the base
1. sRs                                                            metric (i.e. they are δ0 -bisimilar). In the context of reward
2. ∀a pa (s)(U)# pa (s0 )                                         differences, this implies:
                                                                  Theorem 4.2. Classical bisimulation corresponds exactly to
                                R
We define R-bisimulation ∼ to be the largest R-                   the kernel of the temporal extension of ρ, i.e.
bisimulation relation.                                                              s ∼ s0 ⇐⇒ dρτ (s, s0 ) = 0.
This allows one to define arbitrary properties that are pre-      For reward differences, the bisimulation metrics share this
served by the dynamics of the MDP in a systematic way. We         property, although our metrics are more general. Further-
remark, firstly, that the second condition is much stronger       more, despite the kernels matching, the TE metrics are not
than merely requiring that the lifting is a coupling supported    the same as the bisimulation metrics, both in construction
by R, that is, pa (s)(R)# pa (s0 ). Requiring U to be lifted      and in the distances they assign.
also demands (in a coinductive manner) that the successor
states after a transition are R-related and can themselves ex-                              1, [1]                       1, [1]
                                                          R
hibit an appropriate coupling. Secondly, we note that ∼ is
                                                                                                          1
well-defined, since the union of R-bisimulation relations is                                              3


                                                                                                                                            1
itself an R-bisimulation relation. The well-behavedness of                                   s1                           t1


                                                                                                                                           3
                                                                                   ]


                                                                                                                                               +
                                                                               [0
                                                       R


                                                                                                                                                ε,
R-bisimulations depends on their base relations, i.e. ∼ is re-


                                                                              3 ,
                                                                                1


                                                                                                                                                   [0
flexive, symmetric, and transitive whenever R has the same


                                                                                                                                                     ]
                                                                                                                                  1
                                                                                                                                  3 , [0]
property (see Appendix).                                                                                  ε
                                                                              s0                                          t2                       t0
                                                                                                          1
                                                                                            1, [0]
           4   Temporally Extended Metrics                                                                3


                                                                                                                                  1, [0]
                                                                                                                1, [0]


                                                                                                                                                    ]
                                                                                                                                                   [0
                                                                                                         1
                                                                                                         3 −ε
                                                                              2 ,


Although the framework presented in Section 3 is agnos-


                                                                                                                                                ε,
                                                                                3


                                                                                             s2
                                                                                [0


                                                                                                                          t3


                                                                                                                                           3 −
tic with respect to the base relation R, we will be focusing
                                                                                    ]


                                                                                                                                            1
on the setting of quantitative relations. These are relations
parametrized by the use of a real number ε, which arise
from a base pseudometric δ : S × S → R≥0 on states                Figure 2: Rewards indicated in square brackets. Dashed ar-
(or state-action pairs). More formally, given a base metric       rows give the weights of the coupling of p(s0 ) and p(t0 ).
δ : S × S → R≥0 , a quantitative relation δε is the rela-
tion δε := δ −1 ([0, ε]) = {(s, s0 ) | δ(s, s0 ) ≤ ε}. We note    A simple example revisited We consider the almost-
the distinction between the metric δ and the relations δε de-     bisimilar states in Figure 2, examined with ρ as the base
rived from the metric. We call a bisimulation arising from        metric. In this example, all states are ρ1 -bisimilar, but
such a quantitative relation a quantitative bisimulation, and     not ρ0 -bisimilar. This is captured by the metric: one
                                  δ               δε              needs to couple (s2 , t1 ), since the marginal onto t1 has
will abuse notation by writing ∼ε rather than ∼.     An exam-     to equal 1/3 + ε and s1 only has 1/3 to spare. Since
ple of quantitative relations is approximate reward equality      (s2 , t1 ) ∈ support(λ) and |r(s2 ) − r(t1 )| = 1, then
sρε s0 := maxa |r(s, a) − r(s0 , a)| ≤ ε, derived from the        dρτ (s0 , t0 ) = 1. This example highlights the discontinuous
base metric ρ(s, s0 ) = maxa |r(s, a) − r(s0 , a)|.               behaviour of the TE metric – the states can either be coupled
In the context of quantitative bisimulations, we can define a     with rewards 0 or 1.
new metric by taking an infimum over the ε parameter. We
call these the temporally extended (TE) metrics. The TE met-
ric finds the minimum ε such that the states are δε -bisimilar.       5       Comparing Bisimulation Metrics and
That is, the two states are a distance of ε away (in the base
metric δ) and can be coupled corecursively so that future
                                                                                Temporally Extended Metrics
states are ε away and can themselves be coupled. A tempo-         In this section, we compare the temporally extended met-
ral extension can be defined for any base metric.                 rics with the bisimulation metrics. Results in this section are
Definition 4.1 (TE metric). Given a base metric δ and a cor-      given in terms of reward metric ρ but can be generalized to
responding collection of quantitative relations {δε }ε≥0 , the    arbitrary base metrics. The proofs (see Appendix) elucidate
TE metric for δ is defined by                                     the very useful properties of liftings.
                                   n           o
                                          δ
                 dδτ (s, s0 ) = inf ε | s ∼ε s0 .                    1    δ             δ
                                                                         s ∼α t & t ∼β w ⇒ s ∼α+β w
                                                                                                     δ
5.1     Bounds                                                         state distributions. After completing this work we discov-
                                                                       ered that similar ideas for quantitative bisimulations have
Our first result relates the TE metric and the bisimulation            successfully appeared in the control-theory literature (Gi-
metric with a bound.                                                   rard and Pappas 2007), although in the different setting of
Theorem 5.1. The temporal extension of ρ upper bounds                  non-deterministic (rather than probabilistic) transition sys-
the bisimulation metric: ∀s, s0 ∈ S,                                   tems without rewards.
                       d∼ (s, s0 ) ≤ dρτ (s, s0 ).                     This work marks the beginning of an investigation into for-
                                                                       mally safe, computationally tractable, and model-free met-
                                                                       rics for behavioural equivalence. There are many interesting
The bound is tight, and equality need not hold, as Figure
                                                                       avenues for future work that we intend to pursue. For com-
2 shows. Consequently, using bounds from (Ferns, Panan-
                                                                       putational aspects, the TE metric involves the computation
gaden, and Precup 2004), the TE metric gives a guarantee
                                                                       of a bisimulation relation rather than a bisimulation metric,
on the difference in optimal value functions and on the ap-
                                                                       which can be done exactly in O(|A||S|3 ) via partition re-
proximation error for state-abstraction.
                                                                       finements as opposed to approximated upto a degree of ac-
Corollary 5.1.1. Let V̂ be the value function in the abstract          curacy ε in O(|A||S|4 log |S| log ε) (Ferns, Panangaden, and
MDP of any abstraction φ, not necessarily a bisimulation.2             Precup 2004). Deriving an exact algorithm, however, is left
Then, ∀s, s0 ∈ S:                                                      for future work. The possibility of model-free computation
                               1                                       is hypothesized since the metric requires only the existence
      |V ∗ (s)−V ∗ (s0 )| ≤      dρ (s, s0 )         and               of a lifting, as opposed to the exact weights of a coupling as
                              1−γ τ
                                                                       does the Kantorovich metric, thus should be easier to esti-
|V̂ ∗ (φ(s))−V ∗ (s)| ≤                                                mate from samples.
                  γ                        1       X                   A slightly disconcerting aspect of the TE metrics is their
                        max    max                   dρτ (s0 , s00 )   discontinuity with respect to the transition distributions, ob-
             (1 − γ)2 φ(s0 ) s0 ∈φ(s0 ) |φ(s0 )| 00
                                                      s ∈φ(s)          served in Figure 2. This is because we require exact cou-
                                                                       plings - we are currently investigating the use of approxi-
5.2     Optimal Couplings                                              mate couplings to remedy this, which have recently surfaced
                                                                       in the study of differential privacy (Barthe et al. 2016).
In Figure 2, the same coupling minimized both the bisimu-              Finally, as the general notions of R-bisimulations and δ-
lation metric and the TE metric. Interestingly, the couplings          temporal extensions can be considered for arbitrary rela-
chosen need not be the same in general. This is the content            tions and metrics, examining the interplay between differ-
of the next theorem.                                                   ent notions of bisimulation (e.g. for optimal value functions
Theorem 5.2. A minimum coupling λ                               ∈      or policy value functions) promises to be a fruitful direc-
argminΛ(pa (s),pa (s0 )) K(d∼ )(pa (s), pa (s0 )) of the bisimu-       tion.
lation metric need not be a lifting of the optimal bisimulation
 ρ
∼dρτ (s,s0 ) . Conversely, λ which lifts the optimal bisimulation
 ρ                                                                                               References
∼dρτ (s,s0 ) need not be a minimizer of K(d∼ )(pa (s), pa (s0 )).
This highlights the different behaviours of the two metrics –          Barthe, G.; Fong, N.; Gaboardi, M.; Grégoire, B.; Hsu, J.; and
the TE metric aims to minimize the reward difference be-               Strub, P.-Y. 2016. Advanced probabilistic couplings for differential
tween coupled states at every step so as to ensure that a              privacy. In Proceedings of the 2016 ACM SIGSAC Conference on
                                                                       Computer and Communications Security, 55–67. ACM.
single bisimulation relation holds, whereas the bisimulation
metric is not preserving a single relation and is willing to           Ferns, N.; Panangaden, P.; and Precup, D. 2004. Metrics for finite
couple large differences at an initial step. The couplings cho-        Markov decision precesses. In Proceedings of the 20th Conference
sen by the bisimulation metric do not give a (generalized)             on Uncertainty in Artificial Intelligence, 162–169.
bisimulation relation, and the best that one can do with a
(generalized) bisimulation relation is given by the temporal           Girard, A., and Pappas, G. J. 2007. Approximation metrics for
extension.                                                             discrete and continuous systems. IEEE Transactions on Automatic
                                                                       Control 52(5):782–798.

                         6     Discussion                              Givan, R.; Dean, T.; and Greig, M. 2003. Equivalence notions
                                                                       and model minimization in Markov decision processes. Artificial
We have introduced the temporally extended metrics, a novel            Intelligence 147(1-2):163–223.
class of metrics for behavioural equivalence, which are                Li, L.; Walsh, T. J.; and Littman, M. L. 2006. Towards a unified
based on a generalized notion of bisimulation. We have es-             theory of state abstraction for mdps.
tablished bounds and other connections with the more fa-
miliar bisimulation metric, and seen that they neither com-            Lindvall, T. 1999. On Strassen’s theorem on stochastic domination.
pute the same values nor pick out the same couplings of                Electronic communications in probability 4:51–59.
   2
     we refer the reader to (Li, Walsh, and Littman 2006) for back-    Milner, R. 1980. A Calculus for Communicating Systems, vol-
ground on abstract MDPs                                                ume 92 of Lecture Notes in Computer Science. Springer-Verlag.
                                                                                         ρ                                               ρ
                                  A    Proofs                                            ∼dρτ (s,s0 ) . Thus (sk , sj ) ∈ ∼dρτ (s,s0 ) , and we conclude that
                                                                                                                         ρ
                                                                                         dρτ (sk , sj ) = inf ε sk ∼ε sj ≤ dρτ (s, s0 ), ∀sk , sj .
Theorem A.1. A relation U is a bisimulation relation ⇐⇒ sUs0
implies: 1. ∀a r(s, a) = r(s0 , a) and 2. ∀a pa (s)(U)# pa (s0 )                             dn+1 (s, s0 ) ≤ (1 − γ)dρτ (s, s0 )
                                                                                                                                                              n
                                                                                                                                                                                       !
                                                                                                                         X                                    X
Proof. The first condition is immediate in both cases. For                                                         +γ              λa (sk , sj ) (1 − γ)            γ i dρτ (s, s0 )
the forward implication, we pick the coupling λa (s0 , t0 ) =                                                            k,j                                  i=0
1[s0 U t0 ] pa (s)(s0 )pa (t)(t0 )/pa (s)([s0 ]), where [s0 ] := {t0 | s0 Ut0 }                                                              n+1
is the equivalencePclass of s0 . The marginals match since:
                                                                                                                                             X
                                                                                                               = (1 − γ)dρτ (s, s0 )               γi
λa (s0 , S)       =                             0         0            0
                            t0 :s0 U t0 pa (s)(s )pa (t)(t )/pa (s)([s ])   =                                                                i=0
            0           0                0                            0
pa (s)(s )pa (t)([s ])/pa (s)([s ])                =         pa (s)(s ),    as
pa (s)([s0 ]) = pa (t)([s0 ]) by the second condition of bisim-                          Thus the inequality holds for all n. Taking limits finishes the proof:
ulation. Similarly for λa (S, t0 ). To make λa a lifting of U we                                                                   1−γ ρ
still need to check that support(λa ) ⊆ U, which is evident                                              d∼ (s, s0 ) ≤                 dτ (s, s0 ) = dρτ (s, s0 ),
since λa (s0 , t0 ) is only non-zero when s0 Ut0 . For the converse,                                                               1−γ
we use Strassen’s theorem. Let C ∈ S/U, note that we have                                as desired.
pa (s)(C) ≤ pa (t)(C) since U(C) = C. By symmetry of U, we
also have pa (t)(U)# pa (s), so that pa (t)(C) ≤ pa (s)(C). So                           Theorem          A.4.    A    minimum        coupling   λ       ∈
s ∼ t.                                                                                   argminΛ(pa (s),pa (s0 )) K(d∼ )(pa (s), pa (s0 )) of the bisimu-
                                                                                         lation metric need not be a lifting of the optimal bisimulation
Theorem A.2. Given a base pseudometric δ, the TE metric dδτ is                            ρ
                                                                                         ∼dρτ (s,s0 ) . Conversely, λ which lifts the optimal bisimulation
indeed a pseudometric on S.                                                              ρ
                                                                                         ∼dρτ (s,s0 ) need not be a minimizer of K(d∼ )(pa (s), pa (s0 )).
                             δ
Proof. 1. Note that s ∼0 s (via the identity coupling λ(s0 , s00 ) =                     Proof. Consider the following MDP, taking ρ as our base metric.
1[s=s0 ] pa (s)(s )), thus dδτ (s, s) = inf ε≥0 {s ∼δ ε s} = 0.
                  0

                  δ              δ
2. Note that s ∼ε t ⇒ t ∼ε s (via the mirror coupling ψ(t0 , s0 ) =                              1 − ε, [0]        s0        ε, [0]                ε, [0]     t0     1 − ε, [0]
    0   0                                     δ                          δ
λ(s , t )), thus dδτ (s, t)      = inf ε {s ∼ε t} = inf ε {t ∼ε s} =
                                                                                                        s2                   s1                       t1             t2
dδτ (t, s).
                                                   δ                 δ                                       [1]         [0]                            [2]                [1]
3. Let A = A1 + A2 = {ε1 + ε2 |s ∼ε1 w & w ∼ε2 t}, and
             δ                                    δ                                                      0                                                            0
B = {ε|s ∼ε t}. Note A ⊆ B since s ∼ε1 +ε2 t (via the transitive                                        s2                   s01                      t01            t2
                                              λa,1 (s0 ,w0 )λa,2 (w0 ,t0 )
coupling λa (s0 , t0 ) =
                         P
                          w0 ∈support(pa (w))        pa (w)(w0 )
                                                                           ).                                [1]                                                           [1]
So dδτ (s, t) = inf(B) ≤ inf(A) = inf(A1 ) + inf(A2 ) =                                                  00                  [0]                     [0]              00
                                                                                                        s2                                                           t2
dδτ (s, w) + dδτ (w, t).
                                                                                                             [1]                                                           [1]
Theorem A.3. The temporal extension of ρ upper bounds the                                                000                                                          000
                                                                                                       s2                                                            t2
bisimulation metric: ∀s, s0 ∈ S,
                            d∼ (s, s0 ) ≤ dρτ (s, s0 ).                                                 [0]                                                          [0]

                                                                                0
Proof. WeP  proceed by induction, showing that ∀s, t, dn (s, s ) ≤
                                                                                         And consider the following two couplings ω∼ , λτ                                                  ∈
(1 − γ) n          i ρ      0
              i=0 γ dτ (s, s ). The base case is d1 (s, s )
                                                              0
                                                                    =                    Λ(p(s0 ), p(t0 )).
                                0                     ρ     0
maxa {(1 − γ)|r(s, a) − r(s , a)|} ≤ (1 − γ)dτ (s, s ), using
maxa |r(s, a) − r(s0 , a)| ≤ dρτ (s, s0 ). For the induction step we                                    ω∼          s1          s2            λτ        s1       s2
upper bound the min-cost coupling from the minimization problem                                         t1          ε           0             t1        0         ε
                                                       ρ
with the liftings λa ∈ Λ (pa (s), pa (s0 )) given by ∼dρτ (s,s0 ) (one                                  t2          0          1−ε            t2        ε      1 − 2ε
can show the infs are always attained).
                                                                                         One can verify that ω∼ minimizes the bisimulation distance
 dn+1 (s, s0 ) = max{1 − γ)|r(s, a) − r(s0 , a)|                                         and that λτ minimizes the temporally extended metric. Indeed,
                       a                                                                 P
                                                                                            u,v∈S ω∼ (u, v)d∼ (u, v) = εd∼ (s1 , t1 ) + (1 − ε)d∼ (s2 , t2 ) =
                 + γK(dn )(pa (s), pa (s0 ))}                                            2ε{2(1    − γ)} and this coupling is optimal for K(d∼ ). Meanwhile
                                                                                         P
                                                                                            u,v∈S λτ (u, v)d∼ (u, v) = εd∼ (s1 , t2 ) + εd∼ (s2 , t1 ) + (1 −
                                           X
                 ≤ (1 − γ)dρτ (s, s0 ) + γ    λa (sk , sj )dn (sk , sj )
                                                  k,j                                    2ε)d∼ (s2 , t2 ) = εd∼ (s1 , t2 ) + εd∼ (s2 , t1 ) = 2ε{(1 − γ)(1 +
                                                                                         γ + γ 2 )}.
                 ≤ (1 − γ)dρτ (s, s0 )                                                   On the other hand, using ω∼ , the best relation that can be lifted is
                                                          n
                                                                                     !    ρ
                           X                              X                              ∼2 , since (s1 , t1 ) ∈ support(ω∼ ) and r(s1 ) − r(t1 ) = 2. Mean-
                      +γ         λa (sk , sj ) (1 − γ)          γ i dρτ (sk , sj )                             ρ
                                                                                         while, λτ lifts ∼1 and thus achieves the minimum lifting, since
                           k,j                            i=0
                                                                                         (s1 , t2 ), (s2 , t1 ), (s2 , t2 ) all have reward differences of 1. Thus ω∼
                                                        (induction hypothesis)
                                                                                         minimizes the bisimulation metric but not the temporal extension
                                                                                         metric. Conversely, λτ minimizes the temporal extension metric
                                                                                         since it achieves the minimum lifting, but not the bisimulation met-
Now we use the lifting property: the only non-zero terms in                              ric.
the summation are those for which (sk , sj ) ∈ support(λa ) ⊆