<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Temporally Extended Metrics for Markov Decision Processes</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marc G. Bellemare Google Brain</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Doina Precup McGill University / DeepMind</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Philip Amortila McGill University</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Prakash Panangaden McGill University</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Developing safe and efficient methods for state abstraction in reinforcement learning systems is an open research problem. We propose to address it by leveraging ideas from formal verification, namely, bisimulation. Specifically, we generalize the notion of bisimulation by considering arbitrary comparisons between states instead of strict reward matching. We further develop a notion of temporally extended metrics, which extend a base metric between states of an environment so as to reflect not just the current difference but the extent to which the distance is preserved through the course of transitions. We show that this property is not satisfied by bisimulation metrics, which were previously used to compare states with respect to their longterm rewards. A temporal extension can be defined for any base metric of interest, thus making the construction very flexible. The kernel of the temporally extended metrics corresponds precisely to exact bisimulation (thus these metrics form a larger class of bisimulation metrics). We provide bounds relating bisimulation and temporally extended metrics and also examine the couplings of state distributions which are induced.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Building safe AI systems is crucial for a wide range of
applications. This is especially difficult when the system relies
on reinforcement learning, in which an agent learns from its
own experience how to behave in the world. Because in most
applications of reinforcement learning, the environment is
very large or perhaps continuous, an agent is required to
abstract over its state space. However, this operation can
result in putting together states that actually have very
different values, which can lead to suboptimal or even dangerous
behaviour. This is especially true if an agent relies only on
immediate observations to determine the similarity between
states, rather than long-term predictions or behaviour. For
example, a camera mounted on a car may show two very
similar images, but one may lead to the car going up the
curb and the other may be safe driving.</p>
      <p>In reinforcement learning, we can leverage some structure in
the problem formulation to attempt to tackle this problem.</p>
      <p>
        CIFAR Fellow
Copyright held by author(s)
Specifically, the environment of an agent is a Markov
Decision Process (MDP) in which state similarity has been
studied for a couple of decades. One long-studied notion used
to capture behavioral similarity of states in a Markov
Decision Process (MDP) is called bisimulation. Bisimulation has
originated in the fields of concurrency and formal
verification for provably verifying the correctness and safety of
processes and systems
        <xref ref-type="bibr" rid="ref8">(Milner 1980)</xref>
        . An extension to MDPs
was proposed by
        <xref ref-type="bibr" rid="ref4">(Givan, Dean, and Greig 2003)</xref>
        .
Bisimulation is a canonical tool for analyzing the behaviour of
transition systems and clustering equivalent states in overly large
systems. In the context of MDPs, bisimulation requires an
exact match of both rewards and transition distributions. As
this criterion is too brittle, a metric approach was developed
by
        <xref ref-type="bibr" rid="ref2">(Ferns, Panangaden, and Precup 2004)</xref>
        , which allows one
to measure “how bisimilar” two states are. These
bisimulation metrics assign distance of zero to states if and only if
they are bisimilar. Bisimulation metrics can be used for state
abstraction (by aggregating states that are " away), and doing
so provides one with formal (rather than statistical) safety
guarantees on the difference between the true optimal value
function and the approximated one. The bisimulation
metric can be computed iteratively, but each step requires one
to solve a linear program involving the transition
distributions for every pair of states, which is not only
computationally expensive but also requires a full model of the MDP. In
this work, we investigate alternative metrics for behavioural
equivalence, with the goal of maintaining the useful
theoretical guarantees of bisimulation metrics while reducing the
computational burden and the need for knowledge of the
model.
      </p>
      <p>Our contributions are two-fold: firstly, we propose a
coupling-based generalization of bisimulation which allows
for greater flexibility in the comparisons between states
(instead of strict reward matching), and consequently in the
properties being checked. Secondly, we consider the class of
quantitative bisimulations and show how this defines a
notion of temporally extended (TE) metrics. Intuitively, these
metrics compute the minimum value of a chosen base
metric for which the states s and s0 can remain in that range
throughout their dynamics. The TE metrics assign distance 0
to states if and only if they are bisimilar, much like the
bisimulation metrics previously defined. However, both the
construction and the resulting metric are quite different.
The rest of the paper is organized as follows. In the next
section we provide some necessary background. In Section 3
we characterize bisimulation via couplings and define the
extension of this characterization to arbitrary relations. In
Section 4 we define the temporally extended metrics.
Section 5 compares the two metrics by providing some bounds
relating them and analyzing the couplings induced by the
two metrics. Lastly, we wrap up with a discussion on the
benefits and disadvantages of these metrics, highlighting
directions for future work.</p>
      <p>2</p>
    </sec>
    <sec id="sec-2">
      <title>Background</title>
      <p>
        Let D(S ) denote the set of probability distributions on
SS. ;AA;Mfaprako:vS d!eciDsi(oSn)pgrao2cAe;srs :(MS DPA)i!saR5-t0u;ple Mwhe=re
we emphasize that pa(s) 2 D(S ) is to be read as a
probability distribution over the states (one for each action). The
transition probability from s to a set of states X is
written pa(s)(X ) = P
cations, the state spaxc2eXopfat(hse)(Mx)D.IPn
imsasinmypplyractotiocallaragpepltioallow one to compute the value functions exactly without
the use of state abstraction. Bisimulation is a canonical
example of a safe abstraction, in that the optimal policy and
optimal value functions will be preserved in the aggregated
MDP
        <xref ref-type="bibr" rid="ref5">(Li, Walsh, and Littman 2006)</xref>
        . Bisimulation is defined
in terms of equivalence classes, we write S=R for the set of
equivalence classes of an equivalence relation R.
Definition 2.1 (Bisimulation). A bisimulation relation U on
S is an equivalence relation such that sU s0 implies:
s0 if there
1. 8a 2 A; r(s; a) = r(s0; a) and
2. 8a 2 A; 8C 2 S=U ; pa(s)(C) = pa(s0)(C):
We say that s and s0 are bisimilar and write s
is some bisimulation relation U relating them.
      </p>
      <p>Thus bisimilar states will have equal rewards and
thereafter transition with equal probability to more bisimilar
states.</p>
      <p>Given a relation R between states, the lifting (R)# of that
relation allows one to naturally extend R to distributions on
states. Liftings are defined in terms of couplings, we will
later see that this notion will allow us to generalize
bisimulation.</p>
      <p>Definition 2.2 (Couplings and Liftings). A coupling of two
distributions ( ; ) on S is a joint distribution on S S
such that the marginals are and :
(s; S ) =
(s)
&amp;
(s0) =
(S ; s0):
Moreover, a coupling of distributions ( ; ) is a lifting of R
if:
(s; s0) &gt; 0 ) sRs0, i.e. support( )
R:</p>
      <sec id="sec-2-1">
        <title>When there exists a lifting of</title>
        <p>
          (R)# .
and
as above we write
A simple example: bisimulation and liftings We
consider the example of the bisimilar states in Figure 1. To
see that s0 and t0 are bisimilar, take the equivalence classes
S =U = ffs0; t0g; fs1; t1g; fs2; t2; t3gg. All states in each
class receive the same rewards and transition with equal
probability to all other classes, so U is indeed a
bisimulation relation. Now consider the coupling of (p(s0); p(t0))
given by the dashed arrows, i.e. (s1; t1) = (s2; t2) =
(s2; t3) = 13 . This coupling is a lifting of U , since all
supported states are U -related. This is not a coincidence: the
fact that bisimulation is equivalent to the existence of a
lifting is the subject of Section 3. Not all couplings are liftings:
the trivial coupling !(si; tj ) = p(s0)(si)p(t0)(tj ) is not a
lifting of U .
Previous work on metric extensions of bisimulation is built
on the Kantorovich metric K(d)( ; ) = min 2 ( ; )h ; di
(also called the Wasserstein metric). The bisimulation
metric d is the fixed point of the operator F (d)(s; s0) =
maxaf(1 )jr(s; a) r(s0; a)j + K(d)(pa(s); pa(s0))g;
we refer the reader to
          <xref ref-type="bibr" rid="ref2">(Ferns, Panangaden, and Precup 2004)</xref>
          for further background.
        </p>
        <p>3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Generalized Bisimulation via Liftings</title>
      <p>In bisimulation, the rewards match at the first step and
thereafter the states transition such that the bisimulation relation
is preserved. This definition can also be captured in terms of
couplings: our first result is that bisimulation is equivalent to
the existence of a particular lifting of the states.</p>
      <p>
        Theorem 3.1. A relation U is a bisimulation relation ()
sU s0 implies:
1. 8a r(s; a) = r(s0; a)
2. 8a pa(s)(U )#pa(s0)
The proofs of this result and later results are given in the
Appendix. The backward implication follows from the
remarkable Strassen’s theorem on couplings (see e.g.
        <xref ref-type="bibr" rid="ref6">(Lindvall 1999)</xref>
        ), which implies that for any R, (R)# ()
8A S ; (A) (R(A)). Applied to an equivalence
class C and using that U is symmetric gives the
bisimulation property.
      </p>
      <p>Building on the previous result, one can readily generalize
the first condition by using a generic relation between states
instead of demanding that the rewards match.</p>
      <p>Definition 3.1 (R-bisimulation). Given a base relation R
S S , an R-bisimulation relation U S S is a new
relation where the states are R-related and their transition
distributions are U -lifted. Formally, sU s0 implies:
1. sRs0
This allows one to define arbitrary properties that are
preserved by the dynamics of the MDP in a systematic way. We
remark, firstly, that the second condition is much stronger
than merely requiring that the lifting is a coupling supported
by R, that is, pa(s)(R)#pa(s0). Requiring U to be lifted
also demands (in a coinductive manner) that the successor
states after a transition are R-related and can themselves
exhibit an appropriate coupling. Secondly, we note that R is
well-defined, since the union of R-bisimulation relations is
itself an R-bisimulation relation. The well-behavedness of
R-bisimulations depends on their base relations, i.e. R is
reflexive, symmetric, and transitive whenever R has the same
property (see Appendix).</p>
      <p>4</p>
    </sec>
    <sec id="sec-4">
      <title>Temporally Extended Metrics</title>
      <p>Although the framework presented in Section 3 is
agnostic with respect to the base relation R, we will be focusing
on the setting of quantitative relations. These are relations
parametrized by the use of a real number ", which arise
from a base pseudometric : S S ! R 0 on states
(or state-action pairs). More formally, given a base metric
: S S ! R 0, a quantitative relation " is the
relation " := 1 ([0; "]) = f(s; s0) j (s; s0) "g. We note
the distinction between the metric and the relations "
derived from the metric. We call a bisimulation arising from
such a quantitative relation a quantitative bisimulation, and
will abuse notation by writing " rather than ". An
example of quantitative relations is approximate reward equality
s "s0 := maxa jr(s; a) r(s0; a)j "; derived from the
base metric (s; s0) = maxa jr(s; a) r(s0; a)j.
In the context of quantitative bisimulations, we can define a
new metric by taking an infimum over the " parameter. We
call these the temporally extended (TE) metrics. The TE
metric finds the minimum " such that the states are "-bisimilar.
That is, the two states are a distance of " away (in the base
metric ) and can be coupled corecursively so that future
states are " away and can themselves be coupled. A
temporal extension can be defined for any base metric.
Definition 4.1 (TE metric). Given a base metric and a
corresponding collection of quantitative relations f "g" 0, the
TE metric for is defined by
n
d (s; s0) = inf " j s</p>
      <p>o
" s0 :</p>
      <p>This construction gives well-defined pseudometrics. The
proof follows from the symmetry, transitivity, and
additivity1 of the relations " (see Appendix for details).
Theorem 4.1. Given a base pseudometric , the TE metric
d is indeed a pseudometric on S .</p>
      <p>Moreover, the temporally extended metrics assign distance 0
to states if and only if they are perfectly bisimilar in the base
metric (i.e. they are 0-bisimilar). In the context of reward
differences, this implies:
Theorem 4.2. Classical bisimulation corresponds exactly to
the kernel of the temporal extension of , i.e.</p>
      <p>s
s0
()</p>
      <p>d (s; s0) = 0:
For reward differences, the bisimulation metrics share this
property, although our metrics are more general.
Furthermore, despite the kernels matching, the TE metrics are not
the same as the bisimulation metrics, both in construction
and in the distances they assign.
A simple example revisited We consider the
almostbisimilar states in Figure 2, examined with as the base
metric. In this example, all states are 1-bisimilar, but
not 0-bisimilar. This is captured by the metric: one
needs to couple (s2; t1), since the marginal onto t1 has
to equal 1=3 + " and s1 only has 1=3 to spare. Since
(s2; t1) 2 support( ) and jr(s2) r(t1)j = 1, then
d (s0; t0) = 1. This example highlights the discontinuous
behaviour of the TE metric – the states can either be coupled
with rewards 0 or 1.</p>
      <p>5</p>
    </sec>
    <sec id="sec-5">
      <title>Comparing Bisimulation Metrics and</title>
    </sec>
    <sec id="sec-6">
      <title>Temporally Extended Metrics</title>
      <p>
        In this section, we compare the temporally extended
metrics with the bisimulation metrics. Results in this section are
given in terms of reward metric but can be generalized to
arbitrary base metrics. The proofs (see Appendix) elucidate
the very useful properties of liftings.
The bound is tight, and equality need not hold, as Figure
2 shows. Consequently, using bounds from
        <xref ref-type="bibr" rid="ref2">(Ferns,
Panangaden, and Precup 2004)</xref>
        , the TE metric gives a guarantee
on the difference in optimal value functions and on the
approximation error for state-abstraction.
      </p>
      <p>Corollary 5.1.1. Let V^ be the value function in the abstract
MDP of any abstraction , not necessarily a bisimulation.2
Then, 8s; s0 2 S :
1
1
In Figure 2, the same coupling minimized both the
bisimulation metric and the TE metric. Interestingly, the couplings
chosen need not be the same in general. This is the content
of the next theorem.</p>
      <p>Theorem 5.2. A minimum coupling 2
argmin (pa(s);pa(s0)) K(d )(pa(s); pa(s0)) of the
bisimulation metric need not be a lifting of the optimal bisimulation
d (s;s0). Conversely, which lifts the optimal bisimulation
d (s;s0) need not be a minimizer of K(d )(pa(s); pa(s0)).
This highlights the different behaviours of the two metrics –
the TE metric aims to minimize the reward difference
between coupled states at every step so as to ensure that a
single bisimulation relation holds, whereas the bisimulation
metric is not preserving a single relation and is willing to
couple large differences at an initial step. The couplings
chosen by the bisimulation metric do not give a (generalized)
bisimulation relation, and the best that one can do with a
(generalized) bisimulation relation is given by the temporal
extension.</p>
      <p>6</p>
    </sec>
    <sec id="sec-7">
      <title>Discussion</title>
      <p>
        We have introduced the temporally extended metrics, a novel
class of metrics for behavioural equivalence, which are
based on a generalized notion of bisimulation. We have
established bounds and other connections with the more
familiar bisimulation metric, and seen that they neither
compute the same values nor pick out the same couplings of
2we refer the reader to
        <xref ref-type="bibr" rid="ref5">(Li, Walsh, and Littman 2006)</xref>
        for
background on abstract MDPs
state distributions. After completing this work we
discovered that similar ideas for quantitative bisimulations have
successfully appeared in the control-theory literature
        <xref ref-type="bibr" rid="ref3">(Girard and Pappas 2007)</xref>
        , although in the different setting of
non-deterministic (rather than probabilistic) transition
systems without rewards.
      </p>
      <p>
        This work marks the beginning of an investigation into
formally safe, computationally tractable, and model-free
metrics for behavioural equivalence. There are many interesting
avenues for future work that we intend to pursue. For
computational aspects, the TE metric involves the computation
of a bisimulation relation rather than a bisimulation metric,
which can be done exactly in O(jAjjS j3) via partition
refinements as opposed to approximated upto a degree of
accuracy " in O(jAjjS j4 log jS j log ")
        <xref ref-type="bibr" rid="ref2">(Ferns, Panangaden, and
Precup 2004)</xref>
        . Deriving an exact algorithm, however, is left
for future work. The possibility of model-free computation
is hypothesized since the metric requires only the existence
of a lifting, as opposed to the exact weights of a coupling as
does the Kantorovich metric, thus should be easier to
estimate from samples.
      </p>
      <p>
        A slightly disconcerting aspect of the TE metrics is their
discontinuity with respect to the transition distributions,
observed in Figure 2. This is because we require exact
couplings - we are currently investigating the use of
approximate couplings to remedy this, which have recently surfaced
in the study of differential privacy
        <xref ref-type="bibr" rid="ref1">(Barthe et al. 2016)</xref>
        .
Finally, as the general notions of R-bisimulations and
temporal extensions can be considered for arbitrary
relations and metrics, examining the interplay between
different notions of bisimulation (e.g. for optimal value functions
or policy value functions) promises to be a fruitful
direction.
(1
+
      </p>
      <p>n+1
)d (s; s0) X</p>
      <p>n
) X
id (s; s0)
!
d (sk; sj ) = inf" sk
" sj</p>
      <p>d (s;s0), and we conclude that
d (s; s0); 8sk; sj .</p>
      <p>Thus the inequality holds for all n. Taking limits finishes the proof:</p>
      <p>Theorem A.4. A minimum coupling 2
argmin (pa(s);pa(s0)) K(d )(pa(s); pa(s0)) of the
bisimulation metric need not be a lifting of the optimal bisimulation
d (s;s0). Conversely, which lifts the optimal bisimulation
d (s;s0) need not be a minimizer of K(d )(pa(s); pa(s0)).</p>
      <sec id="sec-7-1">
        <title>Proof. Consider the following MDP, taking as our base metric.</title>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Proofs</title>
      <p>Theorem A.1. A relation U is a bisimulation relation
implies: 1. 8a r(s; a) = r(s0; a) and 2. 8a pa(s)(U )#p(a()s0)sU s0
Proof. The first condition is immediate in both cases. For
the forward implication, we pick the coupling a(s0; t0) =
1[s0Ut0]pa(s)(s0)pa(t)(t0)=pa(s)([s0]), where [s0] := ft0 j s0U t0g
is the equivalence class of s0. The marginals match since:
a(s0; S) = Pt0:s0Ut0 pa(s)(s0)pa(t)(t0)=pa(s)([s0]) =
pa(s)(s0)pa(t)([s0])=pa(s)([s0]) = pa(s)(s0); as
pa(s)([s0]) = pa(t)([s0]) by the second condition of
bisimulation. Similarly for a(S; t0). To make a a lifting of U we
still need to check that support( a) U , which is evident
since a(s0; t0) is only non-zero when s0U t0. For the converse,
we use Strassen’s theorem. Let C 2 S=U , note that we have
pa(s)(C) pa(t)(C) since U (C) = C. By symmetry of U , we
also have pa(t)(U )#pa(s), so that pa(t)(C) pa(s)(C). So
s t.</p>
      <p>Theorem A.2. Given a base pseudometric , the TE metric d is
indeed a pseudometric on S.</p>
      <p>Proof. 1. Note that s</p>
      <p>0 s (via the identity coupling (s0; s00) =
1[s=s0]pa(s)(s0)), thus d (s; s) = inf" 0fs
" sg = 0.</p>
      <sec id="sec-8-1">
        <title>2. Note that s</title>
        <p>" t ) t
(s0; t0)), thus d (s; t) = inf"fs
d (t; s).
" tg = inf"ft
" sg =
So d (s; t) = inf(B)
d (s; w) + d (w; t).
3. Let A = A1 + A2 = f"1 + "2js
"1 w &amp; w
"2 tg, and</p>
      </sec>
      <sec id="sec-8-2">
        <title>Theorem A.3. The temporal extension of</title>
        <p>bisimulation metric: 8s; s0 2 S,
upper bounds the
Proof. We proceed by induction, showing that 8s; t, dn(s; s0)
(1 ) Pin=0 id (s; s0): The base case is d1(s; s0) =
maxa f(1 )jr(s; a) r(s0; a)jg (1 )d (s; s0), using
maxa jr(s; a) r(s0; a)j d (s; s0). For the induction step we
upper bound the min-cost coupling from the minimization problem
with the liftings a 2 (pa(s); pa(s0)) given by d (s;s0) (one
can show the infs are always attained).
dn+1(s; s0) = maaxf1
)jr(s; a)</p>
        <p>r(s0; a)j
+</p>
        <p>K(dn)(pa(s); pa(s0))g
(1
(1
+
a(sk; sj )dn(sk; sj )</p>
        <p>n
) X</p>
        <p>!
Now we use the lifting property: the only non-zero terms in
the summation are those for which (sk; sj ) 2 support( a)
" s (via the mirror coupling (t0; s0) =
One can verify that ! minimizes the bisimulation distance
and that minimizes the temporally extended metric. Indeed,
Pu;v2S ! (u; v)d (u; v) = "d (s1; t1) + (1 ")d (s2; t2) =
2"f2(1 )g and this coupling is optimal for K(d ). Meanwhile
Pu;v2S (u; v)d (u; v) = "d (s1; t2) + "d (s2; t1) + (1
2")d (s2; t2) = "d (s1; t2) + "d (s2; t1) = 2"f(1 )(1 +
+ 2)g:
On the other hand, using ! , the best relation that can be lifted is
2, since (s1; t1) 2 support(! ) and r(s1) r(t1) = 2.
Meanwhile, lifts 1 and thus achieves the minimum lifting, since
(s1; t2); (s2; t1); (s2; t2) all have reward differences of 1. Thus !
minimizes the bisimulation metric but not the temporal extension
metric. Conversely, minimizes the temporal extension metric
since it achieves the minimum lifting, but not the bisimulation
metric.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Barthe</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ; Fong,
          <string-name>
            <given-names>N.</given-names>
            ;
            <surname>Gaboardi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ; Gre´goire, B.;
            <surname>Hsu</surname>
          </string-name>
          , J.; and Strub, P.-Y.
          <year>2016</year>
          .
          <article-title>Advanced probabilistic couplings for differential privacy</article-title>
          .
          <source>In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security</source>
          ,
          <fpage>55</fpage>
          -
          <lpage>67</lpage>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Ferns</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Panangaden</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Precup</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2004</year>
          .
          <article-title>Metrics for finite Markov decision precesses</article-title>
          .
          <source>In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence</source>
          ,
          <fpage>162</fpage>
          -
          <lpage>169</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Girard</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Pappas</surname>
            ,
            <given-names>G. J.</given-names>
          </string-name>
          <year>2007</year>
          .
          <article-title>Approximation metrics for discrete and continuous systems</article-title>
          .
          <source>IEEE Transactions on Automatic Control</source>
          <volume>52</volume>
          (
          <issue>5</issue>
          ):
          <fpage>782</fpage>
          -
          <lpage>798</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Givan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ; and Greig,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2003</year>
          .
          <article-title>Equivalence notions and model minimization in Markov decision processes</article-title>
          .
          <source>Artificial Intelligence</source>
          <volume>147</volume>
          (
          <issue>1-2</issue>
          ):
          <fpage>163</fpage>
          -
          <lpage>223</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Walsh</surname>
          </string-name>
          , T. J.; and
          <string-name>
            <surname>Littman</surname>
            ,
            <given-names>M. L.</given-names>
          </string-name>
          <year>2006</year>
          .
          <article-title>Towards a unified theory of state abstraction for mdps.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Lindvall</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <year>1999</year>
          .
          <article-title>On Strassen's theorem on stochastic domination.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>Electronic communications in probability 4</source>
          :
          <fpage>51</fpage>
          -
          <lpage>59</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Milner</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <year>1980</year>
          .
          <article-title>A Calculus for Communicating Systems</article-title>
          , volume
          <volume>92</volume>
          of Lecture Notes in Computer Science. Springer-Verlag.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>